Skip to content

Expected Shortfall

We have thus far explored the fundamentals of risk assessment, focusing on the key principles it must satisfy to achieve sound quantification: monotonicity, diversification, and, for financial purposes, cash-invariance. This foundation has enabled us to highlight the fundamental flaws of mean-variance analysis and value at risk (VaR) in meeting these criteria. However, from a practical standpoint, we are still far from identifying a fully satisfactory approach.

When considering a risk quantification instrument \( R \), the following points are crucial:

  1. Soundness: The instrument \( R \) must satisfy the properties of diversification and monotonicity to ensure robust risk quantification.
  2. Understandability: \( R \) should be intuitively comprehensible from a financial perspective, even for individuals not deeply versed in the intricacies of mathematics. Ultimately, you need to convince your boss, the regulator, and the public that the methodology you employ is sensible and reliable.
  3. Implementability: The computation of \( R \) must be feasible. At the end of the day, you need to produce a quantifiable result. This means it should be possible to create a programmatic function, based on available data, to compute the value of your risk measure (prototyping).
  4. Efficiency and Robustness: The implementation of \( R \) should meet industry standards—being fast, reliable, and free of bugs. Risk computations are not a one-time experiment; they need to be conducted daily. Large financial institutions, by regulatory requirement, must aggregate and assess vast and complex positions to provide timely results on a daily basis.

As for now, our focus has been primarily on the first point—establishing the groundwork for soundness. However, the other points are equally vital in practice. Since the 2008 financial crisis, the shortcomings of value at risk (VaR) have been widely acknowledged. While these shortcomings (particularly related to soundness) were long known to academics, addressing the other points took time before a new industry standard could emerge. This standard is the expected shortfall (also known under equivalent terms such as average value at risk or conditional value at risk).

Expected Shortfall

As the main issue of value at risk being the fact that it only provides information at one point of the CDF and being blind beyond it, the idea is to consider the tail beyond value at risk

Definition: Expected Shortfall

The expected shortfall of a random variable (integrable) at level \(\alpha\) is defined as

\[ ES_{\alpha}(L) = \frac{1}{\alpha}\int_0^\alpha V@R_{s}(L) ds = \frac{1}{\alpha}\int_{1-\alpha}^1 q_L(s) ds \]

Note

The Expected Shortfall (ES) was introduced by Artzner, Delbaen, Eber, and Heath in 1999 to address the shortcomings of Value at Risk (V@R). Expected Shortfall is known by several other names (with equivalent definitions, modulo some subtleties), including:

  • Average Value at Risk (AV@R),
  • Conditional Value at Risk (CV@R),
  • Expected Tail Loss (ETL), and
  • Superquantile.

Expected Shortfall Expected Shortfall

As shown in the figure, the expected shortfall (ES) addresses the shortcomings of value at risk (V@R) by considering the loss area beyond V@R. Specifically, if two loss distributions, \( \tilde{L} \) and \( L \), share the same V@R but \( \tilde{L} \) exhibits larger losses beyond the V@R (i.e., has fatter tails than \( L \)), then—even with identical V@R values—the expected shortfall (the area beyond V@R) of \( \tilde{L} \) will exceed that of \( L \).

This observation addresses the second point on our wish list, as ES naturally rectifies V@R's limitations regarding tail risk. However, it does not resolve the first issue on our list. Specifically, while it is clear that V@R fails to satisfy diversification, it remains puzzling why ES should satisfy this property. The desirable properties of V@R—monotonicity, law invariance, cash-invariance, and positive homogeneity—extend to ES through its integral formulation. However, since V@R is not convex, it is unclear why ES should exhibit convexity based on this representation.

Furthermore, while this formulation satisfies the third point (as ES is computed as an integral of a quantifiable object), doubts remain about its efficiency. Calculating the integral of the quantile involves evaluating numerous quantiles between \( 1-\alpha \) and \( 1 \), which is computationally intensive and prone to error. This challenge is especially pronounced for extreme quantiles (e.g., \( 99.999\% \) or \( 99.99999\% \)), where sampling the distribution in highly unlikely regions becomes unstable.

To address these issues, we now explore another class of risk assessment instruments introduced by operations research scientists Ben-Tal and Teboulle: the optimized certainty equivalent.

Optimized Certainty Equivalent

At the core of the definition of the optimized certainty equivalent is a special penalization function called loss function.

  • Definition: Loss Function

    A function \(\ell \colon \mathbb{R} \to \mathbb{R}\) is called a loss function if

    • \(\ell\) is convex
    • \(\ell\) is increasing
    • \(\ell(0) = 0\) and \(\ell^\prime(0) = 1\)(1)

      1. Note that \(\ell\) does not necessarily need to be differentiable such as \(\ell(x) = x^+/\alpha\) for \(0< \alpha <1\). It just needs to have \(\ell^{\prime}_-(0) \leq 1 \leq \ell^\prime_+(0)\) where \(\ell_-^\prime\) and \(\ell^\prime_+\) are the left and right derivative that always exists for convex functions.
    • \(\lim_{x \to \infty}\ell(x)/x >1\) and \(\lim_{x \to -\infty} \ell(x)/x <1\).

    Classical examples following this definition

    • piecewise linear: \(\ell(x)= x^+/ \alpha\) with \(0< \alpha <1\);
    • quadratic: \(\ell(x)=x^++(x^+)^2/2\);
    • exponential: \(\ell(x)=e^x-1\)

Loss Functions Loss Functions

The loss function penalizes a loss (losses are considered positive in our case) \( x \geq 0 \) by assigning a value \( \ell(x) \geq x \). For gains (negative values), it also penalizes by assigning an amount smaller than the gain itself.

Thus, given a loss profile \( L \), you compute \( E[\ell(L)] \geq E[L] \), which represents the penalized loss estimation of the loss profile. The idea introduced by Ben-Tal and Teboulle is to reduce the value of these penalized losses by allocating some cash \( m \), transitioning from \( E[\ell(L)] \) to \( E[\ell(L-m)] \). However, in terms of total costs, you must account for the cash allocated, leading to the total cost valuation:

\[ m + E[\ell(L-m)]. \]

With the decision variable being the amount of cash allocated, minimizing the total cost gives rise to the definition of the optimized certainty equivalent.

Definition: Optimized Certainty Equivalent

Given a loss function \( \ell \), the optimized certainty equivalent \( R \) of a bounded random variable (under appropriate integrability conditions) is defined as:

\[ R(L) = \inf \left\{ m + E\left[ \ell(L - m) \right] \colon m \in \mathbb{R} \right\}. \]

Proposition

Given a loss function \( \ell \), the optimized certainty equivalent \( R \) is a cash-invariant and law-invariant risk measure.

Furthermore, it holds that:

\[ R(L) = m^\ast + E\left[ \ell(L - m^\ast) \right], \]

where: (1)

  1. If \( \ell \) is not differentiable at \( 0 \), the condition changes to:

    \[ E[\ell^\prime_-(L-m^\ast)] \leq 1 \leq E\left[ \ell^\prime_+(L-m^\ast) \right], \]
\[ E[\ell^\prime(L - m^\ast)] = 1. \]
Proof

We show that \( R \), as defined, is monotone, cash-invariant, and convex.

  • Monotonicity: Suppose \( L_1(\omega) \geq L_2(\omega) \) for all \( \omega \). Since \( \ell \) is increasing:

    \[ m + \ell\left( L_1 - m \right) \geq m + \ell\left( L_2 - m \right). \]

    Taking the expectation:

    \[ m + E\left[\ell\left( L_1 - m \right)\right] \geq m + E\left[\ell\left( L_2 - m \right)\right]. \]

    Since \( m + E[\ell(L_2 - m)] \geq R(L_2) \), it follows that:

    \[ m + E\left[\ell\left( L_1 - m \right)\right] \geq R(L_2). \]

    Taking the infimum over \( m \) yields:

    \[ R(L_1) = \inf\left\{ m + E\left[\ell\left( L_1 - m \right)\right] \colon m \in \mathbb{R} \right\} \geq R(L_2). \]
  • Cash-Invariance: Let \( m \in \mathbb{R} \). Then:

    \[ \begin{align*} R(L - m) & = \inf\left\{ \tilde{m} + E\left[\ell\left( L - m - \tilde{m} \right)\right] \colon \tilde{m} \in \mathbb{R} \right\} \\ & = \inf\left\{ \hat{m} - m + E\left[\ell\left( L - \hat{m} \right)\right] \colon \hat{m} \in \mathbb{R} \right\} \quad \text{(change of variable \( \hat{m} = m + \tilde{m} \))} \\ & = R(L) - m. \end{align*} \]
  • Convexity: Let \( L_1 \) and \( L_2 \) be two loss profiles, and \( 0 \leq \lambda \leq 1 \). For \( m_1, m_2 \in \mathbb{R} \), define \( m = \lambda m_1 + (1-\lambda)m_2 \) and \( L = \lambda L_1 + (1-\lambda)L_2 \). Since \( \ell \) is convex:

    \[ m + \ell(L - m) \leq \lambda\left( m_1 + E\left[ \ell(L_1 - m_1) \right] \right) + (1-\lambda)\left( m_2 + E[\ell(L_2 - m_2)] \right). \]

    Since \( R(L) \leq m + E[\ell(L - m)] \), taking the infimum over \( m_1 \) and \( m_2 \) sequentially yields:

    \[ R(\lambda L_1 + (1-\lambda)L_2) = R(L) \leq \lambda R(L_1) + (1-\lambda)R(L_2). \]
  • Law-Invariance: Law-invariance follows directly, as \( R \) depends only on the expectation \( E[\ell(\cdot)] \), which depends on the CDF of \( L \).

To show the final assertion: Define:

\[ g(m) = m + E\left[ \ell(L - m) \right], \]

for which \( R(L) = \inf g(m) \). Since \( \ell \) is convex, \( g \) is also convex. It follows from \(\ell\) being increasing and the asymptotic assumptions on \(\ell\) that \(\ell(x) \geq a_1 x -c_1\) for \(x\) positively large enough with \(a_1>1\) and \(\ell(x)\geq a_2 x -c_2\) for \(x\) negatively large enough and \(a_2<1\). Since \(L\) is bounded, it follows that for \(m\) positively large enough (more than the bounds of \(L\) at least) we have

\[ g(m) = m + E[\ell(L-m)] \geq m + a_2E\left[ L -m \right] - c_2 = \underbrace{(1-a_2)}_{>0} \underbrace{m}_{>0} + a_2 E[L] - c_2 \xrightarrow[m \to \infty]{} \infty \]

The same argumentation for large enough negative values of \(m\) yields

\[ g(m) = m + E[\ell(L-m)] \geq m + a_1E\left[ L -m \right] - c_1 = \underbrace{(1-a_1)}_{<0} \underbrace{m}_{<0} + a_1 E[L] - c_1 \xrightarrow[m \to -\infty]{} \infty \]

All together, it shows that \(g(m) \to \infty\) for \(m\to \pm \infty\), that is, in mathematical terms, \(g\) is coercive.

Minimum convex Minimum convex

This ensures that \( g \) attains its minimum at \( m^\ast \), satisfying the first-order condition:

\[ E\left[ \ell^\prime_-(L-m^\ast) \right] \leq 1 \leq E\left[ \ell^\prime_+(L-m^\ast) \right]. \]

If \( \ell \) is differentiable, this simplifies to:

\[ E\left[ \ell^\prime(L - m^\ast) \right] = 1. \]

This completes the proof.

This proposition provides several key takeaways:

  1. The optimized certainty equivalent (OCE) is a risk measure independent of the specific definition of \( \ell \), as long as \( \ell \) is a valid loss function.
  2. By its definition and the convexity of the problem, the computation of OCE is straightforward, reducing to a one-dimensional unconstrained convex optimization problem. This allows for the application of efficient, state-of-the-art algorithms.
  3. The simplicity of this optimization problem allows for circumventing classical gradient descent by providing an explicit expression for the first-order condition.

The Exponential Function: Entropic Risk Measure

Consider the loss function \( \ell(x) = (e^{\gamma x} - 1)/\gamma \). By the first-order condition:

\[ 1 = E[\ell^\prime(L-m^\ast)] = E[e^{\gamma (L - m^\ast)}] = e^{-\gamma m^\ast}E\left[ e^{\gamma L} \right]. \]

Solving for \( m^\ast \):

\[ m^\ast = \frac{\ln\left(E[e^{\gamma L}]\right)}{\gamma}. \]

Substituting \( m^\ast \) back into \( R \) yields:

\[ R(L) = \frac{1}{\gamma} \ln \left( E\left[ e^{\gamma L} \right] \right). \]

Hence, for the exponential loss function, the OCE can be computed explicitly, and the resulting risk measure is known as the entropic risk measure.

While this measure is prevalent in other domains (e.g., statistical mechanics, physics, and machine learning) and is computationally efficient, it is unsuitable as a financial risk measure. The exponential penalization assigns extremely high values to large losses, making it impractical for scenarios with rare but severe losses.

For example, consider the loss profile:

\[ \begin{cases} 1,000,000,000 & \text{with probability } 0.00001, \\ -10,000 & \text{otherwise}. \end{cases} \]

Despite the low probability of the extreme loss, the exponential penalization makes the risk computation infeasible due to numerical instability and even with exact values, the resulting risk would be stratospherical.

The Piecewise Linear Function

The exponential function example demonstrates how a strong penalization can lead to explicit representations but may not be practical for financial risk measures. Let us now consider the opposite extreme: a function that penalizes less, specifically a piecewise linear loss function:

\[ \ell(x) = \frac{1}{\alpha}x^+, \]

where \( 0 < \alpha < 1 \).

Since \( \ell \) is not differentiable, the characterization uses the left and right derivatives:

\[ \ell_-^\prime(x) = \frac{1}{\alpha} 1_{(0, \infty)}(x) = \begin{cases} \frac{1}{\alpha} & x > 0, \\ 0 & \text{otherwise}. \end{cases} \quad \text{and} \quad \ell_+^\prime(x) = \frac{1}{\alpha} 1_{[0, \infty)}(x) = \begin{cases} \frac{1}{\alpha} & x \geq 0, \\ 0 & \text{otherwise}. \end{cases} \]

Applying the first-order condition:

\[ E[\ell^\prime_-(L-m^\ast)] \leq 1 \leq E[\ell^\prime_+(L-m^\ast)]. \]

Substituting the derivatives:

\[ \frac{1}{\alpha}P[L > m^\ast] \leq 1 \leq \frac{1}{\alpha}P[L \geq m^\ast]. \]

This simplifies to:

\[ P[L < m^\ast] \leq 1-\alpha \leq P[L \leq m^\ast]. \]

Thus, \( m^\ast \) is the \( 1-\alpha \) quantile of \( L \):

\[ m^\ast = q_L(1-\alpha) = V@R_{\alpha}(L). \]

Therefore, for the piecewise linear loss function:

\[ R(L) = \inf\left\{ m + \frac{1}{\alpha}E\left[ (L-m)^+ \right] \right\} = V@R_{\alpha}(L) + \frac{1}{\alpha}E\left. \]

Expected Shortfall and Optimized Certainty Equivalent

On one hand, we previously noted that it is not entirely clear how to show that Expected Shortfall (ES) is a risk measure when derived as the integral of V@R. On the other hand, the optimized certainty equivalent (OCE) with a piecewise linear loss function shows some structural similarities to V@R. It turns out that these two concepts are strongly connected, as demonstrated by the following proposition:

Proposition

For bounded loss profiles (or even integrable ones), the Expected Shortfall with confidence level \( 0 < \alpha < 1 \) coincides with the optimized certainty equivalent using a piecewise linear loss function with a factor of \( 1/\alpha \).

In other words:

\[ ES_{\alpha}(L) = \frac{1}{\alpha}\int_{0}^\alpha V@R_{s}(L)ds = \inf \left\{ m +\frac{1}{\alpha}E[(L-m)^+] \colon m \in \mathbb{R} \right\} = V@R_{\alpha}(L) + \frac{1}{\alpha}E\left. \]

In particular, Expected Shortfall is a cash-invariant and law-invariant risk measure.

This remarkable result addresses the key questions about ES: it confirms that ES is a sound risk measure and provides a computationally efficient approach. Instead of directly computing the integral of the quantile (which can be computationally intensive and error-prone), ES can be expressed as the sum of V@R (already an industry standard) and the expected loss beyond V@R, which can be computed easily either using the PDF of \( L \) or Monte Carlo methods with importance sampling.

Proof

The connection between ES and OCE arises from the fact that the quantile function \( q_L(s) \) of \( L \) shares the same CDF as \( L \) itself. Formally, given a loss profile (random variable) \( L \) with CDF \( F_L(m) = P[L \leq m] \) and quantile function \( q_L(s) = \inf\{m \colon F_L(m)\geq s\} \), the quantile \( q_L(s) \) can be viewed as a random variable defined on the probability space \( (\tilde{\Omega}, \tilde{\mathcal{F}}, \tilde{P}) \), where:

  • \( \tilde{\Omega} = (0,1) \),
  • \( \tilde{\mathcal{F}} \) is the \( \sigma \)-algebra generated by intervals of \((0,1)\), and
  • \( \tilde{P} \) is the Lebesgue measure \( dx \) (the measure of interval lengths).

It can be shown that \( q_L(s) \) has the same CDF as \( L \), i.e., \( F_{q_L}(m) = F_L(m) \). Indeed, by the definition of \( q_L(s) \):

\[ (0, F_L(m)) \subseteq \{s \colon q_L(s) \leq m\} \subseteq (0, F_L(m)], \]

and under \( \tilde{P} \), these sets yield:

\[ F_L(m) = \tilde{P}[(0, F_L(m))] \leq \tilde{P}[q_L \leq m] \leq \tilde{P}[(0, F_L(m)]] = F_L(m). \]

Therefore, \( F_{q_L}(m) = F_L(m) \).

Using this fact, and noting that \( q_L(s) \geq q_L(1-\alpha) \) for \( s \geq 1-\alpha \):

\[ \begin{align*} ES_{\alpha}(L) & = \frac{1}{\alpha} \int_{0}^\alpha V@R_{s}ds \\ & = \frac{1}{\alpha} \int_{1-\alpha}^1 q_L(s) ds \\ & = q_L(1-\alpha) + \frac{1}{\alpha}\int_{1-\alpha}^1 \left( q_L(s) - q_{L}(1-\alpha) \right)ds \\ & = V@R_{\alpha}(L) + \frac{1}{\alpha}\int_{\mathbb{R}} \left( m - V@R_{\alpha}(L) \right)^+ dF_{q_L}(m) \\ & = V@R_{\alpha}(L) + \frac{1}{\alpha}\int_{\mathbb{R}} \left( m - V@R_{\alpha}(L) \right)^+ dF_{L}(m) \\ & = V@R_{\alpha}(L) + \frac{1}{\alpha}E \left. \end{align*} \]

Remark on the Distribution of the Quantile and Random Sampling

This kind of magic trick to show the relationship between the piecewise linear optimized certainty equivalent and the expected shortfall relies on the fundamental fact that the distribution of a random variable \(X\) on some probability space \((\Omega, \mathcal{F}, P)\) is the same as the distribution of its quantile \(q_X\) on \((\tilde{\Omega}, \tilde{\mathcal{F}}, \tilde{P})\) where \(\tilde{\Omega} = (0,1)\), \(\tilde{F} = \mathcal{B}((0,1))\) the \(\sigma\)-algebra generated by intervals and \(\tilde{P}\) is the lebesgue measure \(dx\) that measure interval length, that it \(\tilde{P}[(a, b]] = b-a\).

This result is widely known and extensively used, particularly for random sampling. Suppose you want to sample \( x_1, \ldots, x_N \) from the distribution of a random variable \( X \) (e.g., normal, Student's t, gamma). A computer, however, generates (quasi-)random numbers \( u_1, \ldots, u_N \) uniformly distributed between \( 0 \) and \( 1 \). By the equivalence between the distributions of \( X \) and \( q_X \), defining \( x_n = q_X(u_n) \) for \( n = 1, \ldots, N \) produces a random sample \( x_1, \ldots, x_N \) from the distribution of \( X \).

import numpy as np
from scipy.stats import norm      # (1)
import plotly.graph_objs as go    # (2)

N = 10000
u = np.random.rand(N)             # uniform sample
x0 = norm.ppf(u)                  # quantile of normal distribution of u
x1 = norm.rvs(size=N)             # sample from normal

# Plot the two histograms
fig = go.Figure()
fig.add_histogram(
    x=x0,
    histnorm='probability',
    name='Quantile of Uniform Sample',
)
fig.add_histogram(
    x=x1,
    histnorm='probability',
    name='Standard Normal Sample',
)
fig.show()
  1. The scipy.stats library provides access to many distributions, including their cdf, pdf, and ppf (quantile function).
  2. plotly is used here for plotting; alternatively, matplotlib can be used.

This principle underpins Monte Carlo integration, where the goal is to compute \( E[f(X)] \). By the law of large numbers and the central limit theorem, it holds that:

\[ \frac{1}{N}\sum_{n=1}^N f(x_n) \xrightarrow[N \to \infty]{} E[f(X)], \]

where \( x_1, \ldots, x_N \) is a random sample from the distribution of \( X \). In practice, a random sample \( u_1, \ldots, u_N \) is drawn from a uniform distribution on \( (0, 1) \), and then \( x_n = q_X(u_n) \) is computed and used in the arithmetic mean of \( f(x_n) \) for \( n = 1, \ldots, N \).

As of now, we know that Expected Shortfall (ES) is a sound risk measure: it is understandable, implementable, and, due to its representation, efficient to compute. Prior to the introduction of ES, financial institutions commonly computed \( V@R \). To transition to ES, they only need to compute the additional term \( E/\alpha \), which is computationally efficient (either analytically or via Monte Carlo methods).

The computation of ES in simple cases is demonstrated below:

import numpy as np
from scipy.stats import norm, t       # Normal and Student's t distributions
from scipy.optimize import root       # Root finding
from scipy.integrate import quad      # One-dimensional integration
import plotly.graph_objs as go        # Plotting library

# Define the basic computation of the quantile (X is a random variable)
def quantile(X, s):
    def fun(m):
        return X.cdf(m) - s
    result = root(fun, 0)  # Find the root
    return result.x[0]

# Compute ES using the integral of quantile representation
def ES1(X, alpha):
    def fun(s):
        return quantile(X, s)
    result, err = quad(fun, 1 - alpha, 1)  # Integrate quantile between 1-alpha and 1
    return result / alpha

# Compute ES using the OCE representation
def ES2(X, alpha):
    var = quantile(X, 1 - alpha)
    def fun(x):
        return (x - var) * X.pdf(x)
    result, err = quad(fun, var, np.Inf)  # Integrate beyond V@R
    return var + result / alpha

# Define distributions
X1 = norm
X2 = t(df=2)  # Student's t distribution with df=2 (variance = 1)

alpha = 0.01  # Confidence level (1%)

# Display results
print(f"""
V@R (Normal):\t{quantile(X1, 1 - alpha)}
ES (slow, Normal):\t{ES1(X1, alpha)}
ES (fast, Normal):\t{ES2(X1, alpha)}

V@R (Student):\t{quantile(X2, 1 - alpha)}
ES (slow, Student):\t{ES1(X2, alpha)}
ES (fast, Student):\t{ES2(X2, alpha)}
""")

# Exercise:
# Compare and plot the differences between V@R and ES for Normal and Student's t distributions for 0.0001 < alpha < 0.05.
# Use %timeit to compare the computation times of ES1 and ES2.

We saw that the expected shortfall has multiple representations, and simple transformations can yield additional formulations.

The expected shortfall has the following representations

\[ \begin{align*} ES_{\alpha}(L) & = \frac{1}{\alpha}\int_0^\alpha V@R_{s}(L)ds && \text{Quantile representation}\\ & = \inf\{m + \frac{1}{\alpha}E[(L-m)^+]\colon m \in \mathbb{R}\} && \text{OCE representation}\\ & = V@R_{\alpha}(L) + \frac{1}{\alpha}E\left \\ & = \frac{1}{\alpha}\int_{V@R_{\alpha}(L)}^\infty x dF_L(x) \end{align*} \]

Furthermore, the expected shortfall is positive homogeneous, that is

\[ ES_{\alpha}(\lambda L) = \lambda ES_{\alpha}(L)\]

for every \(\lambda>0\). In particular \(ES_{\alpha}(L_1 + L_2)\leq ES_{\alpha}(L_1) + ES_{\alpha}(L_2)\).