Skip to content

Probability

Probability Space

In probability, we consider:

  • A state space \( \Omega \) of states \( \omega \in \Omega \): Description of possible states of an outcome for which there is uncertainty.
  • Events \( A \subseteq \Omega \) as a collection of states that can happen. The family of all considered events \( A \subseteq \Omega \) is denoted by \( \mathcal{F} \).

Examples

  1. Coin Flipping:

    • State space: \( \Omega = \{H, T\} \), where \( H \) and \( T \) denote the states "Head occurs" and "Tail occurs" as the possible outcomes of throwing a coin.

    • Events: \( A = \{H\} \) is the event that head will occur.

  2. Temperature tomorrow:

    • State space: \( \Omega = \mathbb{R} \), where \( x \in \Omega \) represents the possible temperature at 8:00 am tomorrow.

    • Events: \( A = [13,19] \) is the event that tomorrow at 8:00 am, the temperature will lie between \( 13 \) and \( 19 \) degrees.

  3. Financial decision:

    • State space: \( \Omega = [-1,10]^2 \), where for \( (x, y) \in \Omega \), \( x \) and \( y \) represent the interest rates that the central banks of the USA and EU, respectively, will fix next month.

    • Events: \( A = [0.25,0.75] \times [0.9,1.8] \cup \{1\} \times [1.7,2.1] \) is the event that next month the USA fixes an interest rate between \( 0.25\% \) and \( 0.75\% \) while the EU fixes one between \( 0.9\% \) and \( 1.8\% \), OR the USA fixes an interest rate of \( 1\% \) while the EU fixes one between \( 1.7\% \) and \( 2.1\% \).

  4. Texas Holdem:

    For Texas Holdem, we have 52 cards deck \(D\) with cards , , , ...

    After pre-flop, flop, turn and river (if you're still there), you have to choose 5 best cards out of the best combinations you can get from the 5 on the table and the two in your hand.

    • State space: \( \Omega = \{\{c_1, c_2, c_3, c_4, c_5\} \colon c_i \in D, \text{ and } c_i \neq c_j\} \). Note here the notation in terms of set, since the order does not count. Furthermore, each card is different since distributing occurs without replacement.

    • Event: The event \(A\) that I have a royal flush corresponds to \(A\) containing the elements \(\{\) , , , , \(\}\), \(\{\) , , , , \(\}\), \(\{\) , , , , \(\}\), \(\{\) , , , , \(\}\).

Events are supposed to be measured afterwards, however we require some structure among events. We want to speak about the occurence of one or the other event, two events happening coincidentally, or an event not happening. Therefore, the definition of measurable space

Measurable Space

A measurable space is a tuple \( (\Omega, \mathcal{F}) \), where

  • \(\Omega\) is a set (state space)
  • \(\mathcal{F}\) is an algebra of subsets of \(\Omega\) (Events)

An algebra is a collection of sets satisfying the following properties

  • \( \emptyset \) (nothing happens) and \( \Omega \) (anything can happen) are events.
  • If \(A\) is an event, then so is \(A^c\);
  • If \(A\) and \(B\) are events, then \(A\cup B\) is an event (the event that \(A\) or \(B\) happen is itself an event)

Warning

Note that this is the intuitive definition of a measurable space, but for mathematical reason, we require the algebra of events \(\mathcal{F}\) to be \(\sigma\)-stable, that is instead of requiring union of two or finitely many events to be an events, we also require

If the state space \(\Omega\) is finite or countable, the classical assumption is to consider as algebra of events the power set \(2^{\Omega}\) which is the collection of any subsets. If the state space is infinite, such as \(\mathbb{R}\), the power set would be truly large and leading to mathematical issues. In the case of \(\mathbb{R}\) for instance, the measurable sets are those generated by intervals.

Proposition

The third assumption for an algebra is equivalent to replace by

  • If \(A\) and \(B\) are events, then \(A\cap B\) is an event.

Proof

Let \(A\) and \(B\) be event, from the second assumption it follows follows that \(A^c\). Now the equivalence between the two assertion (intersection vs union) follows from Morgan's rule

\[ A\cap B = (A^c \cup B^c)^c \quad \text{and}\quad A\cup B = (A^c \cap B^c)^c \]

Examples

Here are some classical exmaples we will see throughout the lecture.

  • Coin toss:

    • State Space: \(\Omega = \{-1, 1\}\) two states for head and tail
    • Events: \(\mathcal{F} = 2^\Omega = \{\emptyset, \Omega, \{1\}, \{-1\}\}\)

    There are here exactly \(2^2 =4\) events.

  • Finite state space:

    • State Space: \(\Omega = \{\omega_1, \ldots, \omega_N\}\)
    • Events: \(\mathcal{F} = 2^\Omega\)

    There are here exactly \(2^{\#\Omega} = 2^N\) events (already with \(N\) beyond 100 this is more than a computer can take).

  • Random Walk:

    The random walk consists to draw a coin several times in a row, recording every single result.

    • State Sapce: \(\Omega = \{\omega = (\omega_1, \ldots, \omega_T)\colon \omega_i = \pm 1\}\) where each state is the sequence of results of the coin toss.
    • Events: \(\mathcal{F}=2^\Omega\).

    As above, the cardinality of \(\mathcal{F}\) is equal to \(2^{\# \Omega}\). However there are \(2^N\) possible sequences, and so the cardinality of events is equal to \(2^{2^N}\). You can imagine that for small \(N\) this size is already gigantic.

Random Variables

Aside from being able to measure events, we also want to know how to measure the events that a function of this state satisfies. For instance, in the case of the coin toss, suppose that you play a game where if head you win 100 and if tail you lose everything. As a function of the state it writes as \(X \colon \Omega \to \mathbb{R}\) where \(X(\omega) = 100\) if \(\omega = 1\) and \(X(\omega) = 0\) otherwize. We want to be able to speak about the event \(A\) you strictly win something which clearly if \(\{1\}\). In the general case we define random variables as such functions where you can measure this function to reach some certain level.

Definition

Let \( (\Omega, \mathcal{F}) \) be a measurable space. A function

\[ \begin{equation*} \begin{split} X\colon \Omega \longrightarrow \mathbb{R}\\ \omega & \longmapsto X(\omega) \end{split} \end{equation*} \]

is called a random variable if for every level \(x\), the set

\[ A = \left\{ \omega \in \Omega \colon X(\omega)\leq x \right\}:= \{X\leq x\} \]

is an event, that is \(A \in \mathcal{F}\).

The fact that we require the event smaller than some value seems arbitrary, however, since we have a (\(\sigma\))-algebra this is quite general

Proposition

It is equivalent for \( X: \Omega \to \mathbb{R} \) to be a random variable to require:

  1. \( \{X > x\} \in \mathcal{F} \) for any \( x \).
  2. \( \{X < x\} \in \mathcal{F} \) for any \( x \).
  3. \( \{X \geq x\} \in \mathcal{F} \) for any \( x \).
  4. \( \{x \leq(<) X \leq(<) y\} \in \mathcal{F} \) for any \( x \leq y \).
Proof
  1. Follows from \( \{X > x\} = \{X \leq x\}^c \), and \( \mathcal{F} \) is closed under complementation.
  2. \( \{X < x\} = \cap_{n} \{X \leq x +1/n\} \), and \( \mathcal{F} \) is closed under countable union.

The other assertions follows similar argumentations.

This definition is compatible with many of the standard operations. In other terms the sum, product, composition with continuous function of random variables remain random variables.

Proposition

Let \( X \) be a random variable and \( f:\mathbb{R}\to \mathbb{R} \) be a continuous function. Then

\[ \begin{equation*} \begin{split} Y\colon \Omega &\longrightarrow \mathbb{R}\\ \omega & \longmapsto Y(\omega) = f(X(\omega)) \end{split} \end{equation*} \]

is a random variable denoted \( Y = f(X) \).

Let \( X, Y \) be random variables as well as \( (X_n) \) be a converging sequence of random variables. The following are random variables:

  • \( aX + bY \) for every \( a, b \in \mathbb{R} \);
  • \( XY \);
  • \( \max(X, Y) \) and \( \min(X, Y) \);
  • \( \sup X_n \) and \( \inf X_n \);
  • \( \lim X_n \).
Proof

The first part of the proof is not trivial and has to do with topology as well as the definition of continuous functions. The argument goes as follows, for \(y\) in \(\mathbb{R}\), the set \(F = \{x \in \mathbb{R}\colon f(x) \leq y\}\) is a close set since \(f\) is continuous (lower semi-continuous would be enought). Now it is possible to show following the previous proposition that if \(X\) is a random variable, then \(\{X \in F\}\) is an event since \(F\) is closed. It follows that

\[ \begin{align*} \{Y \leq y\} & = \left\{\omega \in \Omega\colon f(X(\omega))\leq y\right\}\\ & = \left\{ \omega \in \Omega \colon X(\omega) \in \{x \in \mathbb{R}\colon f(x)\leq y\} \right\}\\ & = \left\{ X \in F \right\} && \text{which is an event.} \end{align*} \]

For the other three points, it follows from the continuity of functions. For the \(\sup\) and \(\inf\), it follows from \(\{\sup X_n \leq x\} = \cap \{X_n \leq x\}\) and \(\{\inf X_n <x\} = \cup \{X_n <x\}\), with similar argumentation of the limit of converging sequence of random variables.

If you are interested, you can ask for lecture notes on probability.

Example: Indicator and Simple Random Variables

We turn to the most simple yet one of the most important example of random variable in probability.

  • Indicator Function


    Definition

    Let \( (\Omega, \mathcal{F}) \) be a measurable space and let \( A \in \mathcal{F} \) be an event. The function

    \[ \begin{equation*} \begin{split} 1_A \colon & \Omega \longrightarrow \mathbb{R}\\ \omega & \longmapsto 1_A(\omega) = \begin{cases} 1 & \text{if } \omega \in A, \\ 0 & \text{if } \omega \notin A \end{cases} \end{split} \end{equation*} \]

    is called the indicator function of \( A \).

    Exercise

    The indicator function \(1_A\) of an event \(A\) is a random variable. Indeed, let \(x\) be in \(\mathbb{R}\). It follows that

    \[ \{1_A \leq x\} = \begin{cases} \emptyset & \text{if } x<0\\ A^c & \text{if }0\leq x <1 \\ \Omega & \text{if }x \geq 1 \end{cases} \]
  • Plot


    Indicator Function Indicator Function

This definition is strongly related to a table of truth: \( 1 \) for true, \( 0 \) for false. Clearly \( 1_{\emptyset} = 0 \) and \( 1_{\Omega} = 1 \).
Show that:

  1. If \( A \) and \( B \) are events such that \( A \cap B = \emptyset \), then \( 1_{A \cup B} = 1_A + 1_B \).
  2. If \( A \) and \( B \) are events, then \( 1_{A \cap B} = 1_A 1_B \).
  3. If \( A \subseteq B \) are events, then \( 1_A \leq 1_B \).
  • Simple Random Variable


    Definition: Simple Random Variables

    For a family \( A_1, A_2, \ldots, A_n \) of disjoint events and numbers \( \alpha_1, \ldots, \alpha_n \), we can define the simple random variable

    \[ X(\omega) = \sum_{k=1}^n \alpha_k 1_{A_k}(\omega) = \begin{cases} \alpha_k & \text{if } \omega \in A_k, \\ 0 & \text{otherwize} \end{cases} \]

    According to the previous proposition, it follows that \( X \) is also a random variable.

    Note that intuitively, multiplication and addition of simple random variables remain simple random variables, however one has to be careful to show it on the events where both random variable coincide.

  • Plot


    Simple RV Simple RV

Example: Random Variable on Finite State Space

Let \( \Omega = \{\omega_1, \omega_2, \ldots, \omega_N\} \) be a finite state space and \( \sigma \)-algebra \( \mathcal{F} = 2^\Omega \). We consider a financial market with one stock \( S \) where \( S_0 > 0 \) denotes the price today and \( S_1 \) represents the possible price of the stock tomorrow. The possible evolution for the stock is given as a function:

\[ \begin{equation*} \begin{split} S_1\colon \Omega & \longrightarrow [0, \infty)\\ \omega_n &\longmapsto S_1(\omega_n) = s_n \end{split} \end{equation*} \]

We can also write the stock price function as a simple random variable (showing therefore that it is a random variable):

\[ S_1 = \sum_{n=1}^N s_n 1_{A_n} \]

where \( A_n = \{\omega_n\} \). In other terms, the stock price is entirely given by the vector \( (s_1, \ldots, s_N) \). Without any loss of generality, since we have one stock, we may assume that \( s_1 < s_2 < \ldots < s_N \). Also, since the stock price is positive, we also have \( 0 \leq s_1 \). The returns \( R_1 = \frac{S_1 - S_0}{S_0} \) are also a random variable that can be described as a vector \( (r_1, \ldots, r_N) \), where

\[ r_n = \frac{s_n - S_0}{S_0} \]

Probability Measure

Definition: Probability Measure

A probability measure \( P \) on the measurable space \( (\Omega, \mathcal{F}) \) is a function \( P: \mathcal{F} \to [0,1] \) that associate to each event \(A\) the likelyhood of this event.

It has the following basic properties:

  • \( P[\emptyset] = 0 \) and \( P[\Omega] = 1 \)(1)

    1. Clearly, the probability that nothing or anything can happen is \(0\) or \(1\).
  • \(P[A \cup B] = P[A] + P[B]\) if \(A\) and \(B\) are two disjoint events.(1)

    1. The countable property is however assumed, that is \( P[\cup A_n] = \sum P[A_n] \) for every sequence of pairwise disjoint(1) events \( (A_n) \subseteq \mathcal{F} \).

The triple \( (\Omega, \mathcal{F}, P) \) is called a probability space.

The assumptions for a probability measure are few, however together with the definition of the algebra we can rapidly derive classical properties that are common knowledge.

Lemma

Let \( P \) be a probability measure. For any events \( A \), \( B \), or sequence \( (A_n) \) of events, the following hold:

  • \( P[B] = P[A] + P[B \setminus A] \geq P[A] \) whenever \( A \subseteq B \);
  • \( P[A^c] = 1 - P[A] \);
  • \( P[A \cup B] + P[A \cap B] = P[A] + P[B] \);
  • If \( A_1 \subseteq A_2 \subseteq \ldots \subseteq A_n \subseteq \ldots \), then:

    \[ P\left[ \cup A_n \right] = \lim P[A_n] \]
  • If \( A_1 \supseteq A_2 \supseteq \ldots \supseteq A_n \supseteq \ldots \), then:

    \[ P\left[ \cap A_n \right] = \lim P[A_n] \]

    In particular, it equals \( 0 \) if \( \cap A_n = \emptyset \).

Proof

We prove some of the points, leaving the others as an exercise.

For the first point, let \( A \subseteq B \). We have \( B = A \cup (B \setminus A) \), where this union is disjoint. By the second property of a probability measure and the positivity of probability:

\[ P[B] = P[A \cup (B \setminus A)] = P[A] + P[B \setminus A] \geq P[A] \]

Taking \( B = \Omega \), and using \( P[\Omega] = 1 \), the second point follows.

Using similar arguments, prove the third point.

For the fourth point, construct the sequence of disjoint sets:

\[ B_1 = A_1, \quad B_2 = A_2 \setminus A_1, \quad \ldots, \quad B_n = A_n \setminus A_{n-1} \]

By induction, it is easy to show:

\[ A_n = \cup_{k=1}^n A_k = \cup_{k=1}^n B_k, \quad \text{and} \quad \cup_n A_n = \cup_n B_n \]

By additivity of the probability measure:

\[ P[A_n] = P\left[ \cup_{k=1}^n A_k \right] = P\left[ \cup_{k=1}^n B_k \right] = \sum_{k=1}^n P[B_k] \nearrow \sum_{k=1}^\infty P[B_k] \]

Thus:

\[ \lim P[A_n] = \sum_{k=1}^\infty P[B_k] \]

By the second property of a probability measure:

\[ P\left[ \cup_{k=1}^\infty A_k \right] = P\left[ \cup_{k=1}^\infty B_k \right] = \sum_{k=1}^\infty P[B_k] \]

Combining these equations shows \( \lim P[A_n] = P[\cup A_n] \).

Follow similar reasoning to prove the last point.

Note: Shorthand Notations in Probability

In probability theory, the following shorthand notations are commonly used:

\[ P[X \in B] := P[\{\omega \in \Omega : X(\omega) \in B\}], \quad P[X = x] := P[\{\omega \in \Omega : X(\omega) = x\}] \]
\[ P[X \leq x] := P[\{\omega \in \Omega : X(\omega) \leq x\}], \quad \ldots \]

Examples

  1. Probability on Finite Sets: Suppose \( \Omega = \{\omega_1, \ldots, \omega_N\} \) is finite. Each probability measure \( P \) on \( \mathcal{F} = 2^\Omega \) is entirely determined by the values \( p_n = P[\{\omega_n\}] \) for \( n = 1, \ldots, N \). Indeed, for every event \(A\) is of the form \(A=\{\omega_n\colon n \in I\}\) for some \(I\subseteq \{1, \ldots, N\}\). It follows that

    \[ P[A] = \sum_{\omega \in A} P[\{\omega\}] = \sum_{n \in I} p_n \]

    This vector \(\boldsymbol{p}=(p_1, \ldots, p_N) \) has the property that \(p_n = P[\{\omega_n\}]\geq 0\) and \(\sum p_n =P[\Omega] = 1\).

    Reciprocally, if you give yourself a vector \(\boldsymbol{p}=(p_1, \ldots, p_N)\) with \(p_n \geq 0\) and \(\sum p_n\), it defines a probability \(P\) on \(\mathcal{F}\) with the definition

    \[ P[A]:=\sum_{n \in I} p_n \]

    where \(A = \{\omega_n \colon n \in I\}\). As an exercise, verify that this defines a probability measure.

    The set of such vectors is denoted by

    \[ \Delta := \left\{ \boldsymbol{p} \in \mathbb{R}^N \colon : p_n \geq 0, \, \sum p_n = 1 \right\} \]

    An important case is when \( p_n = 1/N \) for all \( n \). This is called the uniform probability distribution.

  2. Probability on the Coin Toss Space: Let \( \Omega = \{\omega = (\omega_1, \ldots, \omega_T) : \omega_t = \pm 1\} \), a finite state space. Assuming the probability of heads is \( p \) and coin tosses are independent, the probability is:

    \[ P[\{\omega = (\omega_1, \ldots, \omega_T)\}] = p^l q^{T-l} \]

    where \( l \) is the number of times \( \omega_t = 1 \) for \( t = 1, \ldots, T \).

  3. Normal Distribution: For \( \Omega = \mathbb{R} \) and \( \mathcal{F} \) the \( \sigma \)-algebra of \( \mathbb{R} \) generated by intervals, define for any event \(A\) the probability

    \[ P[A] = \frac{1}{\sigma \sqrt{2\pi}} \int_A e^{-\frac{(x-\mu)^2}{2\sigma^2}} \lambda(dx) \]

    where \( \lambda \) is the Lebesgue measure on \( \mathbb{R} \), the one measuring intervals. This is the normal distribution. For example, temperatures in Shanghai at this time of year may follow a normal distribution around 24°C with variance 1.

Integration

The historical idea behind integration was to measure areas below a function. The expectation in probability brings exactly the same intuition to this more abstract level.

Consider the simple example of the indicator function \(1_A\), it represents a rectangle of height \(1\) and width represented by the measure of \(A\), that is \(P[A]\). Hence, the area of the rectangle, or expectation of the indicator function, is given by \(E[1_A]=1 \times P[A]\).

Extending this concept is straightforward for any positive simple random variable.

  • Integration of Simple Random Variable


    Definition: Expectation 1.0

    Let \((\Omega,\mathcal{F},P)\) be a probability space. Given a simple random variable

    \[ X = \sum_{k\leq n} \alpha_k 1_{A_k} \]

    we define the expectation of \(X\) with respect to \(P\) as

    \[ E[X]:=\sum_{k\leq n} \alpha_k P[A_k] \]
  • Plot


    Expectation Simple RV Expecation Simple RV

Warning

One needs to be careful that this definition is independent of the representation of the simple random variable. Indeed, we have \(X= 1_A + 1_B = 1_{A\cup B}\) if \(A\) and \(B\) are disjoint for instance. Luckily, by the properties of the probability measure, this random variable has the same expectation for the two representations.

Proposition

The two following important properties of the expectation on simple random variables can be rapidely checked.

  • Monotony: \(E[X]\leq E[Y]\) whenever \(X\leq Y\).
  • Linearity: \(E[aX+bY]=aE[X]+bE[Y]\).

The proof of which is easy and left to you.

Exercise

Given a simple random variable \(X\) show that

  1. If \(X\) is positive, then \(E[X]>0\) if and only if \(P[X>0] >0\).
  2. If \(X\) is positive, then \(E[X] = 0\) if and only if \(P[X = 0]=1\).

We can now define the expectation of an arbitrary positive random random variable. The idea is to approximate from below this random variable by simple ones and take the limit.

  • First Approximation


    Expectation 1 Expectation 1

  • Second Approximation


    Expectation 2 Expectation 2

Note

Though the definition of the expectation does not implies the explicit construction of a sequence approximating, it is however possible to formalize the idea in the picture.

Given a random variable \(X\), the strategy is as follows: For every natural number \(n\), divide the ever growing vertical interval \([0, n)\) into \(2^n\) sub intervals \(\left[k \frac{n}{2^Nn}, (k+1)\frac{n}{2^n}\right)\) for \(k=0, \ldots, 2^n-1\). Define now

\[ \alpha_k^n = k \frac{N}{2^n} \quad \text{and}\quad A_k^n = \left\{ k\frac{N}{2^n} \leq X < (k+1)\frac{n}{2^N} \right\} \]

It follows that the sequence \((X_n)\) of simple random variables defined as

\[ X_n = \sum_{k=0}^{2^N-1} \alpha_k^n 1_{A_k^n} \]

is increasing and converges to \(X\).

Definition: Expectation 1.5

Given a positive random variable, the expectation of which is defined as

\[ E[X] := \sup \left\{ E[Y] \colon Y\text{ simple random variable and } Y\leq X \right\} \]

This is well defined but eventually equal to \(\infty\). For this is also holds that for two positive random variable \(X\) and \(Y\) with positive numbers \(a\) and \(b\) then \(E[aX + bY] = aE[X] + b E[Y]\) as well as \(E[X]\leq E[Y]\) if \(X\leq Y\).

To consider general random variable, we need to assume integrability.

Definition: Expectation 2.0

A random variable is called integrable if \(E[X^+]<\infty\) and \(E[X^-]<\infty\). The expectation of an integrable random variable is then defined as

\[E[X] = E[X^+]-E[X^-]\]

On the set of integrable random variables, which is a vector space, the expectation is also linear and monotone.

The following fundamental theorem is due to Lebesgue. It tells under which conditions it is possible to swap limit and expectation.

Theorem

Let \((X_n)\) be a sequence of random variables. The following holds true

  1. Monotone Convergence: If \((X_n)\) are positive and increasing, that is, X_1\leq X_2 \leq \cdots$ it holds that

    \[ \sup E[X_n] = \lim E[X_n] = E[\sup X_n] = E[\lim X_n] \]
  2. Fatou's Lemma: If \((X_n)\) are positive then it holds

    \[ E\left[ \liminf X_n \right]:=E\left[ \sup_n \inf_{k\geq n} X_k\right] \leq \liminf E[X_n] \]
  3. Lebesgue's Dominated Convergence: If \(X_n(\omega) \to X(\omega)\) for all (at least in probability) and \(|X_n|\leq Y\) for some integrable random variable \(Y\), then it holds

    \[ \lim E[X_n] = E[\lim X_n] = E[X] \]
Proof

We start by the monotone convergence.

By monotonicity, we clearly have \(E[X_n]\leq E[X]\) for every \(n\), therefore \(\sup E[X_n]\leq E[X]\).

Reciprocally, suppose that \(E[X]<\infty\) and pick \(\varepsilon>0\) and a positive simple random variable \(Y \) such that \(Y\leq X\) and \(E[X]-\varepsilon\leq E[Y]\). For \(0<c<1\), define the sets \(A_n=\{X_n\geq cY\}\). Since \(X^n\) is increasing to \(X\), it follows that \(A_n\) is an increasing sequence of events. Furthermore, since \(cY\leq Y\leq X\) and \(cY<X\) on \(\{X>0\}\), it follows that \(\cup A_n=\Omega\). By non-negativity of \(X_n\) and monotonicity, it follows that

\[ cE[1_{A_n}Y]\leq E[1_{A_n}X_n]\leq E[X_n] \]

and so

\[ c\sup E[1_{A_n}Y]\leq \sup E[X_n] \]

Since \(Y=\sum_{l\leq k} \alpha_l 1_{B_l}\) for positive numbers \(\alpha_1,\ldots,\alpha_k\) and events \(B_1,\ldots, B_k\), it follows that

\[ E\left[ 1_{A_n}Y \right]=\sum_{l\leq k}\alpha_l P[A_n\cap B_l]. \]

However, since \(P\) is a probability measure, and \(A_n\) is increasing to \(\Omega\), it follows from the lower semi-continuity of probability measures, that \(P[A_n\cap B_l]\nearrow P[\Omega\cap B_l]=P[B_l]\), and so

\[ \sup E[1_{A_n}Y]=\sum_{l\leq k}\alpha_l \sup P[A_n\cap B_l]=\sum \alpha_l P[B_l]=E[Y]. \]

Consequently

\[ E[X]\geq \lim E[X_n]=\sup E[X_n]\geq cE[Y] \geq cE[X]-c\varepsilon \]

which, by letting \(c\) converge to \(1\) and \(\varepsilon\) to \(0\), yields the result.

The case where \(E[X]=\infty\) is similar and left to the reader.

As for Fatou's lemma, define \(Y_n =\inf_{k\geq n} X_k\) which defines by assumption an increasing sequence of positive random variables. It follows from monotone convergence that

\[ \sup_n E\left[ Y_n \right] = E[\sup_n Y_n] = E[\sup_n \sup_{k\geq n}X_n] = E[\liminf X_n] \]

On the other hand, it clearly holds that \(X_k \geq Y_n\) for every \(k\geq n\) and therefore \(\inf_{k\geq n} E[X_k] \geq E[Y_n]\). Combined with the previous inequality we get

\[ E[\liminf X_n] = \sup_n E[Y_n]\leq \sup_n \inf_{k\geq n}E[X_k] = \liminf E[X_n] \]

As for the dominated convergence of Lebesgue, we have by assumption that \(X_n+Y\) is a sequence of positive random variables, which by Fatou's lemma yields

\[ E[X+Y] =E[\liminf X_n +Y ] \leq \liminf E[X_n] +E[Y] \]

Reciprovally \(Y = X_n\) is a sequence of positive random variable for which also holds

\[ E[Y-X] = E[\liminf Y - X_n] \leq E[Y] +\liminf -E[X_n] = E[Y] - \limsup E[X_n] \]

Combining both inequality yields

\[ \limsup E[X_n] \leq E[X] \leq \liminf E[X_n] \]

Since $\liminf \leq \limsup $ if and only if there exists a limit we deduce that \(E[X] = \lim E[X_n]\).

Example

  • Integration for the simple coin toss: Let \(\Omega =\{\omega_1,\omega_2\}\) and \(p=P[\{\omega_1\}]\) and \(q=(1-p)\).
    Every random variable \(X:\Omega \to \mathbb{R}\) is entirely determined by the values \(X(\omega_1) = x_1\) and \(X(\omega_2)=x_2\).
    It follows that

    \[ E[X]=pX(\omega_1)+qX(\omega_2) = p x_1 + (1-p)x_2 \]
  • Integration in the finite state case: Let \(\Omega=\{\omega_1,\ldots,\omega_N\}\) be a finite state space.
    The probability measure is entirely given by the vector \(\boldsymbol{p}=(p_1,\ldots,p_N)\in \mathbb{R}^N\), where \(p_n=P[\{\omega_n\}]\geq 0\) and \(\sum p_n=1\). Every random variable \(X:\Omega \to \mathbb{R}\) can be seen as a vector \(\boldsymbol{x} \in \mathbb{R}^N\), where \(x_n=X(\omega_n)\). It follows that the expectation of \(X\) under \(P\) is given by

    \[ E[X]=\sum p_n X(\omega_n)=\sum p_n x_n=\boldsymbol{p}\cdot \boldsymbol{x} \]

    In other terms, the expectation of \(X\) boils down to the scalar product of the probability vector \(\boldsymbol{p}\) with the vector of values \(\boldsymbol{x}\) of the random variable.

Measure Change

The concept of the expectation of a random variable \( E[X] \) depends, by definition, on the probability measure \( P \). We should therefore write \( E^P[X] \) to signify this dependence. If, on the same measurable space \( (\Omega, \mathcal{F}) \), we are given another probability \( Q \), the question arises: how is \( E^P[X] \) related to \( E^Q[X] \)?

Remark

Before diving into this question, let us first see how, starting from a probability \( P \), we can define a new probability \( Q \). Suppose we are given a random variable \( Z \) such that:

  1. \( Z \) is positive.
  2. \( E^P[Z] = 1 \).

We can define the function:

\[ \begin{aligned} Q \colon \mathcal{F} & \longrightarrow [0,1] \\ A & \longmapsto Q[A] = E^P[Z \cdot 1_A] \end{aligned} \]

This function, for any event \( A \), returns the expectation of \( Z \) over \( A \). It turns out that this function, under the assumptions on \( Z \), defines a new probability measure.
Specifically:

  • \( Q[\emptyset] = E^P[Z \cdot 1_\emptyset] = E^P[0] = 0 \),
  • \( Q[\Omega] = E^P[Z \cdot 1_\Omega] = E^P[Z] = 1 \).

Additivity also holds: for any two disjoint events \( A \) and \( B \), \( 1_{A \cup B} = 1_A + 1_B \). Hence:

\[ Q[A \cup B] = E^P[Z \cdot 1_{A \cup B}] = E^P[Z \cdot 1_A] + E^P[Z \cdot 1_B] = Q[A] + Q[B]. \]
Warning

To fully define \( Q \) as a probability measure, you must also check \(\sigma\)-additivity. That is, for every sequence \( (A_n) \) of pairwise disjoint events, it must hold:

\[ Q\left[\bigcup A_n\right] = \sum Q[A_n]. \]

Define the random variables \( X_n = Z \cdot 1_{\cup_{k \leq n} A_k} = Z \cdot \left( \sum_{k \leq n} 1_{A_k} \right) \) and let \( X = Z \cdot 1_{\cup A_n} \). Since \( |X_n| \leq Z \), where \( Z \) is integrable, dominated convergence implies:

\[ \lim E^P[X_n] = E^P[X]. \]

Meanwhile:

\[ E^P[X_n] = \sum_{k \leq n} E^P[Z \cdot 1_{A_k}] = \sum_{k \leq n} Q[A_k], \]

and \( E^P[X] = Q\left[\bigcup A_n\right] \).

Hence, any positive random variable \( Z \) with expectation 1 under \( P \) defines a new probability measure \( Q \).

Furthermore, for any bounded random variable \( X \), it holds that: [ E^Q[X] = E^P[Z \cdot X]. ]

To see this, consider a simple random variable \( X = \sum \alpha_k \cdot 1_{A_k} \):

\[ \begin{aligned} E^Q[X] &= \sum \alpha_k Q[A_k] \\ &= \sum \alpha_k E^P[Z \cdot 1_{A_k}] \\ &= E^P[Z \cdot X]. \end{aligned} \]

The general case follows by approximating \( X \) with simple random variables.

Additionally, \( Q \) is dominated by \( P \) in the sense that \( P[A] = 0 \) implies \( Q[A] = E^P[Z \cdot 1_A] = 0 \).

From this, we see that a positive random variable \( Z \) with expectation 1 allows us to define a new probability \( Q \), dominated by \( P \), and connects expectations under \( Q \) to those under \( P \). The challenging and powerful task is to establish the reciprocal relationship. The key lies in the concepts of absolute continuity or equivalence between probability measures, and the Radon-Nikodym Theorem.

Definition

Given two probability measures \( P \) and \( Q \), we define:

  1. \( Q \) is absolutely continuous with respect to \( P \) (\( Q \ll P \)) if:

    \[ P[A] = 0 \quad \text{implies} \quad Q[A] = 0. \]
  2. \( Q \) is equivalent to \( P \) (\( Q \sim P \)) if both \( Q \ll P \) and \( P \ll Q \), i.e.:

    \[ P[A] = 0 \quad \text{if and only if} \quad Q[A] = 0. \]

By definition:

\[ Q \ll P \quad \text{if and only if} \quad P[A] = 1 \text{ implies }Q[A] = 1, \]

or equivalently:

\[ Q \ll P \quad \text{if and only if} \quad Q[A] > 0 \text{ implies }P[A] > 0. \]

In the equivalent case:

\[ Q \sim P \quad \text{if and only if} \quad P[A] = 1 \text{if and only if } Q[A] = 1, \]

or equivalently:

\[ Q \sim P \quad \text{if and only if} \quad P[A] > 0 \text{ if and only if } Q[A] > 0. \]

Absolute continuity implies that events unlikely under \( P \) are also unlikely under \( Q \). Equivalence means that \( P \) and \( Q \) agree on which sets are unlikely.

Radon-Nikodym Theorem

On a measurable space \( (\Omega, \mathcal{F}) \), if a probability measure \( Q \) is absolutely continuous with respect to another probability measure \( P \), there exists a (\( P \)-almost surely) unique random variable \( Z \) such that:

\[ \begin{aligned} Z &\geq 0, \\ E^P[Z] &= 1, \\ E^Q[X] &= E^P[Z \cdot X] \quad \text{ for any bounded random variable } X. \end{aligned} \]

This unique random variable is called the density of \( Q \) with respect to \( P \) and is denoted \( \frac{dQ}{dP} \).

The notation \( \frac{dQ}{dP} \) is cosmetic; it does not represent a literal ratio. It simplifies expressions such as:

\[ E^P\left[ \frac{dQ}{dP} \cdot X \right] = \int X \frac{dQ}{dP} \, dP = \int X \, dQ = E^Q[X]. \]

This theorem underpins many results in stochastic processes and finance, such as the Black-Scholes-Merton formula. However, proving it requires knowledge of functional analysis, which is beyond this lecture's scope. The proof is simpler in a finite state space.

Exercise

Let \( \Omega = \{\omega_1, \ldots, \omega_n\} \) be a finite state space with \( \sigma \)-algebra \( \mathcal{F} = 2^\Omega \). Suppose \( P \) is a probability measure given by \( \boldsymbol{p} = (p_1, \ldots, p_n) \), where \( P[\{\omega_i\}] = p_i > 0 \) and \( \sum p_i = 1 \). Let \( Q \) be another probability measure on \( (\Omega, \mathcal{F}) \) given by \( \boldsymbol{q} = (q_1, \ldots, q_n) \), where \( Q[\{\omega_i\}] = q_i \geq 0 \) and \( \sum q_i = 1 \).

Since \( P[A] = 0 \) implies \( A = \emptyset \), it follows that \( Q[A] = Q[\emptyset] = 0 \).
Hence, \( Q \ll P \).

Find a random variable \( \frac{dQ}{dP} \colon \Omega \to \mathbb{R} \) such that \( \frac{dQ}{dP} \geq 0 \), \( E^P\left[\frac{dQ}{dP}\right] = 1 \), and:

\[ E^Q[X] = E^P\left[\frac{dQ}{dP} \cdot X\right] \]

for every random variable \( X \colon \Omega \to \mathbb{R} \). Show that \( \frac{dQ}{dP} \) is unique.

In this finite setting, \( \frac{dQ}{dP} \) can be represented by a vector \( \boldsymbol{z} = (z_1, \ldots, z_n) \) with \( z_i = \frac{dQ}{dP}(\omega_i) \). The conditions reduce to finding \( \boldsymbol{z} \) such that \( z_i \geq 0 \), \( \sum z_i p_i = 1 \), and for every vector \( \boldsymbol{x} = (x_1, \ldots, x_n) \):

\[ \sum x_i q_i = E^Q[X] = E^P\left[\frac{dQ}{dP} \cdot X\right] = \sum x_i z_i p_i. \]

Independence

A fundamental concept in probability, distinct from general measure theory, is independence. Intuitively, two events \( A \) and \( B \) are independent if their probability of joint occurrence equals the product of their respective probabilities.

This concept can be extended to random variables and families of events, with significant implications for results in probability theory.

Definition

Given a probability space \( (\Omega, \mathcal{F}, P) \):

  1. Two events \( A \) and \( B \) are called independent if:

    \[ P[A \cap B] = P[A] P[B]. \]
  2. Two families of events \( \mathcal{C} \) and \( \mathcal{D} \) are independent if any event \( A \) in \(\mathcal{C}\) is independent of any event \( B \) in \(\mathcal{D}\).

  3. Two random variables \( X \) and \( Y \) are independent if the \(\sigma\)-algebras generated by their information,

    \[ \sigma(X) = \sigma(\{X \leq x\} : x \in \mathbb{R}) \quad \text{and} \quad \sigma(Y) = \sigma(\{Y \leq x\} : x \in \mathbb{R}), \]

    are independent.

  4. A collection of families of events \( \mathcal{C}^i \) (with \( i \) indexing the families) is independent if for every finite selection of events \( A^{i_1}, \ldots, A^{i_n} \), where \( A^{i_k}\) is in \(\mathcal{C}^{i_k} \), it holds that:

    \[ P\left[ A^{i_1} \cap \cdots \cap A^{i_n} \right] = \prod_{k=1}^n P[A^{i_k}]. \]
  5. A family (or sequence) of random variables \( (X_i) \) is independent if the family of \(\sigma\)-algebras \( \sigma(X_i) \) is independent.

Warning

The first three points focus on pairwise independence for events, families, or random variables. However, for collections with more than two elements, pairwise independence is insufficient. For example, a sequence of random variables requires a stronger notion of independence that accounts for all finite subsets.

Exercise

Consider a four-element probability space \( \Omega = \{\omega_1, \omega_2, \omega_3, \omega_4\} \) with uniform probability \( P[\{\omega_i\}] = \frac{1}{4} \). Construct three events \( A_1 \), \( A_2 \), and \( A_3 \) such that: - \( A_1 \) is independent of \( A_2 \), - \( A_1 \) is independent of \( A_3 \), - \( A_2 \) is independent of \( A_3 \), - but \( A_1 \), \( A_2 \), and \( A_3 \) together are not independent.

Formally:

\[ \begin{aligned} P[A_1 \cap A_2] &= P[A_1] P[A_2], \\ P[A_1 \cap A_3] &= P[A_1] P[A_3], \\ P[A_2 \cap A_3] &= P[A_2] P[A_3], \\ P[A_1 \cap A_2 \cap A_3] &\neq P[A_1] P[A_2] P[A_3]. \end{aligned} \]

If you struggle, ask ChatGPT—it can handle this.

Independence is a strong assumption, but it depends on the probability measure. Even if two events are independent under a specific \( P \), independence might fail under a different measure. This concept is crucial in foundational results such as the law of large numbers and the central limit theorem, which are cornerstones of Monte Carlo methods.

Let us now present a proposition related to independent random variables, which will be further explored in the context of stochastic processes and conditional expectations.

Proposition

Let \( X \) and \( Y \) be two independent bounded random variables. Then:

\[ E[X Y] = E[X] E[Y]. \]

Proof sketch

Consider the case where \( X = 1_A \) and \( Y = 1_B \) are indicator functions.
Independence of \( X \) and \( Y \) implies that \( A \) and \( B \) are independent.
Hence: [ E[XY] = E[1_A 1_B] = E[1_{A \cap B}] = P[A \cap B] = P[A] P[B] = E[X] E[Y]. ]

This reasoning extends easily to simple random variables, as the \(\sigma\)-algebras generated by \( X \) and \( Y \) correspond to the events on which they are defined.

For the general case, approximate \( X \) and \( Y \) by sequences of simple random variables \( (X_n) \) and \( (Y_n) \), and use the properties of independence and limits of expectations.

Conditional Expectation

The conditional expectation is the first step towards stochastic processes. It is basically the best approximation in terms of expectation given some information. In other terms, let \(\mathcal{G}\subseteq \mathcal{F}\) be a sub-\(\sigma\)-algebra of events, what is the best approximation of the expectation of \(X\) knowing the events in \(\mathcal{G}\).

Conditional Expectation

Let \((\Omega, \mathcal{F}, P)\) be a probability space, \(X\) a random variable and \(\mathcal{G}\subseteq \mathcal{F}\) a \(\sigma\)-algebra.

Then, there exists a unique(1) random variable \(Y\) with the properties

  1. \(Y\) is \(\mathcal{G}\)-measurable;
  2. \(E[Y1_A] = E[X1_A]\) for any event \(A\) in \(\mathcal{G}\).
Proof

The proof of the theorem is a consequence of Radon-Nykodym derivative. Indeed, define the measurs \(Q^+\) and \(Q^-\)

\[ \begin{equation*} \begin{split} Q^\pm \colon \mathcal{G} &\longrightarrow [0, \infty)\\ A & \longmapsto Q^\pm[A] = E[X^\pm 1_A] \end{split} \end{equation*} \]

which are measures defined on the smallest \(\sigma\)-algebra of events \(\mathcal{G}\). These measures are absolutely continuous with respect to \(P\), and therefore there exists unique \(dQ^\pm/dP\) their densities that are \(\mathcal{G}\)-measurable.

Defining

\[ Y = \frac{dQ^+}{dP} - \frac{dQ^-}{dP} \]

give a unique \(\mathcal{G}\)-measurable random variable satistying by definition the expectation property.

Since the random variable satisfying the two conditions is unique(1) we can therefore use it as definition.

Conditional Expectation

The conditional expectation of a random variable \(X\) with respect to \(\mathcal{G}\) is denoted by \(E[X|\mathcal{G}]\) and is defined as the unique random variable which is \(\mathcal{G}\)-measurable and such that \( E[ E[X |\mathcal{G}]1_A] = E[X 1_A]\) for all events \(A\) in \(\mathcal{G}\).

The conditional expectation shares most of the properties of the traditional expectation

Proposition

Let \(X\) be a random variable, \(\mathcal{G} \subseteq \mathcal{F}\). It holds that

  • Expectation: \(E[E[X|\mathcal{G}]] = E[X]\)
  • Conditional Linearity: \(E[Y X + Z |\mathcal{G}] = Y E[X |\mathcal{G}] + Z\) for any random variables \(Y\) and \(Z\) which are \(\mathcal{G}\)-measurable.
  • Tower Property: \(E[E[X | \mathcal{G_2}] | \mathcal{G}_1] = E[X|\mathcal{G}_1]\) if \(\mathcal{G}_1\subseteq \mathcal{G}_2\).
  • Trivial: \(E[X |\mathcal{F}_0] = E[X]\) if \(\mathcal{F}_0 = \{\emptyset, \Omega\}\),
  • Independence: \(E[X | \mathcal{G}] = E[X]\) if \(X\) is independent of \(\mathcal{G}\).