Brownian Motion
Imagine that a pollen particle is suspended in a glass of water. If we were to observe and record the vertical position of the particle over time, we would find that its movements were random. And if we were to plot this position, we’d get a jagged path through time (Figure 1 1 1 , left). This path would be just one of many possible paths, and if we were to repeat this observational experiment many times, we would not expect to see the same path again. Given this randomness, how can we reason about this phenomenon? Can we say anything interesting or useful about the particle? For most of human history, this was a seemingly impossible task. A key insight, a conceptual pillar in probability theory, is to separate what actually happened (Figure 1 1 1 , left) from other possible outcomes (Figure 1 1 1 , right). This approach allows us to reason about the world through counterfactuals: what are all the possible paths the pollen could have taken? How likely is each path? What can we say about the distribution of outcomes? This understanding of the pollen particle as a random process is a deep idea, and it took many decades and scientists to understand. The phenomenon was first observed in the 1830s by the Scottish botanist Robert Brown. Brown used a microscope to observe pollen particles suspended in water, and to his surprise, he saw the particles moving! At first, he thought this meant that the pollen particles were alive, but he tested and then rejected this hypothesis by observing the same effect with particles that he was convinced were inanimate, such as glass powder, minerals, and even pulverized fragments of the Egyptian Sphinx (Góra, 2006) ! For roughly half-a-century, the phenomenon remained a mystery, although it became known as Brownian motion . Then starting in 1905, Albert Einstein published a series of papers in which he hythesized that the pollen particles were moving because they were being bombarded by invisible molecules in the liquid (Einstein, 1905) . In the following year, the Polish physicist Marian Smoluchowski independently published essentially the same theory (Von Smoluchowski, 1906) . At the time, this theory was controversial, because the idea of molecules was not yet widely accepted. However, using statistical mechanics, Einstein and Smoluchowski were able to make testable predictions about the behavior of the particles, and another scientist, Jean Baptiste Perrin, verified the model a few years later (Perrin, 1909) . And since Einstein’s breakthrough work, Brownian motion has been widely studied and more deeply understood. In the mathematical community, Brownian motion was formalized by Norbert Wiener (Wiener, 1923) , and thus Brownian motion is often referred to as a Wiener process , particularly by mathematicians. The goal of this post is to better understand Brownian motion. Brownian motion is an important concept because it can be used to model many phenomenon, from particles suspended in liquids to the prices of stocks. Ultimately, we’ll reconstruct the marginal distribution of our pollen particle at any given point in time. As we will see this, this is the normal distribution. This deep connection means that we can make mathematically precise probabilistic statements about a completely random process. Let’s begin with a simplified model of our pollen particle in discrete time. This is a stochastic process called a random walk . In the next section, we’ll extend this to continuous time, which is Brownian motion. Imagine we can discretize time and then observe a single discrete “tick” on the clock. What happens to the pollen particle during this one tick? In our simple model of the world, we’re going to imagine that we flip a coin, not necessarily fair, and that the pollen particle moves up or down the same amount based on the outcome of that coin toss. The coin toss models the fact that the pollen particle is being randomly bombarded by water molecules and thus its position at the next time point is random. So the pollen particle cannot stay in place; after one tick of the clock, it moves up or down. Formally, let S 0 S_0 S 0 be the particle’s initial position (non-random), and let S 1 S_1 S 1 be a univariate random variable denoting the vertical position of the pollen particle after one tick. We assume that the initial position is zero ( S 0 = 0 S_0 = 0 S 0 = 0 ), since this makes our calculations and notation easier and since it is simply a vertical shift in the final path. So we flip a coin with bias p p p , where p p p is the probability of heads ( H H H ) and q : = 1 − p q := 1-p q : = 1 − p is the probability of tails ( T T T ). If the coin is heads, then the pollen particle moves up u u u , and if the coin is tails, then the pollen particle moves down u u u (to − u -u − u ). Let’s denote the outcome of each coin flip as a random variable Z i Z_i Z i , taking values in { − 1 , 1 } \{-1, 1\} { − 1 , 1 } . Then the position after a single coin flip is u Z 1 u Z_1 u Z 1 (Figure 2 2 2 ). Now consider the position S n S_n S n after n n n time steps. At each time point i i i , we flip a coin, which we assume is independent of all other coin tosses, to discover whether the pollen particle is displaced up or down from its current position. Then S n S_n S n is simply the sum S n = u Z 1 + u Z 2 + ⋯ + u Z n . (1) S_n = u Z_1 + u Z_2 + \dots + u Z_n. \tag{1} S n = u Z 1 + u Z 2 + ⋯ + u Z n . ( 1 ) Since each Z i Z_i Z i is random, S n S_n S n is also random. Clearly, as we repeatedly flip our coin, the set of possible locations of the pollen particle expands linearly with n n n . We can visualize all these possible locations as a directed graph or tree, sometimes called a binomial tree (Figure 3 3 3 , left)—we’ll explain the name in a moment. The tree layers (vertical slices) are zero-indexed, and so the root node occurs at time n = 0 n=0 n = 0 . Each node is a possible location, and the n n n -th layer is all possible locations by time n n n . The directed edges (left to right) are valid moves of the pollen particle. A path in this binomial tree is a sequence of steps which starts at the tree’s root (left-most node) and continues right at each time step until it reaches a leaf node (right-most node). A valid path is one that always moves left-to-right, from root to leaf. A valid path cannot, for example, move straight down at the same time point or move backwards. To help us identify nodes, let’s introduce the counting number k k k , which indexes the leaf nodes, taking values in k ∈ { 0 , 1 , … , n } k \in \{0, 1, \dots, n\} k ∈ { 0 , 1 , … , n } . Like the time index n n n , the number k k k is a zero-based index. Let’s denote the bottom leaf node with k = 0 k=0 k = 0 and the top leaf node with k = n k=n k = n . To illustrate this, I’ve visualized the tree with the nodes labeled with tuples ( n , k ) (n, k) ( n , k ) (Figure 3 3 3 , right). Now that we understand this simple, discrete-time model for our pollen particle, let’s tackle our motivating question: which outcomes (leaf nodes) are most likely? Any given path is random, but can we say something about the distribution of outcomes? To start, let’s compute the probability of arriving at the highlighted leaf node in Figure 4 4 4 . This is really the probability of arriving at a given node ( n , k ) (n, k) ( n , k ) , which in turn is really the probability of flipping k k k heads in n n n coin tosses. Let’s use K n K_n K n for this random variable. Arriving at this node requires that we flip two heads and one tails. The probability of this is P ( { two heads and one tails } ) = p 2 q . (2) \mathbb{P}\left(\{ \text{two heads and one tails} \}\right) = p^2 q. \tag{2} P ( { two heads and one tails } ) = p 2 q . ( 2 ) However, there are three ways flip two heads in three coin tosses, { H H T , H T H , T H H } , (3) \{ HHT, HTH, THH \}, \tag{3} { H H T , H T H , T H H } , ( 3 ) which is another way of saying that there are three paths to the highlighted node. Since each path is a mutually exclusive outcome, we compute our desired probability by summing the probability of all outcomes in Equation 2 2 2 by the number of paths: P ( { arriving at node ( 3 , 2 ) } ) = P ( K 3 = 2 ) = 3 p 2 q . (4) \mathbb{P}\left(\{ \text{arriving at node $(3, 2)$} \}\right) = \mathbb{P}(K_3 = 2) = 3 p^2 q. \tag{4} P ( { arriving at node ( 3 , 2 ) } ) = P ( K 3 = 2 ) = 3 p 2 q . ( 4 ) For example, if p = 1 / 2 p=1/2 p = 1 / 2 , then this probability would be 3 / 8 3/8 3 / 8 . To compute this probability in general, we just need a way to compute the number of ways to get k k k successes or heads in n n n trials. Since order matters, the number of ways to pick k k k heads from n n n coin tosses is n ( n − 1 ) ( n − 2 ) … ( n − k + 1 ) = n ! ( n − k ) ! . (5) n (n-1) (n-2) \dots (n-k+1) = \frac{n!}{(n-k)!}. \tag{5} n ( n − 1 ) ( n − 2 ) … ( n − k + 1 ) = ( n − k ) ! n ! . ( 5 ) First, we can choose any of n n n coin tosses to be a heads. Then we can pick any of n − 1 n-1 n − 1 coins tosses to be heads. And so on, until we have k − 1 k-1 k − 1 heads. (The last pick is completely constrained.) However, this overcounts the possible paths. For example, this does not distinguish between H 1 H 2 H_1 H_2 H 1 H 2 and H 2 H 1 H_2 H_1 H 2 H 1 , where the subscript i i i denotes the i i i -th coin toss. So we need to divide the permutation in Equation 5 5 5 by the number of ways we can order elements in a k k k -sized set. This is k k k factorial. Putting this together, we see that the number of ways to get to each node in the binomial tree is ordered ways to pick k heads from n tosses permutations of k heads = n ( n − 1 ) ( n − 2 ) … ( n − k + 1 ) k ( k − 1 ) ( k − 2 ) … 1 . (6) \frac{\text{ordered ways to pick $k$ heads from $n$ tosses}}{\text{permutations of $k$ heads}} \;\; = \;\; \frac{n(n-1)(n-2) \dots (n-k+1)}{k(k-1)(k-2) \dots 1}. \tag{6} permutations of k heads ordered ways to pick k heads from n tosses = k ( k − 1 ) ( k − 2 ) … 1 n ( n − 1 ) ( n − 2 ) … ( n − k + 1 ) . ( 6 ) This number in Equation 6 6 6 is often called the binomial coefficient , pronounced “ n n n choose k k k ”, and is denoted as ( n k ) ≜ n ! k ! ( n − k ) ! (7) {n \choose k} \triangleq \frac{n!}{k! (n-k)!} \tag{7} ( k n ) ≜ k ! ( n − k ) ! n ! ( 7 ) Putting it all together, the probability of arriving at the k k k -th node in the n n n -th layer of a binomial tree is P ( { K n = k } ) = ( n k ) p k q n − k . (8) \mathbb{P}(\{K_n = k\}) = {n \choose k} p^k q^{n-k}. \tag{8} P ( { K n = k } ) = ( k n ) p k q n − k . ( 8 ) The fact that these probabilities sum to one is just a trivial application of the binomial theorem . See A1 . Computing the mean and variance of K n K_n K n is relatively straightforward. We can view K n K_n K n as the sum of independent Bernoulli random variables, so K n = Z 1 + 1 2 + Z 2 + 1 2 + ⋯ + Z n + 1 2 . (9) K_n = \frac{Z_1 + 1}{2} + \frac{Z_2 + 1}{2} + \dots + \frac{Z_n + 1}{2}. \tag{9} K n = 2 Z 1 + 1 + 2 Z 2 + 1 + ⋯ + 2 Z n + 1 . ( 9 ) The mean of each Z i Z_i Z i is 2 p − 1 2p - 1 2 p − 1 , and the first two moments are easy to compute: E [ K n ] = ∑ i = 1 n E [ Z i + 1 2 ] = n p , V [ K n ] = ∑ i = 1 n V [ Z i + 1 2 ] = n p q . (10) \begin{aligned} \mathbb{E}[K_n] &= \sum_{i=1}^n \mathbb{E}\!\left[\frac{Z_i + 1}{2}\right] = np, \\ \mathbb{V}[K_n] &= \sum_{i=1}^{n} \mathbb{V}\!\left[\frac{Z_i + 1}{2}\right] = npq. \end{aligned} \tag{10} E [ K n ] V [ K n ] = i = 1 ∑ n E [ 2 Z i + 1 ] = n p , = i = 1 ∑ n V [ 2 Z i + 1 ] = n p q . ( 1 0 ) For the variance calculation, we use Bienaymé’s identity and the fact that ( Z i + 1 ) / 2 (Z_i+1)/2 ( Z i + 1 ) / 2 are independent. The distribution of K n K_n K n was first discovered by Jakob Bernoulli and is called the binomial distribution after the binomial terms implicit in n n n choose k k k —again, see A1 —and hence the “binomial tree.” Of course, K n K_n K n is the distribution on the number of heads in n n n coin tosses, while we’re more interested in the location of the pollen particle. But there’s a simple relationship between the two. The location S n S_n S n is simply the number of up moves ( K n K_n K n ), minus the number of down moves ( n − K n n - K_n n − K n ), scaled by the size of the move u u u . In other words, it is: S n = u ( Z 1 + Z 2 + ⋯ + Z n ) = u ( K n − ( n − K n ) ) = u ( 2 K n − n ) . (11) \begin{aligned} S_n &= u \left( Z_1 + Z_2 + \dots + Z_n \right) \\ &= u \left(K_n - (n - K_n)\right) \\ &= u \left(2 K_n - n\right). \end{aligned} \tag{11} S n = u ( Z 1 + Z 2 + ⋯ + Z n ) = u ( K n − ( n − K n ) ) = u ( 2 K n − n ) . ( 1 1 ) Since n n n and u u u are non-random, the events { S n = u ( 2 k − n ) } \{S_n = u(2k-n)\} { S n = u ( 2 k − n ) } and { K n = k } \{K_n=k\} { K n = k } are identical. So we can say P ( { S n = u ( 2 k − n ) } ) = P ( { K n = k } ) = ( n k ) p k q n − k . (12) \mathbb{P}(\{S_n = u(2k-n)\}) = \mathbb{P}(\{K_n = k\}) = {n \choose k} p^k q^{n-k}. \tag{12} P ( { S n = u ( 2 k − n ) } ) = P ( { K n = k } ) = ( k n ) p k q n − k . ( 1 2 ) So the location of our pollen particle by a given layer n n n is determined by the distribution of a random variable K n K_n K n with the probability mass function (PMF) in Equation 8 8 8 . While K n K_n K n and S n S_n S n have the same probabilities, they clearly have different moments. The first two moments of S n S_n S n are: E [ S n ] = E [ u ( 2 K n − n ) ] = 2 n u ( p − 1 / 2 ) , V [ S n ] = V [ u ( 2 K n − n ) ] = V [ 2 u K n ] = 4 u 2 n p q . (13) \begin{aligned} \mathbb{E}[S_n] &= \mathbb{E}[u(2K_n - n)] = 2nu(p-1/2), \\ \mathbb{V}[S_n] &= \mathbb{V}[u(2K_n - n)] = \mathbb{V}[2 u K_n] = 4 u^2 npq. \end{aligned} \tag{13} E [ S n ] V [ S n ] = E [ u ( 2 K n − n ) ] = 2 n u ( p − 1 / 2 ) , = V [ u ( 2 K n − n ) ] = V [ 2 u K n ] = 4 u 2 n p q . ( 1 3 ) In the special case in which p = 1 / 2 p = 1/2 p = 1 / 2 , then E [ S n ] = 0 , V [ S n ] = n u 2 . (14) \begin{aligned} \mathbb{E}[S_n] &= 0, \\ \mathbb{V}[S_n] &= n u^2. \end{aligned} \tag{14} E [ S n ] V [ S n ] = 0 , = n u 2 . ( 1 4 ) We can explore this distribution by plotting the function for various parameterizations (Figure 5 5 5 ). Note that while K n K_n K n is binomially distributed and while S n S_n S n and K n K_n K n have the same probability function, S n S_n S n is not binomially distributed. That’s because the binomial distribution only has support over the non-negative integers. It’s a distribution over repeated coin flips. But S n S_n S n has support over the negative numbers. I don’t think the distribution of S n S_n S n has a name, but speaking loosely, it is essentially a binomial distribution mean-centered at zero. And another way to visualize this is to imagine larger and larger binomial trees (Figure 6 6 6 ). The distribution for the locations S n S_n S n for n ∈ { 6 , 20 , 80 } n \in \{6, 20, 80\} n ∈ { 6 , 2 0 , 8 0 } are the distributions in Figure 5 5 5 . To summarize so far, we have done something remarkable. We have modeled the motion of a completely random particle, and yet we can say something concrete and precise about its distribution of locations over time. To do this, however, we had to assume that time was discrete. So the natural next question is: what’s the distribution of our process in the continuous-time limit? At this point in our story, it is not far-fetched to guess that it’s the normal distribution. De Moivre proved the De Moivre–Laplace theorem , the earliest version of a central limit theorem (CLT), in 1738, so roughly a hundred years before Robert Brown observed Brownian motion. So scientists and mathematicians already knew that a sum of independent and identically distributed random variables converge to a Gaussian. The key insight in the development of Brownian motion was to realize that the bombardment of a pollen particle could be modeled as such as a sum. So now let’s imagine what happens when the molecular bombardments on our pollen particle increase in number but decrease proportionally in impact. So we have more bombardments but they move the pollen particle less per bombardment. This rescaling is critical, or else the variance of our process would explode. Put in physical terms, if we increased the number of bombardments of our pollen particle but did not scale down the size of the move, the pollen particle’s moves would grow implausibly large. To formalize this, let’s first fix p = 1 / 2 p = 1/2 p = 1 / 2 so that E [ Z i ] = 0 \mathbb{E}[Z_i] = 0 E [ Z i ] = 0 . We’ll handle the asymmetric case later. Now suppose that there are n n n bombardments per unit of physical time t t t , so for any fixed amount of physical time t > 0 t \gt 0 t > 0 , we can model the location of our pollen particle as B t ( n ) = u Z 1 + u Z 2 + ⋯ + u Z ⌊ t n ⌋ , u : = 1 n . (15) B_t^{(n)} = u Z_1 + u Z_2 + \dots + u Z_{\lfloor tn \rfloor}, \quad u := \frac{1}{\sqrt{n}}. \tag{15} B t ( n ) = u Z 1 + u Z 2 + ⋯ + u Z ⌊ t n ⌋ , u : = n 1 . ( 1 5 ) The notation ⌊ t n ⌋ {\lfloor tn \rfloor} ⌊ t n ⌋ just indicates flooring to an integer since t t t is a positive real number. And we need u u u to scale with n n n , and so we set u = 1 / n u = 1 / \sqrt{n} u = 1 / n . If we take n → ∞ n \rightarrow \infty n → ∞ , then we get a continuous-time limit of a random walk: B t = lim n → ∞ B t ( n ) . (16) B_t = \lim_{n \rightarrow \infty} B_t^{(n)}. \tag{16} B t = n → ∞ lim B t ( n ) . ( 1 6 ) So again, we hold physical time t t t fixed, and we make our binomial tree finer and finer (larger n n n for fixed t t t ). If we remove the grid of the binomial tree which clutters the visualization, and just visualize paths for finer and finer n n n , we can create visualizations similar to Figure 6 6 6 but for much larger n n n (Figure 7 7 7 ). Now we can ask the same question we asked in the discrete-time case: after physical time t t t , what is the distribution of our pollen particle’s position? As we observed above, it must be a normal distribution! Here, the insight is not that the binomial distribution converges to the normal distribution—again, this was known a hundred years before Robert Brown’s observations. The insight is that by modeling the continuous-time limit of a random walk as in Equation 15 15 1 5 , this rescaled random walk B t ( n ) B_t^{(n)} B t ( n ) converges to a normal distribution N ( 0 , t ) \mathcal{N}(0, t) N ( 0 , t ) . Let’s see this a bit more formally. The De Moivre–Laplace theorem states that a properly standardized binomial random variable converges to the normal distribution. In our notation, K n ∼ binom ( n , p ) K_n \sim \text{binom}(n, p) K n ∼ binom ( n , p ) with p = 1 / 2 p=1/2 p = 1 / 2 , and the theorem states: K n − E [ K n ] V [ K n ] = K n − n p n p q = K n − n / 2 n / 4 → d N ( 0 , 1 ) . (17) \frac{K_n - \mathbb{E}[K_n]}{\sqrt{\mathbb{V}[K_n]}} = \frac{K_n - np}{\sqrt{npq}} = \frac{K_n - n/2}{\sqrt{n/4}} \;\stackrel{d}{\rightarrow}\; \mathcal{N}(0, 1). \tag{17} V [ K n ] K n − E [ K n ] = n p q K n − n p = n / 4 K n − n / 2 → d N ( 0 , 1 ) . ( 1 7 ) Now observe that B t ( n ) B_t^{(n)} B t ( n ) is essentially this standardized quantity up to rescaling: B t ( n ) = u Z 1 + u Z 2 + ⋯ + u Z ⌊ t n ⌋ = 1 n ( Z 1 + Z 2 + ⋯ + Z ⌊ t n ⌋ ) = 1 n ( 2 K ⌊ t n ⌋ − ⌊ t n ⌋ ) = 1 n ⌊ t n ⌋ ⌊ t n ⌋ 2 ( K ⌊ t n ⌋ − ⌊ t n ⌋ / 2 ) = ⌊ t n ⌋ n K ⌊ t n ⌋ − ⌊ t n ⌋ / 2 ⌊ t n ⌋ / 4 . (18) \begin{aligned} B_t^{(n)} &= u Z_1 + u Z_2 + \dots + u Z_{\lfloor tn \rfloor} \\ &= \frac{1}{\sqrt{n}} \left( Z_1 + Z_2 + \dots + Z_{\lfloor tn \rfloor} \right) \\ &= \frac{1}{\sqrt{n}} \left(2 K_{\lfloor tn \rfloor} - {\lfloor tn \rfloor} \right) \\ &= \frac{1}{\sqrt{n}} \frac{\sqrt{\lfloor tn \rfloor}}{\sqrt{\lfloor tn \rfloor}} \; 2 \left(K_{\lfloor tn \rfloor} - {\lfloor tn \rfloor}/2 \right) \\ &= \sqrt{\frac{\lfloor tn \rfloor}{n}} \frac{K_{\lfloor tn \rfloor} - {\lfloor tn \rfloor}/2}{\sqrt{\lfloor tn \rfloor / 4}}. \end{aligned} \tag{18} B t ( n ) = u Z 1 + u Z 2 + ⋯ + u Z ⌊ t n ⌋ = n 1 ( Z 1 + Z 2 + ⋯ + Z ⌊ t n ⌋ ) = n 1 ( 2 K ⌊ t n ⌋ − ⌊ t n ⌋ ) = n 1 ⌊ t n ⌋ ⌊ t n ⌋ 2 ( K ⌊ t n ⌋ − ⌊ t n ⌋ / 2 ) = n ⌊ t n ⌋ ⌊ t n ⌋ / 4 K ⌊ t n ⌋ − ⌊ t n ⌋ / 2 . ( 1 8 ) By De Moivre–Laplace, we can say: K ⌊ t n ⌋ − ⌊ t n ⌋ / 2 ⌊ t n ⌋ / 4 → d N ( 0 , 1 ) . (19) \frac{K_{\lfloor tn \rfloor} - {\lfloor tn \rfloor}/2}{\sqrt{\lfloor tn \rfloor / 4}} \;\stackrel{d}{\rightarrow}\; \mathcal{N}(0, 1). \tag{19} ⌊ t n ⌋ / 4 K ⌊ t n ⌋ − ⌊ t n ⌋ / 2 → d N ( 0 , 1 ) . ( 1 9 ) And as n → ∞ n \rightarrow \infty n → ∞ , we can see that the prefactor converges to t \sqrt{t} t : ⌊ t n ⌋ n → t . (20) \sqrt{\frac{\lfloor tn \rfloor}{n}} \;\rightarrow\; \sqrt{t}. \tag{20} n ⌊ t n ⌋ → t . ( 2 0 ) Since the standardized binomial converges in distribution to N ( 0 , 1 ) \mathcal{N}(0, 1) N ( 0 , 1 ) and the prefactor converges to the constant t \sqrt{t} t , we can see that B t ( n ) → d t N ( 0 , 1 ) = N ( 0 , t ) . (21) B_t^{(n)} \;\stackrel{d}{\rightarrow}\; \sqrt{t} \; \mathcal{N}(0, 1) = \mathcal{N}(0, t). \tag{21} B t ( n ) → d t N ( 0 , 1 ) = N ( 0 , t ) . ( 2 1 ) That’s it! As an aside, I think that in a modern treatment, we would invoke Slutsky’s theorem to arrive at Equation 21 21 2 1 . Slutsky’s theorem states that if a sequence of random variables converges in distribution and is multiplied by a sequence converging to a constant, then the product converges in distribution to the constant times the limit. The geometric interpretation of this is that the marginal distribution after time t t t is simply the normal distribution N ( 0 , t ) \mathcal{N}(0, t) N ( 0 , t ) (Figure 8 8 8 ). ) are denoted with dashed lines. Now that we see the simplest version of the derivation in its entirety, we can make two important adjustments. First, notice that our bombardment factor u = 1 / n u = 1/\sqrt{n} u = 1 / n has no physical meaning. The denominator just ensures convergence, and so this bombardment has unit scale. But we can introduce a parameter σ \sigma σ which captures the physical scale of the bombardment. Concretely, let u = σ n . (22) u = \frac{\sigma}{\sqrt{n}}. \tag{22} u = n σ . ( 2 2 ) It’s easy to see that this will flow through the derivation in Equation 18 18 1 8 and give us σ B t ( n ) → d σ t N ( 0 , 1 ) = N ( 0 , σ 2 t ) . (23) \sigma B_t^{(n)} \;\stackrel{d}{\rightarrow}\; \sigma \sqrt{t} \; \mathcal{N}(0, 1) = \mathcal{N}(0, \sigma^2 t). \tag{23} σ B t ( n ) → d σ t N ( 0 , 1 ) = N ( 0 , σ 2 t ) . ( 2 3 ) But I think the more interesting adjustment is adding a drift parameter μ \mu μ . Of course, we could just shift our Brownian motion directly: μ + σ B t ( n ) → d N ( μ , σ 2 t ) . (24) \mu + \sigma B_t^{(n)} \;\stackrel{d}{\rightarrow}\; \mathcal{N}(\mu, \sigma^2 t). \tag{24} μ + σ B t ( n ) → d N ( μ , σ 2 t ) . ( 2 4 ) But this has no physical meaning for our process. It’s just an arbitrary shift, not a drift. A richer way to approach this is to encode it directly into the bias of our coin flip. Intuitively, if we flip a biased coin (so p ≠ 1 / 2 p \neq 1/2 p = 1 / 2 ), then the position our pollen particle will drift over time (Figure 9 9 9 ). However, there’s a problem with this approach: since p p p is constrained to [ 0 , 1 ] [0, 1] [ 0 , 1 ] , then E [ Z i ] \mathbb{E}[Z_i] E [ Z i ] is constrained to [ − 1 , 1 ] [-1, 1] [ − 1 , 1 ] , and thus the mean of our Brownian motion is constrained to [ − t , t ] [-t, t] [ − t , t ] : E [ Z i ] = 2 p − 1 , E [ B t ( n ) ] = 1 n ⌊ t n ⌋ E [ Z 1 ] . (25) \begin{aligned} \mathbb{E}[Z_i] &= 2p - 1, \\ \mathbb{E}\!\left[B_t^{(n)}\right] &= \frac{1}{\sqrt{n}} \lfloor tn \rfloor \mathbb{E}[Z_1]. \end{aligned} \tag{25} E [ Z i ] E [ B t ( n ) ] = 2 p − 1 , = n 1 ⌊ t n ⌋ E [ Z 1 ] . ( 2 5 ) A more elegant approach is to make p p p a function μ \mu μ . However, we cannot naively do this, since our drift could explode as n → ∞ n \rightarrow \infty n → ∞ . So we need to normalize μ \mu μ by n n n . Consider this definition for our bias parameter, now p n p_n p n : p n = 1 2 + μ 2 σ n . (26) p_n = \frac{1}{2} + \frac{\mu}{2 \sigma \sqrt{n}}. \tag{26} p n = 2 1 + 2 σ n μ . ( 2 6 ) Intuitively, the factor μ / ( 2 σ n ) \mu /(2 \sigma \sqrt{n}) μ / ( 2 σ n ) is the precise rate at which the bias of our coin has to vanish as we increase the number of bombardments n n n per unit of physical time t t t . So the mean of each Z i Z_i Z i is E [ Z i ] = 2 p n − 1 = μ σ n , (27) \mathbb{E}[Z_i] = 2 p_n - 1 = \frac{\mu}{\sigma \sqrt{n}}, \tag{27} E [ Z i ] = 2 p n − 1 = σ n μ , ( 2 7 ) and so the mean of our process—let’s denote it as X n ( n ) X_n^{(n)} X n ( n ) since it is no longer standardized—converges to μ t \mu t μ t as n → ∞ n \rightarrow \infty n → ∞ : E [ X t ( n ) ] = σ n ⌊ t n ⌋ E [ Z 1 ] = σ n ⌊ t n ⌋ μ σ n → μ t . (28) \mathbb{E}\!\left[X_t^{(n)}\right] = \frac{\sigma}{\sqrt{n}} \lfloor tn \rfloor \mathbb{E}[Z_1] = \frac{\sigma}{\sqrt{n}} \lfloor tn \rfloor \frac{\mu}{\sigma \sqrt{n}} \;\rightarrow\; \mu t. \tag{28} E [ X t ( n ) ] = n σ ⌊ t n ⌋ E [ Z 1 ] = n σ ⌊ t n ⌋ σ n μ → μ t . ( 2 8 ) Putting these two adjustments together—one for the drift and one for the size of the bombardment—we can see that the general result is non-standard Brownian motion: X t ( n ) → d N ( μ t , σ 2 t ) . (29) X_t^{(n)} \;\stackrel{d}{\rightarrow}\; \mathcal{N}(\mu t, \sigma^2 t). \tag{29} X t ( n ) → d N ( μ t , σ 2 t ) . ( 2 9 ) Alternatively, we could simply rewrite the main derivation (Equation 18 18 1 8 ) using u u u and p n p_n p n as defined in Equations 22 22 2 2 and 26 26 2 6 respectively. This is arguably the more elegant derivation, since we construct the marginal distribution from the ground up. See A2 for this derivation. Note that this isn’t a proof that the rescaled random walk converges to Brownian motion as a process . That requires more advanced mathematics such as Donsker’s theorem . Rather, it’s a claim about its marginal distribution at any fixed time t t t . But I think this provides amazing intuition for what Brownian motion really is without requiring much beyond elementary probability. I still remember sitting in class for a course on probability and random process and watching the professor churn through the algebra to produce the insight in Equation 19 19 1 9 . It felt surprising and then obvious. The normal distribution is everywhere precisely because it is the limiting distribution for sums of independent and identically distributed random variables. We can shift or scale our random walk. We can make it asymmetric. It doesn’t really matter. We’ll still converge to a normal. And in my mind, this derivation builds good intuition for other properties of Brownian motion. For example, we can say that Brownian motion is a martingale or that it has stationary Gausian increments. The mathematics needed to make these claims precise might require some work, but the basic intuition is encoded in the derivations and visualizations above. The binomial theorem is the following identity, which holds for any non-negative integer power n n n : ( x + y ) n = ∑ k = 0 n ( n k ) x k y n − k . (A1.1) (x + y)^n = \sum_{k=0}^n {n \choose k} x^k y^{n-k}. \tag{A1.1} ( x + y ) n = k = 0 ∑ n ( k n ) x k y n − k . ( A 1 . 1 ) This is easy to prove by induction. One can trivially check that the base case holds. And the inductive step is as follows: ( x + y ) n ( x + y ) = ∑ k = 0 n ( n k ) x k y n − k ( x + y ) = ∑ k = 0 n ( n k ) x k + 1 y n − k + ∑ k = 0 n ( n k ) x k y n − k + 1 : = A + B . (A1.2) \begin{aligned} (x + y)^n (x + y) &= \sum_{k=0}^n {n \choose k} x^k y^{n-k} (x + y) \\ &= \sum_{k=0}^n {n \choose k} x^{k+1} y^{n-k} + \sum_{k=0}^n {n \choose k} x^k y^{n-k+1} \\ &:= A + B. \end{aligned} \tag{A1.2} ( x + y ) n ( x + y ) = k = 0 ∑ n ( k n ) x k y n − k ( x + y ) = k = 0 ∑ n ( k n ) x k + 1 y n − k + k = 0 ∑ n ( k n ) x k y n − k + 1 : = A + B . ( A 1 . 2 ) If we write each sum A A A and B B B explicitly, it’s clear that we have n − 1 n-1 n − 1 overlapping terms: A = ( n 0 ) x 1 y n + ( n 1 ) x 2 y n − 1 + ⋯ + ( n n − 1 ) x n y 1 + ( n n ) x n + 1 y 0 , B = ( n 0 ) x 0 y n + 1 + ( n 1 ) x 1 y n + ⋯ + ( n n − 1 ) x n − 1 y 2 + ( n n ) x n y 1 . (A1.3) \begin{aligned} A &= {n \choose 0} x^1 y^n + {n \choose 1} x^2 y^{n-1} + \dots + {n \choose n-1} x^n y^1 + {n \choose n} x^{n+1} y^0, \\ \\ B &= {n \choose 0} x^0 y^{n+1} + {n \choose 1} x^1 y^n + \dots + {n \choose n-1} x^{n-1} y^2 + {n \choose n} x^n y^1. \end{aligned} \tag{A1.3} A B = ( 0 n ) x 1 y n + ( 1 n ) x 2 y n − 1 + ⋯ + ( n − 1 n ) x n y 1 + ( n n ) x n + 1 y 0 , = ( 0 n ) x 0 y n + 1 + ( 1 n ) x 1 y n + ⋯ + ( n − 1 n ) x n − 1 y 2 + ( n n ) x n y 1 . ( A 1 . 3 ) Collecting the n − 1 n-1 n − 1 like terms, we get: A + B = [ ( n 0 ) + ( n 1 ) ] x 1 y n + ⋯ + [ ( n n − 1 ) + ( n n ) ] x n y 1 + ( n 0 ) x 0 y n + 1 + ( n n ) x n + 1 y 0 . (A1.4) \begin{aligned} A+B &= \left[{n \choose 0} + {n \choose 1}\right] x^1 y^n + \dots + \left[{n \choose n-1} + {n \choose n}\right] x^n y^1 \\ &\quad + {n \choose 0} x^0 y^{n+1} + {n \choose n} x^{n+1} y^0. \end{aligned} \tag{A1.4} A + B = [ ( 0 n ) + ( 1 n ) ] x 1 y n + ⋯ + [ ( n − 1 n ) + ( n n ) ] x n y 1 + ( 0 n ) x 0 y n + 1 + ( n n ) x n + 1 y 0 . ( A 1 . 4 ) Finally, we can use the following identity to collapse bracketed binomial coefficients: ( n k ) = ( n − 1 k − 1 ) + ( n − 1 k ) . (A1.5) {n \choose k} = {n-1 \choose k-1} + {n-1 \choose k}. \tag{A1.5} ( k n ) = ( k − 1 n − 1 ) + ( k n − 1 ) . ( A 1 . 5 ) And we can rewrite the non-overlapping terms in terms of n + 1 n+1 n + 1 since ( n 0 ) = ( n + 1 0 ) = ( n n ) = ( n + 1 n + 1 ) = 1. (A1.6) {n \choose 0} = {n+1 \choose 0} = {n \choose n} = {n+1 \choose n+1} = 1. \tag{A1.6} ( 0 n ) = ( 0 n + 1 ) = ( n n ) = ( n + 1 n + 1 ) = 1 . ( A 1 . 6 ) This completes the inductive step: ( x + y ) n + 1 = ( n + 1 1 ) x 1 y n + ⋯ + ( n + 1 n ) x n y 1 + ( n + 1 0 ) x 0 y n + 1 + ( n + 1 n + 1 ) x n + 1 y 0 = ∑ k = 1 n + 1 ( n + 1 k ) x k y n + 1 − k . (A1.7) \begin{aligned} &(x+y)^{n+1} \\ &= {n+1 \choose 1} x^1 y^n + \dots + {n+1 \choose n} x^n y^1 + {n+1 \choose 0} x^0 y^{n+1} + {n+1 \choose n+1} x^{n+1} y^0 \\ &= \sum_{k=1}^{n+1} {n+1 \choose k} x^k y^{n+1-k}. \end{aligned} \tag{A1.7} ( x + y ) n + 1 = ( 1 n + 1 ) x 1 y n + ⋯ + ( n n + 1 ) x n y 1 + ( 0 n + 1 ) x 0 y n + 1 + ( n + 1 n + 1 ) x n + 1 y 0 = k = 1 ∑ n + 1 ( k n + 1 ) x k y n + 1 − k . ( A 1 . 7 ) Finally, the fact that the binomial distribution normalizes—discussed around Equation 8 8 8 —is simply a direct application of the binomial theorem for x = p x = p x = p and y = 1 − p y=1-p y = 1 − p . Let p n p_n p n be defined as p n = 1 2 + μ 2 σ n . (A3.1) p_n = \frac{1}{2} + \frac{\mu}{2 \sigma \sqrt{n}}. \tag{A3.1} p n = 2 1 + 2 σ n μ . ( A 3 . 1 ) Then clearly E [ Z i ] = 2 p n − 1 = μ σ n , μ K : = E [ K ⌊ t n ⌋ ] = ⌊ t n ⌋ p n = ⌊ t n ⌋ ( 1 2 + μ 2 σ n ) , σ K 2 : = V [ K ⌊ t n ⌋ ] = ⌊ t n ⌋ p n ( 1 − p n ) = ⌊ t n ⌋ ( 1 2 + μ 2 σ n ) ( 1 2 − μ 2 σ n ) = ⌊ t n ⌋ ( 1 4 − μ 2 4 σ 2 n ) . (A3.2) \begin{aligned} \mathbb{E}[Z_i] &= 2 p_n - 1 = \frac{\mu}{\sigma \sqrt{n}}, \\\\ \mu_{K} := \mathbb{E}[K_{\lfloor tn \rfloor}] &= {\lfloor tn \rfloor} p_n \\ &= {\lfloor tn \rfloor} \left( \frac{1}{2} + \frac{\mu}{2 \sigma \sqrt{n}}\right), \\\\ \sigma_{K}^2 := \mathbb{V}[K_{\lfloor tn \rfloor}] &= {\lfloor tn \rfloor} p_n (1 - p_n) \\ &= {\lfloor tn \rfloor} \left( \frac{1}{2} + \frac{\mu}{2 \sigma \sqrt{n}}\right) \left( \frac{1}{2} - \frac{\mu}{2 \sigma \sqrt{n}}\right) \\ &= {\lfloor tn \rfloor} \left( \frac{1}{4} - \frac{\mu^2}{4 \sigma^2 n}\right). \end{aligned} \tag{A3.2} E [ Z i ] μ K : = E [ K ⌊ t n ⌋ ] σ K 2 : = V [ K ⌊ t n ⌋ ] = 2 p n − 1 = σ n μ , = ⌊ t n ⌋ p n = ⌊ t n ⌋ ( 2 1 + 2 σ n μ ) , = ⌊ t n ⌋ p n ( 1 − p n ) = ⌊ t n ⌋ ( 2 1 + 2 σ n μ ) ( 2 1 − 2 σ n μ ) = ⌊ t n ⌋ ( 4 1 − 4 σ 2 n μ 2 ) . ( A 3 . 2 ) Let’s redefine X t ( n ) X_t^{(n)} X t ( n ) as the following sequence: X t ( n ) = u Z 1 + u Z 2 + ⋯ + u Z ⌊ t n ⌋ , u : = σ n . (A3.3) X_t^{(n)} = u Z_1 + u Z_2 + \dots + u Z_{\lfloor tn \rfloor}, \quad u := \frac{\sigma}{\sqrt{n}}. \tag{A3.3} X t ( n ) = u Z 1 + u Z 2 + ⋯ + u Z ⌊ t n ⌋ , u : = n σ . ( A 3 . 3 ) We can write this as: X t ( n ) = σ n [ Z 1 + Z 2 + ⋯ + Z ⌊ t n ⌋ ] = σ n [ 2 K ⌊ t n ⌋ − ⌊ t n ⌋ ] = σ n 2 [ K ⌊ t n ⌋ − ⌊ t n ⌋ 1 2 ] = σ n 2 [ K ⌊ t n ⌋ − ⌊ t n ⌋ 1 2 − ⌊ t n ⌋ ( μ 2 σ n ) + ⌊ t n ⌋ ( μ 2 σ n ) ] = σ n 2 [ K ⌊ t n ⌋ − ⌊ t n ⌋ ( 1 2 − μ 2 σ n ) ] + ⌊ t n ⌋ μ n = σ n 2 [ K ⌊ t n ⌋ − μ K ] + ⌊ t n ⌋ μ n = σ n 4 ⌊ t n ⌋ ⌊ t n ⌋ 1 − μ σ 2 n 1 − μ σ 2 n [ K ⌊ t n ⌋ − μ K ] + ⌊ t n ⌋ μ n = σ n ⌊ t n ⌋ ( 1 − μ σ 2 n ) [ K ⌊ t n ⌋ − μ K ⌊ t n ⌋ ( 1 − μ σ 2 n ) 4 ] + ⌊ t n ⌋ μ n = σ n ⌊ t n ⌋ ( 1 − μ σ 2 n ) [ K ⌊ t n ⌋ − μ K σ K ] + ⌊ t n ⌋ μ n (A3.4) \begin{aligned} X_t^{(n)} &= \frac{\sigma}{\sqrt{n}} \left[ Z_1 + Z_2 + \dots + Z_{\lfloor tn \rfloor} \right] \\ &= \frac{\sigma}{\sqrt{n}} \left[ 2 K_{\lfloor tn \rfloor} - {\lfloor tn \rfloor} \right] \\ &= \frac{\sigma}{\sqrt{n}} \; 2 \left[ K_{\lfloor tn \rfloor} - \lfloor tn \rfloor \frac{1}{2} \right] \\ &= \frac{\sigma}{\sqrt{n}} \; 2 \left[ K_{\lfloor tn \rfloor} - \lfloor tn \rfloor \frac{1}{2} - {\lfloor tn \rfloor} \left( \frac{\mu}{2 \sigma \sqrt{n}}\right) + {\lfloor tn \rfloor} \left( \frac{\mu}{2 \sigma \sqrt{n}}\right) \right] \\ &= \frac{\sigma}{\sqrt{n}} \; 2 \left[ K_{\lfloor tn \rfloor} - \lfloor tn \rfloor \left( \frac{1}{2} - \frac{\mu}{2 \sigma \sqrt{n}} \right) \right] + {\lfloor tn \rfloor} \frac{\mu}{n} \\ &= \frac{\sigma}{\sqrt{n}} \; 2 \left[ K_{\lfloor tn \rfloor} - \mu_K \right] + {\lfloor tn \rfloor} \frac{\mu}{n} \\ &= \frac{\sigma}{\sqrt{n}} \sqrt{4 \frac{\lfloor tn \rfloor}{\lfloor tn \rfloor} \frac{1 - \frac{\mu}{\sigma^2 n}}{1 - \frac{\mu}{\sigma^2 n}}} \left[ K_{\lfloor tn \rfloor} - \mu_K \right] + {\lfloor tn \rfloor} \frac{\mu}{n} \\ &= \frac{\sigma}{\sqrt{n}} \sqrt{\lfloor tn \rfloor \left( 1 - \frac{\mu}{\sigma^2 n} \right)} \left[ \frac{K_{\lfloor tn \rfloor} - \mu_K}{\sqrt{\frac{\lfloor tn \rfloor \left(1 - \frac{\mu}{\sigma^2 n}\right)}{4}}} \right] + {\lfloor tn \rfloor} \frac{\mu}{n} \\ &= \frac{\sigma}{\sqrt{n}} \sqrt{\lfloor tn \rfloor \left( 1 - \frac{\mu}{\sigma^2 n} \right)} \left[ \frac{K_{\lfloor tn \rfloor} - \mu_K}{\sigma_K} \right] + {\lfloor tn \rfloor} \frac{\mu}{n} \end{aligned} \tag{A3.4} X t ( n ) = n σ [ Z 1 + Z 2 + ⋯ + Z ⌊ t n ⌋ ] = n σ [ 2 K ⌊ t n ⌋ − ⌊ t n ⌋ ] = n σ 2 [ K ⌊ t n ⌋ − ⌊ t n ⌋ 2 1 ] = n σ 2 [ K ⌊ t n ⌋ − ⌊ t n ⌋ 2 1 − ⌊ t n ⌋ ( 2 σ n μ ) + ⌊ t n ⌋ ( 2 σ n μ ) ] = n σ 2 [ K ⌊ t n ⌋ − ⌊ t n ⌋ ( 2 1 − 2 σ n μ ) ] + ⌊ t n ⌋ n μ = n σ 2 [ K ⌊ t n ⌋ − μ K ] + ⌊ t n ⌋ n μ = n σ 4 ⌊ t n ⌋ ⌊ t n ⌋ 1 − σ 2 n μ 1 − σ 2 n μ [ K ⌊ t n ⌋ − μ K ] + ⌊ t n ⌋ n μ = n σ ⌊ t n ⌋ ( 1 − σ 2 n μ ) ⎣ ⎢ ⎡ 4 ⌊ t n ⌋ ( 1 − σ 2 n μ ) K ⌊ t n ⌋ − μ K ⎦ ⎥ ⎤ + ⌊ t n ⌋ n μ = n σ ⌊ t n ⌋ ( 1 − σ 2 n μ ) [ σ K K ⌊ t n ⌋ − μ K ] + ⌊ t n ⌋ n μ ( A 3 . 4 ) Finally, it’s clear that the prefactor converges to σ t \sigma \sqrt{t} σ t as n → ∞ n \rightarrow \infty n → ∞ : σ n ⌊ t n ⌋ ( 1 − μ σ 2 n ) → σ t (A3.5) \frac{\sigma}{\sqrt{n}} \sqrt{\lfloor tn \rfloor \left( 1 - \frac{\mu}{\sigma^2 n} \right)} \;\rightarrow\; \sigma \sqrt{t} \tag{A3.5} n σ ⌊ t n ⌋ ( 1 − σ 2 n μ ) → σ t ( A 3 . 5 ) while the last term converges to μ t \mu t μ t as n → ∞ n \rightarrow \infty n → ∞ : ⌊ t n ⌋ μ n → μ t . (A3.6) {\lfloor tn \rfloor} \frac{\mu}{n} \;\rightarrow\; \mu t. \tag{A3.6} ⌊ t n ⌋ n μ → μ t . ( A 3 . 6 ) K ⌊ t n ⌋ − μ K σ K → d N ( 0 , 1 ) , (A3.7) \frac{K_{\lfloor tn \rfloor} - \mu_K}{\sigma_K} \;\stackrel{d}{\rightarrow}\; \mathcal{N}(0, 1), \tag{A3.7} σ K K ⌊ t n ⌋ − μ K → d N ( 0 , 1 ) , ( A 3 . 7 ) then again by Slutsky’s theorem, we know X t ( n ) → d N ( μ t , σ 2 t ) . (A3.8) X_t^{(n)} \;\stackrel{d}{\rightarrow}\; \mathcal{N}(\mu t, \sigma^2 t). \tag{A3.8} X t ( n ) → d N ( μ t , σ 2 t ) . ( A 3 . 8 ) So X t ( n ) X_t^{(n)} X t ( n ) converges to a normal distribution with drift μ t \mu t μ t and volatility σ t \sigma \sqrt{t} σ t .