Posts in Statistics (20 found)
Raph Koster 1 weeks ago

Looking back at a pandemic simulator

It’s been six years now since the early days of the Covid pandemic. People who were paying super close attention started hearing rumors about something going on in China towards the end of 2019 — my earliest posts about it on Facebook were from November that year. Even at the time, people were utterly clueless about the mathematics of how a highly infectious virus spread. I remember spending hours writing posts on various different social media sites explaining that the Infection Fatality Rates and the R value were showing that we could be looking at millions dead. People didn’t tend to believe me: “SEVERAL MILLION DEAD! Okay, I’m done. No one is predicting that. But you made me laugh. Thanks.” You can do the math yourself. Use a low average death estimate of 0.4%. Assume 60% of the population catches it and then we reach herd immunity (which is generous): But that’s with low assumptions… It was like typing to a wall. In fact, it’s pretty likely that it still is, since these days, the discourse is all about how bad the economic and educational impact of lockdowns was — and not about the fact that if the world had acted in concert and forcefully, we could have had a much better outcome than we did. The health response was too soft , the lockdown too lenient, and as a result, we took all the hits. Of course, these days people also forget just how deadly it was and how many died, and so on. We now know that the overall IFR was probably higher than 0.4%, but very strongly tilted towards older people and those with comorbidities. We also now know that herd immunity was a pipe dream — instead we managed to get vaccines out in record time and the ordinary course of viral evolution ended up reducing the death rate until now we behave as if Covid is just a deadlier flu (it isn’t, that thinking ignores long-term impact of the disease). The upshot: my math was not that far off — the estimated toll in the US ended up being 1.2 to 1.4 million souls, and worldwide it’s estimated as between 15 and 28.5 million dead. Plenty of denial of this, these days, and plenty of folks blaming the vaccines for what are most likely issues caused by the disease in the first place. Anyway, in the midst of it all, tired of running math in my spreadsheets (yeah, I was tracking it all in spreadsheets, what can I say?), I started thinking about why only a few sorts of people were wrapping their heads around the implications. The thing they all had in common was that they lived with exponential curves. Epidemiologists, Wall Street quants, statisticians… and game designers. Could we get more people to feel the challenges in their bones? So… I posted this to Facebook on March 24th, 2020: Three weeks ago I was idly thinking of how someone ought to make a little game that shows how the coronavirus spreads, how testing changes things, and how social distancing works. The sheer number of people who don’t get it — numerate people, who ought to be able to do math — is kind of shocking. I couldn’t help worrying at it, and have just about a whole design in my head. But I have to admit, I kinda figured someone would have made it by now. But they haven’t. It’s not even a hard game to make. Little circles on a plain field. Each circle simply bounces around. They are generated each with an age, a statistically real chance of having a co-morbid condition (diabetes, hypertension, immunosuppressed, pulmonary issues…), and crucially, a name out of a baby book. They can be in one of these states: In addition, there’s a diagnosed flag. We render asymptomatic the same as healthy. We render each of the other states differently, depending on whether the diagnosed flag is set. They show as healthy until dead, if not diagnosed. If diagnosed, you can see what stage they are in (icon or color change). The circles move and bounce. If an asymptomatic one touches a healthy one, they have a statistically valid chance of infecting. Circles progress through these states using simple stats. We track current counts on all of these, and show a bar graph. Yes, that means players can see that people are getting sick, but don’t know where. The player has the following buttons. The game ticks through days at an accelerated pace. It runs for 18 months worth of days. At the end of it, you have a vaccine, and the epidemic is over. Then we tell you what percentage of your little world died. Maybe with a splash screen listing every name and age of everyone who died. And we show how much money you spent. Remember, you can go negative, and it’s OK. That’s it. Ideally, it runs in a webpage. Itch.io maybe. Or maybe I have a friend with unlimited web hosting. Luxury features would be a little ini file or options screen that lets you input real world data for your town or country: percent hypertensive, age demographics, that sort of thing. Or maybe you could crowdsource it, so it’s a pulldown… Each weekend I think about building this. So far, I haven’t, and instead I try to focus on family and mental health and work. But maybe someone else has the energy. I suspect it might persuade and save lives. Some things about this that I want to point out in hindsight. Per the American Heart Association, among adults age 20 and older in the United States, the following have high blood pressure: Per the American Diabetes Association, Per studies in JAMA, Next, realize that because the disease spreads mostly inside households (where proximity means one case tends to infect others), this means that protecting the above extremely large slices of the population means either isolating them away from their families, or isolating the entire family and other regular contacts. People tend to think the at-risk population is small. It’s not. The response, for Facebook, was pretty surprising. The post was re-shared a lot, and designers from across the industry jumped in with tweaks to the rules. Some folks re-posted it to large groups about public initiatives, etc. There was also, of course, plenty of skepticism that something like this would make any difference at all. The first to take up the challenge was John Albano, who had his game Covid Ops up and running on itch.io a mere six days later . You can still play it there! Stuck in the house and looking for things to do. Soooo, when a fellow game dev suggested a game idea and basic ruleset along with “I wish someone would make a game like this,” I took that as a challenge to try. Tonight (this morning?), the first release of COVID OPS has been published. John’s game was pretty faithful to the sketch. You can see the comorbidities over on the left, and the way the player has clicked on 72 year old Rowan — who probably isn’t going to make it. As he updated it, he added in more detailed comorbidity data, and (unfortunately, as it turns out) made it so that people were immune after recovering from infection. And of course, like the next one I’ll talk about, John made a point of including real world resource links so that people could take action. By April 6th, another team led by Khail Santia had participated in Jamdemic 2020 and developed the first version of In the Time of Pandemia. He wrote, The compound I stay at is about to be cordoned. We’ve been contact-traced by the police, swabbed by medical personnel covered in protective gear. One of our housemates works at a government hospital and tested positive for antibodies against SARS-CoV-2. The pandemic closes in from all sides. What can a game-maker do in a time like this? I’ve been asking myself this question since the beginning of community quarantine. I’m based in Cebu City, now the top hotspot for COVID-19 in the Philippines in terms of incidence proportion. This game would go on to be completed by a fuller team including a mathematical epidemiologist, and In the Time of Pandemia eventually ended up topping charts on Newgrounds when it launched there in July of 2020. This game went viral and got a ton of press across the Pacific Rim . The team worked closely with universities and doctors in the Philippines and validated all the numbers. They added local flavor to their levels representing cities and neighborhoods that their local players would know. Gregg Victor Gabison, dean of the University of San Jose-Recoletos College of Information, Computer & Communications Technology, whose students play-tested the game, said, “This is the kind of game that mindful individuals would want to check out. It has substance and a storyline that connects with reality, especially during this time of pandemic.” Not only does the game have to work on a technical basis, it has to communicate how real a crisis the pandemic is in a simple, digestible manner. Dr. Mariane Faye Acma, resident physician at Medidas Medical Clinic in Valencia, Bukidnon, was consulted to assess the game’s medical plausibility. She enumerated critical thinking, analysis, and multitasking as skills developed through this game. “You decide who are the high risks, who needs to be tested and isolated, where to focus, [and] how much funds to allocate….The game will make players realize how challenging the work of the health sector is in this crisis.” “Ultimately, the game’s purpose is to give players a visceral understanding of what it takes to flatten the curve,” Santia said. I think most people have no idea that any of this happened or that I was associated with it. I only posted the design sketch on Facebook; it got reshared across a few thousand people. It wasn’t on social media, I didn’t talk about it elsewhere, and for whatever reason, I didn’t blog about it. I have had both these games listed on my CV for a while. Oh, I didn’t do any of the heavy lifting… all credit goes to the developers for that. There’s no question that way more than 95% of the work comes after the high-level design spec. But both games do credit me, and I count them as games I worked on. A while back, someone on Reddit said it was pathetic that I listed these. I never quite know what to make of comments like that (troll much?!?). No offense, but I’m proud of what a little design sketch turned into, and proud of the work that these teams did, and proud that one of the games got written up in the press so much; ended up being used in college classrooms; was vetted and validated by multiple experts in the field; and made a difference however slight. Peak Covid was a horrendous time. Horrendous enough that we have kind of blocked it from our memories. But I lost friends and colleagues. I still remember. Back then I wrote, This is the largest event in your lifetime. It is our World War, our Great Depression. We need to rise the occasion, and think about how we change. There is no retreat to how it used to be. There is only through. A year later, the vaccine gave us that path through, and here we are now. But as I write this, we have the first human case of H5N5 bird flu; it was only a matter of time. Maybe these games helped a few people get through it all. They were played by tens of thousands, after all. Maybe they will help next time. I know that the fact that they were made helped me get through, that making them helped John get through, helped Khail get through — in his own words: In the end, the attempt to articulate a game-maker’s perspective on COVID-19 has enabled me to somehow transcend the chaos outside and the turmoil within. It’s become a welcome respite from isolation, a thread connecting me to a diversity of talents who’ve been truly generous with their expertise and encouragement. As incidences continue to rise here and in many parts of the world, our hope is that the game will be of some use in showing what it takes to flatten the curve and in advocating for communities most in need. So… at minimum, they made a real difference to at least three people. And that’s not a bad thing for a game to aspire to. 328 million people in the US. 60% of that is 196 million catch it. 0.4% of that is 780,000 dead. asymptomatic but contagious symptomatic 70% of asymptomatic cases turn symptomatic after 1d10+5 days. The others stay sick for the full 21 days. Percent chance of moving from symptomatic to severe is based on comorbid conditions, but the base chance is 1 in 5 after some amount of days. Percent chance of moving from severe to critical is 1 in 4, modified by age and comorbidities, if in hospital. Otherwise, it’s double. Percent chance of moving from critical to dead is something like 1 in 5, modified by age and comorbidities, if in hospital. Otherwise, it’s double. Symptomatic, severe, and critical circles that do not progress to dead move to ‘recovered’ after 21 days since reaching symptomatic. Severe and critical circles stop moving. Hover on a circle, and you see the circle’s name and age and any comorbidities (“Alison, 64, hypertension.”) Test . This lets them click on a circle. If the circle is asymptomatic or worse, it gets the diagnosed flag. But it costs you one test. Isolate . This lets them click on a circle, and freezes them in place. Some visual indicator shows they are isolated. Note that isolated cases still progress. Hospitalize . This moves the circle to hospital. Hospital only has so many beds. Clicking on a circle already in hospital drops the circle back out in the world. Circles in hospital have half the chance or progressing to the next stage. Buy test . You only have so many tests. You have to click this button to buy more. Buy bed . You only have this many beds. You have to click this button to buy more. Money goes up when circles move. But you are allowed to go negative for money . Lockdown. Lastly, there is a global button that when pressed, freezes 80% of all circles. But it gradually ticks down and circles individually start to move again, and the button must be pressed again from time to time. While lockdown is running, it costs money as well as not generating it. If pressed again, it lifts the lockdown and all circles can move again. At the time that I posted, I could tell that people were desperately unwilling to enter lockdown for any extended period of time; but “The Hammer and the Dance” strategy of pulsed lockdown periods was still very much in our future. I wanted a mechanic that showed population non-compliance. There was also quite a lot of obsessing over case counts at the time, and one of the things that I really wanted to get across was that our testing was so incredibly inadequate that we really had little idea of how many cases we were dealing with and therefore what the IFR (infection fatality rate) actually was. That’s why tests are limited in the design sketch. I was also trying to get across that money was not a problem in dealing with this. You could take the money value negative because governments can choose to do that. I often pointed out in those days that if the government chose, it could send a few thousand dollars to every household every few weeks for the duration of lockdown. It would likely have been less impact to the GDP and the debt than what we actually did. I wanted names. I wanted players to understand the human cost, not just the statistics. Today, I might even suggest that an LLM generate a little biography for every fatality. Another thing that was constantly missed was the impact of comorbidities. To this day, I hear people say “ah, it only affected the old and the ill, so why not have stayed open?” To which I would reply with: For non-Hispanic whites, 33.4 percent of men and 30.7 percent of women. For non-Hispanic Blacks, 42.6 percent of men and 47.0 percent of women. For Mexican Americans, 30.1 percent of men and 28.8 percent of women. 34.2 million Americans, or 10.5% of the population, have diabetes. Nearly 1.6 million Americans have type 1 diabetes, including about 187,000 children and adolescents 4.2% of of the population of the USA has been diagnosed as immunocompromised by their doctor

1 views
emiruz 1 months ago

Modelling beliefs about sets

Here is an interesting scheme I encountered in the wild, generalised and made abstract for you, my intrepid reader. Let \(X\) be a set of binary variables. We are given information about subsets of \(X\), where each update is a probability ranging over a concrete set, the state of which is described by an arbitrary quantified logic formula. For example, \[P\bigg\{A \subset X \mid \exists_{x_i, x_j \in A} \big(x_o \ne x_j))\bigg\} = p\] The above assigns a probability \(p\) to some concrete subset A, with the additional information that at least 1 pair of its members do not have the same value.

0 views
Alex Molas 1 months ago

Bayesian A/B testing is not immune to peeking

Introduction Over the last few months at RevenueCat I’ve been building a statistical framework to flag when an A/B test has reached statistical significance. I went through the usual literature, including Evan Miller’s posts. In his well known “How Not to Run an A/B Test” there’s a claim that with Bayesian experiment design you can stop at any time and still make valid inferences, and that you don’t need a fixed sample size to get a valid result. I’ve read this claim in other posts. The impression I got is that you can peek as often as you want, stop the moment the posterior clears a threshold (eg $P(A>B) > 0.95$), and you won’t inflate false positives. And this is not correct. If you’re an expert in Bayesian statistics this is probably obvious, but it wasn’t for me. So I decided to run some simulations to see what really happens, and I’m sharing the results here in case it can be useful for others.

0 views
DYNOMIGHT 2 months ago

Dear PendingKetchup

PendingKetchup comments on my recent post on what it means for something to be heritable : The article seems pretty good at math and thinking through unusual implications, but my armchair Substack eugenics alarm that I keep in the back of my brain is beeping. Saying that variance was “invented for the purpose of defining heritability” is technically correct, but that might not be the best kind of correct in this case, because it was invented by the founder of the University of Cambridge Eugenics Society who had decided, presumably to support that project, that he wanted to define something called “heritability”. His particular formula for heritability is presented in the article as if it has odd traits but is obviously basically a sound thing to want to calculate, despite the purpose it was designed for. The vigorous “educational attainment is 40% heritable, well OK maybe not but it’s a lot heritable, stop quibbling” hand waving sounds like a person who wants to show but can’t support a large figure. And that framing of education, as something “attained” by people, rather than something afforded to or invested in them, is almost completely backwards at least through college. The various examples about evil despots and unstoppable crabs highlight how heritability can look large or small independent of more straightforward biologically-mechanistic effects of DNA. But they still give the impression that those are the unusual or exceptional cases. In reality, there are in fact a lot of evil crabs, doing things like systematically carting away resources from Black children’s* schools, and then throwing them in jail. We should expect evil-crab-based explanations of differences between people to be the predominant ones. *Not to say that being Black “is genetic”. Things from accent to how you style your hair to how you dress to what country you happen to be standing in all contribute to racial judgements used for racism. But “heritability” may not be the right tool to disentangle those effects. Dear PendingKetchup, Thanks for complimenting my math (♡), for reading all the way to the evil crabs, and for not explicitly calling me a racist or eugenicist. I also appreciate that you chose sincerity over boring sarcasm and that you painted such a vibrant picture of what you were thinking while reading my post. I hope you won’t mind if I respond in the same spirit. To start, I’d like to admit something. When I wrote that post, I suspected some people might have reactions similar to yours. I don’t like that. I prefer positive feedback! But I’ve basically decided to just let reactions like yours happen, because I don’t know how to avoid them without compromising on other core goals. It sounds like my post gave you a weird feeling. Would it be fair to describe it as a feeling that I’m not being totally upfront about what I really think about race / history / intelligence / biological determinism / the ideal organization of society? Because if so, you’re right. It’s not supposed to be a secret, but it’s true. Why? Well, you may doubt this, but when I wrote that post, my goal was that people who read it would come away with a better understanding of the meaning of heritability and how weird it is. That’s it. Do I have some deeper and darker motivations? Probably. If I probe my subconscious, I find traces of various embarrassing things like “draw attention to myself” or “make people think I am smart” or “after I die, live forever in the world of ideas through my amazing invention of blue-eye-seeking / human-growth-hormone-injecting crabs.” What I don’t find are any goals related to eugenics, Ronald Fisher, the heritability of educational attainment, if “educational attainment” is good terminology, racism, oppression, schools, the justice system, or how society should be organized. These were all non-goals for basically two reasons: My views on those issues aren’t very interesting or notable. I didn’t think anyone would (or should) care about them. Surely, there is some place in the world for things that just try to explain what heritability really means? If that’s what’s promised, then it seems weird to drop in a surprise morality / politics lecture. At the same time, let me concede something else. The weird feeling you got as you read my post might be grounded in statistical truth. That is, it might be true that many people who blog about things like heritability have social views you wouldn’t like. And it might be true that some of them pretend at truth-seeking but are mostly just charlatans out to promote those unliked-by-you social views. You’re dead wrong to think that’s what I’m doing. All your theories of things I’m trying to suggest or imply are unequivocally false. But given the statistical realities, I guess I can’t blame you too much for having your suspicions. So you might ask—if my goal is just to explain heritability, why not make that explicit? Why not have a disclaimer that says, “OK I understand that heritability is fraught and blah blah blah, but I just want to focus on the technical meaning because…”? One reason is that I think that’s boring and condescending. I don’t think people need me to tell them that heritability is fraught. You clearly did not need me to tell you that. Also, I don’t think such disclaimers make you look neutral. Everyone knows that people with certain social views (likely similar to yours) are more likely to give such disclaimers. And they apply the same style of statistical reasoning you used to conclude I might be a eugenicist. I don’t want people who disagree with those social views to think they can’t trust me. Paradoxically, such disclaimers often seem to invite more objections from people who share the views they’re correlated with, too. Perhaps that’s because the more signals we get that someone is on “our” side, the more we tend to notice ideological violations. (I’d refer here to the narcissism of small differences , though I worry you may find that reference objectionable.) If you want to focus on the facts, the best strategy seems to be serene and spiky: to demonstrate by your actions that you are on no one’s side, that you don’t care about being on anyone’s side, and that your only loyalty is to readers who want to understand the facts and make up their own damned mind about everything else. I’m not offended by your comment. I do think it’s a little strange that you’d publicly suggest someone might be a eugenicist on the basis of such limited evidence. But no one is forcing me to write things and put them on the internet. The reason I’m writing to you is that you were polite and civil and seem well-intentioned. So I wanted you to know that your world model is inaccurate. You seem to think that because my post did not explicitly support your social views, it must have been written with the goal of undermining those views. And that is wrong. The truth is, I wrote that post without supporting your (or any) social views because I think mixing up facts and social views is bad. Partly, that’s just an aesthetic preference. But if I’m being fully upfront, I also think it’s bad in the consequentialist sense that it makes the world a worse place. Why do I think this? Well, recall that I pointed out that if there were crabs that injected blue-eyed babies with human growth hormone, that would increase the heritability of height. You suggest I had sinister motives for giving this example, as if I was trying to conceal the corollary that if the environment provided more resources to people with certain genes (e.g. skin color) that could increase the heritability of other things (e.g. educational attainment). Do you really think you’re the only reader to notice that corollary? The degree to which things are “heritable” depends on the nature of society. This is a fact. It’s a fact that many people are not aware of. It’s also a fact that—I guess—fits pretty well with your social views. I wanted people to understand that. Not out of loyalty to your social views, but because it is true. It seems that you’re annoyed that I didn’t phrase all my examples in terms of culture war. I could have done that. But I didn’t, because I think my examples are easier to understand, and because the degree to which changing society might change the heritability of some trait is a contentious empirical question. But OK. Imagine I had done that. And imagine all the examples were perfectly aligned with your social views. Do you think that would have made the post more or less effective in convincing people that the fact we’re talking about is true? I think the answer is: Far less effective. I’ll leave you with two questions: Question 1: Do you care about the facts? Do you believe the facts are on your side? Question 2: Did you really think I wrote that post with with the goal of promoting eugenics? If you really did think that, then great! I imagine you’ll be interested to learn that you were incorrect. But just as you had an alarm beeping in your head as you read my post, I had one beeping in my head as I read your comment. My alarm was that you were playing a bit of a game. It’s not that you really think I wanted to promote eugenics, but rather that you’re trying to enforce a norm that everyone must give constant screaming support to your social views and anyone who’s even slightly ambiguous should be ostracized. Of course, this might be a false alarm! But if that is what you’re doing, I have to tell you: I think that’s a dirty trick, and a perfect example of why mixing facts and social views is bad. You may disagree with all my motivations. That’s fine. ( I won’t assume that means you are a eugenicist.) All I ask is that you disapprove accurately. xox dynomight My views on those issues aren’t very interesting or notable. I didn’t think anyone would (or should) care about them. Surely, there is some place in the world for things that just try to explain what heritability really means? If that’s what’s promised, then it seems weird to drop in a surprise morality / politics lecture.

0 views
emiruz 3 months ago

A short statistical reasoning test

Here are a few practical questions of my own invention which are easy to comprehend but very difficult to solve without statistical reasoning competence. They are provided in order of difficulty. The answers are at the end. If you find errors or have elegant alternative solutions, please email me (address in bio)! QUESTIONS 1. Sorting fractions under uncertainty You are given the number of trials and successes for a set of items, and you are asked to sort them by the fraction #successes / #trials.

0 views
DYNOMIGHT 3 months ago

Futarchy’s fundamental flaw — the market — the blog post

Here’s our story so far: Markets are a good way to know what people really think. When India and Pakistan started firing missiles at each other on May 7, I was concerned, what with them both having nuclear weapons. But then I looked at world market prices: See how it crashes on May 7? Me neither. I found that reassuring. But we care about lots of stuff that isn’t always reflected in stock prices, e.g. the outcomes of elections or drug trials. So why not create markets for those, too? If you create contracts that pay out $1 only if some drug trial succeeds, then the prices will reflect what people “really” think. In fact, why don’t we use markets to make decisions? Say you’ve invented two new drugs, but only have enough money to run one trial. Why don’t you create markets for both drugs, then run the trial on the drug that gets a higher price? Contracts for the “winning” drug are resolved based on the trial, while contracts in the other market are cancelled so everyone gets their money back. That’s the idea of Futarchy , which Robin Hanson proposed in 2007. Why don’t we? Well, maybe it won’t work. In 2022, I wrote a post arguing that when you cancel one of the markets, you screw up the incentives for how people should bid, meaning prices won’t reflect the causal impact of different choices. I suggested prices reflect “correlation” rather than causation, for basically the same reason this happens with observational statistics. This post, it was magnificent. It didn’t convince anyone. Years went by. I spent a lot of time reading Bourdieu and worrying about why I buy certain kinds of beer. Gradually I discovered that essentially the same point about futarchy had been made earlier by, e.g., Anders_H in 2015, abramdemski in 2017, and Luzka in 2021. In early 2025, I went to a conference and got into a bunch of (friendly) debates about this. I was astonished to find that verbally repeating the arguments from my post did not convince anyone. I even immodestly asked one person to read my post on the spot. (Bloggers: Do not do that.) That sort of worked. So, I decided to try again. I wrote another post called ” Futarky’s Futarchy’s fundamental flaw” . It made the same argument with more aggression, with clearer examples, and with a new impossibility theorem that showed there doesn’t even exist any alternate payout function that would incentivize people to bid according to their causal beliefs. That post… also didn’t convince anyone. In the discussion on LessWrong , many of my comments are upvoted for quality but downvoted for accuracy, which I think means, “nice try champ; have a head pat; nah.” Robin Hanson wrote a response , albeit without outward evidence of reading beyond the first paragraph. Even the people who agreed with me often seemed to interpret me as arguing that futarchy satisfies evidential decision theory rather than causal decision theory . Which was weird, given that I never mentioned either of those, don’t accept the premise the futarchy satisfies either of them, and don’t find the distinction helpful in this context. In my darkest moments, I started to wonder if I might fail to achieve worldwide consensus that futarchy doesn’t estimate causal effects. I figured I’d wait a few years and then launch another salvo. But then, legendary human Bolton Bailey decided to stop theorizing and take one of my thought experiments and turn it into an actual experiment. Thus, Futarchy’s fundamental flaw — the market was born. (You are now reading a blog post about that market.) I gave a thought experiment where there are two coins and the market is trying to pick the one that’s more likely to land heads. For one coin, the bias is known, while for the other coin there’s uncertainty. I claimed futarchy would select the worse / wrong coin, due to this extra uncertainty. Bolton formalized this as follows: There are two markets, one for coin A and one for coin B. Coin A is a normal coin that lands heads 60% of the time. Coin B is a trick coin that either always lands heads or always lands tails, we just don’t know which. There’s a 59% it’s an always-heads coin. Twenty-four hours before markets close, the true nature of coin B is revealed. After the markets closes, whichever coin has a higher price is flipped and contracts pay out $1 for heads and $0 for tails. The other market is cancelled so everyone gets their money back. Get that? Everyone knows that there’s a 60% chance coin A will land heads and a 59% chance coin B will land heads. But for coin A, that represents true “aleatoric” uncertainty, while for coin B that represents “epistemic” uncertainty due to a lack of knowledge. (See Bayes is not a phase for more on “aleatoric” vs. “epistemic” uncertainty.) Bolton created that market independently. At the time, we’d never communicated about this or anything else. To this day, I have no idea what he thinks about my argument or what he expected to happen. In the forum for the market, there was a lot of debate about “whalebait”. Here’s the concern: Say you’ve bought a lot of contracts for coin B, but it emerges that coin B is always-tails. If you have a lot of money, then you might go in at the last second and buy a ton of contracts on coin A to try to force the market price above coin B, so the coin B market is cancelled and you get your money back. The conversation seemed to converge towards the idea that this was whalebait. Though notice that if you’re buying contracts for coin A at any price above $0.60, you’re basically giving away free money. It could still work, but it’s dangerous and everyone else has an incentive to stop you. If I was betting in this market, I’d think that this was at least unlikely . Bolton posted about the market. When I first saw the rules, I thought it wasn’t a valid test of my theory and wasted a huge amount of Bolton’s time trying to propose other experiments that would “fix” it. Bolton was very patient, but I eventually realized that it was completely fine and there was nothing to fix. At the time, this is what the prices looked like: That is, at the time, both coins were priced at $0.60, which is not what I had predicted. Nevertheless, I publicly agreed that this was a valid test of my claims. I think this is a great test and look forward to seeing the results. Let me reiterate why I thought the markets were wrong and coin B deserved a higher price. There’s a 59% chance coin B would turns out to be all-heads. If that happened, then (absent whales being baited) I thought the coin B market would activate, so contracts are worth $1. So thats 59% × $1 = $0.59 of value. But if coin B turns out to be all-tails, I thought there is a good chance prices for coin B would drop below coin A, so the market is cancelled and you get your money back. So I thought a contract had to be worth more than $0.59. If you buy a contract for coin B for $0.70, then I think that’s worth Surely isn’t that low. So surely this is worth more than $0.59. More generally, say you buy a YES contract for coin B for $M. Then that contract would be worth It’s not hard to show that the breakeven price is Even if you thought was only 50%, then the breakeven price would still be $0.7421. Within a few hours, a few people bought contracts on coin B, driving up the price. Then, Quroe proposed creating derivative markets. In theory, if there was a market asking if coin A was going to resolve YES, NO, or N/A, supposedly people could arbitrage their bets accordingly and make this market calibrated. Same for a similar market on coin B. Thus, Futarchy’s Fundamental Fix - Coin A and Futarchy’s Fundamental Fix - Coin B came to be. These were markets in which people could bid on the probability that each coin would resolve YES, meaning the coin was flipped and landed heads, NO, meaning the coin was flipped and landed tails, or N/A, meaning the market was cancelled. Honestly, I didn’t understand this. I saw no reason that these derivative markets would make people bid their true beliefs. If they did, then my whole theory that markets reflect correlation rather than causation would be invalidated. Prices for coin B went up and down, but mostly up. Eventually, a few people created large limit orders, which caused things to stabilize. Here was the derivative market for coin A. And here it was market for coin B. During this period, not a whole hell of a lot happened. This brings us up to the moment of truth, when the true nature of coin B was to be revealed. At this point, coin B was at $0.90, even though everyone knows it only has a 59% chance of being heads. The nature of the coin was revealed. To show this was fair, Bolton did this by asking a bot to publicly generate a random number. Thus, coin B was determined to be always-heads. There were still 24 hours left to bid. At this point, a contract for coin B was guaranteed to pay out $1. The market quickly jumped to $1. I was right. Everyone knew coin A had a higher chance of being heads than coin B, but everyone bid the price of coin B way above coin A anyway. In the previous math box, we saw that the breakeven price should satisfy If you invert this and plug in M=$0.90, then you get I’ll now open the floor for questions. Isn’t this market unrealistic? Yes, but that’s kind of the point. I created the thought experiment because I wanted to make the problem maximally obvious, because it’s subtle and everyone is determined to deny that it exists. Isn’t this just a weird probability thing? Why does this show futarchy is flawed? The fact that this is possible is concerning. If this can happen, then futarchy does not work in general . If you want to claim that futarchy works, then you need to spell out exactly what extra assumptions you’re adding to guarantee that this kind of thing won’t happen. But prices did reflect causality when the market closed! Doesn’t that mean this isn’t a valid test? No. That’s just a quirk of the implementation. You can easily create situations that would have the same issue all the way through market close. Here’s one way you could do that: On average, this market will run for 30 days. (The length follows a geometric distribution ). Half the time, the market will close without the nature of coin B being revealed. Even when that happens, I claim the price for coin B will still be above coin A. If futarchy is flawed, shouldn’t you be able to show that without this weird step of “revealing” coin B? Yes. You should be able to do that, and I think you can. Here’s one way: First, have users generate public keys by running this command: Second, they should post the contents of the when asking for their bit. For example: Third, whoever is running the market should save that key as , pick a pit, and encrypt it like this: Users can then decrypt like this: Or you could use email… I think this market captures a dynamic that’s present in basically any use of futarchy: You have some information, but you know other information is out there. I claim that this market—will be weird. Say it just opened. If you didn’t get a bit, then as far as you know, the bias for coin B could be anywhere between 49% and 69%, with a mean of 59%. If you did get a bit, then it turns out that the posterior mean is 58.5% if you got a and 59.5% if you got a . So either way, your best guess is very close to 59%. However, the information for the true bias of coin B is out there! Surely coin B is more likely to end up with a higher price in situations where there are lots of bits. This means you should bid at least a little higher than your true belief, for the same reason as the main experiment—the market activating is correlated with the true bias of coin B. Of course, after the markets open, people will see each other’s bids and… something will happen. Initially, I think prices will be strongly biased for the above reasons. But as you get closer to market close, there’s less time for information to spread. If you are the last person to trade, and you know you’re the last person to trade, then you should do so based on your true beliefs. Except, everyone knows that there’s less time for information to spread. So while you are waiting till the last minute to reveal your true beliefs, everyone else will do the same thing. So maybe people sort of rush in at the last second? (It would be easier to think about this if implemented with batched auctions rather than a real-time market.) Anyway, while the game theory is vexing, I think there’s a mix of (1) people bidding higher than their true beliefs due to correlations between the final price and the true bias of coin B and (2) people “racing” to make the final bid before the markets close. Both of these seem in conflict with the idea of prediction markets making people share information and measuring collective beliefs. Why do you hate futarchy? I like futarchy. I think society doesn’t make decisions very well, and I think we should give much more attention to new ideas like futarchy that might help us do better. I just think we should be aware of its imperfections and consider variants (e.g. commiting to randomization ) that would resolve them. If I claim futarchy does reflect causal effects, and I reject this experiment as invalid, should I specify what restrictions I want to place on “valid” experiments (and thus make explicit the assumptions under which I claim futarchy works) since otherwise my claims are unfalsifiable? Markets are a good way to know what people really think. When India and Pakistan started firing missiles at each other on May 7, I was concerned, what with them both having nuclear weapons. But then I looked at world market prices: See how it crashes on May 7? Me neither. I found that reassuring. But we care about lots of stuff that isn’t always reflected in stock prices, e.g. the outcomes of elections or drug trials. So why not create markets for those, too? If you create contracts that pay out $1 only if some drug trial succeeds, then the prices will reflect what people “really” think. In fact, why don’t we use markets to make decisions? Say you’ve invented two new drugs, but only have enough money to run one trial. Why don’t you create markets for both drugs, then run the trial on the drug that gets a higher price? Contracts for the “winning” drug are resolved based on the trial, while contracts in the other market are cancelled so everyone gets their money back. That’s the idea of Futarchy , which Robin Hanson proposed in 2007. Why don’t we? Well, maybe it won’t work. In 2022, I wrote a post arguing that when you cancel one of the markets, you screw up the incentives for how people should bid, meaning prices won’t reflect the causal impact of different choices. I suggested prices reflect “correlation” rather than causation, for basically the same reason this happens with observational statistics. This post, it was magnificent. It didn’t convince anyone. Years went by. I spent a lot of time reading Bourdieu and worrying about why I buy certain kinds of beer. Gradually I discovered that essentially the same point about futarchy had been made earlier by, e.g., Anders_H in 2015, abramdemski in 2017, and Luzka in 2021. In early 2025, I went to a conference and got into a bunch of (friendly) debates about this. I was astonished to find that verbally repeating the arguments from my post did not convince anyone. I even immodestly asked one person to read my post on the spot. (Bloggers: Do not do that.) That sort of worked. So, I decided to try again. I wrote another post called ” Futarky’s Futarchy’s fundamental flaw” . It made the same argument with more aggression, with clearer examples, and with a new impossibility theorem that showed there doesn’t even exist any alternate payout function that would incentivize people to bid according to their causal beliefs. There are two markets, one for coin A and one for coin B. Coin A is a normal coin that lands heads 60% of the time. Coin B is a trick coin that either always lands heads or always lands tails, we just don’t know which. There’s a 59% it’s an always-heads coin. Twenty-four hours before markets close, the true nature of coin B is revealed. After the markets closes, whichever coin has a higher price is flipped and contracts pay out $1 for heads and $0 for tails. The other market is cancelled so everyone gets their money back. Let coin A be heads with probability 60%. This is public information. Let coin B be an ALWAYS HEADS coin with probability 59% and ALWAYS TAILS coin with probability 41%. This is a secret. Every day, generate a random integer between 1 and 30. If it’s 1, immediately resolve the markets. It it’s 2, reveal the nature of coin B. If it’s between 3 and 30, do nothing. Let coin A be heads with probability 60%. This is public information. Sample 20 random bits, e.g. . Let coin B be heads with probability (49+N)% where N is the number of bits. do not reveal these bits publicly. Secretly send these bits to the first 20 people who ask.

0 views
DYNOMIGHT 3 months ago

Heritability puzzlers

The heritability wars have been a-raging. Watching these, I couldn’t help but notice that there’s near-universal confusion about what “heritable” means. Partly, that’s because it’s a subtle concept. But it also seems relevant that almost all explanations of heritability are very, very confusing. For example, here’s Wikipedia’s definition : Any particular phenotype can be modeled as the sum of genetic and environmental effects: Phenotype ( P ) = Genotype ( G ) + Environment ( E ). Likewise the phenotypic variance in the trait – Var ( P ) – is the sum of effects as follows: Var( P ) = Var( G ) + Var( E ) + 2 Cov( G , E ). In a planned experiment Cov( G , E ) can be controlled and held at 0. In this case, heritability, H ², is defined as H ² = Var( G ) / Var( P ) H ² is the broad-sense heritability. Do you find that helpful? I hope not, because it’s a mishmash of undefined terminology, unnecessary equations, and borderline-false statements. If you’re in the mood for a mini-polemic: Reading this almost does more harm than good. While the final definition is correct, it never even attempts to explain what G and P are, it gives an incorrect condition for when the definition applies, and instead mostly devotes itself to an unnecessary digression about environmental effects. The rest of the page doesn’t get much better. Despite being 6700 words long, I think it would be impossible to understand heritability simply by reading it. Meanwhile, some people argue that heritability is meaningless for human traits like intelligence or income or personality. They claim that those traits are the product of complex interactions between genes and the environment and it’s impossible to disentangle the two. These arguments have always struck me as “suspiciously convenient”. I figured that the people making them couldn’t cope with the hard reality that genes are very important and have an enormous influence on what we are. But I increasingly feel that the skeptics have a point. While I think it’s a fact that most human traits are substantially heritable, it’s also true the technical definition of heritability is really weird, and simply does not mean what most people think it means. In this post, I will explain exactly what heritability is, while assuming no background. I will skip everything that can be skipped but—unlike most explanations—I will not skip things that can’t be skipped. Then I’ll go through a series of puzzles demonstrating just how strange heritability is. How tall you are depends on your genes, but also on what you eat, what diseases you got as a child, and how much gravity there is on your home planet. And all those things interact. How do you take all that complexity and reduce it to a single number, like “80% heritable”? The short answer is: Statistical brute force. The long answer is: Read the rest of this post. It turns out that the hard part of heritability isn’t heritability. Lurking in the background is a slippery concept known as a genotypic value . Discussions of heritability often skim past these. Quite possibly, just looking at the words “genotypic value”, you are thinking about skimming ahead right now. Resist that urge! Genotypic values are the core concept, and without them you cannot possibly understand heritability. For any trait, your genotypic value is the “typical” outcome if someone with your DNA were raised in many different random environments. In principle, if you wanted to know your genotypic height, you’d need to do this: Since you can’t / shouldn’t do that, you’ll never know your genotypic height. But that’s how it’s defined in principle—the average height someone with your DNA would grow to in a random environment. If you got lots of food and medical care as a child, your actual height is probably above your genotypic height. If you suffered from rickets, your actual height is probably lower than your genotypic height. Comfortable with genotypic values? OK. Then (broad-sense) heritability is easy. It’s the ratio Here, is the variance , basically just how much things vary in the population. Among all adults worldwide, is around 50 cm². (Incidentally, did you know that variance was invented for the purpose of defining heritability?) Meanwhile, is how much genotypic height varies in the population. That might seem hopeless to estimate, given that we don’t know anyone’s genotypic height. But it turns out that we can still estimate the variance using, e.g., pairs of adopted twins, and it’s thought to be around 40 cm². If we use those numbers, the heritability of height would be People often convert this to a percentage and say “height is 80% heritable”. I’m not sure I like that, since it masks heritability’s true nature as a ratio. But everyone does it, so I’ll do it too. People who really want to be intimidating might also say, “genes explain 80% of the variance in height”. Of course, basically the same definition works for any trait, like weight or income or fondness for pseudonymous existential angst science blogs. But instead of replacing “height” with “trait”, biologists have invented the ultra-fancy word “phenotype” and write The word “phenotype” suggests some magical concept that would take years of study to understand. But don’t be intimidated. It just means the actual observed value of some trait(s). You can measure your phenotypic height with a tape measure. Let me make two points before moving on. First, this definition of heritability assumes nothing. We are not assuming that genes are independent of the environment or that “genotypic effects” combine linearly with “environmental effects”. We are not assuming that genes are in Hardy-Weinberg equilibrium , whatever that is. No. I didn’t talk about that stuff because I don’t need to. There are no hidden assumptions. The above definition always works. Second, many normal English words have parallel technical meanings, such as “field” , “insulator” , “phase” , “measure” , “tree” , or “stack” . Those are all nice, because they’re evocative and it’s almost always clear from context which meaning is intended. But sometimes, scientists redefine existing words to mean something technical that overlaps but also contradicts the normal meaning, as in “salt” , “glass” , “normal” , “berry” , or “nut” . These all cause confusion, but “heritability” must be the most egregious case in all of science. Before you ever heard the technical definition of heritability, you surely had some fuzzy concept in your mind. Personally, I thought of heritability as meaning how many “points” you get from genes versus the environment. If charisma was 60% heritable, I pictured each person has having 10 total “charisma points”, 6 of which come from genes, and 4 from the environment: If you take nothing else from this post, please remember that the technical definition of heritability does not work like that . You might hope that if we add some plausible assumptions, the above ratio-based definition would simplify into something nice and natural, that aligns with what “heritability” means in normal English. But that does not happen. If that’s confusing, well, it’s not my fault. Not sure what’s happening here, but it seems relevant. So “heritability” is just the ratio of genotypic and phenotypic variance. Is that so bad? I think… maybe? How heritable is eye color? Close to 100%. This seems obvious, but let’s justify it using our definition that . Well, people have the same eye color, no matter what environment they are raised in. That means that genotypic eye color and phenotypic eye color are the same thing. So they have the same variance, and the ratio is 1. Nothing tricky here. How heritable is speaking Turkish? Close to 0%. Your native language is determined by your environment. If you grow up in a family that speaks Turkish, you speak Turkish. Genes don’t matter. Of course, there are lots of genes that are correlated with speaking Turkish, since Turks are not, genetically speaking, a random sample of the global population. But that doesn’t matter, because if you put Turkish babies in Korean households, they speak Korean. Genotypic values are defined by what happens in a random environment, which breaks the correlation between speaking Turkish and having Turkish genes. Since 1.1% of humans speak Turkish, the genotypic value for speaking Turkish is around 0.011 for everyone, no matter their DNA. Since that’s basically constant, the genotypic variance is near zero, and heritability is near zero. How heritable is speaking English? Perhaps 30%. Probably somewhere between 10% and 50%. Definitely more than zero. That’s right. Turkish isn’t heritable but English is. Yes it is . If you ask an LLM, it will tell you that the heritability of English is zero. But the LLM is wrong and I am right. Why? Let me first acknowledge that Turkish is a little bit heritable. For one thing, some people have genes that make them non-verbal. And there’s surely some genetic basis for being a crazy polyglot that learns many languages for fun. But speaking Turkish as a second language is quite rare , meaning that the genotypic value of speaking Turkish is close to 0.011 for almost everyone. English is different. While only 1 in 20 people in the world speak English as a first language, 1 in 7 learn it as a second language. And who does that? Educated people. Some argue the heritability of educational attainment is much lower. I’d like to avoid debating the exact numbers, but note that these lower numbers are usually estimates of “narrow-sense” heritability rather than “broad-sense” heritability as we’re talking about. So they should be lower. (I’ll explain the difference later.) It’s entirely possible that broad-sense heritability is lower than 40%, but everyone agrees it’s much larger than zero. So the heritability of English is surely much larger than zero, too. Say there’s an island where genes have no impact on height. How heritable is height among people on this island? There’s nothing tricky here. Say there’s an island where genes entirely determine height. How heritable is height? Again, nothing tricky. Say there’s an island where neither genes nor the environment influence height and everyone is exactly 165 cm tall. How heritable is height? It’s undefined. In this case, everyone has exactly the same phenotypic and genotypic height, namely 165 cm. Since those are both constant, their variance is zero and heritability is zero divided by zero. That’s meaningless. Say there’s an island where some people have genes that predispose them to be taller than others. But the island is ruled by a cruel despot who denies food to children with taller genes, so that on average, everyone is 165 ± 5 cm tall. How heritable is height? On this island, everyone has a genotypic height of 165 cm. So genotypic variance is zero, but phenotypic variance is positive, due to the ± 5 cm random variation. So heritability is zero divided by some positive number. Say there’s an island where some people have genes that predispose them to be tall and some have genes that predispose them to be short. But, the same genes that make you tall also make you semi-starve your children, so in practice everyone is exactly 165 cm tall. How heritable is height? ∞%. Not 100%, mind you, infinitely heritable. To see why, note that if babies with short/tall genes are adopted by parents with short/tall genes, there are four possible cases. If a baby with short genes is adopted into random families, they will be shorter on average than if a baby with tall genes. So genotypic height varies. However, in reality, everyone is the same height, so phenotypic height is constant. So genotypic variance is positive while phenotypic variance is zero. Thus, heritability is some positive number divided by zero, i.e. infinity. (Are you worried that humans are “diploid”, with two genes (alleles) at each locus, one from each biological parent? Or that when there are multiple parents, they all tend to have thoughts on the merits of semi-starvation? If so, please pretend people on this island reproduce asexually. Or, if you like, pretend that there’s strong assortative mating so that everyone either has all-short or all-tall genes and only breeds with similar people. Also, don’t fight the hypothetical.) Say there are two islands. They all live the same way and have the same gene pool, except people on island A have some gene that makes them grow to be 150 ± 5 cm tall, while on island B they have a gene that makes them grow to be 160 ± 5 cm tall. How heritable is height? It’s 0% for island A and 0% for island B, and 50% for the two islands together. Why? Well on island A, everyone has the same genotypic height, namely 150 cm. Since that’s constant, genotypic variance is zero. Meanwhile, phenotypic height varies a bit, so phenotypic variance is positive. Thus, heritability is zero. For similar reasons, heritability is zero on island B. But if you put the two islands together, half of people have a genotypic height of 150 cm and half have a genotypic height of 160 cm, so suddenly (via math) genotypic variance is 25 cm². There’s some extra random variation so (via more math) phenotypic variance turns out to be 50 cm². So heritability is 25 / 50 = 50%. If you combine the populations, then genotypic variance is Meanwhile phenotypic variance is Say there’s an island where neither genes nor the environment influence height. Except, some people have a gene that makes them inject their babies with human growth hormone, which makes them 5 cm taller. How heritable is height? True, people with that gene will tend be taller. And the gene is causing them to be taller. But if babies are adopted into random families, it’s the genes of the parents that determine if they get injected or not. So everyone has the same genotypic height, genotypic variance is zero, and heritability is zero. Suppose there’s an island where neither genes nor the environment influence height. Except, some people have a gene that makes them, as babies, talk their parents into injecting them with human growth hormone. The babies are very persuasive. How heritable is height? We’re back to 100%. The difference with the previous scenario is that now babies with that gene get injected with human growth hormone no matter who their parents are. Since nothing else influences height, genotype and phenotype are the same, have the same variance, and heritability is 100%. Suppose there’s an island where neither genes nor the environment influence height. Except, there are crabs that seek out blue-eyed babies and inject them with human growth hormone. The crabs, they are unstoppable. How heritable is height? Again, 100%. Babies with DNA for blue eyes get injected. Babies without DNA for blue eyes don’t. Since nothing else influences height, genotype and phenotype are the same and heritability is 100%. Note that if the crabs were seeking out parents with blue eyes and then injecting their babies, then height would be 0% heritable. It doesn’t matter that human growth hormone is weird thing that’s coming from outside the baby. It doesn’t matter if we think crabs should be semantically classified as part of “the environment”. It doesn’t matter that heritability would drop to zero if you killed all the crabs, or that the direct causal effect of the relevant genes has nothing to do with height. Heritability is a ratio and doesn’t care. So heritability can be high even when genes have no direct causal effect on the trait in question. It can be low even when there is a strong direct effect. It changes when the environment changes. It even changes based on how you group people together. It can be larger than 100% or even undefined. Even so, I’m worried people might interpret this post as a long way of saying heritability is dumb and bad, trolololol . So I thought I’d mention that this is not my view. Say a bunch of companies create different LLMs and train them on different datasets. Some of the resulting LLMs are better at writing fiction than others. Now I ask you, “What percentage of the difference in fiction writing performance is due to the base model code, rather than the datasets or the GPUs or the learning rate schedules?” That’s a natural question. But if you put it to an AI expert, I bet you’ll get a funny look. You need code and data and GPUs to make an LLM. None of those things can write fiction by themselves. Experts would prefer to think about one change at a time: Given this model, changing the dataset in this way changes fiction writing performance this much. Similarly, for humans, I think what we really care about is interventions. If we changed this gene, could we eliminate a disease? If we educate children differently, can we make them healthier and happier? No single number can possibly contain all that information. But heritability is something . I think of it as saying how much hope we have to find an intervention by looking at changes in current genes or current environments. If heritability is high, then given current typical genes , you can’t influence the trait much through current typical environmental changes . If you only knew that eye color was 100% heritable, that means you won’t change your kid’s eye color by reading to them, or putting them on a vegetarian diet, or moving to higher altitude. But it’s conceivable you could do it by putting electromagnets under their bed or forcing them to communicate in interpretive dance. If heritability is high, that also means that given current typical environments you can influence the trait through current typical genes . If the world was ruled by an evil despot who forced red-haired people to take pancreatic cancer pills, then pancreatic cancer would be highly heritable. And you could change the odds someone gets pancreatic cancer by swapping in existing genes for black hair. If heritability is low, that means that given current typical environments , you can’t cause much difference through current typical genetic changes . If we only knew that speaking Turkish was ~0% heritable, that means that doing embryo selection won’t much change the odds that your kid speaks Turkish. If heritability is low, that also means that given current typical genes , you might be able change the trait through current typical environmental changes . If we only know that speaking Turkish was 0% heritable, then that means there might be something you could do to change the odds your kid speaks Turkish, e.g. moving to Turkey. Or, it’s conceivable that it’s just random and moving to Turkey wouldn’t do anything. But be careful. Just because heritability is high doesn’t mean that changing genes is easy. And just because heritability is low doesn’t mean that changing the environment is easy. And heritability doesn’t say anything about non-typical environments or non-typical genes. If an evil despot is giving all the red-haired people cancer pills, perhaps we could solve that by intervening on the despot. And if you want your kid to speak Turkish, it’s possible that there’s some crazy genetic modifications that would turn them into unstoppable Turkish learning machine. Heritability has no idea about any of that, because it’s just an observational statistic based on the world as it exists today. Heritability: Five Battles by Steven Byrnes. Covers similar issues in way that’s more connected to the world and less shy about making empirical claims. A molecular genetics perspective on the heritability of human behavior and group differences by Alexander Gusev. I find the quantitative genetics literature to be incredibly sloppy about notation and definitions and math. (Is this why LLMs are so bad at it?) This is the only source I’ve found that didn’t drive me completely insane. This post focused on “broad-sense” heritability. But there a second heritability out there, called “narrow-sense”. Like broad-sense heritability, we can define the narrow-sense heritability of height as a ratio: The difference is that rather than having height in the numerator, we now have “additive height”. To define that, imagine doing the following for each of your genes, one at a time: For example, say overall average human height is 150 cm, but when you insert gene #4023 from yourself into random embryos, their average height is 149.8 cm. Then the additive effect of your gene #4023 is -0.2 cm. Your “additive height” is average human height plus the sum of additive effects for each of your genes. If the average human height is 150 cm, you have one gene with a -0.2 cm additive effect, another gene with a +0.3 cm additive effect and the rest of your genes have no additive effect, then your “additive height” is 150 cm - 0.2 cm + 0.3 cm = 150.1 cm. Note: This terminology of “additive height” is non-standard. People usually define narrow-sense heritability using “additive effects ”, which are the same thing but without including the mean. This doesn’t change anything since adding a constant doesn’t change the variance. But it’s easier to say “your additive height is 150.1 cm” rather than “the additive effect of your genes on height is +0.1 cm” so I’ll do that. Honestly, I don’t think the distinction between “broad-sense” and “narrow-sense” heritability is that important. We’ve already seen that broad-sense heritability is weird, and narrow-sense heritability is similar but different. So it won’t surprise you to learn that narrow-sense heritability is differently -weird. But if you really want to understand the difference, I can offer you some more puzzles. Say there’s an island where people have two genes, each of which is equally likely to be A or B. People are 100 cm tall if they have an AA genotype, 150 cm tall if they have an AB or BA genotype, and 200 cm tall if they have a BB genotype. How heritable is height? Both broad and narrow-sense heritability are 100%. The explanation for broad-sense heritability is like many we’ve seen already. Genes entirely determine someone’s height, and so genotypic and phenotypic height are the same. For narrow-sense heritability, we need to calculate some additive heights. The overall mean is 150 cm, each A gene has an additive effect of -25 cm, and each B gene has an additive effect of +25 cm. But wait! Let’s work out the additive height for all four cases: Since additive height is also the same as phenotypic height, narrow-sense heritability is also 100%. In this case, the two heritabilities were the same. At a high level, that’s because the genes act independently. When there are “gene-gene” interactions, you tend to get different numbers. Say there’s an island where people have two genes, each of which is equally likely to be A or B. People with AA or BB genomes are 100 cm, while people with AB or BA genomes are 200 cm. How heritable is height? Broad-sense heritability is 100%, while narrow-sense heritability is 0%. You know the story for broad-sense heritability by now. For narrow-sense heritability, we need to do a little math. So everyone has an additive height of 150 cm, no matter their genes. That’s constant, so narrow-sense heritability is zero. I think basically for two reasons: First, for some types of data (twin studies) it’s much easier to estimate broad-sense heritability. For other types of data (GWAS) it’s much easier to estimate narrow-sense heritability. So we take what we can get. Second, they’re useful for different things. Broad-sense heritability is defined by looking at what all your genes do together. That’s nice, since you are the product of all your genes working together. But combinations of genes are not well-preserved by reproduction. If you have a kid, then they breed with someone, their kids breed with other people, and so on. Generations later, any special combination of genes you might have is gone. So if you’re interested in the long-term impact of you having another kid, narrow-sense heritability might be the way to go. (Sexual reproduction doesn’t really allow for preserving the genetics that make you uniquely “you”. Remember, almost all your genes are shared by lots of other people. If you have any unique genes, that’s almost certainly because they have deleterious de-novo mutations. From the perspective of evolution, your life just amounts to a tiny increase or decrease in the per-locus population frequencies of your individual genes. The participants in the game of evolution are genes. Living creatures like you are part of the playing field. Food for thought.) Phenotype ( P ) is never defined. This is a minor issue, since it just means “trait”. Genotype ( G ) is never defined. This is a huge issue, since it’s very tricky and heritability makes no sense without it. Environment ( E ) is never defined. This is worse than it seems, since in heritability, different people use “environment” and E to refer to different things. When we write P = G + E , are we assuming some kind of linear interaction? The text implies not, but why? What does this equation mean? If this equation is always true, then why do people often add other stuff like G × E on the right? The text states that if you do a planned experiment (how?) and make Cov( G , E ) = 0, then heritability is Var( G ) / Var( P ). But in fact, heritability is always defined that way. You don’t need a planned experiment and it’s fine if Cov( G , E ) ≠ 0. And—wait a second—that definition doesn’t refer to environmental effects at all. So what was the point of introducing them? What was the point of writing P = G + E ? What are we doing? Create a million embryonic clones of yourself. Implant them in the wombs of randomly chosen women around the world who were about to get pregnant on their own. Convince them to raise those babies exactly like a baby of their own. Wait 25 years, find all your clones and take their average height. If heritability is high, then given current typical genes , you can’t influence the trait much through current typical environmental changes . If you only knew that eye color was 100% heritable, that means you won’t change your kid’s eye color by reading to them, or putting them on a vegetarian diet, or moving to higher altitude. But it’s conceivable you could do it by putting electromagnets under their bed or forcing them to communicate in interpretive dance. If heritability is high, that also means that given current typical environments you can influence the trait through current typical genes . If the world was ruled by an evil despot who forced red-haired people to take pancreatic cancer pills, then pancreatic cancer would be highly heritable. And you could change the odds someone gets pancreatic cancer by swapping in existing genes for black hair. If heritability is low, that means that given current typical environments , you can’t cause much difference through current typical genetic changes . If we only knew that speaking Turkish was ~0% heritable, that means that doing embryo selection won’t much change the odds that your kid speaks Turkish. If heritability is low, that also means that given current typical genes , you might be able change the trait through current typical environmental changes . If we only know that speaking Turkish was 0% heritable, then that means there might be something you could do to change the odds your kid speaks Turkish, e.g. moving to Turkey. Or, it’s conceivable that it’s just random and moving to Turkey wouldn’t do anything. Heritability: Five Battles by Steven Byrnes. Covers similar issues in way that’s more connected to the world and less shy about making empirical claims. A molecular genetics perspective on the heritability of human behavior and group differences by Alexander Gusev. I find the quantitative genetics literature to be incredibly sloppy about notation and definitions and math. (Is this why LLMs are so bad at it?) This is the only source I’ve found that didn’t drive me completely insane. Find a million random women in the world who just became pregnant. For each of them, take your gene and insert it into the embryo, replacing whatever was already at that gene’s locus. Convince everyone to raise those babies exactly like a baby of their own. Wait 25 years, find all the resulting people, and take the difference of their average height from overall average height. The overall mean height is 150 cm. If you take a random embryo and replace one gene with A, then the there’s a 50% chance the other gene is A, so they’re 100 cm, and there’s a 50% chance the other gene is B, so they’re 200 cm, for an average of 150 cm. Since that’s the same as the overall mean, the additive effect of an A gene is +0 cm. By similar logic, the additive effect of a B gene is also +0 cm.

0 views
DHH 4 months ago

Linux crosses magic market share threshold in US

According to Statcounter, Linux has claimed 5% market share of desktop computing in the US. That's double of where it was just three years ago. Really impressive. Windows is still dominant at 63%, and Apple sit at 26%. But for the latter, it's quite a drop from their peak of 33% in June 2023

0 views
qouteall notes 5 months ago

Some Statistics Knowledge

What's the essence of probability? There are two views: Probability is related to sampling assumptions. Example: Bertrand Paradox : there are many ways to randoly select a chord on a circle, with different proability densities of chord. A distribution tells how likely a random variable will be what value: Independent means that two random variables don't affect each other. Knowing one doesn't affect the distribution of other. But there are dependent random variables that, when you know one, the distribution of another changes. P ( X = x ) P(X=x) P ( X = x ) means the probability of random variable X X X take value x x x . It can also be written as P X ( x ) P_X(x) P X ​ ( x ) or P ( X ) P(X) P ( X ) . Sometimes the probability density function f f f is used to represent a distribution. A joint distribution tells how likely a combination of multiple variables will be what value. For a joint distribution of X and Y, each outcome is a pair of X and Y, denoted ( X , Y ) (X, Y) ( X , Y ) . If X and Y are independent, then P ( X = x , Y = y ) = P ( ( X , Y ) = ( x , y ) ) = P ( X = x ) ⋅ P ( Y = y ) P(X=x,Y=y)=P((X,Y)=(x,y))=P(X=x) \cdot P(Y=y) P ( X = x , Y = y ) = P (( X , Y ) = ( x , y )) = P ( X = x ) ⋅ P ( Y = y ) . For a joint distribution of ( X , Y ) (X, Y) ( X , Y ) , if we only care about X, then the distribution of X is called marginal distribution. You can only add probability when two events are mutually exclusive. You can only multiply probability when two events are independent, or multiplying a conditional probability with the condition's probability. P ( E ∣ C ) P(E \vert C) P ( E ∣ C ) means the probability of E E E happening if C C C happens. E and C both happen ) ​ P ( E ∩ C ) = P ( E ∣ C ) ⋅ P ( C ) If E and C are independent, then P ( E ∩ C ) = P ( E ) P ( C ) P(E \cap C) = P(E)P(C) P ( E ∩ C ) = P ( E ) P ( C ) , then P ( E ∣ C ) = P ( E ) P(E \vert C)=P(E) P ( E ∣ C ) = P ( E ) . For example, there is a medical testing method of a disease. The test result can be positive (indicate having diesase) or negative. But that test is not always accurate. There are two random variables: whether test result is positive, whther the person actually has disease. This is a joint distribution. The 4 cases: a , b , c , d a, b, c, d a , b , c , d are four possibilities. a + b + c + d = 1 a + b + c + d = 1 a + b + c + d = 1 . For that distribution, there are two marginal distributions. If we only care about whether the person actually has disease and ignore the test result, then the marginal distribution is: Similarily there is also a marginal distribution of whether the test result is positive. False negative rate is P ( Test is negative  ∣  Actually has disease ) P(\text{Test is negative } \vert \text{ Actually has disease}) P ( Test is negative  ∣  Actually has disease ) , it means the rate of negative test when actually having disease. And false positive rate is P ( Test is positive  ∣  Actually doesn’t have disease ) P(\text{Test is positive } \vert \text{ Actually doesn't have disease}) P ( Test is positive  ∣  Actually doesn’t have disease ) . Some people may intuitively think false negative rate means P ( Test result is false  ∣  Test is negative ) P(\text{Test result is false } \vert \text{ Test is negative}) P ( Test result is false  ∣  Test is negative ) , which equals P ( Actually has disease  ∣  Test is negative ) P(\text{Actually has disease } \vert \text{ Test is negative}) P ( Actually has disease  ∣  Test is negative ) , which equals b b + d \frac{b}{b+d} b + d b ​ . But that's not the official definition of false negative. Bayes theorem allow "reversing" P ( A ∣ B ) P(A \vert B) P ( A ∣ B ) as P ( B ∣ A ) P(B \vert A) P ( B ∣ A ) : The theoretical mean is the "weighted average" of all possible cases using theoretical probabilities. E [ X ] E[X] E [ X ] denotes the theoretical mean of random variable X X X , also called the expected value of X X X . It's also often denoted as μ \mu μ . For discrete case, E [ X ] E[X] E [ X ] is calculated by summing all theoretically possible values multiply by their theoretical probability. The mean for discrete case: x ​ ​ ∑ ​ x ⋅ P ( X = x ) ​ probability of that case ​ The mean for continuous case: Some rules related to mean: (The constant k k k doesn't necessarily need to be globally constant. It just need to be a certain value that's not affected by the random outcome. It just need to be "constant in context".) Another important rule is that, if X X X and Y Y Y are independent, then Because when X X X and Y Y Y are independent, P ( X = x i , Y = y j ) = P ( X = x i ) ⋅ P ( Y = y j ) P(X=x_i, Y=y_j) = P(X=x_i) \cdot P(Y=y_j) P ( X = x i ​ , Y = y j ​ ) = P ( X = x i ​ ) ⋅ P ( Y = y j ​ ) , then: Note that E [ X + Y ] = E [ X ] + E [ Y ] E[X+Y]=E[X]+E[Y] E [ X + Y ] = E [ X ] + E [ Y ] always work regardless of independence, but E [ X Y ] = E [ X ] E [ Y ] E[XY]=E[X]E[Y] E [ X Y ] = E [ X ] E [ Y ] requires independence. For a sum, the common factor that's not related to sum index can be extraced out. So: ​ j ∑ ​ ( irrelevant to j f ( i ) ​ ​ ⋅ g ( j )) ​ = i ∑ ​ ​ f ( i ) irrelevant to i j ∑ ​ g ( j ) ​ ​ ​ = ( i ∑ ​ f ( i ) ) ( j ∑ ​ g ( j ) ) Then: (That's for the discrete case. Continuous case is similar.) If we have n n n samples of X X X , denoted X 1 , X 2 , . . . X n X_1, X_2, ... X_n X 1 ​ , X 2 ​ , ... X n ​ , each sample is a random variable, and each sample is independent to each other , and each sample are taken from the same distribution (independently and identically distributed, i.i.d ), then we can estimate the theoretical mean by calculating the average. The estimated mean is denoted as μ ^ \hat{\mu} μ ^ ​ (Mu hat): Hat ^ \hat{} ^ means it's an empirical value calculated from samples, not the theoretical value. Some important clarifications: The mean of estimated mean equals the theoretical mean. Note that if the samples are not independent to each other, or they are taken from different distributions, then the estimation will be possibly biased. The theoretical variance, Var [ X ] \text{Var}[X] Var [ X ] , also denoted as σ 2 \sigma ^2 σ 2 , measures how "spread out" the samples are. If k k k is a constant: Standard deviation ( stdev ) σ \sigma σ is the square root of variance. Multiplying a random variable by a constant also multiplies the standard deviation. The covariance Cov [ X , Y ] \text{Cov}[X, Y] Cov [ X , Y ] measures the "joint variability" of two random variables X X X and Y Y Y . Some rules related to variance: If X X X and Y Y Y are indepdenent, as previouly mentioned E [ X Y ] = E [ X ] ⋅ E [ Y ] E[XY]=E[X]\cdot E[Y] E [ X Y ] = E [ X ] ⋅ E [ Y ] , then so Var [ X + Y ] = Var [ X ] + Var [ Y ] \text{Var}[X + Y]= \text{Var}[X] + \text{Var}[Y] Var [ X + Y ] = Var [ X ] + Var [ Y ] The mean is sometimes also called location. The variance is sometimes called dispersion. If we have some i.i.d samples but don't know the theoretical variance, how to estimate the variance? If we know the theoretical mean, then it's simple: However, the theoretical mean is different to the estimated mean. If we don't know the theoretical mean and use the estimated mean, it will be biased, and we need to divide n − 1 n-1 n − 1 instead of n n n to avoid bias: This is called Bessel's correction. note that the more i.i.d samples you have, the smaller the bias, so if you have many i.i.d samples, then the bias doesn't matter in practice. Originally, n samples have n degrees of freedom. If we keep the estimated mean fixed, then it will only have n-1 degrees of freedom. That's an intuitive explanation of the correction. The exact dedution of that correction is tricky: Firstly, the estimated mean itself also has variance As each sample is independent to other samples. As previously mentioned, if X X X and Y Y Y are independent, adding the variable also adds the variance: Var [ X + Y ] = Var [ X ] + Var [ Y ] \text{Var}[X + Y]= \text{Var}[X] + \text{Var}[Y] Var [ X + Y ] = Var [ X ] + Var [ Y ] . So: As previously mentioned E [ μ ^ ] = μ E[\hat{\mu}] = \mu E [ μ ^ ​ ] = μ , then Var [ μ ^ ] = E [ ( μ ^ − E [ μ ^ ] ) 2 ] = E [ ( μ ^ − μ ) 2 ] = σ 2 n \text{Var}[\hat{\mu}] = E[(\hat{\mu} - E[\hat{\mu}])^2] = E[(\hat{\mu} - \mu)^2] = \frac{\sigma^2}{n} Var [ μ ^ ​ ] = E [( μ ^ ​ − E [ μ ^ ​ ] ) 2 ] = E [( μ ^ ​ − μ ) 2 ] = n σ 2 ​ . This will be used later. A trick is to rewrite X i − μ ^ X_i - \hat{\mu} X i ​ − μ ^ ​ to ( X i − μ ) − ( μ ^ − μ ) (X_i - \mu) - (\hat{\mu} - \mu) ( X i ​ − μ ) − ( μ ^ ​ − μ ) and then expand: Then take mean of two sides: There are now three terms. The first one equals n σ 2 n\sigma^2 n σ 2 : So the second one becomes Now the above three things become E [ ( μ ^ − μ ) 2 ] E[(\hat{\mu}-\mu)^2] E [( μ ^ ​ − μ ) 2 ] is also Var [ μ ^ ] \text{Var}[\hat{\mu}] Var [ μ ^ ​ ] . As previously mentioned, it equals σ 2 n \frac{\sigma^2}{n} n σ 2 ​ , so For a random variable X X X , if we know its mean μ \mu μ and standard deviation σ \sigma σ then we can "standardize" it so that its mean become 0 and standard deviation become 1: That's called Z-score or standard score. Often the theoretical mean and theoretical standard deviation is unknown, so z score is computed using sample mean and sample stdev: In deep learning, normalization uses Z score: Note that in layer normalization and batch normalization, the variance usually divides by n n n instead of n − 1 n-1 n − 1 . Computing Z-score for a vector can also be seen as a projection: ​ (or n − 1 \sqrt{n-1} n − 1 ​ ): σ 2 = 1 n ( y ) 2 \boldsymbol\sigma^2 = \frac 1 n (\boldsymbol y)^2 σ 2 = n 1 ​ ( y ) 2 , σ = 1 n ∣ y ∣ \boldsymbol\sigma = \frac 1 {\sqrt{n}} \vert \boldsymbol y \vert σ = n ​ 1 ​ ∣ y ∣ . Dividing by standard deviation can be seen as projecting it onto unit sphere then multiply by n \sqrt n n ​ (or n − 1 \sqrt{n-1} n − 1 ​ ). So computing Z-score can be seen as firstly projecting onto a hyper-plane that's orthogonal to 1 \boldsymbol 1 1 and then projecting onto unit sphere then multiply by n \sqrt n n ​ (or n − 1 \sqrt{n-1} n − 1 ​ ). Skewness ​ Skewness measures which side has more extreme values. A large positive skew means there is a fat tail on positive side (may have positive extreme values). A large negative skew means fat tail on negative side (may have negative extreme values). If two sides are symmetric, its skew is 0, regardless of how fat the tails are. Gaussian distributions are symmetric so they has zero skew. note that an asymmetric distribution can also has 0 skewness. There is a concept called moments that unify mean, variance, skewness and kurtosis: There is an unbiased way to estimate the thrid central moment μ 3 \mu_3 μ 3 ​ . The deduction of unbiased third central moment estimator is similar to Bessel's correction, but more tricky. A common way of estimating skewness from i.i.d samples, is to use the unbiased third central moment estimator, to divide by cubic of unbiased estimator of standard deviation: But it's still biased, as E [ X Y ] E[\frac{X}{Y}] E [ Y X ​ ] doesn't necessarily equal E [ X ] E [ Y ] \frac{E[X]}{E[Y]} E [ Y ] E [ X ] ​ . Unfortunately, there is no completely unbiased way to estimate skewness from i.i.d samples (unless you have other assumptions about the underlying distribution). The bias gets smaller with more i.i.d samples. Larger kurtosis means it has a fatter tail. The more extreme values it has, the higher its kurtosis. Gaussian distributions have kurtosis of 3. Excess kurtosis is the kurtosis minus 3. A common way of estimating excess kurtosis from i.i.d samples, is to use the unbiased estimator of fourth cumulant ( E [ ( X − E [ X ] ) 4 ] − 3 V a r [ X ] 2 E[(X-E[X])^4]-3Var[X]^2 E [( X − E [ X ] ) 4 ] − 3 Va r [ X ] 2 ), to divide the square of unbiased estimator of variance: It's still biased. If we have some independent samples of X X X , can estimate mean E [ X ] E[X] E [ X ] by calculating average E ^ [ X ] = 1 n ∑ i X i \hat{E}[X]=\frac{1}{n}\sum_i X_i E ^ [ X ] = n 1 ​ ∑ i ​ X i ​ . The variance of calculated average is 1 n Var [ X ] \frac{1}{n} \text{Var}[X] n 1 ​ Var [ X ] , which will reduce by having more samples. However, if the variance of X X X is large and the amount of samples is few, the average will have a large variance, the estimated mean will be inaccurate. We can make the estimation more accurate by using control variate. Then we can estimate E [ X ] E[X] E [ X ] using E ^ [ X + λ ( Y − E [ Y ] ) ] \hat{E}[X+\lambda(Y-E[Y])] E ^ [ X + λ ( Y − E [ Y ])] , where λ \lambda λ is a constant. By choosing the right λ \lambda λ , the estimator can have lower variance than just calculating average of X. The Y here is called a control variate. Some previous knowledge: E [ E ^ [ A ] ] = E [ A ] E[\hat{E}[A]] = E[A] E [ E ^ [ A ]] = E [ A ] , Var [ E ^ [ A ] ] = 1 n Var [ A ] \text{Var}[\hat{E}[A]]=\frac{1}{n}\text{Var}[A] Var [ E ^ [ A ]] = n 1 ​ Var [ A ] . The mean of that estimator is E [ X ] E[X] E [ X ] , meaning that the estimator is unbiased: E [ Y − E [ Y ]] ​ ​ ) = E [ X ] Then calculate the variance of the estimator: − λ E [ Y ] ​ ​ ] = 1 n Var [ X + λ Y ] = 1 n ( Var [ X ] + Var [ λ Y ] + 2 cov [ X , λ Y ] ) = 1 n ( Var [ X ] + λ 2 Var [ Y ] + 2 λ cov [ X , Y ] ) =\frac{1}{n}\text{Var}[X+\lambda Y] = \frac{1}{n}(\text{Var}[X]+\text{Var}[\lambda Y] +2\text{cov}[X,\lambda Y]) = \frac{1}{n}(\text{Var}[X]+\lambda^2 \text{Var}[Y]+2\lambda \text{cov}[X,Y]) = n 1 ​ Var [ X + λY ] = n 1 ​ ( Var [ X ] + Var [ λY ] + 2 cov [ X , λY ]) = n 1 ​ ( Var [ X ] + λ 2 Var [ Y ] + 2 λ cov [ X , Y ]) We want to minimize the variance of estimator by choosing a λ \lambda λ . We want to find a λ \lambda λ that minimizes Var [ Y ] λ 2 + 2 cov [ X , Y ] λ \text{Var}[Y] \lambda^2 + 2\text{cov}[X,Y] \lambda Var [ Y ] λ 2 + 2 cov [ X , Y ] λ . Quadratic funciton knowledge tells a x 2 + b x + c    ( a > 0 ) ax^2+bx+c \ \ (a>0) a x 2 + b x + c     ( a > 0 ) minimizes when x = − b 2 a x=\frac{-b}{2a} x = 2 a − b ​ , then the optimal lambda is: And by using that optimal λ \lambda λ , the variance of estimator is: If X and Y are correlated, then cov [ X , Y ] 2 Var [ Y ] > 0 \frac{\text{cov}[X,Y]^2}{\text{Var}[Y]} > 0 Var [ Y ] cov [ X , Y ] 2 ​ > 0 , then the new estimator has smaller variance and is more accurate than the simple one. The larger the correlation, the better it can be. Information entropy measures: If we want to measure the amount of information of a specific event, an event E E E 's amount of information as I ( E ) I(E) I ( E ) , there are 3 axioms: Then according to the three axioms, the definition of I I I (self information) is: The base b b b is relative to the unit. We often use the amount of bits as the unit of amount of information. An event with 50% probability has 1 bit of information, then the base will be 2: Then, for a distribution, the expected value of information of one sample is the expected value of I ( E ) I(E) I ( E ) . That defines information entropy H H H : In discrete case: If there exists x x x where P ( x ) = 0 P(x) = 0 P ( x ) = 0 , then it can be ignored in entropy calculation, as lim ⁡ x → 0 x log ⁡ x = 0 \lim_{x \to 0} x \log x = 0 lim x → 0 ​ x lo g x = 0 . Information entropy in discrete case is always positive. In continuous case, where f f f is the probability density function, this is called differential entropy: ( X \mathbb{X} X means the set of x x x where f ( x ) ≠ 0 f(x) \neq 0 f ( x )  = 0 , also called support of f f f .) In continuous case the base is often e e e rather than 2. Here log ⁡ \log lo g by default means log ⁡ e \log_e lo g e ​ . In discrete case, 0 ≤ P ( x ) ≤ 1 0 \leq P(x) \leq 1 0 ≤ P ( x ) ≤ 1 , log ⁡ 1 P ( x ) > 0 \log \frac{1}{P(x)} > 0 lo g P ( x ) 1 ​ > 0 , so entropy can never be negative. But in continuous case, probability density function can take value larger than 1, so entropy may be negative. If X and Y are independent, then H ( ( X , Y ) ) = E [ I ( ( X , Y ) ) ] = E [ I ( X ) + I ( Y ) ] = E [ I ( X ) ] + E [ I ( Y ) ] = H ( X ) + H ( Y ) H((X,Y))=E[I((X,Y))]=E[I(X)+I(Y)]=E[I(X)]+E[I(Y)]=H(X)+H(Y) H (( X , Y )) = E [ I (( X , Y ))] = E [ I ( X ) + I ( Y )] = E [ I ( X )] + E [ I ( Y )] = H ( X ) + H ( Y ) . If one fair coin toss has 1 bit entropy, then n independent tosses has n bit entropy. If I split one case into two cases, entropy increases. If I merge two cases into one case, entropy reduces. Because p 1 log ⁡ 1 p 1 + p 2 log ⁡ 1 p 2 > ( p 1 + p 2 ) log ⁡ 1 p 1 + p 2 p_1\log \frac{1}{p_1} + p_2\log \frac{1}{p_2} > (p_1+p_2) \log \frac{1}{p_1+p_2} p 1 ​ lo g p 1 ​ 1 ​ + p 2 ​ lo g p 2 ​ 1 ​ > ( p 1 ​ + p 2 ​ ) lo g p 1 ​ + p 2 ​ 1 ​ (if p 1 ≠ 0 , p 2 ≠ 0 p_1 \neq 0, p_2 \neq 0 p 1 ​  = 0 , p 2 ​  = 0 ), which is because that f ( x ) = log ⁡ 1 x f(x)=\log \frac{1}{x} f ( x ) = lo g x 1 ​ is convex, so p 1 p 1 + p 2 log ⁡ 1 p 1 + p 2 p 1 + p 2 log ⁡ 1 p 2 > log ⁡ 1 p 1 + p 2 \frac{p_1}{p_1+p_2}\log\frac{1}{p_1}+\frac{p_2}{p_1+p_2}\log\frac{1}{p_2}>\log\frac{1}{p_1+p_2} p 1 ​ + p 2 ​ p 1 ​ ​ lo g p 1 ​ 1 ​ + p 1 ​ + p 2 ​ p 2 ​ ​ lo g p 2 ​ 1 ​ > lo g p 1 ​ + p 2 ​ 1 ​ , then multiply two sides by p 1 + p 2 p_1+p_2 p 1 ​ + p 2 ​ gets the above result. The information entropy is the theorecical minimum of information required to encode a sample. For example, to encode the result of a fair coin toss, we use 1 bit, 0 for head and 1 for tail (reversing is also fine). If the coin is biased to head, to compress the information, we can use 0 for two consecutive heads, 10 for one head, 11 for one tail, which require fewer bits on average for each sample. That may not be optimal, but the most optimal loseless compresion cannot be better than information entropy. In continuous case, if k k k is a positive constant, H ( k X ) = H ( X ) + log ⁡ k H(kX) = H(X) + \log k H ( k X ) = H ( X ) + lo g k : Entropy is invariant to offset of random variable. H ( X + k ) = H ( X ) H(X+k)=H(X) H ( X + k ) = H ( X ) A joint distribution of X and Y is a distribution where each outcome is a pair of X and Y. Its entropy is called joint information entropy. Here I will use H ( ( X , Y ) ) H((X,Y)) H (( X , Y )) to denote joint entropy (to avoid confusing with cross entropy). If I fix the value of Y as y y y , then see the distribution of X: Take that mean over different Y, we get conditional entropy: Applying conditional probability rule: P ( ( X , Y ) ) = P ( X ∣ Y ) P ( Y ) P((X,Y)) = P(X \vert Y) P(Y) P (( X , Y )) = P ( X ∣ Y ) P ( Y ) So the conditional entropy is defined like this: P ( ( X , Y ) ) = P ( X ∣ Y ) P ( Y ) P((X, Y)) = P(X \vert Y) P(Y) P (( X , Y )) = P ( X ∣ Y ) P ( Y ) . Similarily, H ( ( X , Y ) ) = H ( X ∣ Y ) + H ( Y ) H((X,Y))=H(X \vert Y)+H(Y) H (( X , Y )) = H ( X ∣ Y ) + H ( Y ) . The exact deduction is as follows: If X X X and Y Y Y are not independent, then the joint entropy is smaller than if they are independent: H ( ( X , Y ) ) < H ( X ) + H ( Y ) H((X, Y)) < H(X) + H(Y) H (( X , Y )) < H ( X ) + H ( Y ) . If X and Y are not independent then knowing X will also give some information about Y. This can be deduced by mutual information which will be explained below. Here I A ( x ) I_A(x) I A ​ ( x ) denotes the amount of information of value (event) x x x in distribution A A A . The difference of information of the same value in two distributions A A A and B B B : The KL divergence from A A A to B B B is the expected value of that regarding the probabilities of A A A . Here E A E_A E A ​ means the expected value calculated using A A A 's probabilities: You can think KL divergence as: KL divergence is also called relative entropy. KL divergence is asymmetric, D K L ( A ∥ B ) D_{KL}(A\parallel B) D K L ​ ( A ∥ B ) is different to D K L ( B ∥ A ) D_{KL}(B\parallel A) D K L ​ ( B ∥ A ) . It's often that the first distribution is the real underlying distribution, and the second distribution is an approximation or the model output. If A and B are the same, the KL divergence betweem them are zero. Otherwise, KL divergence is positive. KL divergence can never be negative, will explain later. P B ( x ) P_B(x) P B ​ ( x ) appears on denominator. If there exists x x x that P B ( x ) = 0 P_B(x) = 0 P B ​ ( x ) = 0 and P A ( x ) ≠ 0 P_A(x) \neq 0 P A ​ ( x )  = 0 , then it can be seen that KL divergence is infinite. It can be seen as "The model expect something to never happen but it actually can happen". If there is no such case, we say that A absolutely continuous with respect to B, written as A ≪ B A \ll B A ≪ B . This requires all outcomes from B to include all outcomes from A. Another concept is cross entropy . The cross entropy from A to B, denoted H ( A , B ) H(A, B) H ( A , B ) , is the entropy of A plus KL divergence from A to B: Information entropy H ( X ) H(X) H ( X ) can also be expressed as cross entropy of itself H ( X , X ) H(X, X) H ( X , X ) , similar to the relation between variance and covariance. (In some places H ( A , B ) H(A,B) H ( A , B ) denotes joint entropy. I use H ( ( A , B ) ) H((A,B)) H (( A , B )) for joint entropy to avoid ambiguity.) Cross entropy is also asymmetric. In deep learning cross entropy is often used as loss function. If each piece of training data's distribution's entropy H ( A ) H(A) H ( A ) is fixed, minimizing cross entropy is the same as minimizing KL divergence. Jensen's inequality states that for a concave function f f f : The reverse applies for convex. Here is a visual example showing Jensen's inequality. For example I have a discrete distribution with 5 cases X 1 , X 2 , X 3 , X 4 , X 5 X_1,X_2,X_3,X_4,X_5 X 1 ​ , X 2 ​ , X 3 ​ , X 4 ​ , X 5 ​ (these are possible outcomes of distribution, not samples), corresponding to X coordinates of the red dots. The probabilities of the 5 cases are p 1 , p 2 , p 3 , p 4 , p 5 p_1, p_2, p_3, p_4, p_5 p 1 ​ , p 2 ​ , p 3 ​ , p 4 ​ , p 5 ​ that sum to 1. E [ X ] = p 1 X 1 + p 2 X 2 + p 3 X 3 + p 4 X 4 + p 5 X 5 E[X] = p_1 X_1 + p_2 X_2 + p_3 X_3 + p_4 X_4 + p_5 X_5 E [ X ] = p 1 ​ X 1 ​ + p 2 ​ X 2 ​ + p 3 ​ X 3 ​ + p 4 ​ X 4 ​ + p 5 ​ X 5 ​ . E [ f ( x ) ] = p 1 f ( X 1 ) + p 2 f ( X 2 ) + p 3 f ( X 3 ) + p 4 f ( X 4 ) + p 5 f ( X 5 ) E[f(x)] = p_1 f(X_1) + p_2 f(X_2) + p_3 f(X_3) + p_4 f(X_4) + p_5 f(X_5) E [ f ( x )] = p 1 ​ f ( X 1 ​ ) + p 2 ​ f ( X 2 ​ ) + p 3 ​ f ( X 3 ​ ) + p 4 ​ f ( X 4 ​ ) + p 5 ​ f ( X 5 ​ ) . Then ( E [ X ] , E [ f ( x ) ] ) (E[X], E[f(x)]) ( E [ X ] , E [ f ( x )]) can be seen as an interpolation between five points ( X 1 , f ( X 1 ) ) , ( X 2 , f ( X 2 ) ) , ( X 3 , f ( X 3 ) ) , ( X 4 , f ( X 4 ) ) , ( X 5 , f ( X 5 ) ) (X_1, f(X_1)), (X_2, f(X_2)), (X_3, f(X_3)), (X_4, f(X_4)), (X_5, f(X_5)) ( X 1 ​ , f ( X 1 ​ )) , ( X 2 ​ , f ( X 2 ​ )) , ( X 3 ​ , f ( X 3 ​ )) , ( X 4 ​ , f ( X 4 ​ )) , ( X 5 ​ , f ( X 5 ​ )) , using weights p 1 , p 2 , p 3 , p 4 , p 5 p_1, p_2, p_3, p_4, p_5 p 1 ​ , p 2 ​ , p 3 ​ , p 4 ​ , p 5 ​ . The possible area of the interpolated point correspond to the green convex polygon: For each point in green polygon ( E [ X ] , E [ f ( X ) ] ) (E[X], E[f(X)]) ( E [ X ] , E [ f ( X )]) , the point on function curve with the same X coordinate ( E [ X ] , f ( E [ X ] ) ) (E[X], f(E[X])) ( E [ X ] , f ( E [ X ])) is above it. So E [ f ( X ) ] ≤ f ( E [ X ] ) E[f(X)] \leq f(E[X]) E [ f ( X )] ≤ f ( E [ X ]) . The same applies when you add more cases to the discrete distribution, the convex polygon will have more points but still below the function curve. The same applies to continuous distribution when there are infinitely many cases. Jensen's inequality tells that KL divergence is non-negative: There is a trick that extracting -1 makes P A P_A P A ​ be in denominator that will be cancelled later. The logarithm function is concave. Jensen's inequality gives: Multiplying -1 and flip: The right side equals 0 because: Then how to estimate the KL divergence D K L ( A , B ) D_{KL}(A, B) D K L ​ ( A , B ) ? Reference: Approximating KL Divergence As KL divergence is E A [ log ⁡ P A ( x ) P B ( x ) ] E_A\left[\log \frac{P_A(x)}{P_B(x)}\right] E A ​ [ lo g P B ​ ( x ) P A ​ ( x ) ​ ] , the simply way is to calculate the average of log ⁡ P A ( x ) P B ( x ) \log \frac{P_A(x)}{P_B(x)} lo g P B ​ ( x ) P A ​ ( x ) ​ : However it may to be negative in some cases. The true KL divergence can never be negative. This may cause issues. A better way to estimate KL divergence is: ( P A ( x ) = 0 P_A(x) = 0 P A ​ ( x ) = 0 is impossible because it's sampled from A) It's always positive and has no bias. The P B ( x ) P A ( x ) − 1 \frac{P_B(x)}{P_A(x)}-1 P A ​ ( x ) P B ​ ( x ) ​ − 1 is a control variate and is negatively correlated with log ⁡ P A ( x ) P B ( x ) \log \frac{P_A(x)}{P_B(x)} lo g P B ​ ( x ) P A ​ ( x ) ​ . Recall control variate: if we want to estimate E [ X ] E[X] E [ X ] from samples more accurately, we can find another variable Y Y Y that's correlated with X X X , and we must know its theoretical mean E [ Y ] E[Y] E [ Y ] , then we use E ^ [ X + λ Y ] − λ E [ Y ] \hat E[X+\lambda Y] - \lambda E[Y] E ^ [ X + λY ] − λ E [ Y ] to estimate E [ X ] E[X] E [ X ] . The parameter λ \lambda λ is choosed by minimizing variance. The mean of that control variate is zero, because E x ∼ A [ P B ( x ) P A ( x ) − 1 ] = ∑ x P A ( x ) ( P B ( x ) P A ( x ) − 1 ) = ∑ x ( P B ( x ) − P A ( x ) ) = ∑ x P B ( x ) − ∑ x P A ( x ) = 0 E_{x \sim A}\left[\frac{P_B(x)}{P_A(x)}-1\right]=\sum_x P_A(x) (\frac{P_B(x)}{P_A(x)}-1)=\sum_x (P_B(x) - P_A(x)) =\sum_x P_B(x) - \sum_x P_A(x)=0 E x ∼ A ​ [ P A ​ ( x ) P B ​ ( x ) ​ − 1 ] = ∑ x ​ P A ​ ( x ) ( P A ​ ( x ) P B ​ ( x ) ​ − 1 ) = ∑ x ​ ( P B ​ ( x ) − P A ​ ( x )) = ∑ x ​ P B ​ ( x ) − ∑ x ​ P A ​ ( x ) = 0 The λ = 1 \lambda=1 λ = 1 is not chosen by mimimizing variance, but chosen by making the estimator non-negative. If I define k = P B ( x ) P A ( x ) k=\frac{P_B(x)}{P_A(x)} k = P A ​ ( x ) P B ​ ( x ) ​ , then log ⁡ P A ( x ) P B ( x ) + λ ( P B ( x ) P A ( x ) − 1 ) = − log ⁡ k + λ ( k − 1 ) \log \frac{P_A(x)}{P_B(x)} + \lambda(\frac{P_B(x)}{P_A(x)} - 1) = -\log k + \lambda(k-1) lo g P B ​ ( x ) P A ​ ( x ) ​ + λ ( P A ​ ( x ) P B ​ ( x ) ​ − 1 ) = − lo g k + λ ( k − 1 ) . We want it to be non-negative: − log ⁡ k + λ ( k − 1 ) ≥ 0 -\log k + \lambda(k-1) \geq 0 − lo g k + λ ( k − 1 ) ≥ 0 for all k > 0 k>0 k > 0 , it can be seen that a line y = λ ( k − 1 ) y=\lambda (k-1) y = λ ( k − 1 ) must be above y = log ⁡ k y=\log k y = lo g k , the only solution is λ = 1 \lambda=1 λ = 1 , where the line is a tangent line on log ⁡ k \log k lo g k . If X and Y are independent, then H ( ( X , Y ) ) = H ( X ) + H ( Y ) H((X,Y))=H(X)+H(Y) H (( X , Y )) = H ( X ) + H ( Y ) . But if X and Y are not independent, knowing X reduces uncertainty of Y, then H ( ( X , Y ) ) < H ( X ) + H ( Y ) H((X,Y))<H(X)+H(Y) H (( X , Y )) < H ( X ) + H ( Y ) . Mutual information I ( X ; Y ) I(X;Y) I ( X ; Y ) measures how "related" X and Y are: For a joint distribution, if we only care about X, then the distribution of X is a marginal distribution, same as Y. If we treat X and Y as independent, consider a "fake" joint distribution as if X and Y are independent. Denote that "fake" joint distribution as Z Z Z , then P ( Z = ( x , y ) ) = P ( X = x ) P ( Y = y ) P(Z=(x,y))=P(X=x)P(Y=y) P ( Z = ( x , y )) = P ( X = x ) P ( Y = y ) . It's called "outer product of marginal distribution", because its probability matrix is the outer product of two marginal distributions, so it's denoted X ⊗ Y X \otimes Y X ⊗ Y . Then mutual information can be expressed as KL divergence between joint distribution ( X , Y ) (X, Y) ( X , Y ) and that "fake" joint distribution X ⊗ Y X \otimes Y X ⊗ Y : KL divergence is zero when two distributions are the same, and KL divergence is positive when two distributions are not the same. So: H ( ( X , Y ) ) = H ( X ) + H ( Y ) − I ( X ; Y ) H((X,Y))=H(X)+H(Y)-I(X;Y) H (( X , Y )) = H ( X ) + H ( Y ) − I ( X ; Y ) , so if X and Y are not independent then H ( ( X , Y ) ) < H ( X ) + H ( Y ) H((X,Y))<H(X)+H(Y) H (( X , Y )) < H ( X ) + H ( Y ) . Mutual information is symmetric, I ( X ; Y ) = I ( Y ; X ) I(X;Y)=I(Y;X) I ( X ; Y ) = I ( Y ; X ) . As H ( ( X , Y ) ) = H ( X ∣ Y ) + H ( Y ) H((X,Y)) = H(X \vert Y) + H(Y) H (( X , Y )) = H ( X ∣ Y ) + H ( Y ) , so I ( X ; Y ) = H ( X ) + H ( Y ) − H ( ( X , Y ) ) = H ( X ) − H ( X ∣ Y ) I(X;Y) = H(X) + H(Y) - H((X,Y)) = H(X) - H(X \vert Y) I ( X ; Y ) = H ( X ) + H ( Y ) − H (( X , Y )) = H ( X ) − H ( X ∣ Y ) . If knowing Y completely determines X, knowing Y make the distribution of X collapse to one case with 100% probability, then H ( X ∣ Y ) = 0 H(X \vert Y) = 0 H ( X ∣ Y ) = 0 , then I ( X ; Y ) = H ( X ) I(X;Y)=H(X) I ( X ; Y ) = H ( X ) . Some places use correlation factor Cov [ X , Y ] Var [ X ] Var [ Y ] \frac{\text{Cov}[X,Y]}{\sqrt{\text{Var}[X]\text{Var}[Y]}} Var [ X ] Var [ Y ] ​ Cov [ X , Y ] ​ to measure the correlation between two variables. But correlation factor is not accurate in non-linear cases. Mutual information is more accurate in measuring correlation. Information Bottleneck theory tells that the training of neural network will learn an intermediary representation that: If we have two independent random variablex X and Y, and consider the distribution of the sum Z = X + Y Z=X+Y Z = X + Y , then For each z, it sums over different x and y within the constraint z = x + y z=x+y z = x + y . The constraint z = x + y z=x+y z = x + y allows determining y y y from x x x and z z z : y = z − x y=z-x y = z − x , so it can be rewritten as: In continuous case The probability density function of the sum f Z f_Z f Z ​ is denoted as convolution of f X f_X f X ​ and f Y f_Y f Y ​ : The convolution operator ∗ * ∗ can: Convolution can also work in 2D or more dimensions. If X = ( x 1 , x 2 ) X=(x_1,x_2) X = ( x 1 ​ , x 2 ​ ) and Y = ( y 1 , y 2 ) Y=(y_1,y_2) Y = ( y 1 ​ , y 2 ​ ) are 2D random variables (two joint distributions), Z = X + Y = ( z 1 , z 2 ) Z=X+Y=(z_1,z_2) Z = X + Y = ( z 1 ​ , z 2 ​ ) is convolution of X and Y: Convolution can also work on cases where the values are not probabilities. Convolutional neural network uses discrete version of convolution on matrices. Normally when talking about probability we mean the probability of an outcome under a modelled distribution: P ( outcome  ∣  modelled distribution ) P(\text{outcome} \ \vert \ \text{modelled distribution}) P ( outcome   ∣   modelled distribution ) . But sometimes we have some concrete samples from a distribution but want to know which model suits the best, so we talk about the probability that a model is true given some samples: P ( modelled distribution  ∣  outcome ) P(\text{modelled distribution} \ \vert \ \text{outcome}) P ( modelled distribution   ∣   outcome ) . If I have some samples, then some parameters make the samples more likely to come from the modelled distribution, and some parameters make the samples less likely to come from the modelled distribution. For example, if I model a coin flip using a parameter θ \theta θ , that and I observe 10 coin flips have 9 heads and 1 tail, then θ = 0.9 \theta=0.9 θ = 0.9 is more likely than θ = 0.5 \theta=0.5 θ = 0.5 . That's straightforward for a simple model. But for more complex models, we need to measure likelihood. Likelihood L ( θ ∣ x 1 , x 2 , . . . , x n ) L(\theta \vert x_1,x_2,...,x_n) L ( θ ∣ x 1 ​ , x 2 ​ , ... , x n ​ ) measures: For example, if I model a coin flip distribution using a parameter θ \theta θ , the probability of head is θ \theta θ and tail is 1 − θ 1-\theta 1 − θ . If I observe 10 coin flip has 9 heads and 1 tail, then the likelihood of θ \theta θ : The more likely a parameter is, the higher its likelihood. If θ \theta θ equals the true underlying parameter then likelihood takes maximum. By taking logarithm, multiply becomes addition, making it easier to analyze. The log-likelihood function: The score function is the derivative of log-likelihood with respect to parameter, for one sample: If θ \theta θ equals true underlying parameter, then mean of likelihood E x [ L ( θ ∣ x ) ] E_x[L(\theta \vert x)] E x ​ [ L ( θ ∣ x )] takes maximum, mean of log-likelihood E x [ log ⁡ L ( θ ∣ x ) ] E_x[\log L(\theta \vert x)] E x ​ [ lo g L ( θ ∣ x )] also takes maximum. A continuous function's maximum point has zero derivative, so when θ \theta θ is true, then the mean of score function E x [ s ( θ ; x ) ] = ∂ E x [ f ( x ∣ θ ) ] ∂ θ E_x[s(\theta;x)]= \frac{\partial E_x[f(x \vert \theta)]}{\partial \theta} E x ​ [ s ( θ ; x )] = ∂ θ ∂ E x ​ [ f ( x ∣ θ )] ​ is zero. The Fisher information I ( θ ) \mathcal{I}(\theta) I ( θ ) is the mean of the square of score: (The mean is calculated over different outcomes, not different parameters.) We can also think that Fisher information is always computed under the assumption that θ \theta θ is the true underlying parameter, then E x [ s ( θ ; x ) ] = 0 E_x[s(\theta;x)]=0 E x ​ [ s ( θ ; x )] = 0 , then Fisher information is the variance of score I ( θ ) = Var x [ s ( θ ; x ) ] \mathcal{I}(\theta)=\text{Var}_x[s(\theta;x)] I ( θ ) = Var x ​ [ s ( θ ; x )] . Fisher informaiton I ( θ ) \mathcal{I}(\theta) I ( θ ) also measures the curvature of score function, in parameter space, around θ \theta θ . Fisher information measures how much information a sample can tell us about the underlying parameter. When the parameter is an offset and the offset is infinitely small, then the score function is called linear score. If the infinitely small offset is θ \theta θ . The offseted probability density is f 2 ( x ∣ θ ) = f ( x + θ ) f_2(x \vert \theta) = f(x+\theta) f 2 ​ ( x ∣ θ ) = f ( x + θ ) , then ≈ 0 ) ​ = d x d lo g f ( x ) ​ In the places that use score function (and Fisher information) but doesn not specify which parameter, they usually refer to the linear score function. Recall that if we make probability distribution more "spread out" the entropy will increase. If there is no constraint, maximizing entropy of real-number distribution will be "infinitely spread out over all real numbers" (which is not well-defined). But if there are constraints, maximizing entropy will give some common and important distributions: There are other max-entropy distributions. See Wikipedia . We can rediscover these max-entropy distributions, by using Largrange multiplier and functional derivative. To find the distribution with maximum entropy under variance constraint, we can use Largrange multiplier. If we want to find maximum or minimum of f ( x ) f(x) f ( x ) under the constraint that g ( x ) = 0 g(x)=0 g ( x ) = 0 , we can define Largragian function L \mathcal{L} L : Its two partial derivatives have special properties: Then solving equation ∂ L ( x , λ ) ∂ x = 0 \frac{\partial \mathcal{L}(x,\lambda)}{\partial x}=0 ∂ x ∂ L ( x , λ ) ​ = 0 and ∂ L ( x , λ ) ∂ λ = 0 \frac{\partial \mathcal{L}(x,\lambda)}{\partial \lambda}=0 ∂ λ ∂ L ( x , λ ) ​ = 0 will find the maximum or minimum under constraint. Similarily, if there are many constraints, there are multiple λ \lambda λ s. Similar things also apply to functions with multiple arguments. The argument x x x can be a number or even a function, which involves functional derivative: A functional is a function that inputs a function and outputs a value. (One of) its input is a function rather than a value (it's a higher-order function). Functional derivative (also called variational derivative) means the derivative of a functional respect to its argument function. To compute functional derivative, we add a small "perturbation" to the function. f ( x ) f(x) f ( x ) becomes f ( x ) + ϵ ⋅ η ( x ) f(x)+ \epsilon \cdot \eta(x) f ( x ) + ϵ ⋅ η ( x ) , where epsilon ϵ \epsilon ϵ is an infinitely small value that approaches zero, and eta η ( x ) \eta(x) η ( x ) is a test function. The test function can be any function that satisfy some properties. The definition of functional derivative: Note that it's inside integration. For example, this is a functional: G ( f ) = ∫ x f ( x ) d x G(f) = \int x f(x) dx G ( f ) = ∫ x f ( x ) d x . To compute functional derivative ∂ G ( f ) ∂ f \frac{\partial G(f)}{\partial f} ∂ f ∂ G ( f ) ​ , we firstly compute ∂ G ( f + ϵ η ) ∂ ϵ \frac{\partial G(f+\epsilon \eta)}{\partial \epsilon} ∂ ϵ ∂ G ( f + ϵη ) ​ then try to make it into the form of ∫ ∂ G ∂ f ⋅ η ( x ) d x \int \boxed{\frac{\partial G}{\partial f}} \cdot \eta(x) dx ∫ ∂ f ∂ G ​ ​ ⋅ η ( x ) d x Then by pattern matching with the definition, we get ∂ G ∂ f = x \frac{\partial G}{\partial f}=x ∂ f ∂ G ​ = x . Calculate functional derivative for G ( f ) = ∫ x 2 f ( x ) d x G(f)=\int x^2f(x)dx G ( f ) = ∫ x 2 f ( x ) d x : Then ∂ G ∂ f = x 2 \frac{\partial G}{\partial f}=x^2 ∂ f ∂ G ​ = x 2 . Calculate functional derivative for G ( f ) = ∫ ( − f ( x ) log ⁡ f ( x ) ) d x G(f) = \int (-f(x) \log f(x)) dx G ( f ) = ∫ ( − f ( x ) lo g f ( x )) d x : As log ⁡ \log lo g is continuous, and ϵ η ( x ) \epsilon \eta(x) ϵη ( x ) is infinitely small, so log ⁡ ( f ( x ) + ϵ η ( x ) ) = log ⁡ ( f ( x ) ) \log(f(x)+\epsilon \eta(x))=\log (f(x)) lo g ( f ( x ) + ϵη ( x )) = lo g ( f ( x )) : If we constraint the variance range, a ≤ X ≤ b a \leq X \leq b a ≤ X ≤ b , then maximize its entropy using fuctional derivative We have constraint ∫ a b f ( x ) d x = 1 \int_a^b f(x)dx=1 ∫ a b ​ f ( x ) d x = 1 , which is ∫ a b f ( x ) d x − 1 = 0 \int_a^b f(x)dx-1=0 ∫ a b ​ f ( x ) d x − 1 = 0 . Compute derivatives Solve ∂ L ∂ f = 0 \frac{\partial \mathcal{L}}{\partial f}=0 ∂ f ∂ L ​ = 0 : Solve ∂ L ∂ λ 1 = 0 \frac{\partial \mathcal{L}}{\partial \lambda_1}=0 ∂ λ 1 ​ ∂ L ​ = 0 : The result is f ( x ) = 1 b − a     ( a ≤ x ≤ b ) f(x) = \frac 1 {b-a} \ \ \ (a \leq x \leq b) f ( x ) = b − a 1 ​       ( a ≤ x ≤ b ) . The normal distribution, also called Gaussian distribution, is important in statistics. It's the distribution with maximum entropy if we constraint its variance σ 2 \sigma^2 σ 2 to be a finite value. It has two parameters: the mean μ \mu μ and the standard deviation σ \sigma σ . N ( μ , σ 2 ) N(\mu, \sigma^2) N ( μ , σ 2 ) denotes a normal distribution. Changing μ \mu μ moves the PDF alone X axis. Changing σ \sigma σ scales PDF along X axis. We can rediscover normal distribution by maximizing entropy under variance constraint. For a distribution's probability density function f f f , we want to maximize its entropy H ( f ) = ∫ f ( x ) log ⁡ 1 f ( x ) d x H(f)=\int f(x) \log\frac{1}{f(x)}dx H ( f ) = ∫ f ( x ) lo g f ( x ) 1 ​ d x under the constraint: We can simplify to make deduction easier: The Largragian function: ⎨ ⎧ ​ ∫ − ∞ ∞ ​ f ( x ) lo g f ( x ) 1 ​ d x + λ 1 ​ ( ∫ − ∞ ∞ ​ f ( x ) d x − 1 ) + λ 2 ​ ( ∫ − ∞ ∞ ​ f ( x ) x 2 d x − σ 2 ) ​ = ∫ − ∞ ∞ ( − f ( x ) log ⁡ f ( x ) + λ 1 f ( x ) + λ 2 x 2 f ( x ) ) d x − λ 1 − λ 2 σ 2 =\int_{-\infty}^{\infty} (-f(x)\log f(x) + \lambda_1 f(x) + \lambda_2 x^2 f(x) ) dx - \lambda_1 - \lambda_2\sigma^2 = ∫ − ∞ ∞ ​ ( − f ( x ) lo g f ( x ) + λ 1 ​ f ( x ) + λ 2 ​ x 2 f ( x )) d x − λ 1 ​ − λ 2 ​ σ 2 Then compute the functional derivative ∂ L ∂ f \frac{\partial \mathcal{L}}{\partial f} ∂ f ∂ L ​ Then solve ∂ L ∂ f = 0 \frac{\partial \mathcal{L}}{\partial f}=0 ∂ f ∂ L ​ = 0 : We get the rough form of normal distribution's probabilify density function. Then solve ∂ L ∂ λ 1 = 0 \frac{\partial \mathcal{L}}{\partial \lambda_1}=0 ∂ λ 1 ​ ∂ L ​ = 0 : That integration must converge, so λ 2 < 0 \lambda_2<0 λ 2 ​ < 0 . A subproblem: solve ∫ − ∞ ∞ e − k x 2 d x \int_{-\infty}^{\infty} e^{-k x^2}dx ∫ − ∞ ∞ ​ e − k x 2 d x ( k > 0 k>0 k > 0 ). The trick is to firstly compute its square ( ∫ − ∞ ∞ e − k x 2 d x ) 2 (\int_{-\infty}^{\infty} e^{-k x^2}dx)^2 ( ∫ − ∞ ∞ ​ e − k x 2 d x ) 2 , turning the integration into two-dimensional, and then substitude polar coordinates x = r cos ⁡ θ ,   y = r sin ⁡ θ ,   x 2 + y 2 = r 2 ,   d x   d y = r   d r   d θ x=r \cos \theta, \ y = r \sin \theta, \ x^2+y^2=r^2, \ dx\ dy = r \ dr \ d\theta x = r cos θ ,   y = r sin θ ,   x 2 + y 2 = r 2 ,   d x   d y = r   d r   d θ : Then substitude u = − k r 2 ,   d u = − 2 k r   d r ,   d r = − 1 2 k r d u u=-kr^2, \ du = -2kr\ dr, \ dr = -\frac{1}{2kr}du u = − k r 2 ,   d u = − 2 k r   d r ,   d r = − 2 k r 1 ​ d u : So ∫ − ∞ ∞ e − k x 2 d x = π k \int_{-\infty}^{\infty} e^{-kx^2}dx=\sqrt{\frac{\pi}{k}} ∫ − ∞ ∞ ​ e − k x 2 d x = k π ​ ​ . Put − λ 2 = k -\lambda_2=k − λ 2 ​ = k ​ = 1 e − 1 + λ 1 = − λ 2 π e^{-1+\lambda_1} = \sqrt{\frac{-\lambda_2}{\pi}} e − 1 + λ 1 ​ = π − λ 2 ​ ​ ​ Then solve ∂ L ∂ λ 2 = 0 \frac{\partial \mathcal{L}}{\partial \lambda_2}=0 ∂ λ 2 ​ ∂ L ​ = 0 : It requires another trick. For the previous result ∫ − ∞ ∞ e − k x 2 d x = π k \int_{-\infty}^{\infty} e^{-kx^2}dx=\sqrt{\frac{\pi}{k}} ∫ − ∞ ∞ ​ e − k x 2 d x = k π ​ ​ , take derivative to k k k on two sides: ​ k − 2 1 ​ take derivative to  k ​ ∫ − ∞ ∞ ​ ( − x 2 ) e ( − x 2 ) k d x = − 2 1 ​ π ​ k − 2 3 ​ So ∫ − ∞ ∞ x 2 e − k x 2 d x = 1 2 π k 3 \int_{-\infty}^{\infty} x^2e^{-kx^2}dx = \frac{1}{2}\sqrt{\frac{\pi}{k^3}} ∫ − ∞ ∞ ​ x 2 e − k x 2 d x = 2 1 ​ k 3 π ​ ​ ​ = σ 2 By using e − 1 + λ 1 = − λ 2 π e^{-1+\lambda_1} = \sqrt{\frac{-\lambda_2}{\pi}} e − 1 + λ 1 ​ = π − λ 2 ​ ​ ​ , we get: ​ ⋅ 2 1 ​ − λ 2 3 ​ π ​ ​ = σ 2 λ 2 2 ​ 1 ​ ​ = 2 σ 2 Previously we know that λ 2 < 0 \lambda_2<0 λ 2 ​ < 0 , then λ 2 = − 1 2 σ 2 \lambda_2=-\frac{1}{2\sigma^2} λ 2 ​ = − 2 σ 2 1 ​ . Then e − 1 + λ 1 = 1 2 π σ 2 e^{-1+\lambda_1}=\sqrt{\frac{1}{2\pi\sigma^2}} e − 1 + λ 1 ​ = 2 π σ 2 1 ​ ​ Then we finally deduced the normal distribution's probability density function (when mean is 0): ​ e − 2 σ 2 1 ​ x 2 When mean is not 0, substitute x x x as x − μ x-\mu x − μ , we get the general normal distribution: ​ e − 2 σ 2 1 ​ ( x − μ ) 2 = 2 π ​ σ 1 ​ e − 2 1 ​ ( σ x − μ ​ ) 2 Entropy of normal distribution ​ We can then calculate the entropy of normal distribution: ​ e 2 σ 2 ( x − μ ) 2 ​ ) d x = ∫ f ( x ) ( 1 2 log ⁡ ( 2 π σ 2 ) + ( x − μ ) 2 2 σ 2 ) d x = 1 2 log ⁡ ( 2 π σ 2 ) ∫ f ( x ) d x ⏟ = 1 + 1 2 σ 2 ∫ f ( x ) ( x − μ ) 2 ⏟ = σ 2 d x =\int f(x) \left(\frac{1}{2}\log(2\pi\sigma^2)+\frac{(x-\mu)^2}{2\sigma^2}\right)dx=\frac{1}{2}\log(2\pi\sigma^2)\underbrace{\int f(x)dx} _ {=1}+ \frac{1}{2\sigma^2}\underbrace{\int f(x)(x-\mu)^2} _ {=\sigma^2}dx = ∫ f ( x ) ( 2 1 ​ lo g ( 2 π σ 2 ) + 2 σ 2 ( x − μ ) 2 ​ ) d x = 2 1 ​ lo g ( 2 π σ 2 ) = 1 ∫ f ( x ) d x ​ ​ + 2 σ 2 1 ​ = σ 2 ∫ f ( x ) ( x − μ ) 2 ​ ​ d x = 1 2 log ⁡ ( 2 π σ 2 ) + 1 2 = 1 2 log ⁡ ( 2 π e σ 2 ) =\frac{1}{2}\log(2\pi\sigma^2)+\frac{1}{2}=\frac{1}{2}\log(2\pi e \sigma^2) = 2 1 ​ lo g ( 2 π σ 2 ) + 2 1 ​ = 2 1 ​ lo g ( 2 π e σ 2 ) If X follows normal distribution and Y's distribution that have the same mean and variance, the cross entropy H ( Y , X ) H(Y,X) H ( Y , X ) have the same value: 1 2 log ⁡ ( 2 π e σ 2 ) \frac{1}{2}\log(2\pi e \sigma^2) 2 1 ​ lo g ( 2 π e σ 2 ) , regardless of the exact probability density function of Y. The deduction is similar to the above: ​ e 2 σ 2 ( x − μ ) 2 ​ ) d x = ∫ f Y ( x ) ( 1 2 log ⁡ ( 2 π σ 2 ) + ( x − μ ) 2 2 σ 2 ) d x = 1 2 log ⁡ ( 2 π σ 2 ) ∫ f Y ( x ) d x ⏟ = 1 + 1 2 σ 2 ∫ f Y ( x ) ( x − μ ) 2 ⏟ = σ 2 d x =\int f_Y(x) \left(\frac{1}{2}\log(2\pi\sigma^2)+\frac{(x-\mu)^2}{2\sigma^2}\right)dx=\frac{1}{2}\log(2\pi\sigma^2)\underbrace{\int f_Y(x)dx} _ {=1}+ \frac{1}{2\sigma^2}\underbrace{\int f_Y(x)(x-\mu)^2} _ {=\sigma^2}dx = ∫ f Y ​ ( x ) ( 2 1 ​ lo g ( 2 π σ 2 ) + 2 σ 2 ( x − μ ) 2 ​ ) d x = 2 1 ​ lo g ( 2 π σ 2 ) = 1 ∫ f Y ​ ( x ) d x ​ ​ + 2 σ 2 1 ​ = σ 2 ∫ f Y ​ ( x ) ( x − μ ) 2 ​ ​ d x = 1 2 log ⁡ ( 2 π σ 2 ) + 1 2 = 1 2 log ⁡ ( 2 π e σ 2 ) =\frac{1}{2}\log(2\pi\sigma^2)+\frac{1}{2}=\frac{1}{2}\log(2\pi e \sigma^2) = 2 1 ​ lo g ( 2 π σ 2 ) + 2 1 ​ = 2 1 ​ lo g ( 2 π e σ 2 ) Central limit theorem ​ We have a random variable X X X , which has meam 0 and (finite) variance σ 2 \sigma^2 σ 2 . If we add up n n n independent samples of X X X : X 1 + X 2 + . . . + X n X_1+X_2+...+X_n X 1 ​ + X 2 ​ + ... + X n ​ , the variance of sum is n σ 2 n\sigma^2 n σ 2 . To make its variance constant, we can divide it by n \sqrt n n ​ , then we get S n = X 1 + X 2 + . . . + X n n S_n = \frac{X_1+X_2+...+X_n}{\sqrt n} S n ​ = n ​ X 1 ​ + X 2 ​ + ... + X n ​ ​ . Here S n S_n S n ​ is called the standardized sum , because it makes variance not change by sample count. Central limit theorem says that the standardized sum apporaches normal distribution as n n n increase. No matter what the original distribution of X X X is (as long as its variance is finite), the standardized sum will approach normal distribution. The information of distribution of X X X will be "washed out" during the process. This "washing out information" is also increasing of entropy . As n n n increase, the entopy of standardized sum always increase (except when X follows normal distribution the entropy stays at maximum). H ( S n + 1 ) > H ( S n ) H(S_{n+1}) > H(S_n) H ( S n + 1 ​ ) > H ( S n ​ ) if X X X is not normally distributed. Normal distribution has the maximum entropy under variance constraint. As the entropy of standardized sum increase, its entropy will approach maximum and it will approach normal distribution. This is similar to second law of theomodynamics. This is called Entropic Central Limit Theorem. Proving that is hard and requires a lot of prerequisite knowledges. See also: Solution of Shannon's problem on the monotonicity of entropy , Generalized Entropy Power Inequalities and Monotonicity Properties of Information In the real world, many things follow normal distribution, like height of people, weight of people, error in manufacturing, error in measurement, etc. The height of people is affect by many complex factors (nurtrition, health, genetic factors, exercise, environmental factors, etc.). The combination of these complex factors definitely cannot be similified to a standardized sum of i.i.d zero-mean samples X 1 + X 2 + . . . + X n n \frac{X_1+X_2+...+X_n}{\sqrt n} n ​ X 1 ​ + X 2 ​ + ... + X n ​ ​ . Some factors have large effect and some factors have small effect. The factors are not necessarily independent. But the height of people still roughly follows normal distribution. This can be semi-explained by second law of theomodynamics. The complex interactions of many factors increase entropy of the height. At the same time there are also many factors that constraint the variance of height. Why is there a variance constraint? In some cases variance correspond to instability. A human that is 100 meters tall is impossible as it's physically unstable. Similarily a human that's 1 cm tall is impossible in maintaining normal biological function. The unstable things tend to collapse and vanish (survivorship bias), and the stable things remain. That's how the variance constraint occurs in nature. In some places, variance correspond to energy, and the variance is constrainted by conservation of energy. Although normal distribution is common, not all distributions are normal. There are also many things that follow fat-tail distributions. Also note that Central Limit Theorem works when n n n approaches infinity. Even if a distribution's standardized sum approach normal distribution, the speed of converging is important: some distribution converge to normal quickly, and some slowly. Some fat-tail distribution has finite variance but their standardized sum converge to normal distribution very slowly. In below, bold letter (like x \boldsymbol x x ) means column vector: ​ x 1 ​ x 2 ​ ... x n ​ ​ ​ Linear transform: for a (column) vector x \boldsymbol{x} x , muliply a matrix A A A on it: A x A\boldsymbol x A x is linear transformation. Linear transformation can contain rotation, scaling and shearing. For row vector it's x A \boldsymbol xA x A . Two linear transformations can be combined one, corresponding to matrix multiplication. Affine transform: for a (column) vector x \boldsymbol x x , multiply a matrix on it and then add some offset A x + b A\boldsymbol x + \boldsymbol b A x + b . It can move based on the result of linear transform. Two affine transformations can be combined into one. If y = A x + b , z = C y + d \boldsymbol y=A\boldsymbol x+\boldsymbol b, \boldsymbol z=C\boldsymbol y+\boldsymbol d y = A x + b , z = C y + d , then z = ( C A ) x + ( C b + d ) \boldsymbol z=(CA)\boldsymbol x +(C\boldsymbol b + \boldsymbol d) z = ( C A ) x + ( C b + d ) (in some places affine transformation is called "linear transformation".) Normal distribution has linear properties: Note that the elements of y \boldsymbol y y are no longer necessarily independent. What if I apply two or many affine transformations? Two affine transformations can be combined into one. So the result is still multivariate normal distribution. To describe a multivariate normal distribution, an important concept is covariance matrix . Recall covariance: Cov [ X , Y ] = E [ ( X − E [ X ] ) ( Y − E [ Y ] ) ] \text{Cov}[X,Y]=E[(X-E[X])(Y-E[Y])] Cov [ X , Y ] = E [( X − E [ X ]) ( Y − E [ Y ])] . Some rules about covariance: Covariance matrix: Here E [ x ] E[\boldsymbol x] E [ x ] taking mean of each element in x \boldsymbol x x and output a vector. It's element-wise. E [ x ] i = E [ x i ] E[\boldsymbol x]_i = E[\boldsymbol x_i] E [ x ] i ​ = E [ x i ​ ] . Similar for matrix. The covariance matrix written out: ​ Cov [ x 1 ​ , y 1 ​ ] Cov [ x 2 ​ , y 1 ​ ] ⋮ Cov [ x n ​ , y 1 ​ ] ​   Cov [ x 1 ​ , y 2 ​ ]   Cov [ x 2 ​ , y 2 ​ ]   ⋮   Cov [ x n ​ , y 2 ​ ] ​   ...   ...   ⋱   ... ​   Cov [ x 1 ​ , y n ​ ]   Cov [ x 2 ​ , y n ​ ]   ⋮   Cov [ x n ​ , y n ​ ] ​ ​ Recall that multiplying constant and addition can be "moved out of E [ ] E[] E [ ] ": E [ k X ] = k E [ X ] ,   E [ X + Y ] = E [ X ] + E [ Y ] E[kX] = k E[X], \ E[X+Y]=E[X]+E[Y] E [ k X ] = k E [ X ] ,   E [ X + Y ] = E [ X ] + E [ Y ] . If A A A is a matrix that contains random variable and B B B is a matrix that's not random, then E [ A ⋅ B ] = E [ A ] ⋅ B ,   E [ B ⋅ A ] = B ⋅ E [ A ] E[A\cdot B] = E[A]\cdot B, \ E[B\cdot A] = B\cdot E[A] E [ A ⋅ B ] = E [ A ] ⋅ B ,   E [ B ⋅ A ] = B ⋅ E [ A ] , because multiplying a matrix come down to multiplying constant and adding up, which all can "move out of E [ ] E[] E [ ] ". Vector can be seen as a special kind of matrix. So applying it to covariance matrix: Similarily, Cov ( x , B ⋅ y ) = Cov ( x , y ) ⋅ B T \text{Cov}(\boldsymbol x, B \cdot \boldsymbol y) = \text{Cov}(\boldsymbol x, \boldsymbol y) \cdot B^T Cov ( x , B ⋅ y ) = Cov ( x , y ) ⋅ B T . If x \boldsymbol x x follows multivariate normal distribution, it can be described by mean vector μ \boldsymbol \mu μ (the mean of each element of x \boldsymbol x x ) and covariance matrix Cov ( x , x ) \text{Cov}(\boldsymbol x,\boldsymbol x) Cov ( x , x ) . Initially, if I have some independent normal variables x 1 , x 2 , . . . x n x_1, x_2, ... x_n x 1 ​ , x 2 ​ , ... x n ​ with mean values μ 1 , . . . , μ n \mu_1, ..., \mu_n μ 1 ​ , ... , μ n ​ and variances σ 1 2 , . . . , σ n 2 \sigma_1^2, ..., \sigma_n^2 σ 1 2 ​ , ... , σ n 2 ​ . If we treat them as a multivariate normal distribution, the mean vector μ x = ( μ 1 , . . . , μ n ) \boldsymbol \mu_x = (\mu_1, ..., \mu_n) μ x ​ = ( μ 1 ​ , ... , μ n ​ ) . The covariance matrix will be diagonal as they are independent: ​ σ 1 2 ​ 0 ⋮ 0 ​   0   σ 2 2 ​   ⋮   0 ​   ...   ...   ⋱   ... ​   0   0   ⋮   σ n 2 ​ ​ ​ Then if we apply an affine transformation y = A x + b \boldsymbol y = A \boldsymbol x + \boldsymbol b y = A x + b , then μ y = A μ x + b \boldsymbol \mu_y = A \mu_x + \boldsymbol b μ y ​ = A μ x ​ + b . Cov ( y , y ) = Cov ( A x + b , A x + b ) = Cov ( A x , A x ) = A Cov ( x , x ) A T \text{Cov}(\boldsymbol y,\boldsymbol y) = \text{Cov}(A \boldsymbol x + \boldsymbol b,A \boldsymbol x + \boldsymbol b) = \text{Cov}(A \boldsymbol x, A \boldsymbol x) = A \text{Cov}(\boldsymbol x,\boldsymbol x) A^T Cov ( y , y ) = Cov ( A x + b , A x + b ) = Cov ( A x , A x ) = A Cov ( x , x ) A T . The industry standard of 3D modelling is to model the 3D object as many triangles, called mesh. It only models the visible surface object. It use many triangles to approximate curved surface. Gaussian splatting provides an alternative method of 3D modelling. The 3D scene is modelled by a lot of mutlivariate (3D) gaussian distributions, called gaussian. When rendering, that 3D gaussian distribution is projected onto a plane (screen) and approximately become a 2D gaussian distribution, now probability density correspond to color opacity. Note that the projection is perspective projection (near things big and far things small). Perspective projection is not linear. After perspective projection, the 3D Gaussian distribution is no longer strictly a 2D Gaussian distribution, can be approximated by a 2D Gaussian distribution. Triangle mesh is often modelled by people. But gaussian splatting scene is often trained from photos of different perspectives of a scene. A gaussian's color can be fixed or can change based on different view directions. Gaussian splatting also works in 4D by adding a time dimension. In diffusion model, we add gaussian noise to image (or other things). Then the diffusion model takes noisy input and we train it to output the noise added to it. There will be many steps of adding noise and the model should output the noise added in each step. Tweedie's formula shows that estimating the noise added is the same as computing the likelihood of image distribution. To simplify, here we only consider one dimension and one noise step (the same also applies to many dimensions and many noise steps). If the original value is x 0 x_0 x 0 ​ , we add a noise ϵ ∼ N ( 0 , σ 2 ) \epsilon \sim N(0, \sigma^2) ϵ ∼ N ( 0 , σ 2 ) , the noise-added value is x 1 = x 0 + ϵ x_1 = x_0 + \epsilon x 1 ​ = x 0 ​ + ϵ , x 1 ∼ N ( x 0 , σ 2 ) x_1 \sim N(x_0, \sigma^2) x 1 ​ ∼ N ( x 0 ​ , σ 2 ) . The diffusion model only know x 1 x_1 x 1 ​ and don't know x 0 x_0 x 0 ​ . The diffusion model need to estimate ϵ \epsilon ϵ from x 1 x_1 x 1 ​ . (I use p 1 ∣ 0 ( x 1 ∣ x 0 ) p_{1 \vert 0}(x_1 \vert x_0) p 1∣0 ​ ( x 1 ​ ∣ x 0 ​ ) instead of shorter p ( x 1 ∣ x 0 ) p(x_1 \vert x_0) p ( x 1 ​ ∣ x 0 ​ ) is to reduce confusion between different distributions.) p 1 ∣ 0 ( x 1 ∣ x 0 ) p_{1 \vert 0}(x_1 \vert x_0) p 1∣0 ​ ( x 1 ​ ∣ x 0 ​ ) is a normal distribution: ​ σ 1 ​ e − 2 1 ​ ( σ x 1 − x 0 ​ ​ ) 2 Take log: ​ σ 1 ​ The linear score function under condition: Bayes rule: Take partial derivative to x 1 x_1 x 1 ​ : ∂ x 1 ​ ∂ lo g p 0 ​ ( x 0 ​ ) ​ ​ ​ − ∂ x 1 ​ ∂ lo g p 1 ​ ( x 1 ​ ) ​ Using previous result ∂ log ⁡ p 1 ∣ 0 ( x 1 ∣ x 0 ) ∂ x 1 = − x 1 − x 0 σ 2 \frac{\partial \log p_{1 \vert 0}(x_1 \vert x_0)}{\partial x_1} = - \frac{x_1-x_0}{\sigma^2} ∂ x 1 ​ ∂ l o g p 1∣0 ​ ( x 1 ​ ∣ x 0 ​ ) ​ = − σ 2 x 1 ​ − x 0 ​ ​ Now if we already know the noise-added value x 1 x_1 x 1 ​ , but we don't know x 0 x_0 x 0 ​ so x 0 x_0 x 0 ​ is uncertain. We want to compute the expectation of x 0 x_0 x 0 ​ under that condition that x 1 x_1 x 1 ​ is known. ​ x 1 ​ ] = x 1 + E x 0 [ σ 2 ∂ log ⁡ p 0 ∣ 1 ( x 0 ∣ x 1 ) ∂ x 1 ∣ x 1 ] + E x 0 [ σ 2 ∂ log ⁡ p 1 ( x 1 ) ∂ x 1 ∣ x 1 ] = x_1 + E_{x_0}\left[\sigma^2 \frac{\partial\log p_{0 \vert 1}(x_0 \vert x_1)}{\partial x_1}\biggr\vert x_1\right] + E_{x_0}\left[ \sigma^2\frac{\partial \log p_1(x_1)}{\partial x_1} \biggr\vert x_1\right] = x 1 ​ + E x 0 ​ ​ [ σ 2 ∂ x 1 ​ ∂ lo g p 0∣1 ​ ( x 0 ​ ∣ x 1 ​ ) ​ ​ x 1 ​ ] + E x 0 ​ ​ [ σ 2 ∂ x 1 ​ ∂ lo g p 1 ​ ( x 1 ​ ) ​ ​ x 1 ​ ] Within it, E x 0 [ ∂ log ⁡ p 0 ∣ 1 ( x 0 ∣ x 1 ) ∂ x 1 ∣ x 1 ] = 0 E_{x_0}\left[ \frac{\partial\log p_{0 \vert 1}(x_0 \vert x_1)}{\partial x_1} \biggr\vert x_1 \right]=0 E x 0 ​ ​ [ ∂ x 1 ​ ∂ l o g p 0∣1 ​ ( x 0 ​ ∣ x 1 ​ ) ​ ​ x 1 ​ ] = 0 , because ​ x 1 ​ ] = ∫ p 0∣1 ​ ( x 0 ​ ∣ x 1 ​ ) ⋅ ∂ x 1 ​ ∂ lo g p 0∣1 ​ ( x 0 ​ ∣ x 1 ​ ) ​ d x 0 ​ = ∫ p 0 ∣ 1 ( x 0 ) ⋅ 1 p 0 ∣ 1 ( x 0 ∣ x 1 ) ⋅ ∂ p 0 ∣ 1 ( x 0 ∣ x 1 ) ∂ x 1 d x 0 = ∫ ∂ p 0 ∣ 1 ( x 0 ∣ x 1 ) ∂ x 1 d x 0 = ∂ ∫ p 0 ∣ 1 ( x 0 ∣ x 1 ) d x 0 ∂ x 1 = ∂ 1 ∂ x 1 = 0 = \int p_{0 \vert 1}(x_0) \cdot \frac 1 {p_{0 \vert 1}(x_0 \vert x_1)} \cdot \frac{\partial p_{0 \vert 1}(x_0 \vert x_1)}{\partial x_1} dx_0 = \int \frac{\partial p_{0 \vert 1}(x_0 \vert x_1)}{\partial x_1} dx_0 = \frac{\partial \int p_{0 \vert 1}(x_0 \vert x_1) dx_0}{\partial x_1} = \frac{\partial 1}{\partial x_1}=0 = ∫ p 0∣1 ​ ( x 0 ​ ) ⋅ p 0∣1 ​ ( x 0 ​ ∣ x 1 ​ ) 1 ​ ⋅ ∂ x 1 ​ ∂ p 0∣1 ​ ( x 0 ​ ∣ x 1 ​ ) ​ d x 0 ​ = ∫ ∂ x 1 ​ ∂ p 0∣1 ​ ( x 0 ​ ∣ x 1 ​ ) ​ d x 0 ​ = ∂ x 1 ​ ∂ ∫ p 0∣1 ​ ( x 0 ​ ∣ x 1 ​ ) d x 0 ​ ​ = ∂ x 1 ​ ∂ 1 ​ = 0 And E x 0 [ σ 2 ∂ log ⁡ p 1 ( x 1 ) ∂ x 1 ∣ x 1 ] = σ 2 ∂ log ⁡ p 1 ( x 1 ) ∂ x 1 E_{x_0}\left[ \sigma^2\frac{\partial \log p_1(x_1)}{\partial x_1} \biggr\vert x_1\right] = \sigma^2\frac{\partial \log p_1(x_1)}{\partial x_1} E x 0 ​ ​ [ σ 2 ∂ x 1 ​ ∂ l o g p 1 ​ ( x 1 ​ ) ​ ​ x 1 ​ ] = σ 2 ∂ x 1 ​ ∂ l o g p 1 ​ ( x 1 ​ ) ​ because it's unrelated to random x 0 x_0 x 0 ​ . σ 2 ∂ x 1 ​ ∂ lo g p 1 ​ ( x 1 ​ ) ​ ​ ​ That's Tweedie's formula (for 1D case). It can be generalized to many dimensions, where the x 0 , x 1 x_0, x_1 x 0 ​ , x 1 ​ are vectors, the distributions p 0 , p 1 , p 0 ∣ 1 , p 1 ∣ 0 p_0, p_1, p_{0 \vert 1}, p_{1 \vert 0} p 0 ​ , p 1 ​ , p 0∣1 ​ , p 1∣0 ​ are joint distributions where different dimensions are not necessarily independent. The gaussian noise added to different dimensions are still independent. The diffusion model is trained to estimate the added noise, which is the same as estimating the linear score. If we have constraint X ≥ 0 X \geq 0 X ≥ 0 and fix the mean E [ X ] E[X] E [ X ] to a specific value μ \mu μ , then maximizing entropy gives exponential distribution. It can also be rediscovered from Lagrange multiplier: ⎨ ⎧ ​ ∫ 0 ∞ ​ f ( x ) lo g f ( x ) 1 ​ d x + λ 1 ​ ( ∫ 0 ∞ ​ f ( x ) d x − 1 ) + λ 2 ​ ( ∫ 0 ∞ ​ f ( x ) x d x − μ ) ​ = ∫ 0 ∞ ( − f ( x ) log ⁡ f ( x ) + λ 1 f ( x ) + λ 2 x f ( x ) ) d x − λ 1 − λ 2 μ =\int_{0}^{\infty} (-f(x)\log f(x) + \lambda_1 f(x) + \lambda_2 x f(x) ) dx - \lambda_1 - \lambda_2\mu = ∫ 0 ∞ ​ ( − f ( x ) lo g f ( x ) + λ 1 ​ f ( x ) + λ 2 ​ x f ( x )) d x − λ 1 ​ − λ 2 ​ μ ∂ L ∂ f = − log ⁡ f ( x ) − 1 + λ 1 + λ 2 x ∂ L ∂ λ 1 = ∫ 0 ∞ f ( x ) d x − 1 ∂ L ∂ λ 2 = ∫ 0 ∞ x f ( x ) d x − μ \frac{\partial \mathcal{L}}{\partial f} = -\log f(x) - 1 + \lambda_1 + \lambda_2 x \quad\quad\quad \frac{\partial \mathcal{L}}{\partial \lambda_1}=\int_0^{\infty}f(x)dx-1 \quad\quad\quad \frac{\partial \mathcal{L}}{\partial \lambda_2}=\int_0^{\infty} xf(x)dx-\mu ∂ f ∂ L ​ = − lo g f ( x ) − 1 + λ 1 ​ + λ 2 ​ x ∂ λ 1 ​ ∂ L ​ = ∫ 0 ∞ ​ f ( x ) d x − 1 ∂ λ 2 ​ ∂ L ​ = ∫ 0 ∞ ​ x f ( x ) d x − μ Then solve ∂ L ∂ f = 0 \frac{\partial \mathcal{L}}{\partial f}=0 ∂ f ∂ L ​ = 0 : Then solve ∂ L ∂ λ 1 = 0 \frac{\partial \mathcal{L}}{\partial \lambda_1}=0 ∂ λ 1 ​ ∂ L ​ = 0 : To make that integration finite, λ 2 < 0 \lambda_2 < 0 λ 2 ​ < 0 . Let u = λ 2 x ,   d u = λ 2 d x , d x = 1 λ 2 d u u = \lambda_2 x, \ du = \lambda_2 dx, dx=\frac 1 {\lambda_2} du u = λ 2 ​ x ,   d u = λ 2 ​ d x , d x = λ 2 ​ 1 ​ d u , Then solve ∂ L ∂ λ 2 = 0 \frac{\partial \mathcal{L}}{\partial \lambda_2}=0 ∂ λ 2 ​ ∂ L ​ = 0 : Now we have Solving it gives λ 2 = − 1 μ ,   e 1 − λ 1 = μ \lambda_2 = - \frac 1 {\mu}, \ e^{1-\lambda_1} = \mu λ 2 ​ = − μ 1 ​ ,   e 1 − λ 1 ​ = μ . Then In the common definition of exponential distribution, λ = 1 μ \lambda = \frac 1 \mu λ = μ 1 ​ , f ( x ) = λ e − λ x f(x) = \lambda e^{-\lambda x} f ( x ) = λ e − λ x . Its tail function: ​ y = x y = ∞ ​ = e − λ x If some event is happening in fixed rate ( λ \lambda λ ), exponential distribution measures how long do we need to wait for the next event , if how long we will need to wait is irrelevant how long we have aleady waited (memorylessness) . Exponential distribution can measure: How to understand memorlessness? For example, a kind of radioactive atom decays once per 5 minutes on average. If the time unit is minute, then λ = 1 5 \lambda = \frac 1 5 λ = 5 1 ​ . For a specific atom, if we wait for it to decay, the time we need to wait is on average 5 minutes. However, if we have already waited for 3 minutes and it still hasn't decay, the expected time that we need to wait is still 5 minutes. If we have waited for 100 minutes and it still hasn't decay, the expected time that we need to wait is still 5 minutes. Because the atom doesn't "remember" how long we have waited. Memorylessness means the probability that we still need to wait needToWait \text{needToWait} needToWait amount of time is irrelevant to how long we have already waited: (We can also rediscover exponential distrbution from just memorylessness.) Memorylessness is related with its maximum entropy property. Maximizing entropy under constraints means maximizing uncertainty and minizing information other than the constraints. The only two constraints are X ≥ 0 X\geq 0 X ≥ 0 , the wait time is positive, and E [ X ] = 1 λ E[X]=\frac 1 \lambda E [ X ] = λ 1 ​ , the average rate of the event. Other than the two constraints, there is no extra information. No information tells waiting reduces time need to wait, no information tells waiting increases time need to wait. So it's the most unbiased: waiting has no effect on the time need to wait. If the radioactive atom has some "internal memory" that changes over time and controls how likely it will decay, then the waiting time distribution encodes extra information other than the two constraints, which makes it no longer max-entropy. 80/20 rule: for example 80% of weallth are in the richest 20% (the real number may be different). It has fractal property : even within the richest 20%, 80% of wealth are in the richest 20% within. Based on this fractal-like property, we can naturally get Pareto distribution . If the total people count is N N N , the total wealth amount is W W W . Then 0.2 N 0.2N 0.2 N people have 0.8 W 0.8W 0.8 W wealth. Applying the same within the 0.2 N 0.2N 0.2 N people: 0.2 ⋅ 0.2 N 0.2 \cdot 0.2 N 0.2 ⋅ 0.2 N people have 0.8 ⋅ 0.8 W 0.8 \cdot 0.8W 0.8 ⋅ 0.8 W wealth. Applying again, 0.2 ⋅ 0.2 ⋅ 0.2 N 0.2 \cdot 0.2 \cdot 0.2 N 0.2 ⋅ 0.2 ⋅ 0.2 N people have 0.8 ⋅ 0.8 ⋅ 0.8 W 0.8 \cdot 0.8 \cdot 0.8 W 0.8 ⋅ 0.8 ⋅ 0.8 W welath. Generalize it, 0.2 k N 0.2^k N 0. 2 k N people have 0.8 k W 0.8^k W 0. 8 k W wealth ( k k k can be generalized to continuous real number). If the wealth variable is X X X (assume X > 0 X > 0 X > 0 ), its probability density function is f ( x ) f(x) f ( x ) , and porportion of people correspond to probability, the richest 0.2 k 0.2^k 0. 2 k porportion of people group have 0.8 k W 0.8^kW 0. 8 k W wealth, t t t is the wealth threshold (minimum wealth) of that group: Note that f ( x ) f(x) f ( x ) represents probability density function (PDF), which correspond to density of proportion of people. N ⋅ f ( x ) N\cdot f(x) N ⋅ f ( x ) is people amount density over wealth. Multiplying it with wealth x x x and integrate gets total wealth in range: We can rediscover Pareto distribution from these. The first thing to do is extract and eliminate k k k : Then we can take derivative to t t t on two sides: f ( t ) ≠ 0 f(t) \neq 0 f ( t )  = 0 . Divide two sides by − f ( t ) -f(t) − f ( t ) : Take derivative to t t t on two sides again: Now t t t is an argument and can be renamed to x x x . And do some adjustments: Now we get the PDF. We still need to make the total probability area to be 1 to make it a valid distribution. But there is no extra unknown parameter in PDF to change. The solution is to crop the range of X X X . If we set the minimum wealth in distribution to be m m m (but doesn't constraint the maximum wealth), creating constraint X ≥ m X \geq m X ≥ m , then using the previous result Now we rediscovered (a special case of) Pareto distribution from just fractal 80/20 rule. We can generalize it further for other cases like 90/10 rule, 80/10 rule, etc. and get Pareto (Type I) distribution. It has two parameters, shape parameter α \alpha α (correspond to − log ⁡ 0.2 log ⁡ 0.8 − log ⁡ 0.2 = log ⁡ 5 log ⁡ 4 ≈ 1.161 -\frac {\log 0.2} {\log 0.8-\log 0.2} = \frac{\log 5}{\log 4} \approx 1.161 − l o g 0.8 − l o g 0.2 l o g 0.2 ​ = l o g 4 l o g 5 ​ ≈ 1.161 ) and minimum value m m m : Note that in real world the wealth of one can be negative (has debts more than assets). The Pareto distribution is just an approximation. m m m means the threshold where Pareto distribution starts to be good approximation. If α ≤ 1 \alpha \leq 1 α ≤ 1 then its theoretical mean is infinite. Of course if we have finite samples then the sample mean will be finite, but if the theoretical mean is infinite, the more sample we have, the larger the sample mean tend to be, and the trend won't stop. If α ≤ 2 \alpha \leq 2 α ≤ 2 then its theoretical variance is infinite. Recall that centrol limit theorem require finite variance. The standarized sum of values taken from Pareto distribution whose α ≤ 2 \alpha \leq 2 α ≤ 2 does not follow central limit theorem because it has infinite variance. Pareto distribution is often described using tail function (rather than probability density function): There are additive values, like length, mass, money. For additive values, we often compute arithmetic average 1 n ( x 1 + x 2 + . . + x n ) \frac 1 n (x_1 + x_2 + .. + x_n) n 1 ​ ( x 1 ​ + x 2 ​ + .. + x n ​ ) . There are also multiplicative values, like asset return rate, growth ratio. For multiplicative values, we often compute geometric average ( x 1 ⋅ x 2 ⋅ . . . ⋅ x n ) 1 n (x_1 \cdot x_2 \cdot ... \cdot x_n)^{\frac 1 n} ( x 1 ​ ⋅ x 2 ​ ⋅ ... ⋅ x n ​ ) n 1 ​ . For example, if an asset grows by 20% in first year, drops 10% in second year and grows 1% in third year, then the average growth ratio per year is ( 1.2 ⋅ 0.9 ⋅ 1.01 ) 1 3 (1.2 \cdot 0.9 \cdot 1.01)^{\frac 1 3} ( 1.2 ⋅ 0.9 ⋅ 1.01 ) 3 1 ​ . Logarithm allows turning multiplication into addition, and turning power into multiplication. If y = log ⁡ x y = \log x y = lo g x , then log of geometric average of x x x is arithmetic average of y y y : Pareto distribution maximizes entropy under geometric mean constraint E [ log ⁡ X ] E[\log X] E [ lo g X ] . If we have constraints X ≥ m > 0 X \geq m > 0 X ≥ m > 0 , E [ log ⁡ X ] = g E[\log X] = g E [ lo g X ] = g , using largrange multiplier to maximize entropy: ⎨ ⎧ ​ ∫ m ∞ ​ f ( x ) lo g f ( x ) 1 ​ d x + λ 1 ​ ( ∫ m ∞ ​ f ( x ) d x − 1 ) + λ 2 ​ ( ∫ m ∞ ​ f ( x ) lo g x d x − g ) ​ L ( f , λ 1 , λ 2 ) = ∫ m ∞ (   − f ( x ) log ⁡ f ( x ) + λ 1 f ( x ) + λ 2 f ( x ) log ⁡ x   ) d x − λ 1 − g λ 2 \mathcal{L}(f, \lambda_1, \lambda_2) = \int_m^{\infty} (\ -f(x)\log f(x) + \lambda_1 f(x) + \lambda_2 f(x) \log x \ ) dx -\lambda_1 - g \lambda_2 L ( f , λ 1 ​ , λ 2 ​ ) = ∫ m ∞ ​ (   − f ( x ) lo g f ( x ) + λ 1 ​ f ( x ) + λ 2 ​ f ( x ) lo g x   ) d x − λ 1 ​ − g λ 2 ​ ∂ L ∂ f = − log ⁡ f ( x ) − 1 + λ 1 + λ 2 log ⁡ x \frac{\partial \mathcal{L}}{\partial f} = -\log f(x) - 1 + \lambda_1 + \lambda_2 \log x ∂ f ∂ L ​ = − lo g f ( x ) − 1 + λ 1 ​ + λ 2 ​ lo g x ∂ L ∂ λ 1 = ∫ m ∞ f ( x ) d x − 1 \frac{\partial \mathcal{L}}{\partial \lambda_1} = \int_m^{\infty} f(x) dx -1 ∂ λ 1 ​ ∂ L ​ = ∫ m ∞ ​ f ( x ) d x − 1 ∂ L ∂ λ 2 = ∫ m ∞ f ( x ) log ⁡ x   d x − g \frac{\partial \mathcal{L}}{\partial \lambda_2} = \int_m^{\infty} f(x) \log x \ dx-g ∂ λ 2 ​ ∂ L ​ = ∫ m ∞ ​ f ( x ) lo g x   d x − g Solve ∂ L ∂ f = 0 \frac{\partial \mathcal{L}}{\partial f}=0 ∂ f ∂ L ​ = 0 : Solve ∂ L ∂ λ 1 = 0 \frac{\partial \mathcal{L}}{\partial \lambda_1}=0 ∂ λ 1 ​ ∂ L ​ = 0 : To make ∫ m ∞ x λ 2 d x \int_m^{\infty} x^{\lambda_2}dx ∫ m ∞ ​ x λ 2 ​ d x be finite, λ 2 < − 1 \lambda_2 < -1 λ 2 ​ < − 1 . ​ x = m x = ∞ ​ = − λ 2 ​ + 1 1 ​ m λ 2 ​ + 1 = e 1 − λ 1 ​ m λ 2 + 1 λ 2 + 1 = − e 1 − λ 1 e − 1 + λ 1 = − λ 2 + 1 m λ 2 + 1 (1) \frac{m^{\lambda_2+1}}{\lambda_2+1} = -e^{1-\lambda_1} \tag{1}\quad\quad\quad e^{-1+\lambda_1}=-\frac{\lambda_2+1}{m^{\lambda_2+1}} λ 2 ​ + 1 m λ 2 ​ + 1 ​ = − e 1 − λ 1 ​ e − 1 + λ 1 ​ = − m λ 2 ​ + 1 λ 2 ​ + 1 ​ ( 1 ) Solve ∂ L ∂ λ 2 = 0 \frac{\partial \mathcal{L}}{\partial \lambda_2}=0 ∂ λ 2 ​ ∂ L ​ = 0 : If we temporarily ignore e − 1 + λ 1 e^{-1+\lambda_1} e − 1 + λ 1 ​ and compute ∫ m ∞ x λ 2 log ⁡ x   d x \int_m^{\infty} x^{\lambda_2} \log x \ dx ∫ m ∞ ​ x λ 2 ​ lo g x   d x . Let u = log ⁡ x u=\log x u = lo g x , x = e u x=e^u x = e u , d x = e u d u dx = e^udu d x = e u d u : ​ u = l o g m u = ∞ ​ Then By using (1) e − 1 + λ 1 = − λ 2 + 1 m λ 2 + 1 e^{-1+\lambda_1}=-\frac{\lambda_2+1}{m^{\lambda_2+1}} e − 1 + λ 1 ​ = − m λ 2 ​ + 1 λ 2 ​ + 1 ​ : Let α = − 1 log ⁡ m − g \alpha = -\frac 1 {\log m - g} α = − l o g m − g 1 ​ , it become: Now we rediscovered Pareto (Type I) distribution by maximizing entropy. In the process we have λ 2 < − 1 \lambda_2 \lt -1 λ 2 ​ < − 1 . From λ 2 + 1 = 1 log ⁡ m − g \lambda_2+1 = \frac 1 {\log m - g} λ 2 ​ + 1 = l o g m − g 1 ​ we know log ⁡ m − g < 0 \log m - g <0 lo g m − g < 0 , which is m < e g m < e^g m < e g . For example, if wealth follows Pareto distribution, how to compute the wealth share of the top 1%? Generally how to compute the share of the top p p p porpotion? We firstly need to compute the threshold value t t t of the top n n n : Then compute the share ​ x = b x = ∞ ​ = ( − α m − α − α + 1 1 ​ ) b − α + 1 To make that integration finite, we need − α + 1 < 0 -\alpha+1< 0 − α + 1 < 0 , α > 1 \alpha > 1 α > 1 . The share porpotion is irrelevant to m m m . Some concrete numbers: A distribution is power law distribution if its tail function P ( X > x ) P(X>x) P ( X > x ) is roughly porpotional to x − α x^{-\alpha} x − α , where α \alpha α is called exponent. The "roughly" here means that it can have small deviations that is infinitely small when x x x is large enough. Rigorously speaking it's P ( X > x ) ∝ L ( x ) x − α P(X>x) \propto L(x) x^{-\alpha} P ( X > x ) ∝ L ( x ) x − α where L L L is a slow varying function that requires lim ⁡ x → ∞ L ( r x ) L ( x ) = 1 \lim_{x \to \infty} \frac{L(rx)}{L(x)}=1 lim x → ∞ ​ L ( x ) L ( r x ) ​ = 1 for positive r r r . Note that in some places the power law is written as P ( X > x ) ∝ L ( x ) x − ( α − 1 ) P(X>x) \propto L(x) x^{-(\alpha-1)} P ( X > x ) ∝ L ( x ) x − ( α − 1 ) . In these places the α \alpha α is 1 larger than the α \alpha α in Pareto distribution. The same α \alpha α can have different meaning in different places. Here I will use the α \alpha α that's consistent with the α \alpha α in Pareto distribution. The lower the exponent α \alpha α , the more right-skewed it is, and the more extreme values it have. The power law parameter estimation according to Power laws, Pareto distributions and Zipf’s law : Book The Black Swan also provides some estimation of power law parameter in real world: Note that the estimation is not accurate because they are sensitive to rare extreme samples. Note that there are things whose estimated α < 1 \alpha < 1 α < 1 : intensity of solar flares, intensity of wars, frequency of family names. Recall that in Pareto (Type I) distribuion if α ≤ 1 \alpha \leq 1 α ≤ 1 then the theoretical mean is infinite. The sample mean tend to be higher and higher when we collect samples and the trend won't stop. If the intensity of war do follow power law and the real α < 1 \alpha < 1 α < 1 , then much larger wars exists in the future. Note that most of these things has estimated α < 2 \alpha < 2 α < 2 . In Pareto (Type I) distribution if α ≤ 2 \alpha \leq 2 α ≤ 2 then its theoretical variance is infinite. Not having a finite variance makes them not follow central limit theorem and should not be modelled using gaussian distribution. There are other distributions that can have extreme values: They all have less extreme values than power law distributions, but more extreme values than normal distribution and exponential distribution. If T T T follows exponential distribution, then a T a^T a T follows Pareto (Type I) distribution if a > 1 a>1 a > 1 . If T T T follows exponential distribution, its probability density f T ( t ) = λ e − λ t f_T(t) = \lambda e^{-\lambda t} f T ​ ( t ) = λ e − λ t ( T ≥ 0 T\geq 0 T ≥ 0 ), its cumulative distribution function F T ( t ) = P ( T < t ) = 1 − e − λ t F_T(t) = P(T<t) = 1-e^{-\lambda t} F T ​ ( t ) = P ( T < t ) = 1 − e − λ t If Y = a T Y=a^T Y = a T , a > 1 a>1 a > 1 , then Because T ≥ 0 T\geq 0 T ≥ 0 , Y ≥ a 0 = 1 Y \geq a^0=1 Y ≥ a 0 = 1 . Now Y Y Y 's tail function is in the same form as Pareto (Type I) distribution, where α = λ log ⁡ a ,   m = 1 \alpha=\frac{\lambda}{\log a}, \ m =1 α = l o g a λ ​ ,   m = 1 . If the lifetime of something follows power law distribution, then it has Lindy effect: the longer that it has existed, the longer that it will likely to continue existing. If the lifetime T T T follows Pareto distribution, if something keeps living at time t t t , then compute the expected lifetime under that condition. (The mean is weighted average. The conditional mean is also weighted average but under condition. But as the total integrated weight is not 1, it need to divide the total integrated weight.) (For that integration to be finite, − α + 1 < 0 -\alpha+1<0 − α + 1 < 0 , α > 1 \alpha>1 α > 1 ) The expected lifetime is α α − 1 t \frac{\alpha}{\alpha-1} t α − 1 α ​ t under the condition that it has already lived to time t t t . The expected remaining lifetime is α α − 1 t − t = 1 α − 1 t \frac{\alpha}{\alpha-1} t-t= \frac{1}{\alpha-1}t α − 1 α ​ t − t = α − 1 1 ​ t . It increases by t t t . Lindy effect often doesn't apply to physical things. Lindy effect often applies to information, like technology, culture, art, social norm, etc. If some numbers spans multiple orders of magnitudes, Benford's law says that about 30% of numbers have leading digit 1, about 18% of numbers have leading digit of 2, ... The digit d d d 's porportion is log ⁡ 10 ( 1 + 1 d ) \log_{10} \left(1 + \frac 1 d \right) lo g 10 ​ ( 1 + d 1 ​ ) . Pareto distribution is a distribution that spans many orders of magnitudes. Let's compute the distribution of first digit if the number follows Pareto distribution. If x x x starts with digit d d d then d 10 k ≤ x < ( d + 1 ) 10 k d 10^k \leq x < (d+1) 10^k d 1 0 k ≤ x < ( d + 1 ) 1 0 k , k = 0 , 1 , 2 , . . . k=0, 1, 2, ... k = 0 , 1 , 2 , ... Pareto distribution has a lower bound m m m . If we make m m m randomly distributed then analytically computing the probability of each starting digit become hard due to edge cases. In this case, doing a Monte Carlo simulation is easier. How to randomly sample numbers from a Pareto distribution? Firstly we know the cumulative distribution function F ( x ) = P ( X < x ) = 1 − P ( X > x ) = 1 − m α x − α F(x) = P(X<x) = 1-P(X>x) = 1- m^\alpha x^{-\alpha} F ( x ) = P ( X < x ) = 1 − P ( X > x ) = 1 − m α x − α . We can then get quantile function, which is the inverse of F F F : F ( x ) = p ,    Q ( p ) = x F(x)=p, \ \ Q(p) = x F ( x ) = p ,     Q ( p ) = x Now we can randomly sample p p p between 0 and 1 then Q ( p ) Q(p) Q ( p ) will follow Pareto distribution. Given x x x how to calculate its first digit? If 10 ≤ x < 100 10\leq x<100 10 ≤ x < 100 ( 1 ≤ log ⁡ 10 x < 2 1 \leq \log_{10} x < 2 1 ≤ lo g 10 ​ x < 2 ) then first digit is ⌊ x 10 ⌋ \lfloor {\frac x {10}} \rfloor ⌊ 10 x ​ ⌋ . If 100 ≤ x < 1000 100 \leq x < 1000 100 ≤ x < 1000 ( 2 ≤ log ⁡ 10 x < 3 2 \leq \log_{10}x < 3 2 ≤ lo g 10 ​ x < 3 ) then the first digit is ⌊ x 100 ⌋ \lfloor {\frac x {100}} \rfloor ⌊ 100 x ​ ⌋ . Generalize it, the first digit d d d is: Because Pareto distribution has a lot of extreme values, directly calculating the sample will likely to exceed floating-point range and give some . So we need to use log scale. Only calculate using log ⁡ x \log x lo g x and avoid using x x x directly. Sampling in log scale: Calculating first digit in log scale: When α \alpha α approaches 0 0 0 it accurately follows Benford's law. The larger α \alpha α the larger deviation with Benford's law. If we fix the min value m m m as a specific number, like 3 3 3 , when α \alpha α is not very close to 0 0 0 it significantly deviates with Benford's law. However if we make m m m a random value between 1 and 10 then it will be close to Benford's law. We have a null hypothesis H 0 H_0 H 0 ​ , like "the coin is fair", and an alternative hypothesis H 1 H_1 H 1 ​ , like "the coin is unfair". We now need to test how likely H 1 H_1 H 1 ​ is true using data. If you have some data and it's extreme if we assume null hypothesis H 0 H_0 H 0 ​ , then P-value is the probability of getting the result that's as extreme or more extreme than the data if we assume null hypothesis H 0 H_0 H 0 ​ is true. If p-value is small then the alternative hypothesis is likely true. If I do ten coin flips then get 9 heads and 1 tail, the probability that the coin flip is fair but still get 9 heads and 1 tail. P-value is the probability that we get as extreme or more extreme as the result, and the "extreme" is two sided, so p-value is P ( 9 heads 1 tail ) + P ( 10 heads 0 tail ) + P ( 1 heads 9 tail ) + P ( 0 heads 10 tail ) P(\text{9 heads 1 tail}) + P(\text{10 heads 0 tail}) + P(\text{1 heads 9 tail}) + P(\text{0 heads 10 tail}) P ( 9 heads 1 tail ) + P ( 10 heads 0 tail ) + P ( 1 heads 9 tail ) + P ( 0 heads 10 tail ) assume coin flip is fair. Can we swap the null hypothesis and alternative hypothesis? For two conflicting hypothesis, which one should be the null hypothesis? The key is burden of proof . The null hypothesis is the default that most people tend to agree and does not need proving. The alternative hypothesis is special and require you to prove using the data. The lower the p value, the higher your confidence that alternative hypothesis is true. But due to randomness you cannot be 100% sure. If you are doing an AB test, you keep collecting data, and when there is statistical significance (like p-value lower than 0.05) you make a conclusion, this is not statistically sound. A random fluctation in the process could lead to false positive results. A more rigorous approach is to determine required sample size before AB test. And the fewer data you have the stricter hypothesis test should be (lower p-value threshold). According to O'Brien-Fleming Boundary, the p-value threshold should be 0.001 when you have 25% data, 0.005 when you have 50% data, 0.015 when you have 75% data and 0.045 when you have 100% data. If I have some samples and I calculate values like mean, variance, median, etc. The calculated value is called statistic. The statistics themselves are also random. If you are sure "In 95% probability the real median is between 8.1 and 8.2" then [ 8.1 , 8.2 ] [8.1,8.2] [ 8.1 , 8.2 ] is a confidence interval with 95% confidence level. Confidence interval can measure how uncertain a statistics is. One way of computing confidence interval is called bootstrap . It doesn't require you to assume that the statistic is normally distributed. But it do require the samples to be i.i.d. It works by resample from the data and create many replacements of the data, then calculate the statistics of the replacement data, then get the confidence interval. For example if the original samples are [ 1.0 , 2.0 , 3.0 , 4.0 , 5.0 ] [1.0, 2.0, 3.0, 4.0, 5.0] [ 1.0 , 2.0 , 3.0 , 4.0 , 5.0 ] , resample means randomly select one from original data and repeat 5 times, giving things like [ 4.0 , 2.0 , 4.0 , 5.0 , 2.0 ] [4.0, 2.0, 4.0, 5.0, 2.0] [ 4.0 , 2.0 , 4.0 , 5.0 , 2.0 ] or [ 3.0 , 2.0 , 4.0 , 4.0 , 5.0 ] [3.0, 2.0, 4.0, 4.0, 5.0] [ 3.0 , 2.0 , 4.0 , 4.0 , 5.0 ] (they are likely to contain duplicates). Then compute the statistics for each resample. If the confidence level is 95%, then the confidence interval's lower bound is the 2.5% percentile number in these statistics, and the upper bound is the 97.5% percentile number in these statistics. When we train a model (including deep learning and linear regression) we want it to also work on new data that's not in training set. But the training itself is to change the model parameter to fit training data. Overfitting means the training make the model "memorize" the training data and does not discover the underlying rule in real world that generates training data. Reducing overfitting is a hard topic. The ways to reduce overfitting: Regularization. Force the model to be "simpler". Force the model to compress data. Weight sharing is also regularization (CNN is weight sharing comparing to MLP). Add inductive bias to limit the possibility of model. (The old way of regularization is to simply reduce parameter count, but in deep learning, there is deep double descent effect where more parameter is better.) Make the model more expressive. If the model is not exprssive enough to capture real underlying rule in real world that generates training data, it's simply unable to generalize. An example is that RNN is less expressive than Transformer due to fixed-size state. Make the training data more comprehensive. Reinforcement learning, if done properly, can provide more comprehensive training data than supervised learning, because of the randomness in interacting with environment. How to test how overfit a model is? Frequentist: Probability is an objective thing. We can know probability from the result of repeating a random event many times in the same condition. Bayesian: Probability is a subjective thing. Probability means how you think it's likely to happen based on your initial assumptions and the evidences you see. Probability is relative to the information you have. A discrete distribution can be a table, telling the probability of each possible outcome. A discrete distribuiton can be a function, where the input is a possible outcome and the output is probability. A discrete distribution can be a vector (an array), where i-th number is the probability of i-th outcome. A discrete distribution can be a histogram, where each pillar is a possible outcome, and the height of pillar is probability. A continuous distribution can be described by a probability density function (PDF) f f f . A continuous distribution has infinitely many outcomes, and the probability of each specific outcome is zero (usually). We care about the probability of a range: P ( a < X < b ) = ∫ a b f ( x ) d x P(a<X<b)=\int_a^b f(x)dx P ( a < X < b ) = ∫ a b ​ f ( x ) d x . The integral of the whole range should be 1: ∫ − ∞ ∞ f ( x ) d x = 1 \int_{-\infty}^{\infty}f(x)dx=1 ∫ − ∞ ∞ ​ f ( x ) d x = 1 . The value of PDF can be larger than 1. A distribution can be described by cumulative distribution function. F ( x ) = P ( X ≤ x ) F(x) = P(X \leq x) F ( x ) = P ( X ≤ x ) . It can be integration of PDF: F ( x ) = ∫ − ∞ x f ( x ) d x F(x) = \int_{-\infty}^x f(x)dx F ( x ) = ∫ − ∞ x ​ f ( x ) d x . It start from 0 and monotonically increase then reach 1. Quantile function Q Q Q is the inverse of cumulative distribution function. Q ( p ) = x Q(p) = x Q ( p ) = x means F ( x ) = p F(x)=p F ( x ) = p and P ( X ≤ x ) = p P(X \leq x) = p P ( X ≤ x ) = p . The top 25% value is Q ( 0.75 ) Q(0.75) Q ( 0.75 ) . The bottom 25% value is Q ( 0.25 ) Q(0.25) Q ( 0.25 ) . Prior means what I assume the distribution is before knowing some new information. If I see some new information and improved my understanding of the distribution, then the new distribution that I assume is posterior . The mean of two random variables can add up E [ X + Y ] = E [ X ] + E [ Y ] E [ ∑ i X i ] = ∑ i E [ X i ] E[X + Y] = E[X] + E[Y]\quad \quad \quad E[\sum_iX_i] = \sum_iE[X_i] E [ X + Y ] = E [ X ] + E [ Y ] E [ ∑ i ​ X i ​ ] = ∑ i ​ E [ X i ​ ] Multiplying a random variable by a constant k k k multiplies its mean E [ k X ] = k ⋅ E [ X ] E[kX] = k \cdot E[X] E [ k X ] = k ⋅ E [ X ] A constant's mean is that constant E [ k ] = k E[k] = k E [ k ] = k The theoretical mean is weighted average using theoretical probabilities The estimated mean (empirical mean, sample mean) is non-weighted average over samples The theoretical mean is an accurate value, determined by the theoretical distribution The estimated mean is an inaccurate random variable, because it's calculated from random samples Layer normalization : it works on a vector. It treats each element in a vector as different samples from the same distribution, and then replace each element with their Z-score (using sample mean and sample stdev). Batch normalization : it works on a batch of vectors. It treats the elements in the same index in different vectors in batch as different samples from the same distribtion, and then compute Z-score (using sample mean and sample stdev). The input x = ( x 1 , x 2 , . . . , x n ) \boldsymbol{x} = (x_1,x_2,...,x_n) x = ( x 1 ​ , x 2 ​ , ... , x n ​ ) The vector of ones: 1 = ( 1 , 1 , . . . , 1 ) \boldsymbol{1} = (1, 1, ..., 1) 1 = ( 1 , 1 , ... , 1 ) Computing sample mean can be seen as scaling 1 n \frac 1 n n 1 ​ then dot product with the vector of ones: μ ^ = 1 n x ⋅ 1 {\hat \mu}= \frac 1 n \boldsymbol{x} \cdot \boldsymbol{1} μ ^ ​ = n 1 ​ x ⋅ 1 Subtracting the sample mean can be seen as subtracting μ ^ ⋅ 1 \hat {\mu} \cdot \boldsymbol{1} μ ^ ​ ⋅ 1 , let's call it y \boldsymbol y y : y = x − μ ^ ⋅ 1 = x − 1 n ( x ⋅ 1 ) ⋅ 1 \boldsymbol y = \boldsymbol x - {\hat \mu} \cdot \boldsymbol{1} = \boldsymbol x- \frac 1 n (\boldsymbol{x} \cdot \boldsymbol{1}) \cdot \boldsymbol{1} y = x − μ ^ ​ ⋅ 1 = x − n 1 ​ ( x ⋅ 1 ) ⋅ 1 Recall projection: projecting vector a \boldsymbol a a onto b \boldsymbol b b is ( a ⋅ b b ⋅ b ) ⋅ b (\frac{\boldsymbol a \cdot \boldsymbol b}{\boldsymbol b \cdot \boldsymbol b}) \cdot \boldsymbol b ( b ⋅ b a ⋅ b ​ ) ⋅ b . ( 1 ) 2 = n (\boldsymbol 1)^2 = n ( 1 ) 2 = n . So 1 n ( x ⋅ 1 ) ⋅ 1 \frac 1 n (\boldsymbol{x} \cdot \boldsymbol{1}) \cdot \boldsymbol{1} n 1 ​ ( x ⋅ 1 ) ⋅ 1 is the projection of x \boldsymbol x x onto 1 \boldsymbol 1 1 . Subtracting it means removing the component in the direction of 1 \boldsymbol 1 1 from x \boldsymbol x x . So y \boldsymbol y y is orthogonal to 1 \boldsymbol 1 1 . y \boldsymbol y y is in a hyper-plane orthogonal to 1 \boldsymbol 1 1 . Standard deviation can be seen as the length of y \boldsymbol y y divide by n \sqrt{n} n ​ (or n − 1 \sqrt{n-1} n − 1 ​ ): σ 2 = 1 n ( y ) 2 \boldsymbol\sigma^2 = \frac 1 n (\boldsymbol y)^2 σ 2 = n 1 ​ ( y ) 2 , σ = 1 n ∣ y ∣ \boldsymbol\sigma = \frac 1 {\sqrt{n}} \vert \boldsymbol y \vert σ = n ​ 1 ​ ∣ y ∣ . Dividing by standard deviation can be seen as projecting it onto unit sphere then multiply by n \sqrt n n ​ (or n − 1 \sqrt{n-1} n − 1 ​ ). So computing Z-score can be seen as firstly projecting onto a hyper-plane that's orthogonal to 1 \boldsymbol 1 1 and then projecting onto unit sphere then multiply by n \sqrt n n ​ (or n − 1 \sqrt{n-1} n − 1 ​ ). The n-th moment: E [ X n ] E[X^n] E [ X n ] . Mean is the first moment. The n-th central moment: E [ ( X − μ ) n ] E[(X-\mu)^n] E [( X − μ ) n ] . Variance is the second central moment. The n-th central standardized moment: E [ ( X − μ σ ) n ] E[(\frac{X-\mu}{\sigma})^n] E [( σ X − μ ​ ) n ] . Skewness is the third central standardized moment. Kurtosis is the fourth central standardized moment. we have a random variable Y that's correlated with X we know the true mean of Y: E [ Y ] E[Y] E [ Y ] , How uncertain a distribution is. How much information a sample in that distribution carries. If that event always happens, then it carries zero information. I ( E ) = 0 I(E) = 0 I ( E ) = 0 if P ( E ) = 1 P(E) = 1 P ( E ) = 1 . The more rare an event is, the larger information (more surprise) it carries. I ( E ) I(E) I ( E ) increases as P ( E ) P(E) P ( E ) decreases. The information of two independent events happen together is the sum of the information of each event. Here I use ( X , Y ) (X, Y) ( X , Y ) to denote the combination of X X X and Y Y Y . That means I ( ( X , Y ) ) = I ( X ) + I ( Y ) I((X, Y)) = I(X) + I(Y) I (( X , Y )) = I ( X ) + I ( Y ) if P ( ( X , Y ) ) = P ( X ) ⋅ P ( Y ) P((X, Y)) = P(X) \cdot P(Y) P (( X , Y )) = P ( X ) ⋅ P ( Y ) . This implies the usage of logarithm. A fair coin toss with two cases has 1 bit of information entropy: 0.5 ⋅ log ⁡ 2 ( 1 0.5 ) + 0.5 ⋅ log ⁡ 2 ( 1 0.5 ) = 1 0.5 \cdot \log_2(\frac{1}{0.5}) + 0.5 \cdot \log_2(\frac{1}{0.5}) = 1 0.5 ⋅ lo g 2 ​ ( 0.5 1 ​ ) + 0.5 ⋅ lo g 2 ​ ( 0.5 1 ​ ) = 1 bit. If the coin is biased, for example the head has 90% probability and tail 10%, then its entropy is: 0.9 ⋅ log ⁡ 2 ( 1 0.9 ) + 0.1 ⋅ log ⁡ 2 ( 1 0.1 ) ≈ 0.47 0.9 \cdot \log_2(\frac{1}{0.9}) + 0.1 \cdot \log_2(\frac{1}{0.1}) \approx 0.47 0.9 ⋅ lo g 2 ​ ( 0.9 1 ​ ) + 0.1 ⋅ lo g 2 ​ ( 0.1 1 ​ ) ≈ 0.47 bits. If it's even more biased, having 99.99% probability of head and 0.01% probability of tail, then its entropy is: 0.9999 ⋅ log ⁡ 2 ( 1 0.9999 ) + 0.0001 ⋅ log ⁡ 2 ( 1 0.0001 ) ≈ 0.0015 0.9999 \cdot \log_2(\frac{1}{0.9999}) + 0.0001 \cdot \log_2(\frac{1}{0.0001}) \approx 0.0015 0.9999 ⋅ lo g 2 ​ ( 0.9999 1 ​ ) + 0.0001 ⋅ lo g 2 ​ ( 0.0001 1 ​ ) ≈ 0.0015 bits. If a coin toss is fair but has 0.01% percent of standing up on the table, having 3 cases each with probability 0.0001, 0.49995, 0.49995, then its entropy is 0.0001 ⋅ log ⁡ 2 ( 1 0.0001 ) + 0.49995 ⋅ log ⁡ 2 ( 1 0.49995 ) + 0.49995 ⋅ log ⁡ 2 ( 1 0.49995 ) ≈ 1.0014 0.0001 \cdot \log_2(\frac{1}{0.0001}) + 0.49995 \cdot \log_2(\frac{1}{0.49995}) + 0.49995 \cdot \log_2(\frac{1}{0.49995}) \approx 1.0014 0.0001 ⋅ lo g 2 ​ ( 0.0001 1 ​ ) + 0.49995 ⋅ lo g 2 ​ ( 0.49995 1 ​ ) + 0.49995 ⋅ lo g 2 ​ ( 0.49995 1 ​ ) ≈ 1.0014 bits. (The standing up event itself has about 13.3 bits of information, but its probability is low so it contributed small in information entropy) The "distance" between two distributions. If I "expect" the distribution is B, but the distribution is actually A, how much "surprise" do I get on average. If I design a loseless compression algorithm optimized for B, but use it to compress data from A, then the compression will be not optimal and contain redundant information. KL divergence measures how much redundant information it has on average. We have two distributions: A A A is the target distribution, B B B is the output of our model We have n n n samples from A A A : x 1 , x 2 , . . . x n x_1, x_2, ... x_n x 1 ​ , x 2 ​ , ... x n ​ We know the probablity of each sample in each distribution. We know P A ( x i ) P_A(x_i) P A ​ ( x i ​ ) and P B ( x i ) P_B(x_i) P B ​ ( x i ​ ) Mutual information I ( X ; Y ) I(X;Y) I ( X ; Y ) is zero if the joint distribution ( X , Y ) (X,Y) ( X , Y ) is the same as X ⊗ Y X\otimes Y X ⊗ Y , which means X and Y are independent. Mutual information I ( X ; Y ) I(X;Y) I ( X ; Y ) is positive if X and Y are not independent. Mutual information is never negative, because KL divergence is never negative. Minimize I ( Input  ;  IntermediaryRepresentation ) I(\text{Input} \ ; \ \text{IntermediaryRepresentation}) I ( Input   ;   IntermediaryRepresentation ) . Try to compress the intermediary representation and reduce unnecessary information related to input. Maximize I ( IntermediaryRepresentation  ;  Output ) I(\text{IntermediaryRepresentation} \ ; \ \text{Output}) I ( IntermediaryRepresentation   ;   Output ) . Try to keep the information in intermediary representation that's releveant to the output as much as possible. In continuous case, convolution takes two probability density functions, and give a new probability density function. In discrete case, convolution can take two functions and give a new function. Each function inputs an outcome and outputs the probability of that outcome. In discrete case, convolution can take two vectors and give a new vector. Each vector's i-th element correspond to the probability of i-th outcome. How likely that we get samples x 1 , x 2 , . . . , x n x_1, x_2, ... , x_n x 1 ​ , x 2 ​ , ... , x n ​ from the modelled distribution using parameter θ \theta θ . how likely a parameter θ \theta θ is the real underlying parameter, given some independent samples x 1 , x 2 , . . . , x n x_1,x_2,...,x_n x 1 ​ , x 2 ​ , ... , x n ​ . If I assume that the coin flip is fair, θ = 0.5 \theta=0.5 θ = 0.5 , then likelihood is about 0.000977. If I assume θ = 0.9 \theta=0.9 θ = 0.9 , then likelihood is about 0.387, which is larger. If I assume θ = 0.999 \theta=0.999 θ = 0.999 then likelihood is about 0.00099, which is smaller than when assuming θ = 0.9 \theta=0.9 θ = 0.9 . It's a valid probability density function: ∫ − ∞ ∞ f ( x ) d x = 1 \int_{-\infty}^{\infty} f(x)dx=1 ∫ − ∞ ∞ ​ f ( x ) d x = 1 , and f ( x ) ≥ 0 f(x) \geq 0 f ( x ) ≥ 0 The mean: ∫ − ∞ ∞ x f ( x ) d x = μ \int_{-\infty}^{\infty} x f(x) dx = \mu ∫ − ∞ ∞ ​ x f ( x ) d x = μ The variance constraint: ∫ − ∞ ∞ f ( x ) ( x − μ ) 2 d x = σ 2 \int_{-\infty}^{\infty} f(x) (x-\mu)^2 dx = \sigma^2 ∫ − ∞ ∞ ​ f ( x ) ( x − μ ) 2 d x = σ 2 Moving the probability density function along X axis doesn't change entropy, so we can fix the mean as 0 (we can replace x x x as x − μ x-\mu x − μ after finishing deduction). log ⁡ 1 f ( x ) \log\frac{1}{f(x)} lo g f ( x ) 1 ​ already implicitly tells f ( x ) > 0 f(x)>0 f ( x ) > 0 It turns out that the mean constraint ∫ − ∞ ∞ x f ( x ) d x = 0 \int_{-\infty}^{\infty} x f(x) dx = 0 ∫ − ∞ ∞ ​ x f ( x ) d x = 0 is not necessary to deduce the result, so we can not include it in Largrange multipliers. (Including it is also fine but will make it more complex.) if you multiply a constant, the result still follow normal distribution. X ∼ N →   k X ∼ N X \sim N \rightarrow \ kX \sim N X ∼ N →   k X ∼ N if you add a constant, the result still follow normal distribution. X ∼ N → ( X + k ) ∼ N X \sim N \rightarrow (X+k) \sim N X ∼ N → ( X + k ) ∼ N If you add up two independent normal random variables, the result still follows normal distribution. X ∼ N , Y ∼ N → ( X + Y ) ∼ N X \sim N, Y \sim N \rightarrow (X+Y) \sim N X ∼ N , Y ∼ N → ( X + Y ) ∼ N A linear combination of many independent normal distributions also follow normal distribution. X 1 ∼ N , X 2 ∼ N , . . . X n ∼ N → ( k 1 X 1 + k 2 X 2 + . . . + k n X n ) ∼ N X_1 \sim N, X_2 \sim N, ... X_n \sim N \rightarrow (k_1X_1 + k_2X_2 + ... + k_nX_n) \sim N X 1 ​ ∼ N , X 2 ​ ∼ N , ... X n ​ ∼ N → ( k 1 ​ X 1 ​ + k 2 ​ X 2 ​ + ... + k n ​ X n ​ ) ∼ N We have a (row) vector x \boldsymbol x x of independent random variables x = ( x 1 , x 2 , . . . x n ) \boldsymbol x=(x_1, x_2, ... x_n) x = ( x 1 ​ , x 2 ​ , ... x n ​ ) , each element in vector follows a normal distribution (not necessarily the same normal distribution), then, if we apply an affine transformation on that vector, which means multipling a matrix A A A and then adding an offset b \boldsymbol b b , y = A x + b \boldsymbol y=A\boldsymbol x+\boldsymbol b y = A x + b , then each element of y \boldsymbol y y is a linear combination of normal distributions, y i = x 1 A i , 1 + x 2 A i , 2 + . . . x n A i , n + b i y_i=x_1 A_{i,1} + x_2 A_{i, 2} + ... x_n A_{i,n} + b_i y i ​ = x 1 ​ A i , 1 ​ + x 2 ​ A i , 2 ​ + ... x n ​ A i , n ​ + b i ​ , so each element in y \boldsymbol y y also follow normal distribution. Now y \boldsymbol y y follows multivariate normal distribution. It's symmetric: Cov [ X , Y ] = Cov [ Y , X ] \text{Cov}[X,Y] = \text{Cov}[Y,X] Cov [ X , Y ] = Cov [ Y , X ] If X and Y are independent, Cov [ X , Y ] = 0 \text{Cov}[X,Y]=0 Cov [ X , Y ] = 0 Adding constant Cov [ X + k , Y ] = Cov [ X , Y ] \text{Cov}[X+k,Y] = \text{Cov}[X,Y] Cov [ X + k , Y ] = Cov [ X , Y ] . Variance is invariant to translation. Multiplying constant Cov [ k ⋅ X , Y ] = k ⋅ Cov [ X , Y ] \text{Cov}[k\cdot X,Y] = k \cdot \text{Cov}[X,Y] Cov [ k ⋅ X , Y ] = k ⋅ Cov [ X , Y ] Addition Cov [ X + Y , Z ] = Cov [ X , Z ] + Cov [ Y , Z ] \text{Cov}[X+Y,Z] = \text{Cov}[X,Z]+\text{Cov}[Y,Z] Cov [ X + Y , Z ] = Cov [ X , Z ] + Cov [ Y , Z ] p 0 ( x 0 ) p_0(x_0) p 0 ​ ( x 0 ​ ) is the probability density of original clean value (for image generation, it correspond to the probability distribution of images that we want to generate) p 1 ( x 1 ) p_1(x_1) p 1 ​ ( x 1 ​ ) is the probability density of noise-added value p 1 ∣ 0 ( x 1 ∣ x 0 ) p_{1 \vert 0}(x_1 \vert x_0) p 1∣0 ​ ( x 1 ​ ∣ x 0 ​ ) is the probability density of noise-added value, given clean training data x 0 x_0 x 0 ​ . It's a normal distribution given x 0 x_0 x 0 ​ . It can also be seen as a function that take two arguments x 0 , x 1 x_0, x_1 x 0 ​ , x 1 ​ . p 0 ∣ 1 ( x 0 ∣ x 1 ) p_{0 \vert 1}(x_0 \vert x_1) p 0∣1 ​ ( x 0 ​ ∣ x 1 ​ ) is the probability density of the original clean value given noise-added value. It can also be seen as a function that take two arguments x 0 , x 1 x_0, x_1 x 0 ​ , x 1 ​ . The lifetime of machine components. The time until a radioactive atom decays. The time length of phone calls. The time interval between two packets for a router. Log-normal distribution : If log ⁡ X \log X lo g X is normally distributed, then X X X follows log-normal distribution. Put in another way, if Y Y Y is normally distributed, then e Y e^Y e Y follows log-normal distribution. Stretched exponential distribution : P ( X > x ) P(X>x) P ( X > x ) is roughly porpotional to e − k x β e^{-kx^\beta} e − k x β ( β < 1 \beta < 1 β < 1 ) Power law with exponential cutoff : P ( X > x ) P(X>x) P ( X > x ) is roughly porpotional to x − α e − λ x x^{-\alpha} e^{-\lambda x} x − α e − λ x Regularization. Force the model to be "simpler". Force the model to compress data. Weight sharing is also regularization (CNN is weight sharing comparing to MLP). Add inductive bias to limit the possibility of model. (The old way of regularization is to simply reduce parameter count, but in deep learning, there is deep double descent effect where more parameter is better.) Make the model more expressive. If the model is not exprssive enough to capture real underlying rule in real world that generates training data, it's simply unable to generalize. An example is that RNN is less expressive than Transformer due to fixed-size state. Make the training data more comprehensive. Reinforcement learning, if done properly, can provide more comprehensive training data than supervised learning, because of the randomness in interacting with environment. Separate the data into training set and test set. Only train using training set and check model performance on test set. Test sensitivity to random fluctation. We can add randomness to parameter, input, hyperparameter, etc., then see model performance. An overfit model is more prone to random perturbation because memorization is more "fragile" than real underlying rule. Survivorship bias and selection bias. Simpson's paradox and base rate fallacy. Confusing correlation with causalty. Try too many different hypothesis. Spurious correlations Collect data until significance. Wrongly remove outliers.

0 views
DYNOMIGHT 5 months ago

Futarchy’s fundamental flaw

Say you’re Robyn Denholm , chair of Tesla’s board. And say you’re thinking about firing Elon Musk. One way to make up your mind would be to have people bet on Tesla’s stock price six months from now in a market where all bets get cancelled unless Musk is fired . Also, run a second market where bets are cancelled unless Musk stays CEO . If people bet on higher stock prices in Musk-fired world, maybe you should fire him. That’s basically Futarchy : Use conditional prediction markets to make decisions. People often argue about fancy aspects of Futarchy. Are stock prices all you care about? Could Musk use his wealth to bias the market? What if Denholm makes different bets in the two markets, and then fires Musk (or not) to make sure she wins? Are human values and beliefs somehow inseparable? My objection is more basic: It doesn’t work. You can’t use conditional predictions markets to make decisions like this, because conditional prediction markets reveal probabilistic relationships, not causal relationships. The whole concept is faulty. There are solutions—ways to force markets to give you causal relationships. But those solutions are painful and I get the shakes when I see everyone acting like you can use prediction markets to conjure causal relationships from thin air, almost for free. I wrote about this back in 2022 , but my argument was kind of sprawling and it seems to have failed to convince approximately everyone. So thought I’d give it another try, with more aggression. In prediction markets, people trade contracts that pay out if some event happens. There might be a market for “Dynomight comes out against aspartame by 2027” contracts that pay out $1 if that happens and $0 if it doesn’t. People often worry about things like market manipulation, liquidity, or herding. Those worries are fair but boring, so let’s ignore them. If a market settles at $0.04, let’s assume that means the “true probability” of the event is 4%. (I pause here in recognition of those who need to yell about Borel spaces or von Mises axioms or Dutch book theorems or whatever. Get it all out. I value you.) Right. Conditional prediction markets are the same, except they get cancelled unless some other event happens. For example, the “Dynomight comes out against aspartame by 2027” market might be conditional on “Dynomight de-pseudonymizes”. If you buy a contract for $0.12 then: Let’s again assume that if a conditional prediction market settles at $0.12, that means the “true” conditional probability is 12%. But hold on. If we assume that conditional prediction markets give flawless conditional probabilities, then what’s left to complain about? Simple. Conditional probabilities are the wrong thing. If P(A|B)=0.9, that means that if you observe B, then there’s a 90% chance of A. That doesn’t mean anything about the chances of A if you do B. In the context of statistics, everyone knows that correlation does not imply causation . That’s a basic law of science. But really, it’s just another way of saying that conditional probabilities are not what you need to make decisions . And that’s true no matter where the conditional probabilities come from. For example, people with high vitamin D levels are only ~56% as likely to die in a given year as people with low vitamin D levels. Does that mean taking vitamin D halves your risk of death? No, because those people are also thinner, richer, less likely to be diabetic, less likely to smoke, more likely to exercise, etc. To make sure we’re seeing the effects of vitamin D itself, we run randomized trials. Those suggest it might reduce the risk of death a little. (I take it.) Futarchy has the same flaw. Even if you think vitamin D does nothing, if there’s a prediction market for if some random person dies, you should pay much less if the market is conditioned on them having high vitamin D. But you should do that mostly because they’re more likely to be rich and thin and healthy, not because of vitamin D itself. If you like math, conditional prediction markets give you P(A|B). But P(A|B) doesn’t tell you what will happen if you do B. That’s a completely different number with a different notation , namely P(A|do(B)). Generations of people have studied the relationship between P(A|B) and P(A|do(B)). We should pay attention to them. Say people bet for a lower Tesla stock price when you condition on Musk being fired. Does that mean they think that firing Musk would hurt the stock price? No, because there could be reverse causality—the stock price dropping might cause him to be fired. You can try to fight this using the fact that things in the future can’t cause things in the past. That is, you can condition on Musk being fired next week and bet on the stock price six months from now. That surely helps, but you still face other problems. Here’s another example of how lower prices in Musk-fired world may not indicate that firing Musk hurts the stock price. Suppose: You think Musk is a mildly crappy CEO. If he’s fired, he’ll be replaced with someone slightly better, which would slightly increase Tesla’s stock price. You’ve heard rumors that Robyn Denholm has recently decided that she hates Musk and wants to dedicate her life to destroying him. Or maybe not, who knows. If Denholm fired Musk, that would suggest the rumors are true. So she might try to do other things to hurt him, such as trying to destroy Tesla to erase his wealth. So in this situation, Musk being fired leads to lower stock prices even though firing Musk itself would increase the stock price. Or suppose you run prediction markets for the risk of nuclear war, conditional on Trump sending the US military to enforce a no-fly zone over Ukraine (or not). When betting in these markets, people would surely consider the risk that direct combat between the US and Russian militaries could escalate into nuclear war. That’s good (the considering), but people would also consider that no one really knows exactly what Trump is thinking. If he declared a no-fly zone, that would suggest that he’s feeling feisty and might do other things that could also lead to nuclear war. The markets wouldn’t reflect the causal impact of a no-fly zone alone, because conditional probabilities are not causal. So far nothing has worked. But what if we let the markets determine what action is taken? If we pre-commit that Musk will be fired (or not) based on market prices, you might hope that something nice happens and magically we get causal probabilities. I’m pro-hope, but no such magical nice thing happens. Thought experiment . Imagine there’s a bent coin that you guess has a 40% chance of landing heads. And suppose I offer to sell you a contract. If you buy it, we’ll flip the coin and you get $1 if it’s heads and $0 otherwise. Assume I’m not doing anything tricky like 3D printing weird-looking coins. If you want, assume I haven’t even seen the coin. You’d pay something like $0.40 for that contract, right? (Actually, knowing my readers, I’m pretty sure you’re all gleefully formulating other edge cases. But I’m also sure you see the point that I’m trying to make. If you need to put the $0.40 in escrow and have the coin-flip performed by a Cenobitic monk, that’s fine.) Now imagine a variant of that thought experiment . It’s the same setup, except if you buy the contract, then I’ll have the coin laser-scanned and ask a supercomputer to simulate millions of coin flips. If more than half of those simulated flips are heads, the bet goes ahead. Otherwise, you get your money back. Now you should pay at least $0.50 for the contract, even though you only think there’s a 40% chance the coin will land heads. Why? This is a bit subtle, but you should pay more because you don’t know the true bias of the coin. Your mean estimate is 40%. But it could be 20%, or 60%. After the coin is laser-scanned, the bet only activates if there’s at least a 50% chance of heads. So the contract is worth at least $0.50, and strictly more as long as you think it’s possible the coin has a bias above 50%. Suppose b is the true bias of the coin (which the supercomputer will compute). Then your expected return in this game is 𝔼[max(b, 0.50)] = 0.50 + 𝔼[max(b-0.50, 0)] , where the expectations reflect your beliefs over the true bias of the coin. Since 𝔼[max(b-0.50, 0)] is never less than zero, the contract is always worth at least $0.50. If you think there’s any chance the bias is above 50%, then the contract is worth strictly more than $0.50. To connect to prediction markets, let’s do one last thought experiment , replacing the supercomputer with a market. If you buy the contract, then I’ll have lots of other people bid on similar contracts for a while. If the price settles above $0.50, your bet goes ahead. Otherwise, you get your money back. You should still bid more than $0.40, even though you only think there’s a 40% chance the coin will land heads. Because the market acts like a (worse) laser-scanner plus supercomputer. Assuming prediction markets are good, the market is smarter than you, so it’s more likely to activate if the true bias of the coin is 60% rather than 20%. This changes your incentives, so you won’t bet your true beliefs. I hope you now agree that conditional prediction markets are non-causal, and choosing actions based on the market doesn’t magically make that problem go away. But you still might have hope! Maybe the order is still preserved? Maybe you’ll at least always pay more for coins that have a higher probability of coming up heads? Maybe if you run a market with a bunch of coins, the best one will always earn the highest price? Maybe it all works out? Suppose there’s a conditional prediction market for two coins. After a week of bidding, the markets will close, whichever coin had contracts trading for more money will be flipped and $1 paid to contract-holders for head. The other market is cancelled. Suppose you’re sure that coin A , has a bias of 60%. If you flip it lots of times, 60% of the flips will be heads. But you’re convinced coin B , is a trick coin. You think there’s a 59% chance it always lands heads, and a 41% chance it always lands tails. You’re just not sure which. We want you to pay more for a contract for coin A, since that’s the coin you think is more likely to be heads (60% vs 59%). But if you like money, you’ll pay more for a contract on coin B. You’ll do that because other people might figure out if it’s an always-heads coin or an always-tails coin. If it’s always heads, great, they’ll bid up the market, it will activate, and you’ll make money. If it’s always tails, they’ll bid down the market, and you’ll get your money back. You’ll pay more for coin B contracts, even though you think coin A is better in expectation. Order is not preserved. Things do not work out. Naive conditional prediction markets aren’t causal. Using time doesn’t solve the problem. Having the market choose actions doesn’t solve the problem. But maybe there’s still hope? Maybe it’s possible to solve the problem by screwing around with the payouts? Theorem. Nope. You can’t solve the problem by screwing around with the payouts. There does not exist a payout function that will make you always bid your true beliefs. Suppose you run a market where if you pay x and the final market price is y and z happens, then you get a payout of f(x,y,z) dollars. The payout function can be anything, subject only to the constraint that if the final market price is below some constant c , then bets are cancelled, i.e. f(x,y,z)=x for y < c . Now, take any two distributions ℙ₁ and ℙ₂ . Assume that: Then the expected return under ℙ₁ and ℙ₂ is the same. That is, 𝔼₁[f(x,Y,Z)]    = x ℙ₁[Y<c] + ℙ₁[Y≥c] 𝔼₁[f(x,Y,Z) | Y≥c]    = x ℙ₂[Y<c] + ℙ₂[Y≥c] 𝔼₂[f(x,Y,Z) | Y≥c]    = 𝔼₂[f(x,Y,Z)] . Thus, you would be willing to pay the same amount for a contract under both distributions. Meanwhile, the difference in expected values is 𝔼₁[Z] - 𝔼₂[Z]    = ℙ₁[Y<c] 𝔼₁[Z | Y<c] - ℙ₂[Y<c] 𝔼₂[Z | Y<c]      + ℙ₁[Y≥c] 𝔼₁[Z | Y≥c] - ℙ₂[Y≥c] 𝔼₂[Z | Y≥c]    = ℙ₁[Y<c] (𝔼₁[Z | Y<c] - 𝔼₂[Z | Y<c])    ≠ 0 . The last line uses our assumptions that ℙ₁[Y<c] > 0 and 𝔼₁[Z | Y<c] ≠ 𝔼₂[Z | Y<c] . Thus, we have simultaneously that 𝔼₁[f(x,Y,Z)] = 𝔼₂[f(x,Y,Z)] , 𝔼₁[Z] ≠ 𝔼₂[Z] . This means that you should pay the same amount for a contract if you believe ℙ₁ or ℙ₂ , even though these entail different beliefs about how likely Z is to happen. Since we haven’t assumed anything about the payout function f(x,y,z) , this means that no working payout function can exist. This is bad. Just because conditional prediction markets are non-causal does not mean they are worthless. On the contrary, I think we should do more of them! But they should be treated like observational statistics—just one piece of information to consider skeptically when you make decisions. Also, while I think these issues are neglected, they’re not completely unrecognized. For example, in 2013, Robin Hanson pointed out that confounding variables can be a problem: Also, advisory decision market prices can be seriously distorted when decision makers might know things that market speculators do not. In such cases, the fact that a certain decision is made can indicate hidden info held by decision makers. Market estimates of outcomes conditional on a decision then become estimates of outcomes given this hidden info, instead of estimates of the effect of the decision on outcomes. This post from Anders_H in 2015 is the first I’m aware of that points out the problem in full generality. Finally, the flaw can be fixed. In statistics, there’s a whole category of techniques to get causal estimates out of data. Many of these methods have analogies as alternative prediction market designs. I’ll talk about those next time. But here’s a preview: None are free. If Dynomight is still pseudonymous at the end of 2027, you’ll get your $0.12 back. If Dynomight is non-pseudonymous, then you get $1 if Dynomight came out against aspartame and $0 if not. You think Musk is a mildly crappy CEO. If he’s fired, he’ll be replaced with someone slightly better, which would slightly increase Tesla’s stock price. You’ve heard rumors that Robyn Denholm has recently decided that she hates Musk and wants to dedicate her life to destroying him. Or maybe not, who knows. ℙ₁[Y<c] = ℙ₂[Y<c] > 0 ℙ₁[Y≥c] = ℙ₂[Y≥c] 𝔼₁[Z | Y≥c] = 𝔼₂[Z | Y≥c] ℙ₁[(Y,Z) | Y≥c] = ℙ₂[(Y,Z) | Y≥c] (h/t Baram Sosis) 𝔼₁[Z | Y<c] ≠ 𝔼₂[Z | Y<c]

0 views
DYNOMIGHT 5 months ago

Optimizing tea: An N=4 experiment

Tea is a little-known beverage, consumed for flavor or sometimes for conjectured effects as a stimulant. It’s made by submerging the leaves of C. Sinensis in hot water. But how hot should the water be? To resolve this, I brewed the same tea at four different temperatures, brought them all to a uniform serving temperature, and then had four subjects rate them along four dimensions. Subject A is an experienced tea drinker, exclusively of black tea w/ lots of milk and sugar. Subject B is also an experienced tea drinker, mostly of black tea w/ lots of milk and sugar. In recent years, Subject B has been pressured by Subject D to try other teas. Subject B likes fancy black tea and claims to like fancy oolong, but will not drink green tea. Subject C is similar to Subject A. Subject D likes all kinds of tea, derives a large fraction of their joy in life from tea, and is world’s preeminent existential angst + science blogger. For a tea that was as “normal” as possible, I used pyramidal bags of PG Tips tea (Lipton Teas and Infusions, Trafford Park Rd., Trafford Park, Stretford, Manchester M17 1NH, UK). I brewed it according to the instructions on the box, by submerging one bag in 250ml of water for 2.5 minutes. I did four brews with water at temperatures ranging from 79°C to 100°C (174.2°F to 212°F). To keep the temperature roughly constant while brewing, I did it in a Pyrex measuring cup (Corning Inc., 1 Riverfront Plaza, Corning, New York, 14831, USA) sitting in a pan of hot water on the stove. After brewing, I poured the tea into four identical mugs with the brew temperature written on the bottom with a Sharpie Pro marker (Newell Brands, 5 Concourse Pkwy Atlanta, GA 30328, USA). Readers interested in replicating this experiment may note that those written temperatures still persist on the mugs today, three months later. The cups were dark red, making it impossible to see any difference in the teas. After brewing, I put all the mugs in a pan of hot water until they converged to 80°C, so they were served at the same temperature. I shuffled the mugs and placed them on a table in a random order. I then asked the subjects to taste from each mug and rate the teas for: Each rating was to be on a 1-5 scale, with 1=bad and 5=good. Subjects A, B, and C had no knowledge of how the different teas were brewed. Subject D was aware, but was blinded as to which tea was in which mug. During taste evaluation, Subjects A and C remorselessly pestered Subject D with questions about how a tea strength can be “good” or “bad”. Subject D rejected these questions on the grounds that “good” cannot be meaningfully reduced to other words and urged Subjects A and C to review Wittgenstein’s concept of meaning as use , etc. Subject B questioned the value of these discussions. After ratings were complete, I poured tea out of all the cups until 100 ml remained in each, added around 1 gram (1/4 tsp) of sugar, and heated them back up to 80°C. I then re-shuffled the cups and presented them for a second round of ratings. For a single summary, I somewhat arbitrarily combined the four ratings into a “quality” score, defined as (Quality) = 0.1 × (Aroma) + 0.3 × (Flavor) + 0.1 × (Strength) + 0.5 × (Goodness). Here is the data for Subject A, along with a linear fit for quality as a function of brewing temperature. Broadly speaking, A liked everything, but showed weak evidence of any trend. And here is the same for Subject B, who apparently hated everything. Here is the same for Subject C, who liked everything, but showed very weak evidence of any trend. And here is the same for Subject D. This shows extremely strong evidence of a negative trend. But, again, while blinded to the order, this subject was aware of the brewing protocol. Finally, here are the results combining data from all subjects. This shows a mild trend, driven mostly by Subject D. This experiment provides very weak evidence that you might be brewing your tea too hot. Mostly, it just proves that Subject D thinks lower-middle tier black tea tastes better when brewed cooler. I already knew that. There are a lot of other dimensions to explore, such as the type of tea, the brew time, the amount of tea, and the serving temperature. I think that ideally, I’d randomize all those dimensions, gather a large sample, and then fit some kind of regression. Creating dozens of different brews and then serving them all blinded at different serving temperatures sounds like way too much work. Maybe there’s an easier way to go about this? Can someone build me a robot? If you thirst to see Subject C’s raw aroma scores or whatever, you can download the data or click on one of the entries in this table: Subject D was really good at this; why can’t everyone be like Subject D? This experiment provides very weak evidence that you might be brewing your tea too hot. Mostly, it just proves that Subject D thinks lower-middle tier black tea tastes better when brewed cooler. I already knew that. There are a lot of other dimensions to explore, such as the type of tea, the brew time, the amount of tea, and the serving temperature. I think that ideally, I’d randomize all those dimensions, gather a large sample, and then fit some kind of regression. Creating dozens of different brews and then serving them all blinded at different serving temperatures sounds like way too much work. Maybe there’s an easier way to go about this? Can someone build me a robot? If you thirst to see Subject C’s raw aroma scores or whatever, you can download the data or click on one of the entries in this table: Subject Aroma Flavor Strength Goodness Quality A x x x x x B x x x x x C x x x x x D x x x x x All x x x x x Subject D was really good at this; why can’t everyone be like Subject D?

0 views
Weakty 6 months ago

Countin' Bikes

Today I took part in something called the Pedal Poll, which is a countrywide initiative to count how many people are biking, walking, driving, or using a motorized vehicle across a specific time and place. I counted 993 cyclists in the span of 2 hours. I think I would have gotten that other 7 to get over 1000 if I hadn't accidentally closed the app and had to restart it.

0 views
DYNOMIGHT 7 months ago

My more-hardcore theanine self-experiment

Theanine is an amino acid that occurs naturally in tea. Many people take it as a supplement for stress or anxiety. It’s mechanistically plausible, but the scientific literature hasn’t been able to find much of a benefit. So I ran a 16-month blinded self-experiment in the hopes of showing it worked. It did not work . At the end of the post, I put out a challenge: If you think theanine works, prove it. Run a blinded self-experiment. After all, if it works, then what are you afraid of? Well, it turns out that Luis Costigan had already run a self-experiment . Here was his protocol: He repeated this for 20 days. His mean anxiety after theanine was 4.2 and after placebo it was 5.0. A simple Bayesian analysis said there was an 82.6% chance theanine reduced anxiety. A sample size of 20 just doesn’t have enough statistical power to have a good chance of finding a statistically significant result. If you assume the mean under placebo is 5.0, the mean under theanine is 4.2, and the standard deviation is 2.0, then you’d only have a 22.6% chance of getting a result with p<0.05. I think this experiment was good, both the experiment and the analysis. It doesn’t prove theanine works, but it was enough to make me wonder: Maybe theanine does work, but I somehow failed to bring out the effect? What would give theanine the best possible chance of working? Theanine is widely reported to help with anxiety from caffeine. While I didn’t explicitly take caffeine as part of my previous experiment, I drink tea almost every day, so I figured that if theanine helps, it should have shown up. But most people (and Luis) take theanine with coffee , not tea. I find that coffee makes me much more nervous than tea. For this reason, I sort of hate coffee and rarely drink it. Maybe the tiny amounts of natural theanine in tea masked the effects of the supplements? Or maybe you need to take theanine and caffeine at the same time? Or maybe for some strange reason theanine works for coffee (or coffee-tier anxiety) but not tea? So fine. To hell with my mental health. I decided to take theanine (or placebo) together with coffee on an empty stomach first thing in the day. And I decided to double the dose of theanine from 200 mg to 400 mg. Coffee. I used one of those pod machines which are incredibly uncool but presumably deliver a consistent amount of caffeine. Measurements. Each day I recorded my stress levels on a subjective 1-5 scale before I took the capsules. An hour later, I recorded my end stress levels, and my percentage prediction that what I took was actually theanine. Blinding. I have capsules that either contain 200 mg of theanine or 25 mcg of vitamin D. These are exactly the same size. I struggled for a while to see how to take two pills of the same type while being blind to the results. In the end, I put two pills of each type in identical looking cups and shuffled the cups. Then I shut my eyes, took a sip of coffee (to make sure I couldn’t taste any difference), swallowed the pills on one cup, and put the others into a numbered envelope. Here’s a picture of the envelopes, to prove I actually did this and/or invite sympathy for all the coffee I had to endure: After 37 days I ran out of capsules. I’m going to try something new. As I write these words, I have not yet opened the envelopes, so I don’t know the results. I’m going to register some thoughts. My main thought is: I have no idea what the results will show. It really felt like on some days I got the normal spike of anxiety I expect from coffee and on other days it was almost completely gone. But in my previous experiment I often felt the same thing and was proven wrong. It wouldn’t surprise me if the results show a strong effect, or if it’s all completely random. I’ll also pre-register (sort of) the statistical analyses I intend to do: Please hold while I open all the envelopes and do the analyses. Here’s a painting . Here are the raw stress levels. Each line shows one trial, with the start marked with a small horizontal bar. Remember, this measures the effect of coffee and the supplement. So even though stress tends to go up, this would still show a benefit if it went up less with theanine. Here is the difference in stress levels. If Δ Stress is negative, that means stress went down. Here are the start vs. end stress levels, ignoring time. The dotted line shows equal stress levels, so anything below that line means stress went down. And finally, here are my percentage predictions of if what I had taken was actually theanine: So…. nothing jumps out so far. So I did the analysis in my pre-registered plan above. In the process, I realized I wanted to show some extra stuff. It’s all simple and I think unobjectionable. But if you’re the kind of paranoid person who only trusts pre-registered things, I love and respect you and I will mark those with “✔️”. The first thing we’ll look at is the final stress levels, one hour after taking theanine or vitamin D. First up, regular-old frequentist statistics. If the difference is less than zero, that would suggest theanine was better. It looks like there might be a small difference, but it’s nowhere near statistically significant. Next up, Bayes! In this analysis, there are latent variables for the mean and standard deviation of end stress (after one hour) with theanine and also for vitamin D. Following Luis’s analysis, these each have a Gaussian prior with a mean and standard deviation based on the overall mean in the data. The results are extremely similar to the frequentist analysis. This says there’s an 80% chance theanine is better. Next up, let’s look at the difference in stress levels defined as Δ = (end - start). Since this measures an increase in stress, we’d like it to be as small as possible. So again, if the difference is negative, that would suggest theanine is better. Here are the good-old frequentist statistics. And here’s the Bayesian analysis. It’s just like the first one except we have latent variables for the difference in stress levels (end-start). If the difference of that difference was less than zero, that would again suggest theanine was better. In retrospect, this percentage prediction analysis is crazy, and I suggest you ignore it. The issue is that even though Δ stress is usually positive (coffee bad) it’s near zero and can be negative. Computing (T-D)/D when D can be negative is stupid and I think makes the whole calculation meaningless. I regret pre-registering this. The absolute difference is fine. It’s very close (almost suspiciously close) to zero. Finally, let’s look at my percentage prediction that what I took was theanine. It really felt like I could detect a difference. But could I? Here we’d hope that I’d give a higher prediction that I’d taken theanine when I’d actually taken theanine. So a positive difference would suggest theanine is better, or at least different. And here’s the corresponding Bayesian analysis. This is just like the first two, except with latent variables for my percentage prediction under theanine and vitamin D. Taking a percentage difference of a quantity that is itself a percentage difference is really weird, but fine. This is the most annoying possible outcome. A clear effect would have made me happy. Clear evidence of no effect would also have made me happy. Instead, some analyses say there might be a small effect, and others suggest nothing. Ugh. But I’ll say this: If there is any effect, it’s small. I know many people say theanine is life-changing, and I know why: It’s insanely easy to fool yourself. Even after running a previous 18-month trial and finding no effect, I still often felt like I could feel the effects in this experiment. I still thought I might open up all the envelopes and find that I had been under -confident in my guesses. Instead, I barely did better than chance. So I maintain my previous rule. If you claim that theanine has huge effects for you, blind experiment or GTFO. Edit: Data here . Each morning, take 200 mg theanine or placebo (blinded) along with a small iced coffee. Wait 90 minutes. Record anxiety on a subjective scale of 0-10. I’ll plot the data. I’ll repeat Luis’s Bayesian analysis , which looks at end stress levels only. I’ll repeat that again, but looking at the change in stress levels. I’ll repeat that again, but looking at my percentage prediction that what I actually took was theanine vs. placebo. I’ll compute regular-old confidence intervals and p-values for end stress, change in stress, and my percentage prediction that what I actually took was theanine vs. placebo. Intermission

0 views
Gregory Gundersen 8 months ago

De Moivre–Laplace Theorem

As I understand it, the de Moivre–Laplace theorem is the earliest version of the central limit theorem (CLT). In his book The Doctrine of Chances (De Moivre, 1738) , Abraham de Moivre proved that the probability mass function of the binomial distribution asymptotically approximates the probability density function of a particular normal distribution as its parameter n n n grows arbitrarily large. Today, we know that the CLT generalizes this result, and we might say this is a special case of the CLT for the binomial distribution. To introduce notation, we say that X n X_n X n ​ is a binomial random variable with parameters n n n and p p p if P ( X n = k ) = ( n k ) p k q n − k , p ∈ [ 0 , 1 ] ,       q : = 1 − p ,       n ∈ N . (1) \mathbb{P}(X_n = k) = {n \choose k} p^k q^{n-k}, \qquad p \in [0, 1],\;\;q := 1-p,\;\;n \in \mathbb{N}. \tag{1} P ( X n ​ = k ) = ( k n ​ ) p k q n − k , p ∈ [ 0 , 1 ] , q : = 1 − p , n ∈ N . ( 1 ) Typically, we view X n X_n X n ​ as the the sum of n n n Bernoulli random variables, each with parameter p p p . Intuitively, if we flip n n n coins each with bias p p p , Equation 1 1 1 gives the probability of k k k successes. This is clearly related to the CLT, which loosely states that the properly normalized sum of random variables asymptotically approaches the normal distribution. If we let Y i Y_i Y i ​ denote these Bernoulli random variables, we can express this idea as ∣ Y 1 + Y 2 + ⋯ + Y n ⏞ X n    ≃    N  ⁣ ( n p , n p q ) , (2) \overbrace{\vphantom{\Big|} Y_1 + Y_2 + \dots + Y_n}^{X_n} \;\simeq\; \mathcal{N}\!\left(np, npq\right), \tag{2} ∣ ∣ ∣ ∣ ​ Y 1 ​ + Y 2 ​ + ⋯ + Y n ​ ​ X n ​ ​ ≃ N ( n p , n p q ) , ( 2 ) where ≃ \simeq ≃ denotes asymptotic equivalence as n → ∞ n \rightarrow \infty n → ∞ . This is probably the most intuitive form of the CLT because if we simply plot the probability mass function (PMF) for the binomial distribution for increasing values of n n n , we get a discrete distribution which nearly immediately looks a lot like the normal distribution even for relatively small n n n (Figure 1 1 1 ). In contrast, I think the CLT is much less obvious feeling if I were to claim (correctly) that the properly normalized sum of skew normal random variables is also normally distributed! A modern version of de Moivre’s proof is tedious, but it’s not actually that hard to follow. This post is simply my notes on that proof. To start, let’s rewrite the binomial coefficient without the factorial using Stirling’s approximation : n ! ≃ 2 π n ( n e ) n . (3) n! \simeq \sqrt{2\pi n} \left(\frac{n}{e}\right)^n. \tag{3} n ! ≃ 2 π n ​ ( e n ​ ) n . ( 3 ) As a historical aside, note that while Stirling is credited with this approximation, it was actually de Moivre who discovered an early version of it while working on these ideas. So de Moivre has been robbed twice, once for this approximation and once for the normal distribution sometimes being called the “Gaussian” rather than the “de Moivrian”. Anyway, using Stirling’s approximation, we can rewrite the binomial coefficient as ( n k ) ≃ 2 π n ( 2 π k ) ( 2 π ( n − k ) ) ( n e ) n ( k e ) − k ( n − k e ) k − n = n 2 π k ( n − k ) ) ( n n k k ( n − k ) n − k ) . (4) \begin{aligned} {n \choose k} & \simeq \sqrt{\frac{2\pi n}{(2\pi k)(2\pi (n-k))}} \left(\frac{n}{e}\right)^n \left(\frac{k}{e}\right)^{-k} \left(\frac{n-k}{e}\right)^{k-n} \\ &= \sqrt{\frac{n}{2\pi k (n-k))}} \left(\frac{n^n}{k^k (n-k)^{n-k}}\right). \end{aligned} \tag{4} ( k n ​ ) ​ ≃ ( 2 π k ) ( 2 π ( n − k ) ) 2 π n ​ ​ ( e n ​ ) n ( e k ​ ) − k ( e n − k ​ ) k − n = 2 π k ( n − k ) ) n ​ ​ ( k k ( n − k ) n − k n n ​ ) . ​ ( 4 ) If we multiply this term by the “raw probabilities” p k q n − k p^k q^{n-k} p k q n − k and group the terms raised to the powers k k k and n − k n-k n − k , we get: ( n k ) p k q n − k ≃ n 2 π k ( n − k ) ) ( n n k k ( n − k ) n − k ) p k q n − k = n 2 π k ( n − k ) ) ( n p k ) k ( n q n − k ) n − k . (5) \begin{aligned} {n \choose k} p^k q^{n-k} &\simeq \sqrt{\frac{n}{2\pi k (n-k))}} \left(\frac{n^n}{k^k (n-k)^{n-k}}\right) p^k q^{n-k} \\ &= \sqrt{\frac{n}{2\pi k (n-k))}} \left( \frac{np}{k} \right)^k \left( \frac{nq}{n-k} \right)^{n-k}. \end{aligned} \tag{5} ( k n ​ ) p k q n − k ​ ≃ 2 π k ( n − k ) ) n ​ ​ ( k k ( n − k ) n − k n n ​ ) p k q n − k = 2 π k ( n − k ) ) n ​ ​ ( k n p ​ ) k ( n − k n q ​ ) n − k . ​ ( 5 ) My understanding as to the motivation for the next two steps is that we want to “push” n n n into the denominator, which is often nice in asymptotics because it makes terms vanish as n n n gets larger. Let’s tackle the normalizing term (square root) and the probabilities separately. First, the square root. Note that by the law of large numbers , as n n n gets very large, k / n k/n k / n arbitrarily approaches the true probability of success p p p . So let’s rewrite the the square root in terms of k / n k/n k / n and then write k / n k/n k / n in terms of p p p : ( n k ) ≃ n 2 π k ( n − k ) ) = 1 2 π k n n ( 1 − k n ) ) ≃ 1 2 π n p q . (6) {n \choose k} \simeq \sqrt{\frac{n}{2\pi k (n-k))}} = \sqrt{\frac{1}{2\pi \frac{k}{n} n (1 - \frac{k}{n}))}} \simeq \frac{1}{\sqrt{2\pi n p q}}. \tag{6} ( k n ​ ) ≃ 2 π k ( n − k ) ) n ​ ​ = 2 π n k ​ n ( 1 − n k ​ ) ) 1 ​ ​ ≃ 2 π n p q ​ 1 ​ . ( 6 ) If you were already familiar with the normal distribution, this term should look suspiciously like the normalizing constant! Second, the probabilities. The next step is a fairly standard trick, which is to convert a product into a sum by taking the exp-log of the product. Looking only at the terms raised to k k k and n − k n-k n − k in Equation 5 5 5 , we get: ( n p k ) k ( n q n − k ) n − k = exp ⁡ { log ⁡ ( n p k ) k + log ⁡ ( n q n − k ) n − k } = exp ⁡ { − k log ⁡ ( k n p ) + ( k − n ) log ⁡ ( n − k n q ) } . (7) \begin{aligned} \left( \frac{np}{k} \right)^k \left( \frac{nq}{n-k} \right)^{n-k} &= \exp \left\{ \log \left( \frac{np}{k} \right)^k + \log \left( \frac{nq}{n-k} \right)^{n-k} \right\} \\ &= \exp \left\{ - k \log \left( \frac{k}{np} \right) + (k-n) \log \left( \frac{n-k}{nq} \right) \right\}. \end{aligned} \tag{7} ( k n p ​ ) k ( n − k n q ​ ) n − k ​ = exp { lo g ( k n p ​ ) k + lo g ( n − k n q ​ ) n − k } = exp { − k lo g ( n p k ​ ) + ( k − n ) lo g ( n q n − k ​ ) } . ​ ( 7 ) The next trick is express k k k in terms of a standardized binomial random variable z z z . Notice that X n X_n X n ​ is the sum of n n n independent Bernoulli random variables. By the linearity of expectation and the linearity of variance under independence, we have: E [ X n ] = ∑ i = 1 n E [ Y i ] = n p , V [ X n ] = ∑ i = 1 n V [ Y i ] = n p q . (8) \begin{aligned} \mathbb{E}[X_n] &= \sum_{i=1}^n \mathbb{E}[Y_i] = np, \\ \mathbb{V}[X_n] &= \sum_{i=1}^n \mathbb{V}[Y_i] = npq. \end{aligned} \tag{8} E [ X n ​ ] V [ X n ​ ] ​ = i = 1 ∑ n ​ E [ Y i ​ ] = n p , = i = 1 ∑ n ​ V [ Y i ​ ] = n p q . ​ ( 8 ) Since the mean of X n X_n X n ​ is n p np n p and its variance is n p q npq n p q , a standardized binomial random variable is z : = k − E [ k ] V [ k ] = k − n p n p q . (9) z := \frac{k - \mathbb{E}[k]}{\sqrt{\mathbb{V}[k]}} = \frac{k - np}{\sqrt{npq}}. \tag{9} z : = V [ k ] ​ k − E [ k ] ​ = n p q ​ k − n p ​ . ( 9 ) And we can write this in terms of k k k as k = n p + z n p q . (10) k = np + z \sqrt{npq}. \tag{10} k = n p + z n p q ​ . ( 1 0 ) Putting this definition of k k k into the formula above—the point here is to express k k k in terms of n n n , which is the term we want to pay attention to as it increases—, we get: exp ⁡ { − k log ⁡ ( k n p ) + ( k − n ) log ⁡ ( n − k n q ) } = exp ⁡ { − k log ⁡ ( n p + z n p q n p ) + ( k − n ) log ⁡ ( n − n p − z n p q n q ) } = exp ⁡ { − k log ⁡ ( 1 + z q n p ) + ( k − n ) log ⁡ ( 1 − z p n q ) } . (11) \begin{aligned} &\exp \left\{ - k \log \left( \frac{k}{np} \right) + (k-n) \log \left( \frac{n-k}{nq} \right) \right\} \\ &= \exp \left\{ - k \log \left( \frac{np + z \sqrt{npq}}{np} \right) + (k-n) \log \left( \frac{n-np - z \sqrt{npq}}{nq} \right) \right\} \\ &= \exp \left\{ - k \log \left( 1 + z \sqrt{\frac{q}{np}} \right) + (k-n) \log \left( 1 - z \sqrt{\frac{p}{nq}} \right) \right\}. \end{aligned} \tag{11} ​ exp { − k lo g ( n p k ​ ) + ( k − n ) lo g ( n q n − k ​ ) } = exp { − k lo g ( n p n p + z n p q ​ ​ ) + ( k − n ) lo g ( n q n − n p − z n p q ​ ​ ) } = exp { − k lo g ( 1 + z n p q ​ ​ ) + ( k − n ) lo g ( 1 − z n q p ​ ​ ) } . ​ ( 1 1 ) In my mind, the final step is the least obvious, but it’s lovely when you see it. Recall that the Maclaurin series of log ⁡ ( 1 + x ) \log(1+x) lo g ( 1 + x ) is log ⁡ ( 1 + x ) = x − x 2 2 + x 3 3 − … (12) \log(1+x) = x - \frac{x^2}{2} + \frac{x^3}{3} - \dots \tag{12} lo g ( 1 + x ) = x − 2 x 2 ​ + 3 x 3 ​ − … ( 1 2 ) This is a fairly standard result, and it’s worth just writing out yourself if you’ve never done it. Anyway, we can plug in these two definitions of x x x , x : = z q n p , x : = − z p n q , (13) x := z \sqrt{\frac{q}{np}}, \qquad x := -z \sqrt{\frac{p}{nq}}, \tag{13} x : = z n p q ​ ​ , x : = − z n q p ​ ​ , ( 1 3 ) into Equation 12 12 1 2 above, and use that to expand the logs in Equation 11 11 1 1 into infinite sums. Why are we doing this? The key idea that we’ll see is that nearly every term in each sum will be a fraction with n n n in the denominator. So as n n n grows larger, these terms will become arbitrarily small. In the limit, they vanish. All that will be left is the normal distribution’s kernel, exp ⁡ { − 0.5 z 2 } \exp\{-0.5 z^2\} exp { − 0 . 5 z 2 } . Let’s do this. First, let’s just look at one of the log terms. We can write the left one as: − k log ⁡ ( 1 + z q n p ) = − ( n p + z n p q ) [ z ( q n p ) 1 / 2 − 1 2 z 2 q n p + 1 3 z 3 ( q n p ) 3 / 2 + …   ] . (14) \begin{aligned} &-k \log \left( 1 + z \sqrt{\frac{q}{np}} \right) \\ &= -(np + z\sqrt{npq})\left[z \left(\frac{q}{np}\right)^{1/2} - \frac{1}{2} z^2 \frac{q}{np} + \frac{1}{3} z^3 \left(\frac{q}{np}\right)^{3/2} + \dots \right]. \end{aligned} \tag{14} ​ − k lo g ( 1 + z n p q ​ ​ ) = − ( n p + z n p q ​ ) [ z ( n p q ​ ) 1 / 2 − 2 1 ​ z 2 n p q ​ + 3 1 ​ z 3 ( n p q ​ ) 3 / 2 + … ] . ​ ( 1 4 ) The key thing to see is that for most terms in the sum, after we multiply it by n n n or n \sqrt{n} n ​ , we still have n n n in the denominator. And these terms vanish since for some constant c c c , the ratio c / n c/n c / n goes to zero as n → ∞ n \rightarrow \infty n → ∞ . So multiplying the terms in Equation 14 14 1 4 , we get [ − z n p q + 1 2 z 2 q − 1 3 z 3 q 3 / 2 n p + …   ] + [ − z 2 q + 1 2 z 3 q 3 / 2 n p − 1 3 z 4 q 2 n p + …   ] = − z n p q + 1 2 z 2 q − z 2 q = − z n p q − 1 2 z 2 q . (15) \begin{aligned} &\left[ - z\sqrt{npq} + \frac{1}{2} z^2 q - \frac{1}{3} z^3 \frac{q^{3/2}}{\sqrt{np}} + \dots \right] + \left[ - z^2 q + \frac{1}{2} z^3 \frac{q^{3/2}}{\sqrt{np}} - \frac{1}{3} z^4 \frac{q^2}{np} + \dots \right] \\ &= -z\sqrt{npq} + \frac{1}{2} z^2 q - z^2 q \\ &= -z\sqrt{npq} - \frac{1}{2} z^2 q. \end{aligned} \tag{15} ​ [ − z n p q ​ + 2 1 ​ z 2 q − 3 1 ​ z 3 n p ​ q 3 / 2 ​ + … ] + [ − z 2 q + 2 1 ​ z 3 n p ​ q 3 / 2 ​ − 3 1 ​ z 4 n p q 2 ​ + … ] = − z n p q ​ + 2 1 ​ z 2 q − z 2 q = − z n p q ​ − 2 1 ​ z 2 q . ​ ( 1 5 ) That’s the basic idea. If we do the expansion for the other term in Equation 11 11 1 1 , we’ll see that it’s equal to: ( k − n ) log ⁡ ( 1 − z p n q ) = z n p q + 1 2 z 2 p − z 2 p + … ≃ z n p q − 1 2 z 2 p . (16) \begin{aligned} (k-n) \log \left( 1 - z \sqrt{\frac{p}{nq}} \right) &= z\sqrt{npq} + \frac{1}{2} z^2 p - z^2 p + \dots \\ &\simeq z\sqrt{npq} - \frac{1}{2} z^2 p. \end{aligned} \tag{16} ( k − n ) lo g ( 1 − z n q p ​ ​ ) ​ = z n p q ​ + 2 1 ​ z 2 p − z 2 p + … ≃ z n p q ​ − 2 1 ​ z 2 p . ​ ( 1 6 ) Putting these two terms together, we can see that the exponent term is equal to: exp ⁡ { − k log ⁡ ( 1 + z q n p ) + ( k − n ) log ⁡ ( 1 − z p n q ) } ≃ exp ⁡ { − z n p q − 1 2 z 2 q + z n p q − 1 2 z 2 p } = exp ⁡ { − 1 2 z 2 p − 1 2 z 2 q } = exp ⁡ { − 1 2 z 2 ( p + q ) } = exp ⁡ { − 1 2 z 2 } . (17) \begin{aligned} &\exp \left\{ - k \log \left( 1 + z \sqrt{\frac{q}{np}} \right) + (k-n) \log \left( 1 - z \sqrt{\frac{p}{nq}} \right) \right\} \\ &\simeq \exp\left\{ -z\sqrt{npq} - \frac{1}{2} z^2 q + z\sqrt{npq} - \frac{1}{2} z^2 p \right\} \\ &= \exp\left\{ - \frac{1}{2} z^2 p - \frac{1}{2} z^2 q \right\} \\ &= \exp\left\{ - \frac{1}{2} z^2 (p + q) \right\} \\ &= \exp\left\{ - \frac{1}{2} z^2 \right\}. \end{aligned} \tag{17} ​ exp { − k lo g ( 1 + z n p q ​ ​ ) + ( k − n ) lo g ( 1 − z n q p ​ ​ ) } ≃ exp { − z n p q ​ − 2 1 ​ z 2 q + z n p q ​ − 2 1 ​ z 2 p } = exp { − 2 1 ​ z 2 p − 2 1 ​ z 2 q } = exp { − 2 1 ​ z 2 ( p + q ) } = exp { − 2 1 ​ z 2 } . ​ ( 1 7 ) And this is the normal distribution’s kernel! Putting this together with the normalizing term in Equation 6 6 6 and then using the definition of the standardized variable z z z in Equation 9 9 9 , we get: ( n k ) p k q n − k    ≃    1 2 π n p q exp ⁡ { − 1 2 ( k − n p n p q ) 2 } . (18) {n \choose k} p^k q^{n-k} \;\simeq\; \frac{1}{\sqrt{2\pi n p q}} \exp\left\{ - \frac{1}{2} \left( \frac{k - np}{\sqrt{npq}} \right)^2 \right\}. \tag{18} ( k n ​ ) p k q n − k ≃ 2 π n p q ​ 1 ​ exp { − 2 1 ​ ( n p q ​ k − n p ​ ) 2 } . ( 1 8 ) And we’re done! This is quite elegant, because we have expressed this asymptotic distribution in terms of the mean and variance of X n X_n X n ​ . This is remarkable! I still remember the first time I saw this derived and realized precisely why the normal distribution was so pervasive. The normal distribution is everywhere because if you take a bunch of random noise and smash it together, the result is most likely normally distributed! Note that the more general CLT does not require that the random variables in the sum be Bernoulli distributed. For example, if X n X_n X n ​ is the sum of n n n independent skew normal random variables, X n X_n X n ​ itself is still normally distributed! See Figure 2 2 2 for a numerical experiment demonstrating this. The de Moivre–Laplace Theorem was the first hint that this more general result, the central limit theorem, was actually true.

0 views