Latest Posts (9 found)
qouteall notes 2 months ago

WebAssembly Limitations

Background: This article focuses on in-browser Wasm. The data that Wasm program works on: The linear memory doesn't hold these things: Normally program runs with a stack. For native programs, the stack holds: In Wasm, the main stack is managed by Wasm runtime. The main stack is not in lineary memory, and cannot be read/written by address. It has benefits: But it also have downsides: The is taken address to, so it must be in linear memory, not Wasm execution stack (unless the compiler can optimize out the pointer). The common solution is to have a shadow stack that's in linear memory. That stack is managed by Wasm code. (Sometimes shadow stack is called aux stack.) Summarize 2 different stacks: There is a stack switching proposal that aim to allow Wasm to do stack switching. This make it easier to implement lightweight thread (virtual thread, goroutine, etc.), without transforming the code and add many branches. Using shadow stack involves issue of reentrancy explained below. The Wasm linear memory can be seen as a large array of bytes. Address in linear memory is the index into the array. Instruction can grow a linear memory. However, there is no way to shrink the linear memory. Wasm applications (that doesn't use Wasm GC) implements their own allocator in Wasm code. The memory regions freed in that allocator can be used in future allocation. However, the freed memory resources cannot be returned back to OS . Mobile platforms (iOS, Android, etc.) often kill background process that has large memory usage, so not returning memory to OS is an important issue. See also: Wasm needs a better memory management story . Due to this limitation, Wasm applications consume as much physical memory as its peak memory usage . Possible workarounds for reducing peak memory usage: There is a memory control proposal that addresses this issue. When compiling non-GC languages (e.g. C/C++/Rust/Zig) to Wasm, they use the linear memory and implement the allocator in Wasm code. For GC langauges (e.g. Java/C#/Python/Golang), they need to make GC work in Wasm. There are two solutions: The first solution, manually implementing GC encounters difficulties: What about using Wasm's built-in GC functionality? It requires mapping the data structure to Wasm GC data structure. Wasm's GC data structure allows Java-like class (with object header), Java-like prefix subtyping, and Java-like arrays. The important memory management features that it doesn't support: It doesn't support some memory layout optimizations: See also: C# Wasm GC issue , Golang Wasm GC issue For each web tab, there ia an event loop where JS code runs. There is also an event queue 4 . The pseudocode of simplified event loop (of main thread of each tab): New events can be added to event queue in many ways: Important things related to event loop: There are web workers that can run in parallel. Each web worker also runs in an event loop (each web worker is single-threaded), but no rendering involved. Pseudocode: The web threads (main thread and web workers) don't share mutable data (except ): This design avoids data race of JS things and DOM things. WebAssembly multithreading relies on web workers and . Spectre vulnerability is a vulnearbility that allows JS code running in browser to read browser memory. Exploiting it requires accurately measuring memory access latency to test whether a region of memory is in cache. Modern browsers reduced 's precision to make it not usable for exploit. But there is another way of accurately measuring (relative) latency: multi-threaded counter timer. One thread (web worker) keeps incrementing a counter in . Another thread can read that counter, treating it as "time". Subtracting two "time" gets accurate relative latency. Spectre vulneability explanation below The solution to that security issue is cross-origin isolation . Cross-origin isolation make the browser to use different processes for different websites. One website exploiting Spectre vulnearbility can only read the memory in the browser process of their website, not other websites. Cross-origin isolation can be enabled by the HTML loading response having these headers: The threads proposal adds , instructions for suspending a thread, which can be used for implement locks (and conditional variables, etc.). See also However, the main thread cannot be suspended by these instructions. This was due to some concerns about web page responsiveness. Related 1 Related 2 Related 3 This restriction makes porting native multi-threaded code to Wasm harder. For example, locking in web worker can use normal locking, but locking in main thread must use spin-lock. Spin-locking for long time costs performance. The main thread can be blocked using JS Promise integration . That blocking will allow other code (JS code and Wasm code) to execute when blocking. This is called reentrancy . Wasm applications often use shadow stack . It's a stack that's in linear memory, managed by Wasm app rather than Wasm runtime. Shadow stack must be properly switched when Wasm code suspends and resumes using JS Promise integration. Otherwise, the shadow stack parts of different execution can be mixed and messed up. Other things may also break under reentrancy and need to be taken care of. Also, as previously mentioned, if the canvas drawing code suspends (using JS Promise integration), the half-drawn canvas will be presented to web page. This can be workarounded by using offscreen canvas , drawn in web worker. Multi-threading in Web relies on web workers. Currently there is no way to directly launch a Wasm thread in browser. Launching a multi-threaded Wasm application is done by passing shared (that contains a ) to another web worker. That web worker need to separately create a new Wasm instance , using the same (and ). The Wasm globals are thread-local (not actually global). Mutate a mutable Wasm global in one thread don't affect other threads. Mutable globals variables need to be placed in linear memory. Another important limitation: The Wasm tables cannot be shared . That creates trouble when loading new Wasm code during running (dynamic linking). To make existing code call new function, you need indirect call via function reference in table. However, tables cannot be shared across Wasm instances in different web workers. The current workaround is to notify the web workers to make them proactively load the new code and put new function references to table. One simple way is to send a message to web worker. But that doesn't work when web worker's Wasm code is still running. For that case, some other mechanisms (that costs performance) need to be used. While load-time dynamic linking works without any complications, runtime dynamic linking via  /  can require some extra consideration. The reason for this is that keeping the indirection function pointer table in sync between threads has to be done by emscripten library code. Each time a new library is loaded or a new symbol is requested via  , table slots can be added and these changes need to be mirrored on every thread in the process. Changes to the table are protected by a mutex, and before any thread returns from   or   it will wait until all other threads are sync. In order to make this synchronization as seamless as possible, we hook into the low level primitives of  emscripten_futex_wait  and  emscripten_yield . Dynamic Linking — Emscripten There is shared-everything threads proposal that aim to fix that. If all web workers utitlize browser's event loop and don't block for long time in each execution, then they can coorporatively load new Wasm code by processing web worker message, without much delay. Numbers ( , , , ) can be directly passed between JS and Wasm ( maps to in JS, other 3 maps to ). Passing a JS string to Wasm requires: Similarily passing a string in Wasm linear memory to JS is also not easy. Passing strings between Wasm and JS can be a performance bottleneck. If your application involve frequent Wasm-JS data passing, then replacing JS by Wasm may actually reduce performance. Modern Wasm/JS runtime (including V8) can JIT and inline the cross calling between Wasm and JS. But the copying cost still cannot be optimized out. The Wasm Component Model aim to solve this. It allows passing higher-level types such as string, record (struct), list, enum in interface. But different components cannot share memory, and the passed data need to be copied. There are Wasm-JS string builtins that aim to reduce the cost of string passing between Wasm and JS. Wasm code cannot directly call Web APIs. Web APIs must be called via JS glue code. Although all Web's JS APIs have Web IDL specifications, it involves GCed-objects, iterators and async iterators (e.g. returns . Its field is ). These GC-related things and async-related things cannot easily adapt to Wasm code using linear memory. It's hard to design a specification of turning Web IDL interfaces to Wasm interfaces. There was Web IDL Bindings Proposal but superseded by Component Model proposal. Currently Wasm cannot be run in browser without JS code that bootstraps Wasm. The original version of Wasm only supports 32-bit address and up to 4GiB linear memory. In Wasm, a linear memory has a finite size. Accessing an address out of size need to trigger a trap that aborts execution. Normally, to implement that range checking, the runtime need to insert branches for each linear memory access (like ). But Wasm runtimes have an optimization: map the 4GB linear memory to a virtual memory space. The out-of-range pages are not allocated from OS, so accessing them cause error from OS. Wasm runtime can use signal handling to handle these error. No range checking branch needed. That optimization doesn't work when supporting 64-bit address. There is no enough virtual address space to hold Wasm linear memory. So the branches of range checking still need to be inserted for every linear memory access. This costs performance. See also: Is Memory64 actually worth using? Generally, WebAssembly runs slower than native applications compiled from the same source code. Because of many factors: Firstly, the file need to have DWARF debug information in custom section. There is a C/C++ DevTools Support (DWARF) plugin ( Source code ). That plugin is designed to work with C/C++. When using it on Rust, breakpoints and inspecting integer local variables work, but other functionalities (inspecting string, inspecting global, evaluate expression, etc.) are not supported. VSCode can debug Wasm running in Chrome, using vscode-js-debug plugin . Documentation , Documentation . It allows inspecting integer local variable. But the local variable view doesn't show string content. Can only see string content by inspecting linear memory. The debug console expression evaluation doesn't allow call functions. (It also requires VSCode WebAssembly DWARF Debugging extension. Currently (2025 Sept) that extension doesn't exist in Cursor.) Chromium debugging API . Background: Specture vulneability (Variant 1) core exploit JS code ( see also ): The is for converting value to 32-bit integer, helping JS runtime to optimize it into integer operation (JS is dynamic, without that the JITed code may do other things). The is to prevent these read opearations from being optimized out. The same thing can also be done via equivalent Wasm code using . Related: Another vulnerability related to cache side channel: GoFetch . It exploits Apple processors' cache prefetching functionality. See also: WebAssembly Troubles part 2: Why Do We Need the Relooper Algorithm, Again? This may reduce the performance of compiling to Wasm and JIT optimization. This issue doesn't apply to application developers. WebAssembly is not always faster than JS, depending on how much optimization efforts are put in, and browser's limitations. But WebAssembly has higher potential of performance than JS. JS has a lot of flexibility. Flexibility costs performance. JS runtime often use runtime statistics to find unused flexibility and optimize accordingly. But statistics cannot be really sure so JS runtime still have to "prepare" for flexibility. The runtime statistics and "prepare for flexibility" all costs performance, in a way that cannot be optimized without changing code format. ↩ calls a function reference on stack. calls a function reference in a table in an index. , are for tail call. ↩ Safepoint mechanism allows a thread to cooporatively pause at specific points. Scanning a running thread's stack is not reliable, due to memory order issues and race conditions, and some pointers may be in register, not stack. If the thread is suspended using OS functionality, some local variable may be in register, and it's hard to tell whether data in register is pointer or other data (treating integer as pointer may cause memory safety issue or memory leak). If a thread is coorporatively paused in specific places, the references can be reliably scanned. One way to implement safepoint is to have a global safepoint flag. The code frequently reads the safepoint flag and pause if flag is true. There exists optimizations such as using OS page fault signal handler. ↩ That's a simplification. Actually there are two event queues in each main thread per tab. One is callback queue for low-priority events. Another is microtask queue for high-priority events. The high-priority ones execute first. ↩ WebAssembly is an execution model and a code format. It's designed with performance concern. It can achieve higher performance than JS 1 . It's designed with safety concern. Its execution is sandboxed. It can be run in browser. It's close to native assembly (e.g. X86, ARM) but abstracts in a cross-platform way, so that many C/C++/Rust/etc. applications can be compiled to Wasm (but with limitations). Although its name has "Web", it's is not just for Web. It can be used outside of browser. Although its name has "Assembly", it has features (e.g. GC ) that are in a higher abstraction layer than native assembly, similar to JVM bytecode. In browsers, it's the same engine that runs JS and Wasm. Chromium V8 executes both JS and Wasm. Wasm GC use the same GC as JS. Runtime-managed stack. It has local variables, function arguments, return code addresses, etc. It's managed by the runtime. It's not in linear memory. Linear memory. A linear memory is an array of bytes. Can be read/written by address (address can be seen as index in array). Wasm supports having multiple linear memories. A linear memory's size can grow. But currently a linear memory's size cannot shrink. A linear memory can be shared by multiple Wasm instances, see multi-threading section below. Table. Each table is a (growable) array that can hold: Function references. Extern value references. Extern value can be JS value or other things, depending on environment. Exception references. GC value references. Heap. Holds GC values. Explained later. Globals. A global can hold a number ( , , , ), an or a reference (including function reference, GC value reference, extern value reference, etc.). The globals are not in linear memory. Linear memory doesn't hold the stack. The stack managed by runtime and cannot be read/written by address. The linear memory doesn't hold function references. Unlike function pointers in C, Wasm function references cannot be converted to and from integers. This design can improve safety. A function reference can be on stack or on table or in global, and can be called by special instructions 2 . Function pointer becomes integer index corresponding to a function reference in table. Local variables and call arguments. (not all of them are on stack. some are in registers) Return code address. It's the machine code address to jump to when function returns. (Function can be inlined, machine code can be optimized, so this don't always correspond to code.) Other things. (e.g. C# , Golang metadata) It avoids security issues related to control flow hijacking. A native application's stack is in memory, so out-of-bound write can change the return code address on stack, causing it to execute wrong code. There are protections such as  data execution prevention  (DEP) and  stack canary and address space layout randomization (ASLR). These are not needed in Wasm. See also It allows the runtime to optimize stack layout without changing program behavior. Some local variables need to be taken address to. They need to be in linar memory. For example: GC needs to scan the references (pointers) on stack. If the Wasm app use application-managed GC (not Wasm built-in GC, for reasons explained below), then the on-stack references (pointer) need to be "spilled" to linear memory. Stack switching cannot be done. Golang use stack switching for goroutine scheduling (not in Wasm). Currently Golang's performance in Wasm is poor , because it tries to emulate goroutine scheduling in single-threaded Wasm, thus it need to add many dynamic jumps in code. Dynamic stack resizing cannot be done. Golang does dynamic stack resizing so that new goroutines can be initialized with small stacks, reducing memory usage. The main execution stack, that holds local variable, call arguments, return code addresses, and possibly operands (in wasm stack machine). It's managed by Wasm runtime and not in linear memory. It cannot be freely manipulated by Wasm code. The shadow stack. It's in linear memory. Holds the local variables that need to be in linear memory. Managed by Wasm code, not Wasm runtime. Store large data in JS . If an is GC-ed, its physical memory can be returned to OS. Only fetch a small chunk of data from server at a time. Avoid fetching all data then process in batch. Do stream processing. Use Origin-private file system to hold large data, and only load a small chunk into linear memory at once. Still use linear memory to hold data. Implement GC in Wasm code. Use Wasm's built-in GC functionality. GC requires scanning GC roots (pointers). Some GC roots are on stack. But the Wasm main stack is not in linear memory and cannot be read by address. One solution is to "spill" the pointers to the shadow stack in linear memory. Having the shadow stack increases binary size and costs runtime performance. Multi-threaded GC often need to pause the execution to scan the stack correctly. In native applications, it's often done using safepoint mechanism 3 . It also increases binary size and costs runtime performance. Multi-threaded GC often use store barrier or load barrier to ensure scanning correctness. It also increases binary size and costs runtime performance. Cannot collect a cycle where a JS object and an in-Wasm object references each other. GC values cannot be shared across threads No weak reference. No finalizer (the code that runs when an object is collected by GC). No interior pointer. (Golang supports interior pointer) No array of struct type. No per-object locking (in Java and C# every object can be a lock) Cannot use fat pointer to avoid object header. (Golang does it) Cannot add custom fields at the head of an array object. (C# supports it) Don't have compact sum type memory layout. Each time browser calls JS/Wasm code (e.g. event handling), it adds an event to queue. If JS code awaits on an unresolved promise, the event handling finishes. When that promise resolves, a new event is added into queue. Web page rendering is blocked by JS/Wasm code executing. Having JS/Wasm code keep running for long time will "freeze" the web page. When JS code draws canvas, the things drawn in canvas will only be presented once current iteration of event loop finishes ( in pseudocode). If the canvas drawing code is async and awaits on unresolved promise during drawing, half-drawn canvas will be presented. In React, when a component firstly mounts, the effect callback in will run in the next iteration of event loop (React schedules task using ). But the effect in will run in the current iteration of event loop. Usually, JS values sent to another web worker are deep-copied. (except that, immutable values like won't be copied). Sending an across thread will make to detach with its binary data. Only one thread can access its binary data. The immutable things, like , can be sent to another web worker without copying or detaching. be . See also be or . See also If it's , all resources loaded from other websites (origins) must have response header contain (or differently if in CORS mode). If it's , requests sent to other websites won't contain credentials like cookies. transcode (e.g. passing to Rust need to convert WTF-16 to UTF-8), allocate memory in Wasm linear memory, copy transcoded string into Wasm linear memory, pass address and length into Wasm code, Wasm code need to care about deallocating the string. The previously mentioned linear memory bounds check. JIT (just-in-time compilation) cost. Native C/C++/Rust applications can be AOTed (ahead-of-time compiled). V8 firstly use a quick simple compiler to compile Wasm into machine code quickly to improve startup speed (but the generated machine code runs slower), then use a slower high-optimization compiler to generated optimized machine code for few hot Wasm code. See also . That optimization is profile-guided (target on few hot code, use statistical result to guide optimization). Both profiling, optimization and code-switching costs performance. Multi-threading cannot use release-acquire memory ordering, which can improve performance of some atomic oprations. See also Multi-threading require launching web worker, which is a slow operation. Limited access to hardware functionality, such as some special SIMD instructions. But Wasm already support many common SIMD instructions. Cannot access some OS functionalities, such as . CPU has a cache for accelerating memory access. Some parts of memory are put into cache. Accessing these memory can be done by accessing cache, which is faster. The cache size is limited. Accessing new memory can evict existing data in cache, and put newly accessed data into cache. Whether a content of memory is in cache can be tested by memory access latency. CPU does speculative execution and branch prediction. CPU tries to execute as many as possible instructions in parallel. When CPU sees a branch (e.g. ), it tries to predict the branch and speculatively execute code in branch. If CPU later find branch prediction to be wrong, the effects of speculative execution (e.g. written registers, written memory) will be rolled back. However, memory access leaves side effect on cache, and that side effect won't be cancelled by rollback. The branch predictor relies on statistical data, so it can be "trained". If one branch keeps go to first path for many times, the branch predictor will predict it will always go to the first path. The attacker firstly execute that code many times with in-bound to "train" branch predictor. Then the attacker access many other different memory locations to invalidate the cache. Then attacker executes that code using a specific out-of-bound : CPU speculatively reads . It's out-of-bound. That result is the secret in browser process's memory. Then CPU speculatively reads , using an index that's computed from that secret. One specific memory region in will be loaded into cache. Accessing that region is faster. CPU found that branch prediction is wrong and rolls back, but doesn't rollback side effect on cache. The attacker measures memory read latency in . Which place access faster correspond to the value of secret. To accurately measure memory access latency, is not accurate enough. It need to use a multi-threaded counter timer: One thread (web worker) keeps increasing a shared counter in a loop. The attacking thread reads that counter to get "time". The cross-thread counter sharing requires . Although it cannot measure time in standard units (e.g. nanosecond), it's accurate enough to distinguish latency difference between fast cache access and slow RAM access. WebAssembly is not always faster than JS, depending on how much optimization efforts are put in, and browser's limitations. But WebAssembly has higher potential of performance than JS. JS has a lot of flexibility. Flexibility costs performance. JS runtime often use runtime statistics to find unused flexibility and optimize accordingly. But statistics cannot be really sure so JS runtime still have to "prepare" for flexibility. The runtime statistics and "prepare for flexibility" all costs performance, in a way that cannot be optimized without changing code format. ↩ calls a function reference on stack. calls a function reference in a table in an index. , are for tail call. ↩ Safepoint mechanism allows a thread to cooporatively pause at specific points. Scanning a running thread's stack is not reliable, due to memory order issues and race conditions, and some pointers may be in register, not stack. If the thread is suspended using OS functionality, some local variable may be in register, and it's hard to tell whether data in register is pointer or other data (treating integer as pointer may cause memory safety issue or memory leak). If a thread is coorporatively paused in specific places, the references can be reliably scanned. One way to implement safepoint is to have a global safepoint flag. The code frequently reads the safepoint flag and pause if flag is true. There exists optimizations such as using OS page fault signal handler. ↩ That's a simplification. Actually there are two event queues in each main thread per tab. One is callback queue for low-priority events. Another is microtask queue for high-priority events. The high-priority ones execute first. ↩

0 views
qouteall notes 2 months ago

Higher-Level Design Patterns

Higher-level software design patterns: These patterns and ideas are often deeply connected and used together. It involves two different aspects: The benefit of turning computation (logic and action) into data: The benefit of turning execution state into explicit data: The distinction between computation and execution state is blurry. A closure can capture data. An execution state can be seen as a continuation , which is also a computation. Algebraic effect : An effect handler executes some code in a scope. Some code is executed under an effect handler. When it performs an effect, the control flow jumps to the effect handler, and the execution state ( delimited continuation ) up to the effect handler's scope is also saved. The effect handler can then resume using the execution state. A simple introduction to Algebraic effects Delimited continuation is the execution state turned into data. It's delimited because the execution state only include the stackframes within effect handling scope. The continuation (without "delimited") contains the whole execution state of the whole program (assume program is single-threaded). Delimited continuation is "local". Continuation is "global". The "local" one is more fine-grained and useful. Continuation passing style (CPS) is a way of representing programs. In CPS, each function accepts a continuation. Returning becomes calling the continuation. Calling continuation is to continue execution. The output of continuation is the "final output of whole program" (if IO or mutable state involved, the "final output of whole program" can be empty). See also: Configuration complexity clock When you try to handle many different business requests, one solution is to create a flexible rules engine. Configuring the rules engine can handle all of the requests. However then a tradeoff become salient: DSL are useful when it's high in abstraction level, and new requirements mostly follow the abstration. System calls are expensive . Replacing system calls with data can improve performance: Mutation can be represented as data. Data can be interpreted as mutation. The benefits: About rollback: Sometimes, to improve user experience, we need to replay conflicted changes, instead of rolling back them. It's more complex. In some places, we specify a new state and need to compute the mutation (diff). Examples: Mutate-by-recreate: Keep data immutable. Change it by recreating the whole data. In multi-threading, for read-heavy data, it's often beneficial to make the data structure immutable, but keep one mutable atomic reference to it. Updating recreates the whole data structure and atomically change the reference. This called read-copy-update (RCU) or copy-on-write (COW). In pure functional languages (e.g. Haskell), there is no direct way of mutating things. Mutation can only be simutated by recreating. Bitemporal modelling : Store two pieces of records. One records the data and time updated to database. Another records the data and time that reflect the reality. (Sometimes the reality changes but database doesn't edit immediately. Sometimes database contains wrong informaiton that's corrected later.) Conflict-free replicated data type (CRDT) : The mutations can be combined, and the result doesn't depend on order of combining. It allows distributed system get eventual consistency without immediate communication. In CRDT, the operator of combining mutation ∗ * ∗ : Examples of CRDT: For example, in multiplayer game, there is a door. The door's state can be open or close (a boolean). Each operation is a tuple . Combination is max-by-timestamp (for two operations, pick the higher-timestamp ones). Consdering that multiple players can do operation in exactly the same timestamp, so we add player ID as tie-breaker . The operation now become . Combination max-by the tuple of . If equals, larger wins. Note that typical multiplayer game implementation doesn't use CRDT. The server holds source-of-truth game state. Clients send actions to servers. The server validates actions, change game state and broadcast to all clients. Drawing solid triangles to framebuffer can also be seen as CRDT. The whole framebuffer can be seen as an operation. Each pixel in framebuffer has a depth value. Combining two framebuffer takes lowest-depth one for two pixels in the same position. (Two framebuffers may have same depth on same pixel with different color. We can use unique triangle ID as tie-breaker.) Note that actual rasterization in GPU works by having one centralized framebuffer, not using CRDT. In a collaborative text editing system, each character has an ID. It supports two kinds of operations: There is a "root character" in the beginning of document. It's invisible and cannot delete. For two insertions after the same character, the tie-breaker is . Higher timestamp ones appear first. For the same timestamp, higher user id ones appear first. It forms a tree. Each character is a node, containing visibility boolean flag. Each operation is an edge pointing to new character. The document is formed by traversing the tree in depth-first order (edges ordered by tie-breaker) while hiding invisible characters. 3 4 Only compute some parts of the data, and keep the information of remaining computation for future use. Deferred (async) compuation vs immediate compuation: A computation, an optimization, or a safety check can be done in: Most computations that are done at compile time can be done at runtime (with extra performance cost). But if you want to avoid the performance cost by doing it in compile time, it becomes harder. Rust and C++ has Statics-Dynamics Biformity ( see also ): most runtime computation methods cannot be easily used in compile-time. Using compile-time mechanisms often require data to be encoded in types, which then require type gymnastics. The ways that solve (or partially solve) the biformity between compile-time and runtime computation: Related: Symbolic execution , Program search using superposition Views in SQL databases are "fake" tables that represents the result of a query. The view's data is derived from other tables (or views). The generalized concept of view: View takes one information model and present it as another information model, and allow operating as another information model. The generalized view can be understood as: Examples of the generalized view concept: More generally: Dynamically-typed languages also have "types". The "type" here is the mapping between in-memory data and information . Even in dynamic languages, the data still has "shape" at runtime . The program only works with specific "shapes" of data . For example, in Python, if a function accepts an array of string, but you pass it one string, then it treats each character as a string, which is wrong. Mainstream languages often have relatively simpler and less expressive type systems. Some "shape" of data are complex and cannot be easily expressed in mainstram languages' type system (without type erasure). Dynamic languages' benefits: But the statically-typed languages and IDEs are improving. The more expressive type systems reduce friction of typing. Types can help catching mistakes, help understanding code and help IDE functionalities. Type inference and IDE completion saves time of typing types. That's why mainstream dynamic languages (JS, Python) are embracing type annotations. A view can be backed by either storage or computation (or a combination of storage and computation). Modern highly-parallel computation are often bottlenecked by IO and synchronization. Adding new computation hardware units is easy. Making the information to flow efficiently between these hardware units is hard. When memory IO becomes bottleneck, re-computing rather than storing can be beneficial. Most algorithms use the idea of producing invariant, growing invariant and maintaining invariant: About transitive rule: if X and Y both follow invariant, then result of "merging" X and Y also follows invariant. "Transitive" is why the invariant can grow without re-checking the whole data. When the invariant is forced in language level, it can be " contagious ". For example, invariants in business logic: Invariants in data: The timing of maintaining invariant: The responsibility of maintaining invariant: In the second case (application code maintains invariant), to make it less error prone, we can encapsulate the data and the invariant-maintaining code , and ensuring that any usage of encapsulated API won't violate the invariant . If some usages of API can break the invariant and developer can only know it by considering implementation, then it's a leaky abstraction. For example, one common invariant to maintain is consistency between derived data and base data (source-of-truth) . There are many solutions: In real-world legacy code, invariants are often not documented . They are implicit in code. A developer not knowing an invariant can easily break it. Type systems also help maintaining invariant. But a simple type system can only maintain simple invariants. Complex invariants require complex types to maintain. If it becomes too complex, type may be longer than execution code, and type errors become harder to resolve. It's a tradeoff. Concentration and fat-tail distribution (80/20) are common in software world: Many optimizations are based on assuming the high-probability case happens : GoF design patterns Other GoF design patterns briefly explained: It's possible to use separate threads for suspendable compuatation. However, OS threads are expensive and context switch is expensive. Manually-implemented state machine is faster. ↩ It's possible to use polling to fully avoid system call after initial setup, but with costs. ↩ There are optimizations. To avoid storing unique ID for each character, it can store many immutable text blocks, and use as character ID. Consecutive insertions and deletions can be merged. The tree keeps growing, and need to be merged. The exact implementation is complex. ↩ Another way of collaborative text editing is Operational Transformation . It use as cursor. The server can transform a cursor to the latest version of document: if there are insertion before , the increments accordingly. If there are deletion, reduces accordingly. This is also called index rebasing. ↩ Also: In hard disk, magnetic field is viewed as bits. In CD, the pits and lands are viewed as bits. In SSD, the electron's position in floating gate is viewed as bits. In fiber optics, light pulses are viewed as bits. In quantum computer, the quantum state (like spin of electron) can be viewed as bits. ...... ↩ Specifically, bytes are viewed as code units, code units are viewed as code points, code points are viewed as strings. Code points can also be viewed as grapheme clusters. ↩ For beginners, a common misconception is that "if the software shows things on screen, then it's 90% done". In reality, a proof-of-concept is often just 20% done. There are so many corner cases in real usage. Not handing one corner case is bug. Most code are used for handling corner cases, not common cases. Although each specific corner case triggering probability is small, triggering any of the many corner cases is high-probability. ↩ Computation-data duality. Mutation-data duality. Partial computation and multi-stage computation. Generalized View. Invariant production, grow, and maintenance. Turn computation (logic and action) into data. Turn execution state into explicit data. Closure (lambda expression, function value). A function along with captured data. It allows reusing a piece of code along with captured data . It can help abstraction: separate the generation of computation (create function values) and execution of computation (executing function). (It's related to partial computation and multi-stage computation) Composition. The computation that's turned to data can be more easily composed. Functional programming encourages having simple building blocks and compose them into complex logic. Flexibility. The computation that's turned to data can be changed and rebuilt dynamically. Inspection: Explicit execution state is easier to inspect and display (the machine code can be optimized, and it's platform-depenent, so machine code execution position and runtime stack are harder to inspect and manipulate than explicit data) Serialization: Explicit execution state can be serialized and deserialized, thus be stored to database and sent across network. (Example: Restate) Suspension: Explicit execution state allows temporarily suspending execution and resume it later. Suspending thread is harder and less efficient 1 . Modification: Explicit execution state can be modified. It makes cancellation and rollback easier. (Modifying execution stack and execution state is harder, and it's not supported by many mainstream languages.) Forking: Allows forking control flow, which can be useful in some kinds of simulations. If the rules engine is high in abstraction level, doing a lot of predefined things under-the-hood, then: it will be unadaptive when a new requirement clashes with predefined behavior. Simple interface = hardcoded defaults = less customizability . If the rules engine is low in abstraction level, then doing things will require more configuration. It's not more convenient than just coding. It becomes a new DSL. The new DSL is often worse than mainstream languages because: The new DSL often has poor debug support. The new DSL often has no IDE support. No existing libraries ecosystem. Need to reinvent wheel. The new DSL is often less battle-tested and more buggy. io_uring: allows submitting many IO tasks by writing into memory, then use one system call to submit them. 2 Graphics API: Old OpenGL use system calls to change state and dispatch draw call. New Graphics APIs like Vulkan, Metal and WebGPU all use command buffer. Operations are turned to data in command buffer, then one system call to submit many commands. Instead of just doing in-place mutation, we can enqueue a command (or event) to do mutation later. The command is then processed to do actual mutation. (It's also moving computatin between stages) Layered filesystem (in Docker). Mutating or adding file is creating a new layer. The unchanged previous layers can be cached and reused. Event sourcing . Derive latest state from a events (log, mutations). Express the latest state as a view of old state + mutations. The idea is adopted by database WAL, data replication, Lambda architecture, etc. Command Query Responsibility Segregation . The system has two facades: the query facade doesn't allow mutation, and the command facade only accepts commands and don't give data. Easier to inspect, audit and debug mutations, because mutations are explicit data, not implicit execution history. Easier to audit and replay. Can replay mutations and rollback easily. Can replicate (sync) data change without sending full data. Transactional databases allow rolling back a uncommited transaction. (MySQL InnoDB does in-place mutation on disk but writes undo log and redo log. PostgreSQL MVCC write is append-only on disk.) Editing software often need to support undo. It's often implemted by storing previous step's data, while sharing unchanged substructure to optimize. Multiplayer game client that does server-state-prediction (to reduce visible latency) need to rollback when prediction is invalidted by server's message. CPU does branch prediction and speculative execution. If branch prediction fails or there is other failure, it internally rollback. ( Spectre vulnerability and Meltdown vulnerability are caused by rollback not cancelling side effects in cache that can be measured in access speed). Git. Compute diff based on file snapshots. The diff can then be manipulated (e.g. merging, rebasing, cherry-pick). React. Compute diff from virtual data structure and apply to actual DOM. Sync change from virtual data structure to actual DOM. Kubernetes. You configure what pods/volumes/... should exist. Kubernetes observes the diff between reality and configuration, then do actions (e.g. launch new pod, destroy pod) to cover the diff. It must be commutative. a ∗ b = b ∗ a a * b = b * a a ∗ b = b ∗ a . The order of combinding doesn't matter. It must be associative. a ∗ ( b ∗ c ) = ( a ∗ b ) ∗ c a * (b * c) = (a * b) * c a ∗ ( b ∗ c ) = ( a ∗ b ) ∗ c . The order of combining doesn't matter. It must be idempotent: a ∗ a = a a * a = a a ∗ a = a . Duplicating a mutation won't affect result. (Idempotence is not needed if you ensure exactly-once delivery.) Insertion. inserts a new character after the character with id . The is unique globally. Deletion only marks invisible flag of character (keep the tombstone) In lazy evaluation, the unobserved data is not computed. Deferred mutation. Relates to mutation-data duality. Replace immediately executed code with data (expression tree, DSL, etc.) that will be executed (interpreted) later. Relates to computation-data duality. In multi-stage programming, some data are fixed while some data are unknown. The fixed data can be used for optimization. It can be seen as runtime constant value folding. JIT can be seen as treating bytecode as runtime constant and fold them in interpreter code. Replacing a value with a function or expression tree helps handling the currently-unknown data. Using a future (promise) object to represent a pending computation. In Idris, having a hole and inspecting the type of hole can help proving. Immediately free memory is immediate computation. GC is deferred computation. Stream processing is immediate computation. Batch processing is deferred computation. Pytorch's most matrix operations are async. GPU computes in background. The tensor object's content may be yet unknown (and CPU will wait for GPU when you try to read its content). PostgreSQL and SQLite require deferred "vacuum" that rearranges storage space. Mobile GPUs often do tiled rendering . After vertex shader running, the triangles are not immediately rasterized, but dispatched to tiles (one triangle can go to multiple tiles). Each tile then rasterize and run pixel shader separately. It can reduce memory bandwidth requirement and power consumption. Pre-compile stage. (Code generation, IDE linting, etc.) Compile stage. (Compile-time computation, macros, dependent type theorem proving, etc.) Runtime stage. (Runtime check, JIT compilation, etc.) After first run. (Offline profile-guided optimization, etc.) Zig compile-time computation and reflection. See also Dependently-typed languages. (e.g. Idris, Lean) Scala multi-stage programming. See also . It's at runtime, not compile-time. But its purpose is similar to macro and code generation. Dynamically compose code at runtime and then get JITed. Encapsulating information. Hiding you the true underlying information and only expose derived information. "Faking" information. Bits are views of voltage in circuits. 5 Integers are views of bits Characters are views of integers. Strings are views of characters. 6 Other complex data structures are views to binary data. (pointer can be seen as a view to pointed data) A map (dictionary) can be viewed as a function. Lookup acceleration structure (e.g. database index) are also views to underlying data. Cache is view to underlying data/computation. Lazy evaluation provides a view to the computation result. Virtual memory is a view to physical memory. File system is a view to data on disk. The not-on-disk data can also be viewed as files (Unix everything-is-file philosophy). Symbolic link in file systems is a view to another point in file system. Database provides generalized views of in-disk/in-memory data. Linux namespaces, hypervisors, sandboxes, etc. provides view of aspects of the system. Proxy, NAT, firewall, virtualized networking etc. provides manipulated view of network. Transaction isolation in databases provide views of data (e.g. snapshot isolation). Replicated data and redundant data are views to the original data. Multi-tier storage system. From small-fast ones to large-slow ones: register, cache, memory, disk, cloud storage. Previously mentioned computation-data duality and mutation-data duality can be also seen as viewing. Transposing in Pytorch (by default) doesn't change the underlying matrix data. It only changes how the data is viewed. The mapping between binary data and information is view. Information is bits+context . The context is how the bits are mapped between information. Type contains viewing from binary data to information . Abstraction involves viewing different things as the same things . Avoid the shackle of an unexpressive type system. Avoid syntax inconvenience related to type erasure (type erasure in typed languages require inconvenient things like type conversion). Can quickly iterate by changing one part of program, before the changes work with other parts of program (in static typed languages, you need to resolve all compile errors in the code that you don't use now). This is double-edged sword. The broken code that was not tested tend to get missed. Save some time typing types and type definitions. Reference in GC languages. It may be implemented with a pointer, a colored pointer, or a handle (object ID). The pointer may be changed by moving GC. But in-language semantic doesn't change after moving. ID. All kinds of ID, like string id, integer id, UUID, etc. can be seen as a reference to an object. The ID may still exist after referenced object is removed, then ID become "dangling ID". A special kind of ID is path. For example, file path points to a file, URL points to a web resource, permission path points to a permission, etc. They are the "pointers" into a node in a hierarchical (tree-like) structure. Content-addressable ID. Using the hash of object as the ID of object. This is used in Git, Blockchain, IPFS and languages like Unison . Iterator. An Iterator can be seen as a pointer pointing to an element in container. Zipper. A zipper contains two things: 1. a container with a hole 2. element at the position of the hole. Unlike iterator, a zipper contains the information of whole container. It's often used in pure functional languages. Produce invariant. Create invariant at the smallest scale, in the simplest case. Grow invariant. Combine or expand small invariants to make them larger. This often utilizes transitive rule . Do it until the invariant become big enough to finish the task. Maintain invariant. Every mutaiton to a data structure need to maintain its invariant. Merge sort. Create sorted sub-sequence in smallest scale (e.g. two elements). Then merge two sorted sub-sequences into a bigger one, and continue. The invariant of sorted-ness grows up to the whole sequence. Quick sort. Select a pivot. Then partition the sequence into a part that's smaller than pivot and a part that's larger than pivot (and a part that equals pivot). By partitioning, it creates invariant LeftPartElements < Pivot < RightPartElements \text{LeftPartElements} < \text{Pivot} < \text{RightPartElements} LeftPartElements < Pivot < RightPartElements . By recursively creating such invariants until to the smallest scale (individual elements), the whole sequence is sorted. Binary search tree. It creates invariant LeftSubtreeElements ≤ ParentNode ≤ RightSubtreeElements \text{LeftSubtreeElements} \leq \text{ParentNode} \leq \text{RightSubtreeElements} LeftSubtreeElements ≤ ParentNode ≤ RightSubtreeElements . When there is only one node, the invariant is produced at the smallest scale. Every insertion then follows that invariant and then grows and maintains that invariant. Dijkstra algorithm. The visited nodes are the nodes whose shortest path from source node are known. By using the nodes that we know shortest path, it "expands" on graph, knowing new node's shortest path from source. The algorithm iteratively add new nodes into the invariant, until it expands to destination node. Dynamic programming. The problem is separated into sub-problems. There is no cycle dependency between sub-problems. One problem's result can be quickly calculated from sub-problem's results (e.g. max, min). Querying hash map can skip data because hash ( a ) ≠ hash ( b ) \text{hash}(a) \neq \text{hash}(b) hash ( a )  = hash ( b ) implies a ≠ b a \neq b a  = b . Querying ordered search tree can skip data because ( a < b ) ∧ ( b < c ) (a < b) \land (b < c) ( a < b ) ∧ ( b < c ) implies a < c a < c a < c . Parallelization often utilize associativity: a ∗ ( b ∗ c ) = ( a ∗ b ) ∗ c a * (b * c) = (a * b) * c a ∗ ( b ∗ c ) = ( a ∗ b ) ∗ c . For example, a ∗ ( b ∗ ( c ∗ d ) ) = ( a ∗ b ) ∗ ( c ∗ d ) a*(b*(c*d))=(a * b) * (c * d) a ∗ ( b ∗ ( c ∗ d )) = ( a ∗ b ) ∗ ( c ∗ d ) , where a ∗ b a*b a ∗ b and c ∗ d c*d c ∗ d don't depend on each other and can be computed in parallel. Examples: sum, product, max, min, max-by, min-by, list concat, set union, function combination, logical and, logical or. (Associativity with identity is monoid.) User name cannot duplicate. Bank account balance should never be negative. No over-spend. Product inventory count should never be negative. No over-sell. One room in hotel cannot be booked two times with time overlap. The product should be shipped after the order is paid. No lost notification or duplicated notification. The user can't view or change information that's out of their permission. User cannot use a functionality if subscription ends. The reduntant data, derived data and acceleration data structure (index, cache) should stay consistent with base data (source-of-truth). The client side data should be consistent with server side data. Memory safety invariants. Pointer should point to valid data. Should not use-after-free. Only free once. etc. Should free unused memory. Thread safety invariants. Many operations involve 3 stages: read-compute-write. It will malfunction when other thread mutates between reading data and writing data. The previous read result is no longer valid. Some operations involve more stages (many reads and writes). If the partially-modified state is exposed, invariant is also violatated. Some non-thread-safe data structure should not be shared between threads. Some non-thread-safe data structure must be accessed under lock. The modification of some action should be cancelled after a subsequent action fails. (Ad-hoc transaction, application-managed rollback) Immediate invariant maintenance Delayed invariant maintenance (tolerant stale data. cache, batched processing) The database/framework/OS/language etc. is responsible of maintaining invariant. For example, database maintains the validity of index and materialized view. If they don't have bugs, the invariant won't be violated. The application code is responsible for maintaining the invariant. This is more error-prone . Make the derived data always compute-on-demand . No longer need to manually maintain invariant. But it may cost performance. Caching of immutable compute result can make it faster, while still maintaining the semantic of compute-on-demand. Store the derived data as mutable state and manually keep consistency with source-of-truth. This is the most error-prone solution. All modifications to base data should "notify" the derived data to update accordingly. Sometimes notify is to call a function. Sometimes notify involves networking. A more complex case: the derived data need to modified in a way that reflect to base data. This violates single source-of-truth . It's even more error-prone. Even more complex: the client side need to reduce visible latency by predicting server side data, and wrong prediction need to be corrected by server side data. It not only violates single source-of-truth but also often require rollback and replay mechanism. Relying on other tools (database/framework/OS/language etc.) to maintain the invariant, as previously mentioned. Most users use few common features of a software. Most complexity (and bugs) come from very few features requirements. Most time of CPU is spent executing few hot code. Most data access targets few hot data (cache require this to be effective). Most branches are biased to one in execution (branch prediction require this to be effective). Most developers use few languages, libraries and frameworks. (Matthew effect of ecosystem) Most code and development efforts are for fixing edge cases. Few code and development efforts are spent on main case handling. 7 Most bugs that users see are caused by few easy-to-trigger bugs. Only a small portion of transisters in hardware are used in most times. Many transistors are rarely used. (Many transistors are for rarely-used instructions. Hardware defects related to them have higher probability to evade the test. See also: Silent Data Corruptions at Scale , Cores that don’t count ) Branch prediction assumes that it will execute the high-probability branch. If it predicts wrongly, speculative execution rolls back. Cache assumes that it will access hot data. If it accesses outside of hot data, cache is not hit. Optimistic concurrency control assumes there will be no concurrency conflict. If there do is a conflict, it rolls back and retries. It requires fewer waiting and communication than pessimistic concurrency control (locking), unless there are many contentions. Computation-data duality: Factory pattern. Turn object creation code into factory object. Prototype pattern. Turn object creation code into copying prototype. Chain of Responsibility pattern. Multiple processor objects process command objects in a pipeline. Command pattern. Turn command (action) into object. Interpreter pattern. Interpret data as computation. Iterator pattern / generator. Turn iteration code into state machine. Strategy pattern. Turn strategy into object. Observer pattern. Turn event handling code into an observer. Mutation-data duality: Command pattern. It also involves computation-data duality. The command can both represent mutation, action and computation. Partial computation and multi-stage computation: Builder pattern. Turn the process of creating object into multiple stages. Each stage builds a part. Chain of Responsibility pattern. The processing is separated into a pipeline. It's also in computation-data duality. Generalized view: Adapter pattern. View one interface as another interface. Bridge pattern. View the underlying different implentations as one interface. Composite pattern. View multiple objects as one object. Decorator pattern. Wrap an object, changing its behavior and view it as the same object. Facade pattern. View multiple complex interfaces as one simple interface. Proxy pattern. Proxy object provides a view into other things. Template method pattern. View different implementations as the same interface methods. Flyweight pattern. Save memory by sharing common data. View shared data as owned data. Invariant production, grow and maintenance: There is no GoF pattern that tightly corresponds to it. Visitor pattern. Can be replaced by pattern matching. State pattern. Make state a polymorphic object. Memento pattern. Backup the state to allow rollback. It exists mainly because in OOP data is tied to behavior. The separate memento is just data, decoupled with behavior. Memento pattern doesn't involve turning mutation into data, so it's not mutation-data duality. Singleton pattern. It's similar to global variable, but can be late-initialized, can be ploymorphic, etc. Mediator pattern. One abstraction to centrally manage other abstractions. It's possible to use separate threads for suspendable compuatation. However, OS threads are expensive and context switch is expensive. Manually-implemented state machine is faster. ↩ It's possible to use polling to fully avoid system call after initial setup, but with costs. ↩ There are optimizations. To avoid storing unique ID for each character, it can store many immutable text blocks, and use as character ID. Consecutive insertions and deletions can be merged. The tree keeps growing, and need to be merged. The exact implementation is complex. ↩ Another way of collaborative text editing is Operational Transformation . It use as cursor. The server can transform a cursor to the latest version of document: if there are insertion before , the increments accordingly. If there are deletion, reduces accordingly. This is also called index rebasing. ↩ Also: In hard disk, magnetic field is viewed as bits. In CD, the pits and lands are viewed as bits. In SSD, the electron's position in floating gate is viewed as bits. In fiber optics, light pulses are viewed as bits. In quantum computer, the quantum state (like spin of electron) can be viewed as bits. ...... ↩ Specifically, bytes are viewed as code units, code units are viewed as code points, code points are viewed as strings. Code points can also be viewed as grapheme clusters. ↩ For beginners, a common misconception is that "if the software shows things on screen, then it's 90% done". In reality, a proof-of-concept is often just 20% done. There are so many corner cases in real usage. Not handing one corner case is bug. Most code are used for handling corner cases, not common cases. Although each specific corner case triggering probability is small, triggering any of the many corner cases is high-probability. ↩

0 views
qouteall notes 3 months ago

How to Avoid Fighting Rust Borrow Checker

The 3 important facts in Rust: Firstly consider the reference 2 shape of your in-memory data. The most fighting with borrow checker happens in the borrow-check-unfriendly cases . The solutions in borrow-checker-unfriendly cases (will elaborate below): Contagious borrow issue is a very common and important source of frustrations in Rust, especially for beginners. The previously mentioned two important facts: A simple example: Compile error: (That example is just for illustrating contagious borrow issue. The is analogous to a complex state that exists in real applications. Same for subsequent examples. Just summing integer don't necessarily need to use mutable field. Simple integer mutable state can be workarounded using .) This code is totally memory-safe: the only touch the field, and only touch the field. They work on separate data, but borrow checker thinks they overlap, because of contagious borrow : You just want to borrow one field, but forced to borrow the whole object. What if I just inline and ? Then it compiles fine: Why that compiles? Because it does a split borrow : the compiler sees borrowing of individual fields in one function ( ), and don't do contagious borrow. The deeper cause is that: Summarize solutions (workarounds) of contagious borrow issue (elaborated below): Another solution is to treat mutation as data . To mutate something, append a mutation command into command queue . Then execute the mutation commands at once. (Note that command should not indirectly borrow base data.) What if I need the latest state before executing the commands in queue? Then inspect both the command queue and base data to get latest state ( LSM tree does similar things). You can often avoid needing to getting latest state during processing, by separating it into multiple stages. The previous code rewritten using deferred mutation: Deferred mutation is not "just a workaround for borrow checker". Treating mutation as data also has other benefits: Other applications of the idea of mutation-as-data: The previous problem occurs partially due to mutable borrow exclusiveness. If all borrows are immutable, then contagious borrow is usually not a problem. The common way of avoiding mutation is mutate-by-recreate : All data is immutable. When you want to mutate something, you create a new version of it. Just like in pure functional language (e.g. Haskell). Unfortunately, mutate-by-recreate is also contagious : if you recreated a new version of a child, you need to also recreate a new version of parent that holds the new child, and parent's parent, and so on. There are abstractions like lens to make this kind of cascade-recreate more convenient. Mutate-by-recreate can be useful for cases like: Persistent data structure : they share unchanged sub-structure (structural sharing) to make mutate-by-recreate faster. Some crates of persistent data structures: rpds , im , pvec . Example of mutating hash map while looping on a clone of it, using rpds : Split borrow of fields in struct : As previously mentioned, if you separately borrow two fields of a struct within one scope (e.g. a function), Rust will do a split borrow. This can solve contagious borrow issue. Getter and setter functions break split borrow, because borrowing information become coarse-grained in function signature. Contagious borrow can also happen in containers. If you borrow one element of a container, then another element cannot be mutably borrowed. How to split borrow a container: For looping on container is very common. Rust provides concise container for-loop syntax . However, it has an implicit iterator that keeps borrowing the whole container . One solution is to manually manage index (key) in loop, without using iterator. For or slice, you can make index a mutable local variable, then use while loop to traverse the array. The previous example rewritten using manual loop: It calls many times. Each time, the result borrow is kept for only a short time. After copying the field of element, it stops borrowing the element, which then indirectly stops borrowing the parent. Note that it requires stop borrowing the element before doing mutation . That example copies integer so it can stop borrowing the child. For other large data structures, you also need copying/cloning to stop borrowing element (reference counting and persistent data structure can reduce cost of cloning). (Rust doesn't have C-style for loop .) The similar thing can be done in . We can get the minimum key, then iteratively get next key. This allows looping on without keep borrowing it. Example of mutating a when looping on it. Note that it requires copying/cloning the key, and stop borrowing element before mutating. That way doesn't work for . doesn't preserver order and doesn't allow getting the next key. But that way can work on indexmap 's , which allows getting key by integer index (it internally uses array, so removing or adding in the middle is not fast). Cloning data can avoid keeping borrowing the data. For immutable data, wrapping in ( ) then clone can work: The previous example rewritten by wrapping container in then for-loop: For mutable data, to make cloning and mutation more efficient, the previously mentioned persistent data structure can be used. If the data is small, deep cloning is usually fine. If it's not in hot code, deep cloning is also usually fine. Some may argue that "Circular reference is a bad thing. Look how much trouble do circular references create in mathematics": Circular proof : if A then B, if B then A. Circular proof is wrong. It can prove neither A nor B. The set that indirectly includes itself cause Russel's paradox : Let R be the set of all sets that are not members of themselves. R contains R deduces R should not contain R, and vice versa. Set theory carefully avoids cirular reference. Halting problem is proved impossible to solve, by using circular reference: Assume there exists a function , which takes in a and data, and outputs a boolean telling whether will eventually halt. Then construct a paradox program : Then will cause a paradox. If it returns true, then halts, but in 's definition it should deadloop. Rice's theorem is an extension to Halting problem: All non-trivial semantic properties of programs are undecidable (includes whether it eventually halts). There is something in common between Halting problem, Russel's paradox and Gödel's incomplete theorem: they all self-reference and "negate" itself, causing paradox. Circular reference being bad in mathematics does NOT mean they are also bad in programming. The circular reference in math theories are different to circular reference in data. There are many valid cases of circular references in programming (e.g. there are doublely-linked list running in Linux kernel that works fine). But circular reference do add risks to memory management: Here are some common use cases of circular reference: In the case 1 above: The child references parent, just for convenience. In OOP code. If you just have a reference to a child object, and you want to use some data in parent, child referencing parent would be convenient. Without it, the parent also need to be passed as argument. That convenience in OOP languages will lead to troubles in Rust. It's recommended to pass extra arguments instead of having circular reference. Note that due to previously mentioned contagious borrow issue, you cannot mutably borrow child and parent at the same time (except using interior mutability). The workaround is to 1. do a split borrow on parent and pass the individual components of parent (pass more arguments and be more verbose than in other languages) 2. use interior mutability (e.g. , , ). Observer pattern is commonly used in GUI and other dynamic reactive systems. If parent want to be notified when some event happens on child, the parent register callback to child, and child calls callback when event happens. However, the callback function object often have to reference the parent (because it need to use parent's data). Then it creates circular reference: parent references child, child references callback, callback references parent , as mentioned previously in case 2. Use reference counting and interior mutability. The classical ones: (singlethreaded), (multithreaded). I also recommend using (elaborated below): , . The back-reference (callback to parent, child to parent) should use to avoid memory leak. Use event bus to replace callbacks. Similar to the previous deferred mutation, we turn event into data. Each component listen to specific "event channel" or "event topic". When something happens, put the event into event bus, then event bus notifies components. Use ID/handle to replace borrow (elaborated later). In the previously mentioned case 3 and case 4, circular reference is needed in data structure. Self-reference means a struct contains an interior pointer to another part of data that it owns. Zero-cost self reference requires and . Normal Rust mutable borrow allow moving the value out (by , or , etc.). disallows that, as self-reference pointer can be invalidated by moving. They are complex and hard to use. Workarounds includes separating child and use reference counting so it's no longer self-reference. Data-oriented design: Some may think that using handle/ID is "just a nasty workaround caused by borow checker". However, in GC languages, using ID to refer to object is also common, as reference cannot be saved to database or sent via network. 5 One kind of arena is slotmap : Other map structure, like or can also be arenas. If no element can be removed from arena, then a can also be an areana. The important things about arena: Some may think "using arena cannot protect you from equivalent of 'use after free' so it doesn't solve problem". But arena can greatly improve determinism of bugs, making debugging much easier. A randomly-occuring memory safety Heisenbug may no longer trigger when you enable sanitizer, as sanitizer can change memory layout. Entity component system (ECS) is a way of organizing data that's different to OOP. In OOP, an object's fields are laid together in memory. But in ECS, each object is separated into components. The same kind of components for different entities are managed together (often laid together in memory). It can improve cache-friendliness. (Note that performance is affected by many factors and depends on exact case.) ECS also favors composition over inheritance. Inheritance tend to bind code with specific types that cannot easily compose. (For example, in an OOP game, extends , extends . There is a special enemy that ignores collision and also extends . But one day if you want to add a new player skill that temporarily ignores collision like , you cannot make extend and have to duplicate code. In ECS that can be solved by just combining special collision component.) The concept of generalized reference : The generalized reference is separated into two kinds: strong and weak: Strong generalized reference: The system ensures it always points to a living object . It contains: normal references in GC languages (when not null), Rust borrow and ownership, strong reference counting ( , , when not null), and ID in database with foreign key constraint. Weak generalized reference: The system does NOT ensure it points to a living object . It contains: ID (no foreign key constraint), handles, weak reference in GC languages, weak reference counting ( , ). The major differences: If you want to design an abstraction that decouples object lifetime and how these objects are referenced, it's recommended to either: As previously mentioned, Rust has mutable borrow exclusiveness : That is also called "mutation xor sharing", as mutation and sharing cannot co-exist. In multi-threading case, this is natural: multiple threads read the same immutable data is fine. As long as one thread mutates the data, other thread cannot safely read or write it without other synchronization (atomics, locks, etc.). But in single-threaded case, this restriction is not natural at all. No mainstream language (other than Rust) has this restriction. Mutation xor sharing  is, in some sense, neither necessary nor sufficient. It’s not  necessary  because there are many programs (like every program written in Java) that share data like crazy and yet still work fine. It’s also not  sufficient  in that there are many problems that demand some amount of sharing – which is why Rust has “backdoors” like  ,  , and—the ultimate backdoor of them all— . - The borrow checker within Mutable borrow exclusiveness is still important for safety of interior pointer, even in single thread: Rust has interior pointer . Interior pointer are the pointers that point into some data inside another object. A mutation can invalidate the memory layout that interior pointer points to . For example, you can take pointer of an element in . If the grows, it may allocate new memory and copy existing data to new memory, thus the interior pointer to it can become invalid (break the memory layout that interior pointer points to). Mutable borrow exclusiveness can prevent this issue from happening: Compile error: Another example is about : interior pointer pointing inside can also be invalidated, because different enum variants has different memory layout. In one layout the first 8 bytes is integer, in another layout the first 8 bytes may be a pointer. Treating an arbitrary integer as a pointer is definitely not memory-safe. Compile error: Note that sometimes mutating can keep validity of interior pointer. For example, changing an element in doesn't invalidate interior pointer to elements, because there is no memory layout change. But Rust by default prevents all mutation when interior pointer exists (unless using interior mutability). Golang also supports interior pointer, but doesn't have such restriction. For example, interior pointer into slice: Because after re-allocating the slice, the old slice still exists in memory (not immediately freed). If there is an interior pointer into the old slice, the old slice won't be freed by GC. The interior pointer will always be memory-safe (but may point to stale data). Golang also doesn't have sum type, so there is no equivalent to enum memory layout change in the previous Rust example. Also, Golang's doesn't allow taking interior pointer to map entry value, but Rust allows. Rust's interior pointer is more powerful than Golang's. In Java, there is no interior pointer. So no memory safety issue caused by interior pointer. But in Java there is one thing logically similar to interior pointer: . Mutating a container can cause iterator invalidation: That will get . Java's has an internal version counter that's incremented every time it changes. The iterator code checks concurrent modification using version counter. Even without the version check, it will still be memory-safe because array access is range-checked. Note that the container for loop in java internally use iterator (except for raw array). Inserting or removing to the container while for looping can also cause iterator invalidation. Note that iteration invalidation is logic error , no matter whether it's memory-safe or not. In Java, you can remove element via the iterator, then the iterator will update together with container, and no longer invalidate. Or use that avoids managing iterator. Mutable borrow exclusiveness is still important in single-threaded case, because of interior pointer. But if we don't use any interior pointer, then mutable borrow exclusiveness is not necessary for memory safety in single-thread case . That's why mainstream languages has no mutable borrow exclusiveness, and still works fine in single-threaded case. Java, JS and Python has no interior pointer. Golang and C# have interior pointer, they have GC and restrict interior pointer, so memory safe is still kept without mutable borrow exclusiveness. Rust's mutable borrow exclusiveness creates a lot of troubles in single-threaded cases. But it also has benefits (even in signle-threaded cases): Mutable borrow exclusiveness is overly restrictive. It is not necessary for memory safety in single-threaded code when not using interior pointer. There is interior mutability that allows getting rid of that constraint. Interior mutability allows you to mutate something from an immutable reference to it. (Because of that, immutable borrow doesn't necessarily mean the pointed data is actually immutable. This can cause some confusion.) Ways of interior mutability: They are usually used inside reference counting ( , ). In the previous contagious borrow case, wrapping parent in can make the code compile. However it doesn't fix the issue. It just turns compile error into runtime panic : It will panic with error. still follows mutable borrow exclusiveness rule, just checked at runtime, not compile time. Borrowing one fields inside still borrows the whole . Wrapping parent in cannot fix contagious borrow, but putting individual children into can work, as it makes borrow more fine-grained . See also: Dynamic borrow checking causes unexpected crashes after refactorings Rust assumes that, if you have a mutable borrow , you can use it at any time. But holding the reference is different to using reference . There are use cases that I have two mutable references to the same object, but I only use one at a time. This is the use case that solves. Another problem: The borrow taken from cannot be directly returned. Compile error: Because the borrow got from is not normal borrow, it's actually . implements so it can be used similar to a normal borrow. The "help: consider borrowing here" suggestion won't solve the compiler error. Don't blindly follow compiler's suggestions. One solution is to return , instead of returning . Returning or returning ) can also work, but they are not recommended. and allows freely copying reference and freely mutating things, just like in other languages. Finally "get rid of shackle of borrow checker". However, there are traps : has an internal ID. is also an ID. You can only use via an that has matched ID. The borrowing to "centralizes" the borrowing of many s associated with it, ensuring mutable borrow exclusiveness. Using it require passing borrow of in argument everywhere it's used. will fail to borrow if the owner ID doesn't match. Different to , if owner ID matches, it won't panic just because nested borrow. Its runtime cost is low. When borrowing, it just checks whether cell's id matches owner's id. It has memory cost of owner ID per cell. One advantage of is that the duplicated borrow will be compile-time error instead of runtime panic, which helps catch error earlier. If I change the previous panic example into : Compile error: It turns runtime panic into compile error, which make discovering problems eariler. GPUI 's is similar to , where GPUI's correspond to . It can also work in multithreading, by having . This can allow one lock to protect many pieces of data in different places 7 . Ghost cell and LCell are similar to QCell, but use closure lifetime as owner id. They are zero-cost, but more restrictive (owner is tied to closure scope, cannot dynamically create, owner cannot outlive closure). Re-entrant lock means one thread can lock one lock, then lock it again, then unlock twice, without deadlocking. Rust lock is not re-entrant. (Rust lock is also responsible for keeping mutable borrow exclusiveness. Allowing re-entrant can produce two for same object.) For example, in Java, the two-layer locking doesn't deadlock: But in Rust the equivalent will deadlock: It prints then deadlocks. In Rust, it's important to be clear about which scope holds lock . Golang lock is also not re-entrant. Another important thing is that Rust only unlocks at the end of scope by default. gives a . implements , so it will drop at the end of scope. It's different to the local variables whose type doesn't implement , they are dropped after their last use (unless borrowed). This is called NLL (non-lexical lifetime). uses atomic operations to change its reference count. (Cloning and dropping changes reference count, but borrowing doesn't.) However, when many threads frequently change the same atomic counter, performance can degrade. The more threads touching it, the slower it is. Modern CPUs use cache coherency protocol (e.g. MOESI ). Atomic operations often require the CPU core to hold "exclusive ownership" to cache line (this may vary between different hardware). Many threads frequently doing so cause cache contention, similar to locking, but on hardware. Example 1 , Example 2 Atomic reference counting is still fast if not contended (mostly only one thread change reference count). Atomic reference counting is faster on Apple silicon than Intel CPUs. 8 These deferred memory reclamation techniques (hazard pointer, epoch-based) are also used in lock-free data structures. If one thread can read an element while another thread removes and frees the same element in parallel, it will not be memory-safe (this issue doesn't exist in GC languages). There are some ambiguity of the word "GC". Some say reference counting is GC, some say it isn't. No matter what the definition of "GC" is, reference counting is different from tracing GC (in Java/JS/C#/Golang/etc.): bumpalo provides bump allocator. In bump allocator, allocation is fast because it usually just increase an integer. It supports quickly freeing the whole arena, but doesn't support freeing individual objects. It's usually faster than normal memory allocators. Normal memory allocator will do a lot of bookkeeping work for each allocation and free. Each individual memory region can free separately, these regions can be reused for later allocation, these information need to be recorded and updated. Bump allocator frees memory in batched and deferred way. As it cannot free individual objects, it may temporarily consume more memory. Bump allocator is suitable for temporary objects, where you are sure that none of these temporary objects will be needed after the work complets. The function signature of allocation (changed for clarity): It takes immutable borrow of (it has interior mutability). It moves into the bump-allocated memory region. It outputs a mutable borrow, having the same lifetime as bump allocator. That lifetime ensures memory safety (cannot make the borrow of allocated value outlive bump allocator). If you want to keep the borrow of allocated result for long time, then lifetime annotation is often required . In Rust, lifetime annotation is also contagious : Adding or removing lifetime for one thing may involve refactoring many code that use it, which can be huge work. Be careful in planning what lifetime parameters it needs. doesn't implement , so is not . It cannot be shared across threads (even if it can share, there will be lifetime constraint that force you to use structured concurrency). It's recommended to have separated bump allocator in each thread, locally. By using unsafe you can freely manipulate pointers and are not restricted by borrow checker. But writing unsafe Rust is harder than just writing C, because you need to carefully avoid breaking the constraints that safe Rust code relies on . A bug in unsafe code can cause issue in safe code. Writing unsafe Rust correctly is hard. Here are some traps in unsafe: Modern compilers tries to optimize as much as possible. To optimize as much as possible, the compiler makes assumptions as much as possible. Breaking any of these assumption can lead to wrong optimization. That's why it's so complex. See also Unfortunately Rust's syntax ergonomics on raw pointer is currently not good: Current borrow checker does coarse-grained analysis on branch. One branch's output's borrowing is contagious to another branch. Currently, this won't compile ( see also ): Becaue the first branch 's output value indirectly mutably borrows , the second branch has to also indirectly mutably borrow , which conflicts with another mutable borrow in scope. This will be fixed by Polonius borrow checker. Currently (2025 Aug) it's available in nightly Rust and can be enabled by an option. Rust favors tree-shaped ownership. Each object is owned by exactly one place. If you send tree-shaped data to another thread, only one thread can access it, so it's thread-safe. No data race. Sending an immutable borrow to another thread is also fine as long as the shared data is actually immutable. But there are exceptions. One exception is interior mutability. Because of interior mutability, the data pointed by immutable borrow may no longer actually be immutable. So Rust prevents sharing and by making and not . If is then is . If is not then is not . This prevents and from being shared across threads. also has internal shared mutable reference counter. It's not atomic, so cannot be passed between threads. It's neither or . But can be shared because lock protects them. Also immutable reference to atomic cells like can be shared because of atomicity. and are so and are . There are things that are but not , like . If something is already locked, sharing its reference to other threads temporarily is fine. But moving a to another thread is not fine because locking is tied to thread. Tokio is a popular async runtime. In Tokio, submitting a task require the future to be and . means it's standalone (self-owned). It doesn't borrow temporary things. It can borrow global values (global values will always live when program is running). It cannot borrow a value that only temporarily exists. The spawned future may be kept for a long time. It's not determined whether future will only temporarily live within a scope. So the future need to be . tokio_scoped allows submitting a future that's not , but it must be finished within a scope. If the future need to share data with outside, pass into (not ). Note that the "static" in C/C++/Java/C# often mean global variable. But in Rust its meaning is different. means that the future can be sent across threads. Tokio use work-stealing, which means that one thread's task can be stolen by other threads that currently have no work. is not needed if the async runtime doesn't move future between threads. Rust converts an async functions into a state machine, which is the future. In async function, the local variables that are used across points will become fields in future. If the future is required to be then these local variables also need to be . In C and GC languages: But in Rust it's different. Normally mutable borrow can only be moved and cannot be copied. But reborrow is a feature that sometimes allow you to use a mutable borrow multiple times. Reborrow is very common in real-world Rust code. Reborrow is not explicitly documented . See also That works. Rust will implicitly treat the first as so that is not moved into and become usable again. But extracting the second into a local variable early make it not compile: require future to be standalone and doesn't borrow other things ( ). Passing passing an (that's used later) into moved closure makes closure borrow the . The data that contains borrow is not standalone (not ). Compile error Manually clone the and put it into local variable works. It will make the cloned version to move into future: Note that inlining local variable make it not compile: There is a proposal on improving syntax ergonomic of it. Rust prefers data-oriented design. Rust doesn't fit OOP. Rust dislikes sharing mutable data. Rust dislikes circular reference. Rust is unfriendly to observer pattern . Getter and setter can easily cause contagious borrow issue. Sharing and mutation has many limitations. Rust is less flexible and does not suit quick iteration (unless you are a Rust expert). Rust's constraints apply to both human and AI . In a large C/C++ codebase, both human and AI can accidentally break memory safety and thread safety in non-obvious way . Rust can protect against that. Popular open source projects are often flooded with AI-generated PR. Rust makes reviewing PR easier: as long as CI passes and it doesn't use , it won't break memory and thread safety. Note that Rust doesn't protect against many kinds logical error. The native Rust ownership relation form a tree. Reference counting ( , ) allows shared ownership. ↩ Note that here "reference" here means reference in general OOP context (where there is no distinction between ownership and non-owning reference, think about reference in Java/C#/JS/Python). This is different to the Rust reference. I will use "borrow" for Rust reference in this article. ↩ Specifically, Gödel encodes symbols, statements and proofs into integer, called Gödel number. There exists many ways of encoding symbols/statements/proofs as data, and which exact way is not important. For simplicity, I will treat them all as data, and ignore the conversion between data and symbol/statements/proofs. ↩ Here is symbol substitution. replacing the free variable with , while avoid making two different variables same name by renaming when necessary. It's also similar to Y combinator: . In that case , , , is a fixed point of : . , ↩ Having both ID and object reference introduces friction: translating between ID and object reference. Some ORM will malfunction if there exists two objects with the same primary key. ↩ Each slotmap ensures key uniqueness, but if you mix keys of different slotmaps, the different keys of different slotmap may duplicate. Using the wrong key may successfully get an element but logically wrong. ↩ Sometimes, having fine-grained lock is slower because of more lock/unlock operations. But sometimes having fine-grained lock is faster because it allows higher parallelism. Sometimes fine-grained lock can cause deadlock but coarse-grained lock won't deadlock. It depends on exact case. ↩ See also . That was in 2020. Unsure whether it changed now. One possible reason is that ARM allows weaker memory order than X86. Also, Swift and Objective-C use reference counting almost everywhere, so possibly Apple payed more efforts in optimizing atomic reference counting. ↩ Tracing GC is faster for short-lived programs (such as some CLI programs and serverless functions), because there's no need to free memory for individual objects on exit. Example: My JavaScript is Faster than Your Rust . The same optimization is also achievable in Rust, but require extra work (e.g. , bump allocator). ↩ It lags because it need to do many counter decrement and deallocation for each individual object. Can be workarounded by sending the to another thread and drop in that thread. Also, for deep structures, dropping may stack overflow. ↩ Contended atomic operations (many threads touch one atomic value at the same time) are much slower than when not contended. Its cost also include memory block allocation and freeing. ↩ GC frequency is roughly porpotional to allocation speed divide by free memory. In generational GC, a minor GC only scans young generation, whose cost is roughly count of living young generation objects. But it still need to occasionally do full GC. ↩ When running in tools like Miri , the pointer provenance will be tracked at runtime. ↩ will not execute if returns true. will not execute if returns false. ↩ The Heisenbugs may only trigger in relase build, not in debug build, not when sanitizers are on, not when logging is on, not when debugger is on. Because optimization, sanitizer, debugger and logging can change timing and memory layout, which can make memory safety or thread safety bug no longer trigger. Debugging a Heisenbug in large codebase may take weeks even months. Note that not all memory/thread safety bugs are Heisenbugs. Many are still easy to trigger. ↩ Golang is not memory-safe under data race. ↩ Tree-shaped ownership . In Rust's ownership system, one object can own many children or no chlld, but must be owned by exactly one parent . Ownership relations form a tree. 1 Mutable borrow exclusiveness . If there exists one mutable borrow for an object, then no other borrow to that object can exist. Mutable borrow is exclusive. Borrow is contagious . If you borrow a child, you indirectly borrow the parent (and parent's parent, and so on). Mutably borrowing one wheel of a car makes you borrow the whole car, preventing another wheel from being borrowed. It can be avoided by split borrow which only works within one scope. If the reference is tree-shaped, then it's simple and natural in Rust. If the reference shape has sharing , things become a little complicated. Sharing means there are two or more references to the same object. If shared object is immutable: If the sharing is scoped (only temporarily shared), then you can use immutable borrow. You may need lifetime annotation. If the sharing is not scoped (may share for a long time, not bounded within a scope), you need to use reference counting ( in singlethreaded case, in possibly-multithreaded case) If shared object is mutable, then it's in borrow-check-unfriendly case . Solutions elaborated below. Contagious borrow can cause unwanted sharing (elaborated below). If the reference shape has cycle , then it's also in borrow-check-unfriendly case . Solutions elaborated below. Data-oriented design. Avoid unnecessary getter and setter . Do split borrow in outer scope and pass borrowing of each component separately. Use ID/handle to replace borrow . Use arena to hold data. Defer mutation . Turn mutation as commands and execute later. Avoid in-place mutation . Mutate-by-recreate. Use to share immutable data. Use persistent data structure . For circular reference: For graph data structure, use ID/handle and arena. For callback, use event bus or (use to cut cycle) Borrow as temporary as possible. For example, replace container for-loop with raw index loop. (only use when really necessary) and raw pointer (only use when really necessary) Mutable borrow exclusiveness . If you mutably borrow one object, others cannot borrow it. Borrow is contagious . If you borrow a child, you indirectly borrow the parent (and parent's parent, and so on), which contagiously borrow other childs of the same parent. Mutably borrowing one wheel of a car makes you borrow the whole car, including all 4 wheels, then the wheel that you don't use cannot be borrowed. This don't happen under split borrow. In , although the method body just borrows field, the return value indirectly borrows the whole , not just one field. In , the function body only mutably borrowed field, but the argument borrows the whole , not just one field. Inside the loop, one immutable borrow to the whole and one mutable borrow to the whole overlaps in lifetime. Borrow checker works locally : when seeing a function call, it only checks function signature , instead of checking code inside the function. (Its benefit is to make borrow checking faster and simpler. Doing whole-program analysis is hard and slow, and doesn't work with things like dynamic linking.) Information is lost in function signature : the borrowing information becomes coarse-grained and is simplified in function signature. The type system does not allow expressing borrowing only one field, and can only express borrowing the whole object. There are propsed solutions: view type . Remove unnecessary getters and setters . Just simply make fields public. This makes split borrow possible. If you want encapsulation, it's recommended to use ID/handle to replace borrow of mutable data (elaborated below). The getter that returns cloned/copied value is fine. If data is immutable, getter is also fine. Defer mutation Avoid in-place mutation Do a split borrow on the outer scope. Or just get rid of struct, pass fields as separate arguments. (This is inconvenient.) Manually manage index (or key) in container for-loop. Borrow as temporary as possible. Just clone the data (can be shallow-clone). Use interior mutability (cells and locks). In the process of creating new commands, it only do immutable borrow to base data, and only one mutable borrow to the command queue at a time. When executing the commands, it only do one mutable borrow to base data, and one borrow to command queue at a time. The mutation can be serialized, and sent via network or saved to disk. The mutation can be inspected for debugging and logging. You can post-process the command list, such as sorting, filtering. Easier parallelism . The process of generating mutation command does not mutate the base data, so it can be parallelized. If data is sharded, the execution of mutation commands can be dispatched to shards executing in parallel. Transactional databases often use write-ahead log (WAL) to help atomicity of transactions. Database writs all mutations into WAL. Then after some time the mutations in WAL will be merged to base data in disk. Event sourcing. Derive the latest state from events and previous checkpoint. Distributes systems often use consensus protocol (like Raft) to replicate log (events, mutations). The mutable data is derived from logs and previous checkpoint. The idea of turning operations into data is also adopted by io_uring and modern graphics APIs (Vulkan, Metal, WebGPU). The idea of turning mutation into insertion is also adopted by ClickHouse. In ClickHouse, direct mutaiton is not performant. Mutate-by-insert is faster, but querying require aggregate both the old data and new mutations. Safely sharing data in multithreading (read-copy-update (RCU), copy-on-write (COW)) Take snapshot and rollback efficiently For and slice, use Circular proof : if A then B, if B then A. Circular proof is wrong. It can prove neither A nor B. The set that indirectly includes itself cause Russel's paradox : Let R be the set of all sets that are not members of themselves. R contains R deduces R should not contain R, and vice versa. Set theory carefully avoids cirular reference. Halting problem is proved impossible to solve, by using circular reference: Assume there exists a function , which takes in a and data, and outputs a boolean telling whether will eventually halt. Then construct a paradox program : Gödel's incomplete theorem . Firstly encode symbols, statements and proofs into data 3 . The statements that contain free variables (e.g. x is a free variable in "x is an even number") can also be encoded (it can represent "functions" and even "higher-order functions"). allows determining whether a proof successfully proves a theory. Then is defined as whether there exists a that satisfies . Unprovable is inverse of provable: Let . Then let 4 , which creates a self-referencial statement: means is not provable. If is true, then is not provable, then is false, which is a paradox. In C/C++, if two objects point to each other, when one object destructs, the another object's reference should be cleared, or there will be risk of use-after-free. When using reference counting, loop should be cut by weak reference, or it will memory leak. In GC languages, circular reference has memory leak risk. If all children references parents, and parents references children, then referencing any child will keep the whole structure alive. Case 1: The parent references a child. The child references its parent, just for convenience. (Referencing to parent is not necessary, parent can be passed by argument) Case 2: The parent registers a callback to child. When something happened on child, the callback is called, and parent do something. It that case, parent references child, child references callback, callback references parent. Case 3: In a tree structure, the child references parent allows getting the path from one node to root node. Without it, you cannot get the path from just one node reference, and need to store variable-length path information. Case 4: The data is inherently a graph structure that can contain cycles. Case 5: The data requires self-reference. Use reference counting and interior mutability. The classical ones: (singlethreaded), (multithreaded). I also recommend using (elaborated below): , . The back-reference (callback to parent, child to parent) should use to avoid memory leak. Use event bus to replace callbacks. Similar to the previous deferred mutation, we turn event into data. Each component listen to specific "event channel" or "event topic". When something happens, put the event into event bus, then event bus notifies components. Use ID/handle to replace borrow (elaborated later). Use reference counting and interior mutability (previously mentioned). This is recommended when there are many different types of components and you want to add new types easily (like in GUI). Use ID/handle to replace borrow (elaborated later). This is recommended when you want more compact memory layout, and you rarely need to add new types into data (suitable for data-intensive cases, can obtain better performance due to cache-friendliness). Try to pack data into contagious array, (instead of objects laid out sparsely managed by allocator). Use handle (e.g. array index) or ID to replace reference. Decouple object ID with memory address . An ID can be saved to disk and sent via network, but a pointer cannot (the same address cannot be used in another process or after process restart, because there may be other data in the same address). The different fields of the same object doesn't necessarily need to be together in memory. The one field of many objects can be put together (parallel array). Manage memory based on arenas . Slotmap is basically an array of elements, but each element has a version integer. Each handle (key) has an index and a version. It's -able data that's not restricted by borrow checker. When accessing the slotmap, it firstly does a bound check, then checks version. After removing element, the version increments. The previous handle cannot get the new element at the same index, because of version mismatch. Although memory safe, it still has the equivalent of "use-after-free": using a handle of an already-removed object cannot get element from the slotmap 6 . Each get element operation may fail. Note that slotmap is not efficient when there are many unused empty space between elements. Slotmap offers two other variants for sparce case. The borrow checker no longer ensure the ID/handle points to a living object . Each data access to arena may fail. There is equivalent of "use after free". Arenas still suffer from contagious borrow issue . We need to borrow things as temporary as possible. But when arena contains a container and we want to for-loop on it, to avoid cost of copying the container, borrowing for long time is still necessary. If we want to change the arena when for looping in that container, contagious borrow issue appears. The previously mentioned deferred mutation can help . The reference in GC languages is generalized reference. Pointer is generalized reference. Borrowing in Rust is generalized reference. Ownership in Rust is also considered as generalized reference. Smart pointer ( , , , in Rust, , , in C++, etc.) are generalized reference. ID s are generalized reference. (It includes all kinds of IDs, including handles , UUID, string id (URL, file path, username, etc.), integer id, primary key, and all kinds of identification information). Strong generalized reference: The system ensures it always points to a living object . It contains: normal references in GC languages (when not null), Rust borrow and ownership, strong reference counting ( , , when not null), and ID in database with foreign key constraint. Weak generalized reference: The system does NOT ensure it points to a living object . It contains: ID (no foreign key constraint), handles, weak reference in GC languages, weak reference counting ( , ). For weak generalized references, every data access may fail, and requires error handling . (just panic is also a kind of error handling) For strong generalized reference, the lifetime of referenced object is tightly coupled with the existence of reference : In Rust, the coupling comes from borrow checker. The borrow is limited by lifetime and other constraints. In GC langauges, the coupling comes from GC. The existence of a strong reference keeps the object alive. Note that in GC languages there are live-but-unusable objects (e.g. Java is unusable after closing). In reference counting, the coupling of course comes from runtime reference counting. The foreign key constraint of ID is enforced by database. For weak generalized reference, the lifetime of object is decoupled from referces to it . Use weak generalized reference, such as ID and handle. The object can be freed without having to consider how its IDs are held. Use strong generalized reference, but add a new usability state that's decoupled with object lifetime . This is common in GC languages. Examples: In JS, if you send an to another web worker, the object can still be referenced and kept alive, but the binary content is no longer accessible from that object. In Java, the IO-related objects (e.g. ) can no longer be used after closing, even these objects are still referenced and still alive. A mutable borrow to one object cannot co-exist with any other borrow to the same object. Two mutable borrows cannot co-exist. One mutable and one immutable also cannot co-exist. Multiple immutable borrows for one object can co-exist. Make the borrow more universal. In Rust, map key and value can be borrowed. But in Golang you cannot take interior pointer to map key or value. This makes abstractions that work with borrows more general. Mutable borrow is exclusive, so Rust can emit attribute to LLVM. means the pointed data cannot be accessed by other code, which means: It allows aggressively merging reads. Before next write to it, it can be temporarily treated as constant. It allows aggressively merging writes. If there are two memory writes to it, compiler can remove the first write, only keep the last write. It allows removing reads after write, using the previous write as read result. It allows aggressively reordering reads/writes to it between other computation and other memory accesses. The above give compiler a lot of freedom of transforming code, which enables many other optimizations. Without , the optimizer must consider all possible reads/writes to the same value to do above transformation. In many cases, compiler don't have enough information, so much fewer optimizations can be done. . It's suitable for simple copy-able types like integer. In the previous contagious borrow example, if the is replaced with then mutating it doesn't need mutable borrow of parent thus avoid the issue. only supports replacing the whole at once, and doesn't support getting a mutable borrow. , suitable for data structure that does incremental mutation, in single-threaded cases. It has internal counters tracking how many immutable borrow and mutable borrow currently exist. If it detects violation of mutable borrow exclusiveness, or will panic.It can cause crash if there is nested borrow that involves mutation. , for locking in multi-threaded case. Its functionality is similar to . Note that unnecessary locking can cost performance, and has risk of deadlock. It's not recommended to overuse just because it can satisfy the borrow checker. . Elaborated below. Atomic types such as Lazily-initialized Contagious borrowing . As previously mentioned, wrapping parent doesn't solve contagious borrowing. also won't. Violating mutable borrow exclusiveness in the same thread is panic (or error) in and deadlock in . Rust lock is not re-entrant, explained below . Need to cut cycle using , unless it will memory leak. Performance . and has relatively small performance cost. But for , unnecessary locking can hurt performance. also can have performance issue, explained below . It's still ok to use them when not in performance bottleneck. Their syntax ergonomic is not good. The code will have a lot of "noise" like . Avoid sharing the same reference count. Copying data is sometimes better. trc and hybrid_rc . They use per-thread non-atomic counter, and another shared atomic counter for how many threads use it. This can make atomic operations be less frequent, getting higher performance. For scenario of frequent short-term reads: arc_swap . It uses hazard pointer and other mechanics to improve performance. aarc and crossbeam_epoch . Use epoch-based memory reclamation . Every struct that holds bump-allocated borrow need to also have lifetime annotation of the bump allocator. Every function that use it also needs lifetime annotation. Rust has lifetime elision , which allows you to omit lifetime in function signature in some cases. However it doesn't work in all cases. Don't violate mutable borrow exclusiveness. A cannot overlap with any other borrow that overlaps. The overlap here also includes interior pointer. A to an object cannot co-exist with any other borrow into any part of that object. Violating that rule cause undefined behavior and can cause wrong optimization. Rust adds attribute for mutable borrows into LLVM IR. LLVM will heavily optimize based on . See also The above rule doesn't apply to raw pointer . Converting a to then mutate pointed data is undefined behavior. For that use case, wrap in . It's very easy to accidentally violate that rule when using borrows in unsafe. It's recommended to always use raw pointer and avoid using borrow (including slice borrow) in unsafe code. Related1 , Related2 Pointer provenance . Two pointers created from two provenances is considered to never alias. If their address equals, it's undefined behavior. Converting an integer to pointer gets a pointer with no provenance, using that pointer is undefined behavior, unless in these two cases: The integer was converted from a pointer using and then integer converts to pointer using The integer is converted to pointer using ( is another pointer). The result has same provenance of . Adding a pointer with an integer doesn't change provenance. The provenance is tracked by compiler in compile time. In actual execution, pointer is still integer address that doesn't attach provenance information 13 . Using uninitialized memory is undefined behavior. will drop the original object in place of . If is uninitialized, then it will drop an unitialized object, which is undefined behavior. Use Related Handle panic unwinding. Reading/writing to mutable data that's shared between threads need to use atomic, or volatile access ( , ), or use other synchronization (like locking). If not, optimizer may wrongly merge and reorder reads/writes. Note that volatile access themself doesn't establish memory order (unlike Java ). If is a raw pointer, you cannot write (like in C/C++), and can only write Raw pointer cannot be method receiver (self). There is no "raw pointer to slice". You need to manually pointer and dereference. Bound checking is also manual. means it's standalone (self-owned). It doesn't borrow temporary things. It can borrow global values (global values will always live when program is running). It cannot borrow a value that only temporarily exists. The spawned future may be kept for a long time. It's not determined whether future will only temporarily live within a scope. So the future need to be . tokio_scoped allows submitting a future that's not , but it must be finished within a scope. If the future need to share data with outside, pass into (not ). Note that the "static" in C/C++/Java/C# often mean global variable. But in Rust its meaning is different. means that the future can be sent across threads. Tokio use work-stealing, which means that one thread's task can be stolen by other threads that currently have no work. is not needed if the async runtime doesn't move future between threads. The normal sleep and normal locking should not be used when using async runtime, because they block using OS functionality without telling async runtime, so they will block the async runtime's scheduling thread. In Tokio, use and . Cancellation safety. See also If a variable is used only once, you can inline that variable. This will only change execution order (except in short-circuit 14 ). Extracting a variable will only change execution order (except when variable is used twice or in short-circuit). A temporary value drops immediately after evaluating, except when there is a borrow to it, its lifetime extends by the borrow. It's called temporary lifetime extension. There are implicit ways of creating borrow. can implicitly borrow, can implicitly borrow , or can also borrow which extend the lifetime Sometimes temporary lifetime extension doesn't work, such as A value that's put into a local variable: If its type implements , then it will drop at the end of scope (one example is ). If its type doesn't implement , then it will drop after its last use. This is called NLL (non-lexical lifetime). Putting a temporary value to local variable usually make it live longer. Inlining a local variable usually make it live shorter. Borrowing that cross function boundary is contagious. Just borrowing a wheel of car can indirectly borrow the whole car. Mutate-by-recreate is contagious. Recreating child require also recreating parent that holds the new child, and parent's parent, and so on. Lifetime annotation is contagious. If some type has a lifetime parameter, then every type that holds it must also have lifetime parameter. Every function that use them also need lifetime parameter, except when lifetime elision works. Refactoring that adds/remove lifetime parameter can be huge work. In current borrow checker, one branch's output's borrowing is contagious to the whole branching scope. is contagious. function can call normal function. Normal function cannot easily call function (but it's possible to call by blocking). Being not / is contagious. A struct that indirectly owns a non- data is not . A struct that indirectly owns a non- data is not . Error passing is contagious. If panic is not acceptable, then all functions that indirectly call a fallible function must return . Related: NaN is contagious in floating point computation. "Rust doesn't ensure safety of code, so using defeats the purpose of using Rust". No. If you keep the amount of small, then when memory/thread safety issue happens, you can inspect these small amount of code. In C/C++ you need to inspect all related code. It's still not recommended to use many in Rust. "Using arena still face the equivalent of 'use after free', so arena doesn't solve the problem". No. Arenas can make these bugs much more deterministic than the use-after-free in C/C++, prevent memory-safety Heisenbugs 15 , making debugging much easier. "Rust borrow checker rejects your code because your code is wrong." No. Rust can reject valid safe code. "Doubly-linked list is useless." No. It can be useful in many cases. Linux kernel uses them. But often trees and hash maps can replace manually-implemented doubly-linked list. "Circular reference is bad and should be avoided." No. Circular reference can be useful in many cases. Circular reference do come with risks. "Rust guarantees high performance." No. If one evades borrow checker by using everywhere, the program will be likely slower than using a normal GC language (and has more risk of deadlocking). But it's easier to achieve high performance in Rust. In many other languages, achieving high perfomance often require bypassing (hacking) a lot of language functionalities. "Rust guarantees security." No. Not all security issues are memory/thread safety issues. According to Common Weakness Enumeration 2024 , many real-world vulnerabilities are XSS, SQL injection, directory traversal, command injection, missing authentication, etc. that are not memory/thread safety. "Rust doesn't help other than memory/thread safety." No. Algebraic data type (e.g. , ) helps avoid creating illegal data from the source. Using ADT data require pattern match all cases, avoiding forgetting handling one case. (except when using escape hatch like ). Mutable borrow exclusiveness prevents iterator invalidation. And it reduces bugs caused by accidental mutation. Explicit avoids accidentally copying container like in C++. Managing dependencies is much easier in Rust than in C/C++. "Using immutable data structure is just a workaround forced by Rust." No. Immutable data structure can prevent many bugs caused by accidental mutation. If used correctly, they can reduce complexity. The persistent data structures are also efficient for things like rollback. "Memory safety can only be achieved by Rust." No. Most GC languages are memory-safe. 16 Memory safety of existing C/C++ applications can be achieved via Fil-C . The native Rust ownership relation form a tree. Reference counting ( , ) allows shared ownership. ↩ Note that here "reference" here means reference in general OOP context (where there is no distinction between ownership and non-owning reference, think about reference in Java/C#/JS/Python). This is different to the Rust reference. I will use "borrow" for Rust reference in this article. ↩ Specifically, Gödel encodes symbols, statements and proofs into integer, called Gödel number. There exists many ways of encoding symbols/statements/proofs as data, and which exact way is not important. For simplicity, I will treat them all as data, and ignore the conversion between data and symbol/statements/proofs. ↩ Here is symbol substitution. replacing the free variable with , while avoid making two different variables same name by renaming when necessary. It's also similar to Y combinator: . In that case , , , is a fixed point of : . , ↩ Having both ID and object reference introduces friction: translating between ID and object reference. Some ORM will malfunction if there exists two objects with the same primary key. ↩ Each slotmap ensures key uniqueness, but if you mix keys of different slotmaps, the different keys of different slotmap may duplicate. Using the wrong key may successfully get an element but logically wrong. ↩ Sometimes, having fine-grained lock is slower because of more lock/unlock operations. But sometimes having fine-grained lock is faster because it allows higher parallelism. Sometimes fine-grained lock can cause deadlock but coarse-grained lock won't deadlock. It depends on exact case. ↩ See also . That was in 2020. Unsure whether it changed now. One possible reason is that ARM allows weaker memory order than X86. Also, Swift and Objective-C use reference counting almost everywhere, so possibly Apple payed more efforts in optimizing atomic reference counting. ↩ Tracing GC is faster for short-lived programs (such as some CLI programs and serverless functions), because there's no need to free memory for individual objects on exit. Example: My JavaScript is Faster than Your Rust . The same optimization is also achievable in Rust, but require extra work (e.g. , bump allocator). ↩ It lags because it need to do many counter decrement and deallocation for each individual object. Can be workarounded by sending the to another thread and drop in that thread. Also, for deep structures, dropping may stack overflow. ↩ Contended atomic operations (many threads touch one atomic value at the same time) are much slower than when not contended. Its cost also include memory block allocation and freeing. ↩ GC frequency is roughly porpotional to allocation speed divide by free memory. In generational GC, a minor GC only scans young generation, whose cost is roughly count of living young generation objects. But it still need to occasionally do full GC. ↩ When running in tools like Miri , the pointer provenance will be tracked at runtime. ↩ will not execute if returns true. will not execute if returns false. ↩ The Heisenbugs may only trigger in relase build, not in debug build, not when sanitizers are on, not when logging is on, not when debugger is on. Because optimization, sanitizer, debugger and logging can change timing and memory layout, which can make memory safety or thread safety bug no longer trigger. Debugging a Heisenbug in large codebase may take weeks even months. Note that not all memory/thread safety bugs are Heisenbugs. Many are still easy to trigger. ↩ Golang is not memory-safe under data race. ↩

1 views
qouteall notes 3 months ago

Traps to Developers

A summarization of some traps to developers. There traps are unintuitive things that are easily misunderstood and cause bugs. This article spans a wide range of knowledge. If you find a mistake or have a suggestion, please leave a comment in GitHub discussion . is by default. Inside flexbox or grid, often makes min width determined by content. It has higher priority than many other CSS attributes including , and . It's recommended to set . See also Horizontal and vertical are different in CSS: Block formatting context (BFC): Stacking context . In these cases, it will start a new stacking context: Stacking context can cause these behaviors: On mobile browsers, the top address bar and bottom navigation bar can go out of screen when you scroll down. correspond to the height when top bar and bottom bar gets out of screen, which is larger than the height when the two bars are on screen. The modern solution is . makes the width-that-excludes-scrollbar to be , which can make the total width (including scrollbar) to horizontally overflow. can avoid that issue. is not based on its parent. It's based on its nearest positioned ancestor (the nearest ancestor that has be , or creates stacking context). does not consider ambient things . If the parent's is or , then the child's has no effect If the parent's width/height is not pre-determined, then percent width/height (e.g. , ) doesn't work. (It avoids circular dependency where parent height is determined by content height, but content height is determined by parent height.) ignores and Whitespace collapse. See also aligns text and inline things, but doesn't align block elements (e.g. normal divs). By default and doesn't include padding and border. with can still overflow the parent. make the width/height include border and padding. Cumulative Layout Shift . It's recommended to specify and attribute in to avoid layout shift due to image loading delay. File download request is not shown in Chrome dev tool, because it only shows networking in current tab, but file download is treated as in another tab. To inspect file download request, use . JS-in-HTML may interfere with HTML parsing. For example makes browser treat the first as ending tag. See also NaN. Floating point NaN is not equal to any number including itself. NaN == NaN is always false (even if the bits are same). NaN != NaN is always true. Computing on NaN usually gives NaN (it can "contaminate" computation). There are +Inf and -Inf. They are not NaN. There is a negative zero -0.0 which is different to normal zero. The negative zero equals zero when using floating point comparision. Normal zero is treated as "positive zero". The two zeros behave differently in some computations (e.g. , , , is NaN) JSON standard doesn't allow NaN or Inf: Directly compare equality for floating point may fail due to precision loss. Compare equality by things like JS use floating point for all numbers. The max "safe" integer is 2 53 − 1 2^{53}-1 2 53 − 1 . The "safe" here means every integer in range can be accurately represented. Outside of the safe range, most integers will be inaccurate. For large integer it's recommended to use . If a JSON contains an integer larger than that, and JS deserializes it using , the number in result will be likely inaccurate. The workaround is to use other ways of deserializing JSON or use string for large integer. (Putting millisecond timestamp integer in JSON fine, as millisecond timestamp exceeds limit in year 287396. But nanosecond timestamp suffers from that issue.) Associativity law and distribution law doesn't strictly hold because of precision loss. Parallelizing matrix multiplication and sum dynamically using these laws can be non-deterministic. See also: Defeating Nondeterminism in LLM Inference Division is much slower than multiplication (unless using approximation). Dividing many numbers with one number can be optimized by firstly computing reciprocal then multiply by reciprocal. These things can make different hardware have different floating point computation results: Floating point accuracy is low for values with very large absolute value or values very close to zero. It's recommended to avoid temporary result to have very large absolute value or be very close-to-zero. Iteration can cause error accumulation. For example, if something need to rotate 1 degree every frame, don't cache the matrix and multiply 1-degree rotation matrix every frame. Compute angle based on time then re-calculate rotation matrix from angle. Some routers and firewall silently kill idle TCP connections without telling application. Some code (like HTTP client libraries, database clients) keep a pool of TCP connections for reuse, which can be silently invalidated (using these TCP connection will get RST). To solve it, configure system TCP keepalive. See also The result of is not reliable. See also . Sometimes tcptraceroute is useful. TCP slow start can increase latency. Can be fixed by disabling . See also TCP sticky packet. Nagle's algorithm delays packet sending. It will increase latency. Can be fixed by enabling . See also If you put your backend behind Nginx, you need to configure connection reuse, otherwise under high concurrency, connection between nginx and backend may fail, due to not having enough internal ports. Nginx delays SSE. The HTTP protocol does not explicitly forbit GET and DELETE requests to have body. Some places do use body in GET and DELETE requests. But many libraries and HTTP servers does not support them. One IP can host multiple websites, distinguished by domain name. The HTTP header and SNI in TLS handshake carries domain name, which are important. Some websites cannot be accessed via IP address. CORS (cross-origin resource sharing). For requests to another website (origin), the browser will prevent JS from getting response, unless the server's response contains header and it matches client website. This requires configuring the backend. If you want to pass cookie to another website it involves more configuration. Generally, if your frontend and backend are in the same website (same domain name and port) then there is no CORS issue. Reverse path filtering . When routing is asymmetric, packet from A to B use different interface than packets from B to A, then reverse path filtering rejects valid packets. In old versions of Linux, if is enabled, it aggressively recycles connection based on TCP timestamp. NAT and load balancer can make TCP timestamp not monotonic, so that feature can drop normal connections. Strictly speaking, they use WTF-16 encoding, which is similar to UTF-16 but allows invalid surrogate pairs. Also, Java has an optimization that use Latin-1 encoding (1 byte per code point) for in-memory string if possible. But the API of still works on WTF-16 code units. Similar things may happen in C# and JS. ↩ Directly treating existing binary data as struct is undefined behavior because the object lifetime hasn't started. But using to initialize a struct is fine. ↩ is by default. Inside flexbox or grid, often makes min width determined by content. It has higher priority than many other CSS attributes including , and . It's recommended to set . See also Horizontal and vertical are different in CSS: Normally tries fill available space in parent. But normally tries to just expand to fit content. For inline elements, inline-block elements and float elements, does not try to expand. centers horizontally. But normally become which does not center vertically. In a flexbox with , can center vertically. Margin collapse happens vertically but not horizontally. The above flips when layout direction flips (e.g. ) Block formatting context (BFC): creates a BFC. (There are other ways to create BFC, like , , , , but with side effects) Margin collapse. Two vertically touching siblings can overlap margin. Child margin can "leak" outside of parent. Margin collapse can be avoided by BFC. Margin collapse also doesn't happen when or spcified If a parent only contains floating children, the parent's height will collapse to 0. Can be fixed by BFC. Stacking context . In these cases, it will start a new stacking context: The attributes that give special rendering effects ( , , , , etc.) will create a new stacking context or creates a stacking context Specifies and is or Specifies and the element is inside flexbox or grid doesn't work across stacking contexts. It only works within a stacking context. Stacking context can affect the coordinate of or . (The underlying logic is complex, see also ) doesn't work across stacking context. will still be clipped by stacking context will position based on stacking context On mobile browsers, the top address bar and bottom navigation bar can go out of screen when you scroll down. correspond to the height when top bar and bottom bar gets out of screen, which is larger than the height when the two bars are on screen. The modern solution is . makes the width-that-excludes-scrollbar to be , which can make the total width (including scrollbar) to horizontally overflow. can avoid that issue. is not based on its parent. It's based on its nearest positioned ancestor (the nearest ancestor that has be , or creates stacking context). does not consider ambient things . If the parent's is or , then the child's has no effect If the parent's width/height is not pre-determined, then percent width/height (e.g. , ) doesn't work. (It avoids circular dependency where parent height is determined by content height, but content height is determined by parent height.) ignores and Whitespace collapse. See also By default, newlines in html are treated as spaces. Multiple spaces together collapse into one. can avoid collapsing whitespace but has weird behavior in the beginning and end of content. Often the spaces in the beginning and end of content are ignored, but this doesn't happen in . Any space or line break between two elements will be rendered as spacing. This doesn't happen in flexbox or grid. aligns text and inline things, but doesn't align block elements (e.g. normal divs). By default and doesn't include padding and border. with can still overflow the parent. make the width/height include border and padding. Cumulative Layout Shift . It's recommended to specify and attribute in to avoid layout shift due to image loading delay. File download request is not shown in Chrome dev tool, because it only shows networking in current tab, but file download is treated as in another tab. To inspect file download request, use . JS-in-HTML may interfere with HTML parsing. For example makes browser treat the first as ending tag. See also Two concepts: code point, grapheme cluster: Grapheme cluster is the "unit of character" in GUI. For visible ascii characters, a character is a code point, a character is a grapheme cluster. An emoji is a grapheme cluster, but it may consist of many code points. In UTF-8, a code point can be 1, 2, 3 or 4 bytes. The byte number does not necessarily represent code point number. In UTF-16, each UTF-16 code unit is 2 bytes. A code point can be 1 code unit (2 bytes) or 2 code units (4 bytes, surrogate pair). JSON string escape uses surrogate pair. in JSON has only one code point. Different in-memory string behaviors in different languages: Rust use UTF-8 for in-memory string. gives byte count. Rust does not allow directly indexing on a (but allows subslicing). gives code point count. Rust is strict in UTF-8 code point validity (for example Rust doesn't allow subslice to cut on invalid code point boundary). Java, C# and JS's string encoding is similar to UTF-16 1 . String length is code unit count. Indexing works on code units. Each code unit is 2 bytes. One code point can be 1 code unit or 2 code units. In Python, gives code point count. Indexing gives a string that contains one code point. Golang string has no constraint of encoding and is similar to byte array. String length and indexing works same as byte array. But the most commonly used encoding is UTF-8. See also In C++, has no constraint of encoding and is similar to byte array. String length and indexing is based on bytes. No language mentioned above do string length and indexing based on grapheme cluster. In SQL, limits 100 code points (not byte). Some text files have byte order mark (BOM) at the beginning. For example, FE FF means file is in big-endian UTF-16. EF BB BF means UTF-8. It's mainly used in Windows. Some non-Windows software does not handle BOM. When converting binary data to string, often the invalid places are replaced by � (U+FFFD) Confusable characters . Normalization. For example é can be U+00E9 (one code point) or U+0065 U+0301 (two code points). String comparision works on binary data and don't consider normalization. Zero-width characters , Invisible characters Line break. Windows often use CRLF for line break. Linux and MacOS often use LF for line break. Locale ( elaborated below ). NaN. Floating point NaN is not equal to any number including itself. NaN == NaN is always false (even if the bits are same). NaN != NaN is always true. Computing on NaN usually gives NaN (it can "contaminate" computation). There are +Inf and -Inf. They are not NaN. There is a negative zero -0.0 which is different to normal zero. The negative zero equals zero when using floating point comparision. Normal zero is treated as "positive zero". The two zeros behave differently in some computations (e.g. , , , is NaN) JSON standard doesn't allow NaN or Inf: JS turns NaN and Inf to null. Python will directly write , into result, which is not compliant to JSON standard. will raise if has NaN or Inf. Golang will give error if has NaN or Inf. Directly compare equality for floating point may fail due to precision loss. Compare equality by things like JS use floating point for all numbers. The max "safe" integer is 2 53 − 1 2^{53}-1 2 53 − 1 . The "safe" here means every integer in range can be accurately represented. Outside of the safe range, most integers will be inaccurate. For large integer it's recommended to use . If a JSON contains an integer larger than that, and JS deserializes it using , the number in result will be likely inaccurate. The workaround is to use other ways of deserializing JSON or use string for large integer. (Putting millisecond timestamp integer in JSON fine, as millisecond timestamp exceeds limit in year 287396. But nanosecond timestamp suffers from that issue.) Associativity law and distribution law doesn't strictly hold because of precision loss. Parallelizing matrix multiplication and sum dynamically using these laws can be non-deterministic. See also: Defeating Nondeterminism in LLM Inference Division is much slower than multiplication (unless using approximation). Dividing many numbers with one number can be optimized by firstly computing reciprocal then multiply by reciprocal. These things can make different hardware have different floating point computation results: Hardware FMA (fused multiply-add) support. (in some places ). Most modern hardware make intermediary result in FMA have higher precision. Some old hardware or embedded processors don't do that and treat it as normal multiply and add. Floating point has a Subnormal range to make very-close-to-zero numbers more accurate. Most mondern hardware can handle them, but some old hardware and embedded processors treat subnormals as zero. Rounding mode. The standard allows different rounding modes like round-to-nearest-ties-to-even (RNTE) or round-toward-zero (RTZ). In X86 and ARM, rounding mode is thread-local mutable state can be set by special instructions. It's not recommended to touch the rounding mode as it can affect other code. In GPU, there is no mutable state for rounding mode. Rasterization often use RNTE rounding mode. In CUDA different rounding modes are associated by different instructions. Math functions (e.g. sin, log) may be less accurate in some embedded hardware or old hardware. X86 has legacy FPU which has 80-bit floating point registers and per-core rounding mode state. It's recommended to not use them. Floating point accuracy is low for values with very large absolute value or values very close to zero. It's recommended to avoid temporary result to have very large absolute value or be very close-to-zero. Iteration can cause error accumulation. For example, if something need to rotate 1 degree every frame, don't cache the matrix and multiply 1-degree rotation matrix every frame. Compute angle based on time then re-calculate rotation matrix from angle. Leap second . Unix timestamp is "transparent" to leap second, which means converting between Unix timestamp and UTC time ignores leap second. A common solution is leap smear: make the time measured in Unix timestamp stretch or squeeze near a leap second. Time zone. UTC and Unix timestamp is globally uniform. But human-readable time is time-zone-dependent. It's recommended to store timestamp in database and convert to human-readable time in UI, instead of storing human-readable time in database. Daylight Saving Time (DST): In some region people adjust clock forward by one hour in warm seasons. Time may "go backward" due to NTP sync. It's recommended to configure the server's time zone as UTC. Different nodes having different time zones will cause trouble in distributed system. After changing system time zone, the database may need to be reconfigured or restarted. There are two clocks: hardware clock and system clock. The hardware clock itself doesn't care about time zone. Linux treats it as UTC by default. Windows treats it as local time by default. compares object reference. Should use to compare object content. Forget to override and . It will use object identity equality by default in map key and set. Mutate the content of map key object (or set element object) makes the container malfunciton. A method that returns may sometimes return mutable , but sometimes return which is immutable. Trying to modify on throws . A method that returns may return (this is not recommended, but this do exist in real codebases). Null is ambiguous. If on a map returns null, it may be either value is missing or value exists but it's null (can distinguish by ). Null field and missing field in JSON are all mapped to null in Java object. See also Implicitly converting etc. to etc. can cause . Return in block swallows any exception thrown in the or block. The method will return the value from . Interrupt. Some libraries ignore interrupt. If a thread is interrupted and then load a class, and class initialization has IO, then class may fail to load. Thread pool does not log exception of tasks sent by by default. You can only get exception from the future returned by . Don't discard the future. And task silently stop if exception is thrown. Literal number starting with 0 will be treated as octal number. ( is 83) When debugging, debugger will call to local variables. Some class' has side effect, which cause the code to run differently under debugger. This can be disabled in IDE. Before Java24 virtual thread can be "pinned" when blocking on lock, which may cause deadlock. It's recommended to upgrade to Java 24 if you use virtual thread. It's not recommended to override . If runs too slow, it can block GC. Exceptions out of are not logged. A dead object can resurrect itself in , and if a resurrected object become dead again, won't be called again. Use for GC-directed disposal. reuses memory region if capacity allows. Appending to a subslice can overwrite parent if they share memory region. executes when the function returns, not when the lexical scope exits. capture mutable variable. About : There are nil slice and empty slice (the two are different). But there is no nil string, only empty string. The nil map can be read like an empty map, but nil map cannot be written. Interface weird behavior. Interface pointer is a fat pointer containing type info and data pointer. If the data pointer is null but type info is not null, then it will not equal . Before Go 1.22, loop variable capture issue . Dead wait. Understanding Real-World Concurrency Bugs in Go Different kinds of timeout. The complete guide to Go net/http timeouts Having interior pointer to an object keeps the whole object alive. This may cause memory leak. Storing a pointer of an element inside and then grow the vector, may re-allocate content, making element pointer no longer valid. created from literal string may be temporary. Taking from a temporary string is wrong. Iterator invalidation. Modifying a container when looping on it. doesn't remove but just rearrange elements. actually removes. Literal number starting with 0 will be treated as octal number. ( is 83) Destructing a deep tree structure can stack overflow. Solution is to replace recursion with loop in destructor. Undefined behaviors. The compiler optimization aim to keep defined behavior the same, but can freely change undefined behavior. Relying on undefined behavior can make program break under optimization. See also Accessing uninitialized memory is undefined behavior. Converting a to struct pointer can be seen as accessing uninitialized memory, because the object lifetime hasn't started. It's recommended to put the struct elsewhere and use to initialize it. Accessing invalid memory (e.g. null pointer) is undefined behavior. Integer overflow/underflow is undefined behavior. Note that unsigned integer can underflow below 0. Aliasing. Aliasing means multiple pointers point to the same place in memory. Strict aliasing rule: If there are two pointers with type and , then compiler assumes two pointer can never equal. If they equal, it's undefined behavior. Except in two cases: 1. and has subtyping relation 2. converting pointer to byte pointer ( , or ) (the reverse does not apply). Pointer provenance. Two pointers from two different provenances are treated as never alias. If their address equals, it's undefined behavior. See also Alignment. For example, 64-bit integer's address need to be disivible by 8. In ARM, accessing memory in unaligned way can cause crash. Unaligned memory access is undefined behavior. Directly treating a part of byte buffer as a struct is undefined behavior. Not only due to alignment, but also due to object lifetime not yet started 2 . Alignment can cause padding in struct that waste space. Some SIMD instructions only work with aligned data. For example, AVX instructions usually require 32-byte alignment. Default argument is a stored value that will not be re-created on every call. Be careful about indentation when copying and pasting Python code. Null is special. doesn't work. works. Null does not equal itself, similar to NaN. Unique index allows duplicating null (except in Microsoft SQL server). may treat nulls as the same (this is database-specific). and ignore rows where is null. Date implicit conversion can be timezone-dependent. Complex join with disctinct may be slower than nested query. See also In MySQL (InnoDB), if string field doesn't have then it will error if you try to insert a text containing 4-byte UTF-8 code point. MySQL (InnoDB) default to case-insensitive. MySQL (InnoDB) can do implicit conversion by default. gives 124. MySQL (InnoDB) gap lock may cause deadlock. In MySQL (InnoDB) you can select a field and group by another field. It gives nondeterministic result. In SQLite the field type doesn't matter unless the table is . SQLite by default does not do vacuum. The file size only increases and won't shrink. To make it shrink you need to either manually or enable . Foreign key may cause implicit locking, which may cause deadlock. Locking may break repeatable read isolation (it's database-specific). Distributed SQL database may doesn't support locking or have weird locking behaviors. It's database-specific. If the backend has N+1 query issue, the slowness may won't be shown in slow query log, because the backend does many small queries serially and each individual query is fast. Long-running transaction can cause problems (e.g. locking). It's recommended to make all transactions finish quickly. If a string column is used in index or primary key, it will have length limit. MySQL applies the limitation when changing table schema. PostgreSQL applies the limitation by erroring when inserting or updating data. Whole-table locks that can make the service temporarily unusable: In MySQL (InnoDB) 8.0+, adding unique index or foreign key is mostly concurrent (only briefly lock) and won't block operations. But in older versions it may do whole-table lock. used without cause whole-table read lock. In PostgreSQL, or cause whole-table read-lock. To avoid that, use to add unique index. For foreign key, use then . About ranges: If you store non-overlapping ranges, querying the range containing a point by is inefficient (even when having composite index of ). An efficient way: (only require index of column). For overlappable ranges, normal B-tree index is not sufficient for efficient querying. It's recommended to use spatial index in MySQL and GiST in PostgreSQL. : itself cannot replace locks. itself doesn't provide atomicity. You don't need for data protected by lock. Locking can already establish memory order and prevent some wrong optimizations. In C/C++, only avoids some wrong optimizations, and won't automatically add memory barrier instruction for access. In Java, accesses have sequentially-consistent ordering (JVM will use memory barrier instruction if needed) In C#, accesses to the same value have release-aquire ordering (CLR will use memory barrier instruction if needed) can avoid wrong optimization related to reordering and merging memory reads/writes. (Compiler can merge reads by caching a value in register. Compiler can merge writes by only writing to register and delaying writing to memory. A read after a write can be optimized out.). Time-of-check to time-of-use ( TOCTOU ). In SQL database, for special uniqueness constraints that doesn't fit simple unique index (e.g. unique across two tables, conditional unique, unique within time range), if the constraint is enforced by application, then: In MySQL (InnoDB), if in repeatable read level, application checks using then insert, and the unique-checked column has index, then it works due to gap lock. (Note that gap lock may cause deadlock under high concurrency, ensure deadlock detection is on and use retrying). In PostgreSQL, if in repeatable read level, application checks using then insert, it's not sufficient to enforce constraint under concurrency (due to write skew). Some solutions: Use serializable level Don't rely on application to enforce constraint: For conditional unique, use partial unique index. For uniqueness across two tables case, insert redundant data into one extra table with unique index. For time range exclusiveness case, use range type and exclude constraint. Atomic reference counting ( , ) can be slow when many threads frequently change the same counter. See also About read-write lock: trying to write lock when holding read lock can deadlock. The correct way is to firstly release the read lock, then acquire write lock, and the conditions that were checked in read lock need to be re-checked. Reentrant lock: Reentrant means one thread can lock twice (and unlock twice) without deadlocking. Java's and are reentrant. Non-reentrant means if one thread lock twice, it will deadlock. Rust and Golang are not reentrant. False sharing of the same cache line costs performance. Forget to check for null/None/nil. Modifying a container when for looping on it. Single-thread "data race". Unintended sharing of mutable data. For example in Python does not create a proper 2D array. For non-negative integer may overflow. A safer way is . Short circuit. will not run if returns true. will not run when returns false. When using profiler: the profiler may by default only include CPU time which excludes waiting time. If your app spends 90% time waiting on database, the flamegraph may not include that 90% which is misleading. If the current directory is moved, still shows the original path. shows the real path. make both stdout and stderr go to file. But only make stdout go to file but don't redirect stderr. File name is case sensitive (unlike Windows). There is a capability system for executables, apart from file permission sytem. Use to see capability. Unset variables. If is unset, becomes . Using can make bash error when encountering unset variable. If you want a script to add variables and aliases to current shell, it should be executed by using , instead of directly executing. But the effect of is not permanent and doesn't apply after re-login. It can be made permanent by putting into . Bash has caching between command name and file path of command. If you move one file in then using that command gives ENOENT. Refresh cache using Using a variable unquoted will make its line breaks treated as space. can make the script exit immediately when a sub-command fails, but it doesn't work inside function whose result is condition-checked (e.g. the left side of , , condition of ). See also K8s used with debugger. Breakpoint debugger usually block the whole application, making it unable to respond health check request, so it can be killed by K8s . Modify state in rendering code. React compares equality using reference equality, not content equality. The objects and arrays that are newly created in rendering are treated as always-new. Use to fix. The closure functions that are created in rendering are also always-new. Use to fix. If an always-new thing is put into dependency array, the effect will run on every render. See also Cloudflare indicent 2025 Sept-12 . Don't forget to include dependencies in the dependency array. And the dependencies also need to be memoed. When using effect to manage , if the effect has dependency value, it will remove timer and re-add timer when dependency changes, which can mess up the timing. State objects themselves should be immutable. Don't directly set fields of state objects. Always recreate whole object. Forget to include value in dependency array. Forget clean up in . Closure trap. Closure can capture a state. If the state changes, the closure still captures the old state. One solution is to make closure not capture state and access state within . Another solution is to put state in (note that changing value inside ref don't trigger re-rendering, you need to change state or prop to trigger re-rendering) firstly runs after component DOM presents in web page. Doing initialization in may cause visual flicker. Use for early initialization. When using ref to get DOM object, it won't be accessible during first rendering (component function call). It can be accessed in . Rebase can rewrite history. After rebasing local branch, normal push will give weird result (because history is rewritten). Rebase should be used with force push. If remote branch's history is rewritten, pulling should use . Force pushing with can sometimes avoid overwriting other developers' commits. But if you fetch then don't pull, cannot protect. Reverting a merge doesn't fully cancel the side effect of the merge. If you merge B to A and then revert, merging B to A again has no effect. One solution is to revert the revert of merge. (A cleaner way to cancel a merge, instead of reverting merge, is to backup the branch, then hard reset to commit before merge, then cherry pick commits after merge, then force push.) In GitHub, if you accidentally commited secret (e.g. API key) and pushed to public, even if you override it using force push, GitHub will still record that secret. See also Example activity tab In GitHub, if there is a private repo A and you forked it as B (also private), then when A become public, the private repo B's content is also publicly accessible, even after deleting B. See also . GitHub by default allows deleting a release tag, and adding a new tag with same name, pointing to another commit. It's not recommended to do that. Many build systems cache based on release tag, which breaks under that. It can be disabled in rulesets configuration. does not drop the stash if there is a conflict. In Windows, Git often auto-convert cloned text files to be CRLF line ending. But in WSL many software (e.g. bash) doesn't work with files with CRLF. Using can make git clone as LF. MacOS auto adds files into every folder. It's recommended to add into . Some routers and firewall silently kill idle TCP connections without telling application. Some code (like HTTP client libraries, database clients) keep a pool of TCP connections for reuse, which can be silently invalidated (using these TCP connection will get RST). To solve it, configure system TCP keepalive. See also Note that HTTP/1.0 Keep-Alive is different to TCP keepalive. The result of is not reliable. See also . Sometimes tcptraceroute is useful. TCP slow start can increase latency. Can be fixed by disabling . See also TCP sticky packet. Nagle's algorithm delays packet sending. It will increase latency. Can be fixed by enabling . See also If you put your backend behind Nginx, you need to configure connection reuse, otherwise under high concurrency, connection between nginx and backend may fail, due to not having enough internal ports. Nginx delays SSE. The HTTP protocol does not explicitly forbit GET and DELETE requests to have body. Some places do use body in GET and DELETE requests. But many libraries and HTTP servers does not support them. One IP can host multiple websites, distinguished by domain name. The HTTP header and SNI in TLS handshake carries domain name, which are important. Some websites cannot be accessed via IP address. CORS (cross-origin resource sharing). For requests to another website (origin), the browser will prevent JS from getting response, unless the server's response contains header and it matches client website. This requires configuring the backend. If you want to pass cookie to another website it involves more configuration. Generally, if your frontend and backend are in the same website (same domain name and port) then there is no CORS issue. Reverse path filtering . When routing is asymmetric, packet from A to B use different interface than packets from B to A, then reverse path filtering rejects valid packets. In old versions of Linux, if is enabled, it aggressively recycles connection based on TCP timestamp. NAT and load balancer can make TCP timestamp not monotonic, so that feature can drop normal connections. The upper case and lower case can be different in other natural languages. In Turkish (tr-TR) lowercase of is and upper case of is . The (word char) in regular expression can be locale-dependent. Letter ordering can be different in other natural languages. Regular expression may malfunction in other locale. Text notation of floating-point number is locale-dependent. in US correspond to in Germany. CSV use normally use as spearator, but use as separator in German locale. Han unification . Some characters in different language with slightly different appearance use the same code point. Usually a font will contain variants for different languages that render these characters differently. HTML code Regular expression cannot parse the syntax that allows infinite nesting (because regular expression engine use finite state machine. Infinite nesting require infinite states to parse). HTML allows infinite nesting. But it's ok to use regex to parse HTML of a specific website. Regular expression behavior can be locale-dependent (depending on which regular expression engine). There are many different "dialects" of regular expression. Don't assume a regular expression that works in JS can work in Java. A separate regular expression validation can be out-of-sync with actual data format. Crowdstrike incident was caused by a wrong separate regular expression validation. It's recommended to avoid separate regular expression validation. Reuse parsing code for validation . See also: Parse, don't validate Backtracking performance issue. See also: Cloudflare indicent 2019 July-2 , Stack Exchange incident 2016 July-20 YAML: YAML is space-sensitive, unlike JSON. is wrong. is correct. YAML doesn't allow using tab for indentation. Norway country code become false if unquoted . Git commit hash may become number if unquoted . The yaml document from hell When using Microsoft Excel to open a CSV file, Excel will do a lot of conversions, such as date conversion (e.g. turn and into ) and Excel won't show you the original string. The gene SEPT1 was renamed due to this Excel issue . Excel will also make large numbers inaccurate (e.g. turn into ) and won't show you the original accurate number, because Excel internally use floating point for number. It's recommended to configure billing limit when using cloud services, especially serverless. See also: ServerlessHorrors Big endian and little endian in binary file and net packet. Strictly speaking, they use WTF-16 encoding, which is similar to UTF-16 but allows invalid surrogate pairs. Also, Java has an optimization that use Latin-1 encoding (1 byte per code point) for in-memory string if possible. But the API of still works on WTF-16 code units. Similar things may happen in C# and JS. ↩ Directly treating existing binary data as struct is undefined behavior because the object lifetime hasn't started. But using to initialize a struct is fine. ↩

0 views
qouteall notes 4 months ago

About Code Reuse, Polymorphism and Abstraction

Extracting function is regularization, while inlining function is de-regularization. Extracting function turns duplicated code into a shared function, and inlining turns shared function into duplicated code. Why we sometimes specialize instead of generalizing: About leaky abstraction: Abstraction aim to hide details and make things simpler. But some abstractions are leaky : to use it correctly you need to understand the details that it tries to hide. The more leaky an abstraction is, the less useful it is. If a new requirement follows the regularity that the abstraction uses, then the abstraction is good and makes things simpler. But when the new requirement change breaks the regularity, then abstraction hinders the developer. The developer will be left with two choices: Every abstraction makes some things easier AND make other things harder. It's a tradeoff. every game engine has things they make easier and things they make harder. working exclusively with one tool for a long time makes your brain stop even considering designs that fall outside the scope of that tool. it can make it feel like the tool doesnt have limits - Tyler Glaiel, Link Real world is complex. Building software require making decision on a lot of details. If some tool has a simple interface, it must have hardcoded a lot of detail decisions inside. If the interface exposes these detail decisions, the interface won't be simple. This also applies to AI coding. When you write a vague prompt and LLM generates a whole application/feature for you, the generated code contains many opinionated detail decisions that's made by LLM, not you (of course you can then prompt the LLM to change a detail). These decisions are important and should be made early (when using AI-assisted coding, these decisions sholuld be clearly specified). Reducing complexity requires making things as unrelated as possible. One thing is less complex when less thing relates with it. Reduce responsibility of any individual module. Separation of concern. In the context of programming, orthogonality means unrelatedness : Sometimes splitting a complex operation into multiple stages makes it more orthogonal. Merging multiple steps into one step increases complexity. The reality is usually less perfect than theories. Often two things are mostly orthogonal but has some non-orthogonal edge cases. If the edge cases are few and are not complex, and add the special case handling is ok. However, if there are many special cases, or some special cases are complex, then the two modules are very non-orthogonal and should be re-designed. Sometimes the interface allow passing two orthogonal options, but it actually does not support some combinations of options. This is fake orthogonality (seems orthogonal in interface but actually doesn't). Sum types are useful for avoiding the invalid combinations of data, reducing fake orthogonality. They can help correctness by stopping the invalid combinations of data from being created. Another case is that the software provides orthogonality in interface, and actually supports all combinations of options (including many useless option combinations), but the implementation is non-orthogonal, then the implementaiton will face combinatory explosion . Limiting the supported combinations in interface is better. If you consider it as a library, you can use Windows linker functionality X in combination with Unix linker functionality Y, but there was no precedent for what the linker should behave in such a case. Even worse, in many situations, it was not obvious what would be the “right” behavior. We spent a lot of time discussing to define the semantics that would make sense for all possible feature combinations, and we carefully wrote complex code to support all targets simultaneously. However, in hindsight, this was probably not a good way to spend time because no one really wanted to use such hypothetical feature combinations. lld v1 probably didn't have any real users. - My story on “worse is better” The user name is used as id of user. But a new requirement comes: the user must be able to change the user name. (Using name as id is usually a bad design, unless the tool is for programmers.) In a game, if an entity dies, that entity is deleted. But a new requirement comes: a dead entity can be resurrected by a new magic. To implement that, you need to change real delete to soft delete. For example, add a boolean flag of whether it's living, and check that flag in every logic of entity behavior. An app supports one language. And the event log is recorded using simple strings. But a new requirement comes: make the app support multiple languages. The user can switch language at any time and see the event log in their language. To implement that, you cannot store the text as string. The log should be stored as data structure of relevant information of log, and turned to text when showing in UI. (A "dumber" way is to store the strings for every supported language.) A todo list app need to support undo and redo. In a singleplayer game, all game logic runs locally. All game data are in memoery and are loaded/saved from file. But a new requirement comes: make it multiplayer. In singleplayer game, the in-memory data can be source-of-truth, but in multiplayer the server is source-of-truth. Every non-client operation now requires packet sending and receiving. What's more, to reduce visible latency, the client side game must guess future game state and correct the guess from server packets (add rollback mechanism). It can become complex. In a todo list app, all data are loaded from server. All edits also go through server. But a new requirement comes: make the app work offline and sync when it connects with internet. In a GUI, previously there is a long running task that changes GUI state, and user cannot operate the GUI while task is running. Now, to improve user experience, you need to allow operating the GUI while task is running. Both the background task and user can now change the mutable state. User interfaces are hard - why? Two previously separated UI components now need to share mutable state. The complexity that lives in the GUI | RoyalSloth The previous data processing removes some information. New requirement needs to keep that information. (Example TODO) There are some fixed workflows (hardcoded in code). A new requirement comes: allow the user to configure and customize the workflow. The new flexible system allow much more ways of configuring and introduce many corner cases. (Developing specially for each enterprise customer may be actually easier than creating a configurable flexible "rules engine". The custom "rules engine" will be more complex and harder to debug than just code. You can still share common code when developing separately. The Configuration Complexity Clock ) Special case in permission system. Allow non-logged-in users to access some functionalities. Add bot as a new kind of user with special permissions. Make permission of modifying specific field fing-grained. Two systems A and B need to work together, but A and B's API both change across versions. However every version of A must work with every version of B. Keep adding AB test feature flags. There will be many combinations of feature flags. It's possible that some combinations will trigger bugs. Generalization introduces new concepts and adds cognitive load . Sometimes, not adding these is better, depending on how useful the abstraction is. A new requirement can break the assumption or regularity that the generalization is based on. New exceptions break generalization . De-regularize the abstraction and do the change accordingly. (And create new abstractions that follow the new regularity. This is refactoring.) Add special case handlings within the current abstraction. The exceptions can make the previously unrelated things related again (break orthogonality), increasing (accidental) complexity . It will often involve new boolean flags that control internal behavior, weird data relaying, new state sharing, new concurrency handling, etc. Data modelling: Which data to store? Which data to compute-on-demand? How and when is ID allocated? What lookup acceleration structure or redundant data do we have? How to migrate schema? Is there any ambiguity in data model? (two different things correspond to same data) Constraints: What can change and what cannot change? What can duplicate (overlap) and what cannot? Does this ID always point to a valid object? What constraints does business logic require? Does this allow concurrency? Will concurrency break the constraints? Dataflow: Which data is source of truth? Which data is derived from source of truth? How is change of source of truth broadcasted to derived data? How is the cache invalidated? How is the lookup acceleration structure maintained to be consistent with source of truth? What data should we expose to client side? How and when to validate external data? Separate of responsibility (concern) and encapsulation: What module is responsible for updating this data? Should this data be encapsulated or let other modules access? What module is responsible for keeping that derived data to be consistent with source of truth? What module is responsible for kepping that constraint? Tradeoffs: What tradeoff do we make to simplify it? What tradeoff do we make to optimize performance? What tradeoff do we make to maintain compatibility? What work must be done immediately? What work can be deferred? What data can be stale? What data cannot be stale? Two different pieces of data can be combined in valid way. Two different pieces of logic can work together without interferring with each other. No need to do special-case-handling of combinations. No combinatory explosion. The user name is used as id of user. But a new requirement comes: the user must be able to change the user name. (Using name as id is usually a bad design, unless the tool is for programmers.) In a game, if an entity dies, that entity is deleted. But a new requirement comes: a dead entity can be resurrected by a new magic. To implement that, you need to change real delete to soft delete. For example, add a boolean flag of whether it's living, and check that flag in every logic of entity behavior. An app supports one language. And the event log is recorded using simple strings. But a new requirement comes: make the app support multiple languages. The user can switch language at any time and see the event log in their language. To implement that, you cannot store the text as string. The log should be stored as data structure of relevant information of log, and turned to text when showing in UI. (A "dumber" way is to store the strings for every supported language.) A todo list app need to support undo and redo. In a singleplayer game, all game logic runs locally. All game data are in memoery and are loaded/saved from file. But a new requirement comes: make it multiplayer. In singleplayer game, the in-memory data can be source-of-truth, but in multiplayer the server is source-of-truth. Every non-client operation now requires packet sending and receiving. What's more, to reduce visible latency, the client side game must guess future game state and correct the guess from server packets (add rollback mechanism). It can become complex. In a todo list app, all data are loaded from server. All edits also go through server. But a new requirement comes: make the app work offline and sync when it connects with internet. In a GUI, previously there is a long running task that changes GUI state, and user cannot operate the GUI while task is running. Now, to improve user experience, you need to allow operating the GUI while task is running. Both the background task and user can now change the mutable state. User interfaces are hard - why? Two previously separated UI components now need to share mutable state. The complexity that lives in the GUI | RoyalSloth The previous data processing removes some information. New requirement needs to keep that information. (Example TODO) There are some fixed workflows (hardcoded in code). A new requirement comes: allow the user to configure and customize the workflow. The new flexible system allow much more ways of configuring and introduce many corner cases. (Developing specially for each enterprise customer may be actually easier than creating a configurable flexible "rules engine". The custom "rules engine" will be more complex and harder to debug than just code. You can still share common code when developing separately. The Configuration Complexity Clock ) Special case in permission system. Allow non-logged-in users to access some functionalities. Add bot as a new kind of user with special permissions. Make permission of modifying specific field fing-grained. Two systems A and B need to work together, but A and B's API both change across versions. However every version of A must work with every version of B. Keep adding AB test feature flags. There will be many combinations of feature flags. It's possible that some combinations will trigger bugs. There is a data visualization UI. Originally, it firstly loads all data from server then render. But when the data size become huge, loading becomes slow and you need break the data into parts, dynamically load parts and visualize loaded parts. A game has loading screen when switching scene. A new requirement comes: make the loading seamless and remove the loading screen. It loads all data from database and then compute things using programming language. One day the data become so big that cannot be held in memory. You need to either load partial data into memory, compute separately and then merge the result, or rewrite logic into SQL and let database compute it

0 views
qouteall notes 4 months ago

Term Ambiguity

A lot of debate happen because same word has different meanings to different people. Some ambiguities related to programming: Oxymoron naming: Encryption. Calculating hash code or signing is/isn't encryption. Linear regression is / isn't machine learning. AGI. Average-human-level / super-human AI. Pass-by-value. In some places, passing a reference is techincally called "pass-by-value". In some places, pass-by-value means pass object content instead of object reference. Compile. Turn source code into machine code / turn one kind of code into another kind of code (IR) / optimize code without changing code format (React compiler). Render. Generate image / generate HTML / generate video / generate other things. Parse. Parse contains / doesn't contain validation. Garbage collection. In some places, it means only tracing garbage collection. In some places, it also includes reference counting. GC includes epoch-based memory reclamation. In distributed system, "availability" means can process read requests / can process both read and write requests. Let's Consign CAP to the Cabinet of Curiosities - Marc's Blog (brooker.co.za) Negative feedback loop. In some places, it means self-regulating process (like thermostat). In some places, it means self-reinforcing negative effect (such as self-reinforcing asset price drop in a financial crisis). Forward and backward in time. Sometimes "forward" is future-oriented, analogous to walking. Sometimes "forward" is past-oriented, when talking about history. MVC. There are two kinds of MVCs. One is for client GUI applications, where controller is the mediator between view and model. One is for server-side web applications, where the model accesses database, the view generates HTML and the controller calls the previous two and handle RESTful APIs. MVC Isn’t MVC — Collin Donnell Synchronization. In some places, specifying memory ordering and accessing Java volatile are called "synchronization". In some places these are not called synchronization. In English, synchronzied can mean "happen at the same time", which contradicts the fact that caller waiting for the service working. Asynchronous can mean "not happening at the same time", which contradicts the fact that the caller calling an asynchronous interface can run with the called service at the same time. "Low-level". Normally "low-level" usually means entry-level, junior-level. But in programming "low-level" can mean very deep things involving things like OS and hardware internal, which require high-level skill. We can use "deep-level", "infrastructure-level" instead of "low-level" to avoid misunderstanding. Predict. Normally "predict" means figuring out what happens in the future. But in AI, "predict" means estimating something, not necessarily the things in future. For example: "predict masked token", "predict noise". KB, MB, GB. Most commonly, 1 KB = 1024 bytes, 1MB = 1024 KB, 1GB = 1024 MB. (Formally they should be written as KiB, MiB, GiB.) In disk manufactuers' descriptions, 1 KB = 1000 bytes, 1MB = 1000 KB, 1GB = 1000 MB. In networking speed, 1 Kbps = 1000 bits per second, 1Mbps = 1000 Kbps, 1Gbps = 1000 Mbps. Verbal. Sometimes mean spoken words. Sometimes includes both written text and spoken words. "Last" can mean "previous" or "final". Immutable. There are different kinds of "immutable": The referenced object is immutable, and the reference is also immutable. The referenced object is immutable, but the reference itself is mutable. The referenced object is mutable, but the reference itself is immutable. Character. A character in GUI is a grapheme cluster. Sometimes it mean a code point. In C, a is a byte. In Java a is two bytes. Artificial nerual network are "Black Box". All the matrix computations and weights involved in inference and training are white-box. The "Black Box" here means the mechanism of why it produce specific output is not clear. Although human can view the weight numbers, it's hard to understand how these weights correspond to what "thinking" and "decision making". RAG (retrieval augmented generation). Sometimes it must involve vector database. Sometimes it involves all kinds of information retrieval methods. Serverless servers . Constant variable. (a constant. but treated as a special kind of variable in compiler) Unnamed namespaces. (C++ namespace without a name) Safe unsafe Rust code. (unsafe Rust code wrapped in a safe way) Asynchronous synchronization. (non-blocking replication) Lock-free deadlock. (deadlock can happen in message-passing systems, without any explicit lock) Transparent opacity. (an opacity value that's less than 1.0, making it semi-transparent) Static animation. (the animation data is hard-coded) Unlogged log in. (the log-in event is not written to record)

0 views
qouteall notes 4 months ago

Pitfalls in API usability

Here API means the generalized concept of "API": With that broader sense of API, all programming revolves around using "APIs" (and creating "APIs"). API usability is important to developer productivity. Instruction set is the "API" of CPU. Machine code invokes the "API" of CPUs. Source code invokes the "API" of programming languages. Functions and types are API. Networking protocols (IP, TCP, UDP, HTTP, etc.) are the "API" of the internet. Restful APIs. Data formats and configuration formats are also "API". All the contracts and protocols between different parts of software/hardware are in the broader sense of "API". Missing documentation details about exact format of input/output data or missing examples. The document writer, under curse of knowledge , may assume the user know, but most users don't know. Doesn't provide example usages. Examples are valuable because a working example cannot omit details . Without detailed documentation, developers usually test the API manually to figure out details. Tweaking (tinkering) a working example makes learning more proactive and efficient . The document lacks clearifications. Many words are ambiuous. For example "immutable" can mean 1. reference is immutable, referenced object is mutable 2. reference is mutable, referenced object is immutable 3. reference and referenced object are both immutable 4. it's just read-only, the referenced object can be mutated by other ways ... Is very hard to do manual testing. No simple REPL. Cannot easily setup virtual environments. Cannot easily take and load snapshots. Cannot call from simple commands. Cannot easily undo mistakes made in testing. Cannot easily use curl to test a Restful API. Lacking debugging and visualization tools. Doesn't allow easily check internal state or intermediary data. Doesn't allow easy profiling. An example is using efficient binary data format instead of text format, but lack tools to inspect and edit the binary data (one main advantage of text-based data format is that it's easy to inspect and edit without special tools for the format). Behavior is unintuitive, causing developers to easily misunderstand the behavior. This can also happen if the behavior deviates from the similar APIs of mainstream tools, when most developers are familiar with mainstream tools. One example is yaml require a space after colon (different to JSON). Another example is CSS layouting. Missing documentation telling the traps (wrong ways of using the API). When the API is used wrongly, silently do nothing ( fail-silent ) or do unexpected things (undefined behavior), without giving error. An example is memory management in memory-unsafe languages (already improved by tools such as valgrind). Another example is that a wrong spelling field name in a JSON config file makes the config ineffective, without giving error message because JSON parsers usually ignore unknown fields. No safety net preventing wrong usage of API. The common example is memory management in memory-unsafe languages (C/C++). Another common example is data race. Abstraction leakage . You only know how to correctly use it if you understand the implementation detail. The abstraction fails to hide complexity. The API changed between versions incompatibly. The online materials may be outdated and LLMs are trained with outdated material. Doesn't explicitly tell that some configuration is unused or not effective. (Example: for two sets of configurations, where one overrides another, changing the overridden one has no effect.) Error messages is silently put into another place (can only check using a special command or a special function call, or in a special log file). Beginners usually don't how where to see the error message. Error message is vague and doesn't tell which thing is wrong. Example: only provide an error code that correspond to many kinds of errors. Sometimes it's caused by not retaining enough runtime metadata. It cannot output useful error message because the relevant information is missing at runtime. Doesn't tell error early. Only tell error if some functionality is used. This may make some configuration bugs become unnoticed until some condition is met. Doesn't tell error in the correct stage of computing. A wrong configuration of stage 1 may not give error in stage 1, but gives error in stage 2 when stage 2 processes invalid data from stage 1, which make the error message more obsecure because the context in stage 1 is lost. The tool does too many "magic" under the hood. The API seems simple but is actually complex. The "magic" sometimes make things more convenient, but sometimes cause unwanted behavior. Try to use heuristics to "fix error". This makes the true error hidden and not fixed (make the app eventually accumulate many errors unnoticed). The heuristics cannot fully fix the error and malfunction in some edge cases. Another example is layouting in CSS. Most layout-related attributes in CSS are very versatile. Each attribute usually have many side effects. CSS aims to make layout work with very few CSS attributes, but result in a complex system that's hard-to-understand. A convenience feature causes security vulnerability. (e.g. some JSON libraries store class name in JSON to support polymorphic objects, but trusting class name from user is insecure.) Too many downstream errors hiding the root error. An example is log spam in log file, where only the first error is meaningful and all subsequent spam errors are side-effects of the first error. In C++ if you use some STL container wrongly there may be a spam of compiler error that's in STL code, hiding the root error. The API becomes complex to accomodate special custom usages, making common simple usage harder and more complex. The API is too simple to accomodate special custom usage. Doing special custom usage requires complex and error-prone hacking (relying on internal implementation instead of public API). Provides two sets of APIs (such as one set of old version API and one set of new version API, or one set of simple but slow API and one set of complex but fast API). But two sets of APIs have complex interactions under the hood, using both of them causes weird behaviors. Lacking of isolation and orthogonality. Changing one thing affects another thing that's seemingly unrelated. An example is layout in CSS. Having strict constraint that makes prototyping hard. In Rust changing data structure may involve huge refactoring (adding or removing lifetime parameters in every usage, replacing a reference with Arc, etc. See also ). These constraints can help correctness and make reviewing PR easier, but they hinder prototyping and iteration. It's a tradeoff. Default API usage make it easy to be used inefficiently. Example: directly passing regular expression string in argument cause it to parse regular expression on every call (can be mitigated by underlying caching). Sacrifice usability for seemingly correctness. An example is Windows's file system, where you cannot move or delete a file that's being used. This seemingly helps correctness, but it make software upgrade harder. In Windows, softwre upgrading is error-prone to other software reading its files. Can only safely upgrade via rebooting. Also forgien key helps correctness but make backup loading and schema migration harder. The API was designed without caring about performance, and cannot optimize in the future without breaking compatibility. The API overfly focus on security, making doing simple things harder. Feedback loop is long. Example: after changing the code, the developer have to wait for slow CI/CD to see the effect in website. The long feedback loop makes working inefficient, consumes more patience and make the developer retain less temporal memory. A good example is hot-reloading, where feedback loop is very short. An LLM hallucinates about an important nuanced assumption, causing developer to misunderstand the API's nuanced assumption, then waste a lot of time debugging without questioning the assumption. Order-dependent setup and fragility to mis-ordering. Getting one order wrong cause it to break. This is especially hard to deal with in concurrent or distributed systems, where the order is influenced by random factors, causing unreproducable random errors. Duplicated configuration. When a configuration is duplicated 3 times, changing it requries changing all of the 3 places. Multi-source configuration. For example, one option can be changed globally, change locally, inherit from parent, change by type, etc. One example is CSS. Although it seems convenient, when one configuration is wrong, it's hard to track where does the wrong config value come from. Overly flexible config file. A config file is a plain text file that does not support rich features provided by a normal programming language, such as variables, conditions and repetition. Trying to make the config file more flexible and expressive eventually turn it into a DSL that's hard to use (existing debugging and logging tools cannot be used on it, existing libraries cannot be used on it, and it usually lacks IDE support). Have to maintain consistency between the data managed by library and the data managed by your code. Each one can update the other one (no single source of truth). If the two pices of data are not kept in sync, weird issues will happen. The library provides the functionality except for an important detail. Then you cannot use the library and have to re-implement. (Example: fine-grained text layout control is hard to do in HTML/CSS so a lot of web apps are forced to do in-canvas rendering for all texts.)

0 views
qouteall notes 5 months ago

Some Statistics Knowledge

What's the essence of probability? There are two views: Probability is related to sampling assumptions. Example: Bertrand Paradox : there are many ways to randoly select a chord on a circle, with different proability densities of chord. A distribution tells how likely a random variable will be what value: Independent means that two random variables don't affect each other. Knowing one doesn't affect the distribution of other. But there are dependent random variables that, when you know one, the distribution of another changes. P ( X = x ) P(X=x) P ( X = x ) means the probability of random variable X X X take value x x x . It can also be written as P X ( x ) P_X(x) P X ​ ( x ) or P ( X ) P(X) P ( X ) . Sometimes the probability density function f f f is used to represent a distribution. A joint distribution tells how likely a combination of multiple variables will be what value. For a joint distribution of X and Y, each outcome is a pair of X and Y, denoted ( X , Y ) (X, Y) ( X , Y ) . If X and Y are independent, then P ( X = x , Y = y ) = P ( ( X , Y ) = ( x , y ) ) = P ( X = x ) ⋅ P ( Y = y ) P(X=x,Y=y)=P((X,Y)=(x,y))=P(X=x) \cdot P(Y=y) P ( X = x , Y = y ) = P (( X , Y ) = ( x , y )) = P ( X = x ) ⋅ P ( Y = y ) . For a joint distribution of ( X , Y ) (X, Y) ( X , Y ) , if we only care about X, then the distribution of X is called marginal distribution. You can only add probability when two events are mutually exclusive. You can only multiply probability when two events are independent, or multiplying a conditional probability with the condition's probability. P ( E ∣ C ) P(E \vert C) P ( E ∣ C ) means the probability of E E E happening if C C C happens. E and C both happen ) ​ P ( E ∩ C ) = P ( E ∣ C ) ⋅ P ( C ) If E and C are independent, then P ( E ∩ C ) = P ( E ) P ( C ) P(E \cap C) = P(E)P(C) P ( E ∩ C ) = P ( E ) P ( C ) , then P ( E ∣ C ) = P ( E ) P(E \vert C)=P(E) P ( E ∣ C ) = P ( E ) . For example, there is a medical testing method of a disease. The test result can be positive (indicate having diesase) or negative. But that test is not always accurate. There are two random variables: whether test result is positive, whther the person actually has disease. This is a joint distribution. The 4 cases: a , b , c , d a, b, c, d a , b , c , d are four possibilities. a + b + c + d = 1 a + b + c + d = 1 a + b + c + d = 1 . For that distribution, there are two marginal distributions. If we only care about whether the person actually has disease and ignore the test result, then the marginal distribution is: Similarily there is also a marginal distribution of whether the test result is positive. False negative rate is P ( Test is negative  ∣  Actually has disease ) P(\text{Test is negative } \vert \text{ Actually has disease}) P ( Test is negative  ∣  Actually has disease ) , it means the rate of negative test when actually having disease. And false positive rate is P ( Test is positive  ∣  Actually doesn’t have disease ) P(\text{Test is positive } \vert \text{ Actually doesn't have disease}) P ( Test is positive  ∣  Actually doesn’t have disease ) . Some people may intuitively think false negative rate means P ( Test result is false  ∣  Test is negative ) P(\text{Test result is false } \vert \text{ Test is negative}) P ( Test result is false  ∣  Test is negative ) , which equals P ( Actually has disease  ∣  Test is negative ) P(\text{Actually has disease } \vert \text{ Test is negative}) P ( Actually has disease  ∣  Test is negative ) , which equals b b + d \frac{b}{b+d} b + d b ​ . But that's not the official definition of false negative. Bayes theorem allow "reversing" P ( A ∣ B ) P(A \vert B) P ( A ∣ B ) as P ( B ∣ A ) P(B \vert A) P ( B ∣ A ) : The theoretical mean is the "weighted average" of all possible cases using theoretical probabilities. E [ X ] E[X] E [ X ] denotes the theoretical mean of random variable X X X , also called the expected value of X X X . It's also often denoted as μ \mu μ . For discrete case, E [ X ] E[X] E [ X ] is calculated by summing all theoretically possible values multiply by their theoretical probability. The mean for discrete case: x ​ ​ ∑ ​ x ⋅ P ( X = x ) ​ probability of that case ​ The mean for continuous case: Some rules related to mean: (The constant k k k doesn't necessarily need to be globally constant. It just need to be a certain value that's not affected by the random outcome. It just need to be "constant in context".) Another important rule is that, if X X X and Y Y Y are independent, then Because when X X X and Y Y Y are independent, P ( X = x i , Y = y j ) = P ( X = x i ) ⋅ P ( Y = y j ) P(X=x_i, Y=y_j) = P(X=x_i) \cdot P(Y=y_j) P ( X = x i ​ , Y = y j ​ ) = P ( X = x i ​ ) ⋅ P ( Y = y j ​ ) , then: Note that E [ X + Y ] = E [ X ] + E [ Y ] E[X+Y]=E[X]+E[Y] E [ X + Y ] = E [ X ] + E [ Y ] always work regardless of independence, but E [ X Y ] = E [ X ] E [ Y ] E[XY]=E[X]E[Y] E [ X Y ] = E [ X ] E [ Y ] requires independence. For a sum, the common factor that's not related to sum index can be extraced out. So: ​ j ∑ ​ ( irrelevant to j f ( i ) ​ ​ ⋅ g ( j )) ​ = i ∑ ​ ​ f ( i ) irrelevant to i j ∑ ​ g ( j ) ​ ​ ​ = ( i ∑ ​ f ( i ) ) ( j ∑ ​ g ( j ) ) Then: (That's for the discrete case. Continuous case is similar.) If we have n n n samples of X X X , denoted X 1 , X 2 , . . . X n X_1, X_2, ... X_n X 1 ​ , X 2 ​ , ... X n ​ , each sample is a random variable, and each sample is independent to each other , and each sample are taken from the same distribution (independently and identically distributed, i.i.d ), then we can estimate the theoretical mean by calculating the average. The estimated mean is denoted as μ ^ \hat{\mu} μ ^ ​ (Mu hat): Hat ^ \hat{} ^ means it's an empirical value calculated from samples, not the theoretical value. Some important clarifications: The mean of estimated mean equals the theoretical mean. Note that if the samples are not independent to each other, or they are taken from different distributions, then the estimation will be possibly biased. The theoretical variance, Var [ X ] \text{Var}[X] Var [ X ] , also denoted as σ 2 \sigma ^2 σ 2 , measures how "spread out" the samples are. If k k k is a constant: Standard deviation ( stdev ) σ \sigma σ is the square root of variance. Multiplying a random variable by a constant also multiplies the standard deviation. The covariance Cov [ X , Y ] \text{Cov}[X, Y] Cov [ X , Y ] measures the "joint variability" of two random variables X X X and Y Y Y . Some rules related to variance: If X X X and Y Y Y are indepdenent, as previouly mentioned E [ X Y ] = E [ X ] ⋅ E [ Y ] E[XY]=E[X]\cdot E[Y] E [ X Y ] = E [ X ] ⋅ E [ Y ] , then so Var [ X + Y ] = Var [ X ] + Var [ Y ] \text{Var}[X + Y]= \text{Var}[X] + \text{Var}[Y] Var [ X + Y ] = Var [ X ] + Var [ Y ] The mean is sometimes also called location. The variance is sometimes called dispersion. If we have some i.i.d samples but don't know the theoretical variance, how to estimate the variance? If we know the theoretical mean, then it's simple: However, the theoretical mean is different to the estimated mean. If we don't know the theoretical mean and use the estimated mean, it will be biased, and we need to divide n − 1 n-1 n − 1 instead of n n n to avoid bias: This is called Bessel's correction. note that the more i.i.d samples you have, the smaller the bias, so if you have many i.i.d samples, then the bias doesn't matter in practice. Originally, n samples have n degrees of freedom. If we keep the estimated mean fixed, then it will only have n-1 degrees of freedom. That's an intuitive explanation of the correction. The exact dedution of that correction is tricky: Firstly, the estimated mean itself also has variance As each sample is independent to other samples. As previously mentioned, if X X X and Y Y Y are independent, adding the variable also adds the variance: Var [ X + Y ] = Var [ X ] + Var [ Y ] \text{Var}[X + Y]= \text{Var}[X] + \text{Var}[Y] Var [ X + Y ] = Var [ X ] + Var [ Y ] . So: As previously mentioned E [ μ ^ ] = μ E[\hat{\mu}] = \mu E [ μ ^ ​ ] = μ , then Var [ μ ^ ] = E [ ( μ ^ − E [ μ ^ ] ) 2 ] = E [ ( μ ^ − μ ) 2 ] = σ 2 n \text{Var}[\hat{\mu}] = E[(\hat{\mu} - E[\hat{\mu}])^2] = E[(\hat{\mu} - \mu)^2] = \frac{\sigma^2}{n} Var [ μ ^ ​ ] = E [( μ ^ ​ − E [ μ ^ ​ ] ) 2 ] = E [( μ ^ ​ − μ ) 2 ] = n σ 2 ​ . This will be used later. A trick is to rewrite X i − μ ^ X_i - \hat{\mu} X i ​ − μ ^ ​ to ( X i − μ ) − ( μ ^ − μ ) (X_i - \mu) - (\hat{\mu} - \mu) ( X i ​ − μ ) − ( μ ^ ​ − μ ) and then expand: Then take mean of two sides: There are now three terms. The first one equals n σ 2 n\sigma^2 n σ 2 : So the second one becomes Now the above three things become E [ ( μ ^ − μ ) 2 ] E[(\hat{\mu}-\mu)^2] E [( μ ^ ​ − μ ) 2 ] is also Var [ μ ^ ] \text{Var}[\hat{\mu}] Var [ μ ^ ​ ] . As previously mentioned, it equals σ 2 n \frac{\sigma^2}{n} n σ 2 ​ , so For a random variable X X X , if we know its mean μ \mu μ and standard deviation σ \sigma σ then we can "standardize" it so that its mean become 0 and standard deviation become 1: That's called Z-score or standard score. Often the theoretical mean and theoretical standard deviation is unknown, so z score is computed using sample mean and sample stdev: In deep learning, normalization uses Z score: Note that in layer normalization and batch normalization, the variance usually divides by n n n instead of n − 1 n-1 n − 1 . Computing Z-score for a vector can also be seen as a projection: ​ (or n − 1 \sqrt{n-1} n − 1 ​ ): σ 2 = 1 n ( y ) 2 \boldsymbol\sigma^2 = \frac 1 n (\boldsymbol y)^2 σ 2 = n 1 ​ ( y ) 2 , σ = 1 n ∣ y ∣ \boldsymbol\sigma = \frac 1 {\sqrt{n}} \vert \boldsymbol y \vert σ = n ​ 1 ​ ∣ y ∣ . Dividing by standard deviation can be seen as projecting it onto unit sphere then multiply by n \sqrt n n ​ (or n − 1 \sqrt{n-1} n − 1 ​ ). So computing Z-score can be seen as firstly projecting onto a hyper-plane that's orthogonal to 1 \boldsymbol 1 1 and then projecting onto unit sphere then multiply by n \sqrt n n ​ (or n − 1 \sqrt{n-1} n − 1 ​ ). Skewness ​ Skewness measures which side has more extreme values. A large positive skew means there is a fat tail on positive side (may have positive extreme values). A large negative skew means fat tail on negative side (may have negative extreme values). If two sides are symmetric, its skew is 0, regardless of how fat the tails are. Gaussian distributions are symmetric so they has zero skew. note that an asymmetric distribution can also has 0 skewness. There is a concept called moments that unify mean, variance, skewness and kurtosis: There is an unbiased way to estimate the thrid central moment μ 3 \mu_3 μ 3 ​ . The deduction of unbiased third central moment estimator is similar to Bessel's correction, but more tricky. A common way of estimating skewness from i.i.d samples, is to use the unbiased third central moment estimator, to divide by cubic of unbiased estimator of standard deviation: But it's still biased, as E [ X Y ] E[\frac{X}{Y}] E [ Y X ​ ] doesn't necessarily equal E [ X ] E [ Y ] \frac{E[X]}{E[Y]} E [ Y ] E [ X ] ​ . Unfortunately, there is no completely unbiased way to estimate skewness from i.i.d samples (unless you have other assumptions about the underlying distribution). The bias gets smaller with more i.i.d samples. Larger kurtosis means it has a fatter tail. The more extreme values it has, the higher its kurtosis. Gaussian distributions have kurtosis of 3. Excess kurtosis is the kurtosis minus 3. A common way of estimating excess kurtosis from i.i.d samples, is to use the unbiased estimator of fourth cumulant ( E [ ( X − E [ X ] ) 4 ] − 3 V a r [ X ] 2 E[(X-E[X])^4]-3Var[X]^2 E [( X − E [ X ] ) 4 ] − 3 Va r [ X ] 2 ), to divide the square of unbiased estimator of variance: It's still biased. If we have some independent samples of X X X , can estimate mean E [ X ] E[X] E [ X ] by calculating average E ^ [ X ] = 1 n ∑ i X i \hat{E}[X]=\frac{1}{n}\sum_i X_i E ^ [ X ] = n 1 ​ ∑ i ​ X i ​ . The variance of calculated average is 1 n Var [ X ] \frac{1}{n} \text{Var}[X] n 1 ​ Var [ X ] , which will reduce by having more samples. However, if the variance of X X X is large and the amount of samples is few, the average will have a large variance, the estimated mean will be inaccurate. We can make the estimation more accurate by using control variate. Then we can estimate E [ X ] E[X] E [ X ] using E ^ [ X + λ ( Y − E [ Y ] ) ] \hat{E}[X+\lambda(Y-E[Y])] E ^ [ X + λ ( Y − E [ Y ])] , where λ \lambda λ is a constant. By choosing the right λ \lambda λ , the estimator can have lower variance than just calculating average of X. The Y here is called a control variate. Some previous knowledge: E [ E ^ [ A ] ] = E [ A ] E[\hat{E}[A]] = E[A] E [ E ^ [ A ]] = E [ A ] , Var [ E ^ [ A ] ] = 1 n Var [ A ] \text{Var}[\hat{E}[A]]=\frac{1}{n}\text{Var}[A] Var [ E ^ [ A ]] = n 1 ​ Var [ A ] . The mean of that estimator is E [ X ] E[X] E [ X ] , meaning that the estimator is unbiased: E [ Y − E [ Y ]] ​ ​ ) = E [ X ] Then calculate the variance of the estimator: − λ E [ Y ] ​ ​ ] = 1 n Var [ X + λ Y ] = 1 n ( Var [ X ] + Var [ λ Y ] + 2 cov [ X , λ Y ] ) = 1 n ( Var [ X ] + λ 2 Var [ Y ] + 2 λ cov [ X , Y ] ) =\frac{1}{n}\text{Var}[X+\lambda Y] = \frac{1}{n}(\text{Var}[X]+\text{Var}[\lambda Y] +2\text{cov}[X,\lambda Y]) = \frac{1}{n}(\text{Var}[X]+\lambda^2 \text{Var}[Y]+2\lambda \text{cov}[X,Y]) = n 1 ​ Var [ X + λY ] = n 1 ​ ( Var [ X ] + Var [ λY ] + 2 cov [ X , λY ]) = n 1 ​ ( Var [ X ] + λ 2 Var [ Y ] + 2 λ cov [ X , Y ]) We want to minimize the variance of estimator by choosing a λ \lambda λ . We want to find a λ \lambda λ that minimizes Var [ Y ] λ 2 + 2 cov [ X , Y ] λ \text{Var}[Y] \lambda^2 + 2\text{cov}[X,Y] \lambda Var [ Y ] λ 2 + 2 cov [ X , Y ] λ . Quadratic funciton knowledge tells a x 2 + b x + c    ( a > 0 ) ax^2+bx+c \ \ (a>0) a x 2 + b x + c     ( a > 0 ) minimizes when x = − b 2 a x=\frac{-b}{2a} x = 2 a − b ​ , then the optimal lambda is: And by using that optimal λ \lambda λ , the variance of estimator is: If X and Y are correlated, then cov [ X , Y ] 2 Var [ Y ] > 0 \frac{\text{cov}[X,Y]^2}{\text{Var}[Y]} > 0 Var [ Y ] cov [ X , Y ] 2 ​ > 0 , then the new estimator has smaller variance and is more accurate than the simple one. The larger the correlation, the better it can be. Information entropy measures: If we want to measure the amount of information of a specific event, an event E E E 's amount of information as I ( E ) I(E) I ( E ) , there are 3 axioms: Then according to the three axioms, the definition of I I I (self information) is: The base b b b is relative to the unit. We often use the amount of bits as the unit of amount of information. An event with 50% probability has 1 bit of information, then the base will be 2: Then, for a distribution, the expected value of information of one sample is the expected value of I ( E ) I(E) I ( E ) . That defines information entropy H H H : In discrete case: If there exists x x x where P ( x ) = 0 P(x) = 0 P ( x ) = 0 , then it can be ignored in entropy calculation, as lim ⁡ x → 0 x log ⁡ x = 0 \lim_{x \to 0} x \log x = 0 lim x → 0 ​ x lo g x = 0 . Information entropy in discrete case is always positive. In continuous case, where f f f is the probability density function, this is called differential entropy: ( X \mathbb{X} X means the set of x x x where f ( x ) ≠ 0 f(x) \neq 0 f ( x )  = 0 , also called support of f f f .) In continuous case the base is often e e e rather than 2. Here log ⁡ \log lo g by default means log ⁡ e \log_e lo g e ​ . In discrete case, 0 ≤ P ( x ) ≤ 1 0 \leq P(x) \leq 1 0 ≤ P ( x ) ≤ 1 , log ⁡ 1 P ( x ) > 0 \log \frac{1}{P(x)} > 0 lo g P ( x ) 1 ​ > 0 , so entropy can never be negative. But in continuous case, probability density function can take value larger than 1, so entropy may be negative. If X and Y are independent, then H ( ( X , Y ) ) = E [ I ( ( X , Y ) ) ] = E [ I ( X ) + I ( Y ) ] = E [ I ( X ) ] + E [ I ( Y ) ] = H ( X ) + H ( Y ) H((X,Y))=E[I((X,Y))]=E[I(X)+I(Y)]=E[I(X)]+E[I(Y)]=H(X)+H(Y) H (( X , Y )) = E [ I (( X , Y ))] = E [ I ( X ) + I ( Y )] = E [ I ( X )] + E [ I ( Y )] = H ( X ) + H ( Y ) . If one fair coin toss has 1 bit entropy, then n independent tosses has n bit entropy. If I split one case into two cases, entropy increases. If I merge two cases into one case, entropy reduces. Because p 1 log ⁡ 1 p 1 + p 2 log ⁡ 1 p 2 > ( p 1 + p 2 ) log ⁡ 1 p 1 + p 2 p_1\log \frac{1}{p_1} + p_2\log \frac{1}{p_2} > (p_1+p_2) \log \frac{1}{p_1+p_2} p 1 ​ lo g p 1 ​ 1 ​ + p 2 ​ lo g p 2 ​ 1 ​ > ( p 1 ​ + p 2 ​ ) lo g p 1 ​ + p 2 ​ 1 ​ (if p 1 ≠ 0 , p 2 ≠ 0 p_1 \neq 0, p_2 \neq 0 p 1 ​  = 0 , p 2 ​  = 0 ), which is because that f ( x ) = log ⁡ 1 x f(x)=\log \frac{1}{x} f ( x ) = lo g x 1 ​ is convex, so p 1 p 1 + p 2 log ⁡ 1 p 1 + p 2 p 1 + p 2 log ⁡ 1 p 2 > log ⁡ 1 p 1 + p 2 \frac{p_1}{p_1+p_2}\log\frac{1}{p_1}+\frac{p_2}{p_1+p_2}\log\frac{1}{p_2}>\log\frac{1}{p_1+p_2} p 1 ​ + p 2 ​ p 1 ​ ​ lo g p 1 ​ 1 ​ + p 1 ​ + p 2 ​ p 2 ​ ​ lo g p 2 ​ 1 ​ > lo g p 1 ​ + p 2 ​ 1 ​ , then multiply two sides by p 1 + p 2 p_1+p_2 p 1 ​ + p 2 ​ gets the above result. The information entropy is the theorecical minimum of information required to encode a sample. For example, to encode the result of a fair coin toss, we use 1 bit, 0 for head and 1 for tail (reversing is also fine). If the coin is biased to head, to compress the information, we can use 0 for two consecutive heads, 10 for one head, 11 for one tail, which require fewer bits on average for each sample. That may not be optimal, but the most optimal loseless compresion cannot be better than information entropy. In continuous case, if k k k is a positive constant, H ( k X ) = H ( X ) + log ⁡ k H(kX) = H(X) + \log k H ( k X ) = H ( X ) + lo g k : Entropy is invariant to offset of random variable. H ( X + k ) = H ( X ) H(X+k)=H(X) H ( X + k ) = H ( X ) A joint distribution of X and Y is a distribution where each outcome is a pair of X and Y. Its entropy is called joint information entropy. Here I will use H ( ( X , Y ) ) H((X,Y)) H (( X , Y )) to denote joint entropy (to avoid confusing with cross entropy). If I fix the value of Y as y y y , then see the distribution of X: Take that mean over different Y, we get conditional entropy: Applying conditional probability rule: P ( ( X , Y ) ) = P ( X ∣ Y ) P ( Y ) P((X,Y)) = P(X \vert Y) P(Y) P (( X , Y )) = P ( X ∣ Y ) P ( Y ) So the conditional entropy is defined like this: P ( ( X , Y ) ) = P ( X ∣ Y ) P ( Y ) P((X, Y)) = P(X \vert Y) P(Y) P (( X , Y )) = P ( X ∣ Y ) P ( Y ) . Similarily, H ( ( X , Y ) ) = H ( X ∣ Y ) + H ( Y ) H((X,Y))=H(X \vert Y)+H(Y) H (( X , Y )) = H ( X ∣ Y ) + H ( Y ) . The exact deduction is as follows: If X X X and Y Y Y are not independent, then the joint entropy is smaller than if they are independent: H ( ( X , Y ) ) < H ( X ) + H ( Y ) H((X, Y)) < H(X) + H(Y) H (( X , Y )) < H ( X ) + H ( Y ) . If X and Y are not independent then knowing X will also give some information about Y. This can be deduced by mutual information which will be explained below. Here I A ( x ) I_A(x) I A ​ ( x ) denotes the amount of information of value (event) x x x in distribution A A A . The difference of information of the same value in two distributions A A A and B B B : The KL divergence from A A A to B B B is the expected value of that regarding the probabilities of A A A . Here E A E_A E A ​ means the expected value calculated using A A A 's probabilities: You can think KL divergence as: KL divergence is also called relative entropy. KL divergence is asymmetric, D K L ( A ∥ B ) D_{KL}(A\parallel B) D K L ​ ( A ∥ B ) is different to D K L ( B ∥ A ) D_{KL}(B\parallel A) D K L ​ ( B ∥ A ) . It's often that the first distribution is the real underlying distribution, and the second distribution is an approximation or the model output. If A and B are the same, the KL divergence betweem them are zero. Otherwise, KL divergence is positive. KL divergence can never be negative, will explain later. P B ( x ) P_B(x) P B ​ ( x ) appears on denominator. If there exists x x x that P B ( x ) = 0 P_B(x) = 0 P B ​ ( x ) = 0 and P A ( x ) ≠ 0 P_A(x) \neq 0 P A ​ ( x )  = 0 , then it can be seen that KL divergence is infinite. It can be seen as "The model expect something to never happen but it actually can happen". If there is no such case, we say that A absolutely continuous with respect to B, written as A ≪ B A \ll B A ≪ B . This requires all outcomes from B to include all outcomes from A. Another concept is cross entropy . The cross entropy from A to B, denoted H ( A , B ) H(A, B) H ( A , B ) , is the entropy of A plus KL divergence from A to B: Information entropy H ( X ) H(X) H ( X ) can also be expressed as cross entropy of itself H ( X , X ) H(X, X) H ( X , X ) , similar to the relation between variance and covariance. (In some places H ( A , B ) H(A,B) H ( A , B ) denotes joint entropy. I use H ( ( A , B ) ) H((A,B)) H (( A , B )) for joint entropy to avoid ambiguity.) Cross entropy is also asymmetric. In deep learning cross entropy is often used as loss function. If each piece of training data's distribution's entropy H ( A ) H(A) H ( A ) is fixed, minimizing cross entropy is the same as minimizing KL divergence. Jensen's inequality states that for a concave function f f f : The reverse applies for convex. Here is a visual example showing Jensen's inequality. For example I have a discrete distribution with 5 cases X 1 , X 2 , X 3 , X 4 , X 5 X_1,X_2,X_3,X_4,X_5 X 1 ​ , X 2 ​ , X 3 ​ , X 4 ​ , X 5 ​ (these are possible outcomes of distribution, not samples), corresponding to X coordinates of the red dots. The probabilities of the 5 cases are p 1 , p 2 , p 3 , p 4 , p 5 p_1, p_2, p_3, p_4, p_5 p 1 ​ , p 2 ​ , p 3 ​ , p 4 ​ , p 5 ​ that sum to 1. E [ X ] = p 1 X 1 + p 2 X 2 + p 3 X 3 + p 4 X 4 + p 5 X 5 E[X] = p_1 X_1 + p_2 X_2 + p_3 X_3 + p_4 X_4 + p_5 X_5 E [ X ] = p 1 ​ X 1 ​ + p 2 ​ X 2 ​ + p 3 ​ X 3 ​ + p 4 ​ X 4 ​ + p 5 ​ X 5 ​ . E [ f ( x ) ] = p 1 f ( X 1 ) + p 2 f ( X 2 ) + p 3 f ( X 3 ) + p 4 f ( X 4 ) + p 5 f ( X 5 ) E[f(x)] = p_1 f(X_1) + p_2 f(X_2) + p_3 f(X_3) + p_4 f(X_4) + p_5 f(X_5) E [ f ( x )] = p 1 ​ f ( X 1 ​ ) + p 2 ​ f ( X 2 ​ ) + p 3 ​ f ( X 3 ​ ) + p 4 ​ f ( X 4 ​ ) + p 5 ​ f ( X 5 ​ ) . Then ( E [ X ] , E [ f ( x ) ] ) (E[X], E[f(x)]) ( E [ X ] , E [ f ( x )]) can be seen as an interpolation between five points ( X 1 , f ( X 1 ) ) , ( X 2 , f ( X 2 ) ) , ( X 3 , f ( X 3 ) ) , ( X 4 , f ( X 4 ) ) , ( X 5 , f ( X 5 ) ) (X_1, f(X_1)), (X_2, f(X_2)), (X_3, f(X_3)), (X_4, f(X_4)), (X_5, f(X_5)) ( X 1 ​ , f ( X 1 ​ )) , ( X 2 ​ , f ( X 2 ​ )) , ( X 3 ​ , f ( X 3 ​ )) , ( X 4 ​ , f ( X 4 ​ )) , ( X 5 ​ , f ( X 5 ​ )) , using weights p 1 , p 2 , p 3 , p 4 , p 5 p_1, p_2, p_3, p_4, p_5 p 1 ​ , p 2 ​ , p 3 ​ , p 4 ​ , p 5 ​ . The possible area of the interpolated point correspond to the green convex polygon: For each point in green polygon ( E [ X ] , E [ f ( X ) ] ) (E[X], E[f(X)]) ( E [ X ] , E [ f ( X )]) , the point on function curve with the same X coordinate ( E [ X ] , f ( E [ X ] ) ) (E[X], f(E[X])) ( E [ X ] , f ( E [ X ])) is above it. So E [ f ( X ) ] ≤ f ( E [ X ] ) E[f(X)] \leq f(E[X]) E [ f ( X )] ≤ f ( E [ X ]) . The same applies when you add more cases to the discrete distribution, the convex polygon will have more points but still below the function curve. The same applies to continuous distribution when there are infinitely many cases. Jensen's inequality tells that KL divergence is non-negative: There is a trick that extracting -1 makes P A P_A P A ​ be in denominator that will be cancelled later. The logarithm function is concave. Jensen's inequality gives: Multiplying -1 and flip: The right side equals 0 because: Then how to estimate the KL divergence D K L ( A , B ) D_{KL}(A, B) D K L ​ ( A , B ) ? Reference: Approximating KL Divergence As KL divergence is E A [ log ⁡ P A ( x ) P B ( x ) ] E_A\left[\log \frac{P_A(x)}{P_B(x)}\right] E A ​ [ lo g P B ​ ( x ) P A ​ ( x ) ​ ] , the simply way is to calculate the average of log ⁡ P A ( x ) P B ( x ) \log \frac{P_A(x)}{P_B(x)} lo g P B ​ ( x ) P A ​ ( x ) ​ : However it may to be negative in some cases. The true KL divergence can never be negative. This may cause issues. A better way to estimate KL divergence is: ( P A ( x ) = 0 P_A(x) = 0 P A ​ ( x ) = 0 is impossible because it's sampled from A) It's always positive and has no bias. The P B ( x ) P A ( x ) − 1 \frac{P_B(x)}{P_A(x)}-1 P A ​ ( x ) P B ​ ( x ) ​ − 1 is a control variate and is negatively correlated with log ⁡ P A ( x ) P B ( x ) \log \frac{P_A(x)}{P_B(x)} lo g P B ​ ( x ) P A ​ ( x ) ​ . Recall control variate: if we want to estimate E [ X ] E[X] E [ X ] from samples more accurately, we can find another variable Y Y Y that's correlated with X X X , and we must know its theoretical mean E [ Y ] E[Y] E [ Y ] , then we use E ^ [ X + λ Y ] − λ E [ Y ] \hat E[X+\lambda Y] - \lambda E[Y] E ^ [ X + λY ] − λ E [ Y ] to estimate E [ X ] E[X] E [ X ] . The parameter λ \lambda λ is choosed by minimizing variance. The mean of that control variate is zero, because E x ∼ A [ P B ( x ) P A ( x ) − 1 ] = ∑ x P A ( x ) ( P B ( x ) P A ( x ) − 1 ) = ∑ x ( P B ( x ) − P A ( x ) ) = ∑ x P B ( x ) − ∑ x P A ( x ) = 0 E_{x \sim A}\left[\frac{P_B(x)}{P_A(x)}-1\right]=\sum_x P_A(x) (\frac{P_B(x)}{P_A(x)}-1)=\sum_x (P_B(x) - P_A(x)) =\sum_x P_B(x) - \sum_x P_A(x)=0 E x ∼ A ​ [ P A ​ ( x ) P B ​ ( x ) ​ − 1 ] = ∑ x ​ P A ​ ( x ) ( P A ​ ( x ) P B ​ ( x ) ​ − 1 ) = ∑ x ​ ( P B ​ ( x ) − P A ​ ( x )) = ∑ x ​ P B ​ ( x ) − ∑ x ​ P A ​ ( x ) = 0 The λ = 1 \lambda=1 λ = 1 is not chosen by mimimizing variance, but chosen by making the estimator non-negative. If I define k = P B ( x ) P A ( x ) k=\frac{P_B(x)}{P_A(x)} k = P A ​ ( x ) P B ​ ( x ) ​ , then log ⁡ P A ( x ) P B ( x ) + λ ( P B ( x ) P A ( x ) − 1 ) = − log ⁡ k + λ ( k − 1 ) \log \frac{P_A(x)}{P_B(x)} + \lambda(\frac{P_B(x)}{P_A(x)} - 1) = -\log k + \lambda(k-1) lo g P B ​ ( x ) P A ​ ( x ) ​ + λ ( P A ​ ( x ) P B ​ ( x ) ​ − 1 ) = − lo g k + λ ( k − 1 ) . We want it to be non-negative: − log ⁡ k + λ ( k − 1 ) ≥ 0 -\log k + \lambda(k-1) \geq 0 − lo g k + λ ( k − 1 ) ≥ 0 for all k > 0 k>0 k > 0 , it can be seen that a line y = λ ( k − 1 ) y=\lambda (k-1) y = λ ( k − 1 ) must be above y = log ⁡ k y=\log k y = lo g k , the only solution is λ = 1 \lambda=1 λ = 1 , where the line is a tangent line on log ⁡ k \log k lo g k . If X and Y are independent, then H ( ( X , Y ) ) = H ( X ) + H ( Y ) H((X,Y))=H(X)+H(Y) H (( X , Y )) = H ( X ) + H ( Y ) . But if X and Y are not independent, knowing X reduces uncertainty of Y, then H ( ( X , Y ) ) < H ( X ) + H ( Y ) H((X,Y))<H(X)+H(Y) H (( X , Y )) < H ( X ) + H ( Y ) . Mutual information I ( X ; Y ) I(X;Y) I ( X ; Y ) measures how "related" X and Y are: For a joint distribution, if we only care about X, then the distribution of X is a marginal distribution, same as Y. If we treat X and Y as independent, consider a "fake" joint distribution as if X and Y are independent. Denote that "fake" joint distribution as Z Z Z , then P ( Z = ( x , y ) ) = P ( X = x ) P ( Y = y ) P(Z=(x,y))=P(X=x)P(Y=y) P ( Z = ( x , y )) = P ( X = x ) P ( Y = y ) . It's called "outer product of marginal distribution", because its probability matrix is the outer product of two marginal distributions, so it's denoted X ⊗ Y X \otimes Y X ⊗ Y . Then mutual information can be expressed as KL divergence between joint distribution ( X , Y ) (X, Y) ( X , Y ) and that "fake" joint distribution X ⊗ Y X \otimes Y X ⊗ Y : KL divergence is zero when two distributions are the same, and KL divergence is positive when two distributions are not the same. So: H ( ( X , Y ) ) = H ( X ) + H ( Y ) − I ( X ; Y ) H((X,Y))=H(X)+H(Y)-I(X;Y) H (( X , Y )) = H ( X ) + H ( Y ) − I ( X ; Y ) , so if X and Y are not independent then H ( ( X , Y ) ) < H ( X ) + H ( Y ) H((X,Y))<H(X)+H(Y) H (( X , Y )) < H ( X ) + H ( Y ) . Mutual information is symmetric, I ( X ; Y ) = I ( Y ; X ) I(X;Y)=I(Y;X) I ( X ; Y ) = I ( Y ; X ) . As H ( ( X , Y ) ) = H ( X ∣ Y ) + H ( Y ) H((X,Y)) = H(X \vert Y) + H(Y) H (( X , Y )) = H ( X ∣ Y ) + H ( Y ) , so I ( X ; Y ) = H ( X ) + H ( Y ) − H ( ( X , Y ) ) = H ( X ) − H ( X ∣ Y ) I(X;Y) = H(X) + H(Y) - H((X,Y)) = H(X) - H(X \vert Y) I ( X ; Y ) = H ( X ) + H ( Y ) − H (( X , Y )) = H ( X ) − H ( X ∣ Y ) . If knowing Y completely determines X, knowing Y make the distribution of X collapse to one case with 100% probability, then H ( X ∣ Y ) = 0 H(X \vert Y) = 0 H ( X ∣ Y ) = 0 , then I ( X ; Y ) = H ( X ) I(X;Y)=H(X) I ( X ; Y ) = H ( X ) . Some places use correlation factor Cov [ X , Y ] Var [ X ] Var [ Y ] \frac{\text{Cov}[X,Y]}{\sqrt{\text{Var}[X]\text{Var}[Y]}} Var [ X ] Var [ Y ] ​ Cov [ X , Y ] ​ to measure the correlation between two variables. But correlation factor is not accurate in non-linear cases. Mutual information is more accurate in measuring correlation. Information Bottleneck theory tells that the training of neural network will learn an intermediary representation that: If we have two independent random variablex X and Y, and consider the distribution of the sum Z = X + Y Z=X+Y Z = X + Y , then For each z, it sums over different x and y within the constraint z = x + y z=x+y z = x + y . The constraint z = x + y z=x+y z = x + y allows determining y y y from x x x and z z z : y = z − x y=z-x y = z − x , so it can be rewritten as: In continuous case The probability density function of the sum f Z f_Z f Z ​ is denoted as convolution of f X f_X f X ​ and f Y f_Y f Y ​ : The convolution operator ∗ * ∗ can: Convolution can also work in 2D or more dimensions. If X = ( x 1 , x 2 ) X=(x_1,x_2) X = ( x 1 ​ , x 2 ​ ) and Y = ( y 1 , y 2 ) Y=(y_1,y_2) Y = ( y 1 ​ , y 2 ​ ) are 2D random variables (two joint distributions), Z = X + Y = ( z 1 , z 2 ) Z=X+Y=(z_1,z_2) Z = X + Y = ( z 1 ​ , z 2 ​ ) is convolution of X and Y: Convolution can also work on cases where the values are not probabilities. Convolutional neural network uses discrete version of convolution on matrices. Normally when talking about probability we mean the probability of an outcome under a modelled distribution: P ( outcome  ∣  modelled distribution ) P(\text{outcome} \ \vert \ \text{modelled distribution}) P ( outcome   ∣   modelled distribution ) . But sometimes we have some concrete samples from a distribution but want to know which model suits the best, so we talk about the probability that a model is true given some samples: P ( modelled distribution  ∣  outcome ) P(\text{modelled distribution} \ \vert \ \text{outcome}) P ( modelled distribution   ∣   outcome ) . If I have some samples, then some parameters make the samples more likely to come from the modelled distribution, and some parameters make the samples less likely to come from the modelled distribution. For example, if I model a coin flip using a parameter θ \theta θ , that and I observe 10 coin flips have 9 heads and 1 tail, then θ = 0.9 \theta=0.9 θ = 0.9 is more likely than θ = 0.5 \theta=0.5 θ = 0.5 . That's straightforward for a simple model. But for more complex models, we need to measure likelihood. Likelihood L ( θ ∣ x 1 , x 2 , . . . , x n ) L(\theta \vert x_1,x_2,...,x_n) L ( θ ∣ x 1 ​ , x 2 ​ , ... , x n ​ ) measures: For example, if I model a coin flip distribution using a parameter θ \theta θ , the probability of head is θ \theta θ and tail is 1 − θ 1-\theta 1 − θ . If I observe 10 coin flip has 9 heads and 1 tail, then the likelihood of θ \theta θ : The more likely a parameter is, the higher its likelihood. If θ \theta θ equals the true underlying parameter then likelihood takes maximum. By taking logarithm, multiply becomes addition, making it easier to analyze. The log-likelihood function: The score function is the derivative of log-likelihood with respect to parameter, for one sample: If θ \theta θ equals true underlying parameter, then mean of likelihood E x [ L ( θ ∣ x ) ] E_x[L(\theta \vert x)] E x ​ [ L ( θ ∣ x )] takes maximum, mean of log-likelihood E x [ log ⁡ L ( θ ∣ x ) ] E_x[\log L(\theta \vert x)] E x ​ [ lo g L ( θ ∣ x )] also takes maximum. A continuous function's maximum point has zero derivative, so when θ \theta θ is true, then the mean of score function E x [ s ( θ ; x ) ] = ∂ E x [ f ( x ∣ θ ) ] ∂ θ E_x[s(\theta;x)]= \frac{\partial E_x[f(x \vert \theta)]}{\partial \theta} E x ​ [ s ( θ ; x )] = ∂ θ ∂ E x ​ [ f ( x ∣ θ )] ​ is zero. The Fisher information I ( θ ) \mathcal{I}(\theta) I ( θ ) is the mean of the square of score: (The mean is calculated over different outcomes, not different parameters.) We can also think that Fisher information is always computed under the assumption that θ \theta θ is the true underlying parameter, then E x [ s ( θ ; x ) ] = 0 E_x[s(\theta;x)]=0 E x ​ [ s ( θ ; x )] = 0 , then Fisher information is the variance of score I ( θ ) = Var x [ s ( θ ; x ) ] \mathcal{I}(\theta)=\text{Var}_x[s(\theta;x)] I ( θ ) = Var x ​ [ s ( θ ; x )] . Fisher informaiton I ( θ ) \mathcal{I}(\theta) I ( θ ) also measures the curvature of score function, in parameter space, around θ \theta θ . Fisher information measures how much information a sample can tell us about the underlying parameter. When the parameter is an offset and the offset is infinitely small, then the score function is called linear score. If the infinitely small offset is θ \theta θ . The offseted probability density is f 2 ( x ∣ θ ) = f ( x + θ ) f_2(x \vert \theta) = f(x+\theta) f 2 ​ ( x ∣ θ ) = f ( x + θ ) , then ≈ 0 ) ​ = d x d lo g f ( x ) ​ In the places that use score function (and Fisher information) but doesn not specify which parameter, they usually refer to the linear score function. Recall that if we make probability distribution more "spread out" the entropy will increase. If there is no constraint, maximizing entropy of real-number distribution will be "infinitely spread out over all real numbers" (which is not well-defined). But if there are constraints, maximizing entropy will give some common and important distributions: There are other max-entropy distributions. See Wikipedia . We can rediscover these max-entropy distributions, by using Largrange multiplier and functional derivative. To find the distribution with maximum entropy under variance constraint, we can use Largrange multiplier. If we want to find maximum or minimum of f ( x ) f(x) f ( x ) under the constraint that g ( x ) = 0 g(x)=0 g ( x ) = 0 , we can define Largragian function L \mathcal{L} L : Its two partial derivatives have special properties: Then solving equation ∂ L ( x , λ ) ∂ x = 0 \frac{\partial \mathcal{L}(x,\lambda)}{\partial x}=0 ∂ x ∂ L ( x , λ ) ​ = 0 and ∂ L ( x , λ ) ∂ λ = 0 \frac{\partial \mathcal{L}(x,\lambda)}{\partial \lambda}=0 ∂ λ ∂ L ( x , λ ) ​ = 0 will find the maximum or minimum under constraint. Similarily, if there are many constraints, there are multiple λ \lambda λ s. Similar things also apply to functions with multiple arguments. The argument x x x can be a number or even a function, which involves functional derivative: A functional is a function that inputs a function and outputs a value. (One of) its input is a function rather than a value (it's a higher-order function). Functional derivative (also called variational derivative) means the derivative of a functional respect to its argument function. To compute functional derivative, we add a small "perturbation" to the function. f ( x ) f(x) f ( x ) becomes f ( x ) + ϵ ⋅ η ( x ) f(x)+ \epsilon \cdot \eta(x) f ( x ) + ϵ ⋅ η ( x ) , where epsilon ϵ \epsilon ϵ is an infinitely small value that approaches zero, and eta η ( x ) \eta(x) η ( x ) is a test function. The test function can be any function that satisfy some properties. The definition of functional derivative: Note that it's inside integration. For example, this is a functional: G ( f ) = ∫ x f ( x ) d x G(f) = \int x f(x) dx G ( f ) = ∫ x f ( x ) d x . To compute functional derivative ∂ G ( f ) ∂ f \frac{\partial G(f)}{\partial f} ∂ f ∂ G ( f ) ​ , we firstly compute ∂ G ( f + ϵ η ) ∂ ϵ \frac{\partial G(f+\epsilon \eta)}{\partial \epsilon} ∂ ϵ ∂ G ( f + ϵη ) ​ then try to make it into the form of ∫ ∂ G ∂ f ⋅ η ( x ) d x \int \boxed{\frac{\partial G}{\partial f}} \cdot \eta(x) dx ∫ ∂ f ∂ G ​ ​ ⋅ η ( x ) d x Then by pattern matching with the definition, we get ∂ G ∂ f = x \frac{\partial G}{\partial f}=x ∂ f ∂ G ​ = x . Calculate functional derivative for G ( f ) = ∫ x 2 f ( x ) d x G(f)=\int x^2f(x)dx G ( f ) = ∫ x 2 f ( x ) d x : Then ∂ G ∂ f = x 2 \frac{\partial G}{\partial f}=x^2 ∂ f ∂ G ​ = x 2 . Calculate functional derivative for G ( f ) = ∫ ( − f ( x ) log ⁡ f ( x ) ) d x G(f) = \int (-f(x) \log f(x)) dx G ( f ) = ∫ ( − f ( x ) lo g f ( x )) d x : As log ⁡ \log lo g is continuous, and ϵ η ( x ) \epsilon \eta(x) ϵη ( x ) is infinitely small, so log ⁡ ( f ( x ) + ϵ η ( x ) ) = log ⁡ ( f ( x ) ) \log(f(x)+\epsilon \eta(x))=\log (f(x)) lo g ( f ( x ) + ϵη ( x )) = lo g ( f ( x )) : If we constraint the variance range, a ≤ X ≤ b a \leq X \leq b a ≤ X ≤ b , then maximize its entropy using fuctional derivative We have constraint ∫ a b f ( x ) d x = 1 \int_a^b f(x)dx=1 ∫ a b ​ f ( x ) d x = 1 , which is ∫ a b f ( x ) d x − 1 = 0 \int_a^b f(x)dx-1=0 ∫ a b ​ f ( x ) d x − 1 = 0 . Compute derivatives Solve ∂ L ∂ f = 0 \frac{\partial \mathcal{L}}{\partial f}=0 ∂ f ∂ L ​ = 0 : Solve ∂ L ∂ λ 1 = 0 \frac{\partial \mathcal{L}}{\partial \lambda_1}=0 ∂ λ 1 ​ ∂ L ​ = 0 : The result is f ( x ) = 1 b − a     ( a ≤ x ≤ b ) f(x) = \frac 1 {b-a} \ \ \ (a \leq x \leq b) f ( x ) = b − a 1 ​       ( a ≤ x ≤ b ) . The normal distribution, also called Gaussian distribution, is important in statistics. It's the distribution with maximum entropy if we constraint its variance σ 2 \sigma^2 σ 2 to be a finite value. It has two parameters: the mean μ \mu μ and the standard deviation σ \sigma σ . N ( μ , σ 2 ) N(\mu, \sigma^2) N ( μ , σ 2 ) denotes a normal distribution. Changing μ \mu μ moves the PDF alone X axis. Changing σ \sigma σ scales PDF along X axis. We can rediscover normal distribution by maximizing entropy under variance constraint. For a distribution's probability density function f f f , we want to maximize its entropy H ( f ) = ∫ f ( x ) log ⁡ 1 f ( x ) d x H(f)=\int f(x) \log\frac{1}{f(x)}dx H ( f ) = ∫ f ( x ) lo g f ( x ) 1 ​ d x under the constraint: We can simplify to make deduction easier: The Largragian function: ⎨ ⎧ ​ ∫ − ∞ ∞ ​ f ( x ) lo g f ( x ) 1 ​ d x + λ 1 ​ ( ∫ − ∞ ∞ ​ f ( x ) d x − 1 ) + λ 2 ​ ( ∫ − ∞ ∞ ​ f ( x ) x 2 d x − σ 2 ) ​ = ∫ − ∞ ∞ ( − f ( x ) log ⁡ f ( x ) + λ 1 f ( x ) + λ 2 x 2 f ( x ) ) d x − λ 1 − λ 2 σ 2 =\int_{-\infty}^{\infty} (-f(x)\log f(x) + \lambda_1 f(x) + \lambda_2 x^2 f(x) ) dx - \lambda_1 - \lambda_2\sigma^2 = ∫ − ∞ ∞ ​ ( − f ( x ) lo g f ( x ) + λ 1 ​ f ( x ) + λ 2 ​ x 2 f ( x )) d x − λ 1 ​ − λ 2 ​ σ 2 Then compute the functional derivative ∂ L ∂ f \frac{\partial \mathcal{L}}{\partial f} ∂ f ∂ L ​ Then solve ∂ L ∂ f = 0 \frac{\partial \mathcal{L}}{\partial f}=0 ∂ f ∂ L ​ = 0 : We get the rough form of normal distribution's probabilify density function. Then solve ∂ L ∂ λ 1 = 0 \frac{\partial \mathcal{L}}{\partial \lambda_1}=0 ∂ λ 1 ​ ∂ L ​ = 0 : That integration must converge, so λ 2 < 0 \lambda_2<0 λ 2 ​ < 0 . A subproblem: solve ∫ − ∞ ∞ e − k x 2 d x \int_{-\infty}^{\infty} e^{-k x^2}dx ∫ − ∞ ∞ ​ e − k x 2 d x ( k > 0 k>0 k > 0 ). The trick is to firstly compute its square ( ∫ − ∞ ∞ e − k x 2 d x ) 2 (\int_{-\infty}^{\infty} e^{-k x^2}dx)^2 ( ∫ − ∞ ∞ ​ e − k x 2 d x ) 2 , turning the integration into two-dimensional, and then substitude polar coordinates x = r cos ⁡ θ ,   y = r sin ⁡ θ ,   x 2 + y 2 = r 2 ,   d x   d y = r   d r   d θ x=r \cos \theta, \ y = r \sin \theta, \ x^2+y^2=r^2, \ dx\ dy = r \ dr \ d\theta x = r cos θ ,   y = r sin θ ,   x 2 + y 2 = r 2 ,   d x   d y = r   d r   d θ : Then substitude u = − k r 2 ,   d u = − 2 k r   d r ,   d r = − 1 2 k r d u u=-kr^2, \ du = -2kr\ dr, \ dr = -\frac{1}{2kr}du u = − k r 2 ,   d u = − 2 k r   d r ,   d r = − 2 k r 1 ​ d u : So ∫ − ∞ ∞ e − k x 2 d x = π k \int_{-\infty}^{\infty} e^{-kx^2}dx=\sqrt{\frac{\pi}{k}} ∫ − ∞ ∞ ​ e − k x 2 d x = k π ​ ​ . Put − λ 2 = k -\lambda_2=k − λ 2 ​ = k ​ = 1 e − 1 + λ 1 = − λ 2 π e^{-1+\lambda_1} = \sqrt{\frac{-\lambda_2}{\pi}} e − 1 + λ 1 ​ = π − λ 2 ​ ​ ​ Then solve ∂ L ∂ λ 2 = 0 \frac{\partial \mathcal{L}}{\partial \lambda_2}=0 ∂ λ 2 ​ ∂ L ​ = 0 : It requires another trick. For the previous result ∫ − ∞ ∞ e − k x 2 d x = π k \int_{-\infty}^{\infty} e^{-kx^2}dx=\sqrt{\frac{\pi}{k}} ∫ − ∞ ∞ ​ e − k x 2 d x = k π ​ ​ , take derivative to k k k on two sides: ​ k − 2 1 ​ take derivative to  k ​ ∫ − ∞ ∞ ​ ( − x 2 ) e ( − x 2 ) k d x = − 2 1 ​ π ​ k − 2 3 ​ So ∫ − ∞ ∞ x 2 e − k x 2 d x = 1 2 π k 3 \int_{-\infty}^{\infty} x^2e^{-kx^2}dx = \frac{1}{2}\sqrt{\frac{\pi}{k^3}} ∫ − ∞ ∞ ​ x 2 e − k x 2 d x = 2 1 ​ k 3 π ​ ​ ​ = σ 2 By using e − 1 + λ 1 = − λ 2 π e^{-1+\lambda_1} = \sqrt{\frac{-\lambda_2}{\pi}} e − 1 + λ 1 ​ = π − λ 2 ​ ​ ​ , we get: ​ ⋅ 2 1 ​ − λ 2 3 ​ π ​ ​ = σ 2 λ 2 2 ​ 1 ​ ​ = 2 σ 2 Previously we know that λ 2 < 0 \lambda_2<0 λ 2 ​ < 0 , then λ 2 = − 1 2 σ 2 \lambda_2=-\frac{1}{2\sigma^2} λ 2 ​ = − 2 σ 2 1 ​ . Then e − 1 + λ 1 = 1 2 π σ 2 e^{-1+\lambda_1}=\sqrt{\frac{1}{2\pi\sigma^2}} e − 1 + λ 1 ​ = 2 π σ 2 1 ​ ​ Then we finally deduced the normal distribution's probability density function (when mean is 0): ​ e − 2 σ 2 1 ​ x 2 When mean is not 0, substitute x x x as x − μ x-\mu x − μ , we get the general normal distribution: ​ e − 2 σ 2 1 ​ ( x − μ ) 2 = 2 π ​ σ 1 ​ e − 2 1 ​ ( σ x − μ ​ ) 2 Entropy of normal distribution ​ We can then calculate the entropy of normal distribution: ​ e 2 σ 2 ( x − μ ) 2 ​ ) d x = ∫ f ( x ) ( 1 2 log ⁡ ( 2 π σ 2 ) + ( x − μ ) 2 2 σ 2 ) d x = 1 2 log ⁡ ( 2 π σ 2 ) ∫ f ( x ) d x ⏟ = 1 + 1 2 σ 2 ∫ f ( x ) ( x − μ ) 2 ⏟ = σ 2 d x =\int f(x) \left(\frac{1}{2}\log(2\pi\sigma^2)+\frac{(x-\mu)^2}{2\sigma^2}\right)dx=\frac{1}{2}\log(2\pi\sigma^2)\underbrace{\int f(x)dx} _ {=1}+ \frac{1}{2\sigma^2}\underbrace{\int f(x)(x-\mu)^2} _ {=\sigma^2}dx = ∫ f ( x ) ( 2 1 ​ lo g ( 2 π σ 2 ) + 2 σ 2 ( x − μ ) 2 ​ ) d x = 2 1 ​ lo g ( 2 π σ 2 ) = 1 ∫ f ( x ) d x ​ ​ + 2 σ 2 1 ​ = σ 2 ∫ f ( x ) ( x − μ ) 2 ​ ​ d x = 1 2 log ⁡ ( 2 π σ 2 ) + 1 2 = 1 2 log ⁡ ( 2 π e σ 2 ) =\frac{1}{2}\log(2\pi\sigma^2)+\frac{1}{2}=\frac{1}{2}\log(2\pi e \sigma^2) = 2 1 ​ lo g ( 2 π σ 2 ) + 2 1 ​ = 2 1 ​ lo g ( 2 π e σ 2 ) If X follows normal distribution and Y's distribution that have the same mean and variance, the cross entropy H ( Y , X ) H(Y,X) H ( Y , X ) have the same value: 1 2 log ⁡ ( 2 π e σ 2 ) \frac{1}{2}\log(2\pi e \sigma^2) 2 1 ​ lo g ( 2 π e σ 2 ) , regardless of the exact probability density function of Y. The deduction is similar to the above: ​ e 2 σ 2 ( x − μ ) 2 ​ ) d x = ∫ f Y ( x ) ( 1 2 log ⁡ ( 2 π σ 2 ) + ( x − μ ) 2 2 σ 2 ) d x = 1 2 log ⁡ ( 2 π σ 2 ) ∫ f Y ( x ) d x ⏟ = 1 + 1 2 σ 2 ∫ f Y ( x ) ( x − μ ) 2 ⏟ = σ 2 d x =\int f_Y(x) \left(\frac{1}{2}\log(2\pi\sigma^2)+\frac{(x-\mu)^2}{2\sigma^2}\right)dx=\frac{1}{2}\log(2\pi\sigma^2)\underbrace{\int f_Y(x)dx} _ {=1}+ \frac{1}{2\sigma^2}\underbrace{\int f_Y(x)(x-\mu)^2} _ {=\sigma^2}dx = ∫ f Y ​ ( x ) ( 2 1 ​ lo g ( 2 π σ 2 ) + 2 σ 2 ( x − μ ) 2 ​ ) d x = 2 1 ​ lo g ( 2 π σ 2 ) = 1 ∫ f Y ​ ( x ) d x ​ ​ + 2 σ 2 1 ​ = σ 2 ∫ f Y ​ ( x ) ( x − μ ) 2 ​ ​ d x = 1 2 log ⁡ ( 2 π σ 2 ) + 1 2 = 1 2 log ⁡ ( 2 π e σ 2 ) =\frac{1}{2}\log(2\pi\sigma^2)+\frac{1}{2}=\frac{1}{2}\log(2\pi e \sigma^2) = 2 1 ​ lo g ( 2 π σ 2 ) + 2 1 ​ = 2 1 ​ lo g ( 2 π e σ 2 ) Central limit theorem ​ We have a random variable X X X , which has meam 0 and (finite) variance σ 2 \sigma^2 σ 2 . If we add up n n n independent samples of X X X : X 1 + X 2 + . . . + X n X_1+X_2+...+X_n X 1 ​ + X 2 ​ + ... + X n ​ , the variance of sum is n σ 2 n\sigma^2 n σ 2 . To make its variance constant, we can divide it by n \sqrt n n ​ , then we get S n = X 1 + X 2 + . . . + X n n S_n = \frac{X_1+X_2+...+X_n}{\sqrt n} S n ​ = n ​ X 1 ​ + X 2 ​ + ... + X n ​ ​ . Here S n S_n S n ​ is called the standardized sum , because it makes variance not change by sample count. Central limit theorem says that the standardized sum apporaches normal distribution as n n n increase. No matter what the original distribution of X X X is (as long as its variance is finite), the standardized sum will approach normal distribution. The information of distribution of X X X will be "washed out" during the process. This "washing out information" is also increasing of entropy . As n n n increase, the entopy of standardized sum always increase (except when X follows normal distribution the entropy stays at maximum). H ( S n + 1 ) > H ( S n ) H(S_{n+1}) > H(S_n) H ( S n + 1 ​ ) > H ( S n ​ ) if X X X is not normally distributed. Normal distribution has the maximum entropy under variance constraint. As the entropy of standardized sum increase, its entropy will approach maximum and it will approach normal distribution. This is similar to second law of theomodynamics. This is called Entropic Central Limit Theorem. Proving that is hard and requires a lot of prerequisite knowledges. See also: Solution of Shannon's problem on the monotonicity of entropy , Generalized Entropy Power Inequalities and Monotonicity Properties of Information In the real world, many things follow normal distribution, like height of people, weight of people, error in manufacturing, error in measurement, etc. The height of people is affect by many complex factors (nurtrition, health, genetic factors, exercise, environmental factors, etc.). The combination of these complex factors definitely cannot be similified to a standardized sum of i.i.d zero-mean samples X 1 + X 2 + . . . + X n n \frac{X_1+X_2+...+X_n}{\sqrt n} n ​ X 1 ​ + X 2 ​ + ... + X n ​ ​ . Some factors have large effect and some factors have small effect. The factors are not necessarily independent. But the height of people still roughly follows normal distribution. This can be semi-explained by second law of theomodynamics. The complex interactions of many factors increase entropy of the height. At the same time there are also many factors that constraint the variance of height. Why is there a variance constraint? In some cases variance correspond to instability. A human that is 100 meters tall is impossible as it's physically unstable. Similarily a human that's 1 cm tall is impossible in maintaining normal biological function. The unstable things tend to collapse and vanish (survivorship bias), and the stable things remain. That's how the variance constraint occurs in nature. In some places, variance correspond to energy, and the variance is constrainted by conservation of energy. Although normal distribution is common, not all distributions are normal. There are also many things that follow fat-tail distributions. Also note that Central Limit Theorem works when n n n approaches infinity. Even if a distribution's standardized sum approach normal distribution, the speed of converging is important: some distribution converge to normal quickly, and some slowly. Some fat-tail distribution has finite variance but their standardized sum converge to normal distribution very slowly. In below, bold letter (like x \boldsymbol x x ) means column vector: ​ x 1 ​ x 2 ​ ... x n ​ ​ ​ Linear transform: for a (column) vector x \boldsymbol{x} x , muliply a matrix A A A on it: A x A\boldsymbol x A x is linear transformation. Linear transformation can contain rotation, scaling and shearing. For row vector it's x A \boldsymbol xA x A . Two linear transformations can be combined one, corresponding to matrix multiplication. Affine transform: for a (column) vector x \boldsymbol x x , multiply a matrix on it and then add some offset A x + b A\boldsymbol x + \boldsymbol b A x + b . It can move based on the result of linear transform. Two affine transformations can be combined into one. If y = A x + b , z = C y + d \boldsymbol y=A\boldsymbol x+\boldsymbol b, \boldsymbol z=C\boldsymbol y+\boldsymbol d y = A x + b , z = C y + d , then z = ( C A ) x + ( C b + d ) \boldsymbol z=(CA)\boldsymbol x +(C\boldsymbol b + \boldsymbol d) z = ( C A ) x + ( C b + d ) (in some places affine transformation is called "linear transformation".) Normal distribution has linear properties: Note that the elements of y \boldsymbol y y are no longer necessarily independent. What if I apply two or many affine transformations? Two affine transformations can be combined into one. So the result is still multivariate normal distribution. To describe a multivariate normal distribution, an important concept is covariance matrix . Recall covariance: Cov [ X , Y ] = E [ ( X − E [ X ] ) ( Y − E [ Y ] ) ] \text{Cov}[X,Y]=E[(X-E[X])(Y-E[Y])] Cov [ X , Y ] = E [( X − E [ X ]) ( Y − E [ Y ])] . Some rules about covariance: Covariance matrix: Here E [ x ] E[\boldsymbol x] E [ x ] taking mean of each element in x \boldsymbol x x and output a vector. It's element-wise. E [ x ] i = E [ x i ] E[\boldsymbol x]_i = E[\boldsymbol x_i] E [ x ] i ​ = E [ x i ​ ] . Similar for matrix. The covariance matrix written out: ​ Cov [ x 1 ​ , y 1 ​ ] Cov [ x 2 ​ , y 1 ​ ] ⋮ Cov [ x n ​ , y 1 ​ ] ​   Cov [ x 1 ​ , y 2 ​ ]   Cov [ x 2 ​ , y 2 ​ ]   ⋮   Cov [ x n ​ , y 2 ​ ] ​   ...   ...   ⋱   ... ​   Cov [ x 1 ​ , y n ​ ]   Cov [ x 2 ​ , y n ​ ]   ⋮   Cov [ x n ​ , y n ​ ] ​ ​ Recall that multiplying constant and addition can be "moved out of E [ ] E[] E [ ] ": E [ k X ] = k E [ X ] ,   E [ X + Y ] = E [ X ] + E [ Y ] E[kX] = k E[X], \ E[X+Y]=E[X]+E[Y] E [ k X ] = k E [ X ] ,   E [ X + Y ] = E [ X ] + E [ Y ] . If A A A is a matrix that contains random variable and B B B is a matrix that's not random, then E [ A ⋅ B ] = E [ A ] ⋅ B ,   E [ B ⋅ A ] = B ⋅ E [ A ] E[A\cdot B] = E[A]\cdot B, \ E[B\cdot A] = B\cdot E[A] E [ A ⋅ B ] = E [ A ] ⋅ B ,   E [ B ⋅ A ] = B ⋅ E [ A ] , because multiplying a matrix come down to multiplying constant and adding up, which all can "move out of E [ ] E[] E [ ] ". Vector can be seen as a special kind of matrix. So applying it to covariance matrix: Similarily, Cov ( x , B ⋅ y ) = Cov ( x , y ) ⋅ B T \text{Cov}(\boldsymbol x, B \cdot \boldsymbol y) = \text{Cov}(\boldsymbol x, \boldsymbol y) \cdot B^T Cov ( x , B ⋅ y ) = Cov ( x , y ) ⋅ B T . If x \boldsymbol x x follows multivariate normal distribution, it can be described by mean vector μ \boldsymbol \mu μ (the mean of each element of x \boldsymbol x x ) and covariance matrix Cov ( x , x ) \text{Cov}(\boldsymbol x,\boldsymbol x) Cov ( x , x ) . Initially, if I have some independent normal variables x 1 , x 2 , . . . x n x_1, x_2, ... x_n x 1 ​ , x 2 ​ , ... x n ​ with mean values μ 1 , . . . , μ n \mu_1, ..., \mu_n μ 1 ​ , ... , μ n ​ and variances σ 1 2 , . . . , σ n 2 \sigma_1^2, ..., \sigma_n^2 σ 1 2 ​ , ... , σ n 2 ​ . If we treat them as a multivariate normal distribution, the mean vector μ x = ( μ 1 , . . . , μ n ) \boldsymbol \mu_x = (\mu_1, ..., \mu_n) μ x ​ = ( μ 1 ​ , ... , μ n ​ ) . The covariance matrix will be diagonal as they are independent: ​ σ 1 2 ​ 0 ⋮ 0 ​   0   σ 2 2 ​   ⋮   0 ​   ...   ...   ⋱   ... ​   0   0   ⋮   σ n 2 ​ ​ ​ Then if we apply an affine transformation y = A x + b \boldsymbol y = A \boldsymbol x + \boldsymbol b y = A x + b , then μ y = A μ x + b \boldsymbol \mu_y = A \mu_x + \boldsymbol b μ y ​ = A μ x ​ + b . Cov ( y , y ) = Cov ( A x + b , A x + b ) = Cov ( A x , A x ) = A Cov ( x , x ) A T \text{Cov}(\boldsymbol y,\boldsymbol y) = \text{Cov}(A \boldsymbol x + \boldsymbol b,A \boldsymbol x + \boldsymbol b) = \text{Cov}(A \boldsymbol x, A \boldsymbol x) = A \text{Cov}(\boldsymbol x,\boldsymbol x) A^T Cov ( y , y ) = Cov ( A x + b , A x + b ) = Cov ( A x , A x ) = A Cov ( x , x ) A T . The industry standard of 3D modelling is to model the 3D object as many triangles, called mesh. It only models the visible surface object. It use many triangles to approximate curved surface. Gaussian splatting provides an alternative method of 3D modelling. The 3D scene is modelled by a lot of mutlivariate (3D) gaussian distributions, called gaussian. When rendering, that 3D gaussian distribution is projected onto a plane (screen) and approximately become a 2D gaussian distribution, now probability density correspond to color opacity. Note that the projection is perspective projection (near things big and far things small). Perspective projection is not linear. After perspective projection, the 3D Gaussian distribution is no longer strictly a 2D Gaussian distribution, can be approximated by a 2D Gaussian distribution. Triangle mesh is often modelled by people. But gaussian splatting scene is often trained from photos of different perspectives of a scene. A gaussian's color can be fixed or can change based on different view directions. Gaussian splatting also works in 4D by adding a time dimension. In diffusion model, we add gaussian noise to image (or other things). Then the diffusion model takes noisy input and we train it to output the noise added to it. There will be many steps of adding noise and the model should output the noise added in each step. Tweedie's formula shows that estimating the noise added is the same as computing the likelihood of image distribution. To simplify, here we only consider one dimension and one noise step (the same also applies to many dimensions and many noise steps). If the original value is x 0 x_0 x 0 ​ , we add a noise ϵ ∼ N ( 0 , σ 2 ) \epsilon \sim N(0, \sigma^2) ϵ ∼ N ( 0 , σ 2 ) , the noise-added value is x 1 = x 0 + ϵ x_1 = x_0 + \epsilon x 1 ​ = x 0 ​ + ϵ , x 1 ∼ N ( x 0 , σ 2 ) x_1 \sim N(x_0, \sigma^2) x 1 ​ ∼ N ( x 0 ​ , σ 2 ) . The diffusion model only know x 1 x_1 x 1 ​ and don't know x 0 x_0 x 0 ​ . The diffusion model need to estimate ϵ \epsilon ϵ from x 1 x_1 x 1 ​ . (I use p 1 ∣ 0 ( x 1 ∣ x 0 ) p_{1 \vert 0}(x_1 \vert x_0) p 1∣0 ​ ( x 1 ​ ∣ x 0 ​ ) instead of shorter p ( x 1 ∣ x 0 ) p(x_1 \vert x_0) p ( x 1 ​ ∣ x 0 ​ ) is to reduce confusion between different distributions.) p 1 ∣ 0 ( x 1 ∣ x 0 ) p_{1 \vert 0}(x_1 \vert x_0) p 1∣0 ​ ( x 1 ​ ∣ x 0 ​ ) is a normal distribution: ​ σ 1 ​ e − 2 1 ​ ( σ x 1 − x 0 ​ ​ ) 2 Take log: ​ σ 1 ​ The linear score function under condition: Bayes rule: Take partial derivative to x 1 x_1 x 1 ​ : ∂ x 1 ​ ∂ lo g p 0 ​ ( x 0 ​ ) ​ ​ ​ − ∂ x 1 ​ ∂ lo g p 1 ​ ( x 1 ​ ) ​ Using previous result ∂ log ⁡ p 1 ∣ 0 ( x 1 ∣ x 0 ) ∂ x 1 = − x 1 − x 0 σ 2 \frac{\partial \log p_{1 \vert 0}(x_1 \vert x_0)}{\partial x_1} = - \frac{x_1-x_0}{\sigma^2} ∂ x 1 ​ ∂ l o g p 1∣0 ​ ( x 1 ​ ∣ x 0 ​ ) ​ = − σ 2 x 1 ​ − x 0 ​ ​ Now if we already know the noise-added value x 1 x_1 x 1 ​ , but we don't know x 0 x_0 x 0 ​ so x 0 x_0 x 0 ​ is uncertain. We want to compute the expectation of x 0 x_0 x 0 ​ under that condition that x 1 x_1 x 1 ​ is known. ​ x 1 ​ ] = x 1 + E x 0 [ σ 2 ∂ log ⁡ p 0 ∣ 1 ( x 0 ∣ x 1 ) ∂ x 1 ∣ x 1 ] + E x 0 [ σ 2 ∂ log ⁡ p 1 ( x 1 ) ∂ x 1 ∣ x 1 ] = x_1 + E_{x_0}\left[\sigma^2 \frac{\partial\log p_{0 \vert 1}(x_0 \vert x_1)}{\partial x_1}\biggr\vert x_1\right] + E_{x_0}\left[ \sigma^2\frac{\partial \log p_1(x_1)}{\partial x_1} \biggr\vert x_1\right] = x 1 ​ + E x 0 ​ ​ [ σ 2 ∂ x 1 ​ ∂ lo g p 0∣1 ​ ( x 0 ​ ∣ x 1 ​ ) ​ ​ x 1 ​ ] + E x 0 ​ ​ [ σ 2 ∂ x 1 ​ ∂ lo g p 1 ​ ( x 1 ​ ) ​ ​ x 1 ​ ] Within it, E x 0 [ ∂ log ⁡ p 0 ∣ 1 ( x 0 ∣ x 1 ) ∂ x 1 ∣ x 1 ] = 0 E_{x_0}\left[ \frac{\partial\log p_{0 \vert 1}(x_0 \vert x_1)}{\partial x_1} \biggr\vert x_1 \right]=0 E x 0 ​ ​ [ ∂ x 1 ​ ∂ l o g p 0∣1 ​ ( x 0 ​ ∣ x 1 ​ ) ​ ​ x 1 ​ ] = 0 , because ​ x 1 ​ ] = ∫ p 0∣1 ​ ( x 0 ​ ∣ x 1 ​ ) ⋅ ∂ x 1 ​ ∂ lo g p 0∣1 ​ ( x 0 ​ ∣ x 1 ​ ) ​ d x 0 ​ = ∫ p 0 ∣ 1 ( x 0 ) ⋅ 1 p 0 ∣ 1 ( x 0 ∣ x 1 ) ⋅ ∂ p 0 ∣ 1 ( x 0 ∣ x 1 ) ∂ x 1 d x 0 = ∫ ∂ p 0 ∣ 1 ( x 0 ∣ x 1 ) ∂ x 1 d x 0 = ∂ ∫ p 0 ∣ 1 ( x 0 ∣ x 1 ) d x 0 ∂ x 1 = ∂ 1 ∂ x 1 = 0 = \int p_{0 \vert 1}(x_0) \cdot \frac 1 {p_{0 \vert 1}(x_0 \vert x_1)} \cdot \frac{\partial p_{0 \vert 1}(x_0 \vert x_1)}{\partial x_1} dx_0 = \int \frac{\partial p_{0 \vert 1}(x_0 \vert x_1)}{\partial x_1} dx_0 = \frac{\partial \int p_{0 \vert 1}(x_0 \vert x_1) dx_0}{\partial x_1} = \frac{\partial 1}{\partial x_1}=0 = ∫ p 0∣1 ​ ( x 0 ​ ) ⋅ p 0∣1 ​ ( x 0 ​ ∣ x 1 ​ ) 1 ​ ⋅ ∂ x 1 ​ ∂ p 0∣1 ​ ( x 0 ​ ∣ x 1 ​ ) ​ d x 0 ​ = ∫ ∂ x 1 ​ ∂ p 0∣1 ​ ( x 0 ​ ∣ x 1 ​ ) ​ d x 0 ​ = ∂ x 1 ​ ∂ ∫ p 0∣1 ​ ( x 0 ​ ∣ x 1 ​ ) d x 0 ​ ​ = ∂ x 1 ​ ∂ 1 ​ = 0 And E x 0 [ σ 2 ∂ log ⁡ p 1 ( x 1 ) ∂ x 1 ∣ x 1 ] = σ 2 ∂ log ⁡ p 1 ( x 1 ) ∂ x 1 E_{x_0}\left[ \sigma^2\frac{\partial \log p_1(x_1)}{\partial x_1} \biggr\vert x_1\right] = \sigma^2\frac{\partial \log p_1(x_1)}{\partial x_1} E x 0 ​ ​ [ σ 2 ∂ x 1 ​ ∂ l o g p 1 ​ ( x 1 ​ ) ​ ​ x 1 ​ ] = σ 2 ∂ x 1 ​ ∂ l o g p 1 ​ ( x 1 ​ ) ​ because it's unrelated to random x 0 x_0 x 0 ​ . σ 2 ∂ x 1 ​ ∂ lo g p 1 ​ ( x 1 ​ ) ​ ​ ​ That's Tweedie's formula (for 1D case). It can be generalized to many dimensions, where the x 0 , x 1 x_0, x_1 x 0 ​ , x 1 ​ are vectors, the distributions p 0 , p 1 , p 0 ∣ 1 , p 1 ∣ 0 p_0, p_1, p_{0 \vert 1}, p_{1 \vert 0} p 0 ​ , p 1 ​ , p 0∣1 ​ , p 1∣0 ​ are joint distributions where different dimensions are not necessarily independent. The gaussian noise added to different dimensions are still independent. The diffusion model is trained to estimate the added noise, which is the same as estimating the linear score. If we have constraint X ≥ 0 X \geq 0 X ≥ 0 and fix the mean E [ X ] E[X] E [ X ] to a specific value μ \mu μ , then maximizing entropy gives exponential distribution. It can also be rediscovered from Lagrange multiplier: ⎨ ⎧ ​ ∫ 0 ∞ ​ f ( x ) lo g f ( x ) 1 ​ d x + λ 1 ​ ( ∫ 0 ∞ ​ f ( x ) d x − 1 ) + λ 2 ​ ( ∫ 0 ∞ ​ f ( x ) x d x − μ ) ​ = ∫ 0 ∞ ( − f ( x ) log ⁡ f ( x ) + λ 1 f ( x ) + λ 2 x f ( x ) ) d x − λ 1 − λ 2 μ =\int_{0}^{\infty} (-f(x)\log f(x) + \lambda_1 f(x) + \lambda_2 x f(x) ) dx - \lambda_1 - \lambda_2\mu = ∫ 0 ∞ ​ ( − f ( x ) lo g f ( x ) + λ 1 ​ f ( x ) + λ 2 ​ x f ( x )) d x − λ 1 ​ − λ 2 ​ μ ∂ L ∂ f = − log ⁡ f ( x ) − 1 + λ 1 + λ 2 x ∂ L ∂ λ 1 = ∫ 0 ∞ f ( x ) d x − 1 ∂ L ∂ λ 2 = ∫ 0 ∞ x f ( x ) d x − μ \frac{\partial \mathcal{L}}{\partial f} = -\log f(x) - 1 + \lambda_1 + \lambda_2 x \quad\quad\quad \frac{\partial \mathcal{L}}{\partial \lambda_1}=\int_0^{\infty}f(x)dx-1 \quad\quad\quad \frac{\partial \mathcal{L}}{\partial \lambda_2}=\int_0^{\infty} xf(x)dx-\mu ∂ f ∂ L ​ = − lo g f ( x ) − 1 + λ 1 ​ + λ 2 ​ x ∂ λ 1 ​ ∂ L ​ = ∫ 0 ∞ ​ f ( x ) d x − 1 ∂ λ 2 ​ ∂ L ​ = ∫ 0 ∞ ​ x f ( x ) d x − μ Then solve ∂ L ∂ f = 0 \frac{\partial \mathcal{L}}{\partial f}=0 ∂ f ∂ L ​ = 0 : Then solve ∂ L ∂ λ 1 = 0 \frac{\partial \mathcal{L}}{\partial \lambda_1}=0 ∂ λ 1 ​ ∂ L ​ = 0 : To make that integration finite, λ 2 < 0 \lambda_2 < 0 λ 2 ​ < 0 . Let u = λ 2 x ,   d u = λ 2 d x , d x = 1 λ 2 d u u = \lambda_2 x, \ du = \lambda_2 dx, dx=\frac 1 {\lambda_2} du u = λ 2 ​ x ,   d u = λ 2 ​ d x , d x = λ 2 ​ 1 ​ d u , Then solve ∂ L ∂ λ 2 = 0 \frac{\partial \mathcal{L}}{\partial \lambda_2}=0 ∂ λ 2 ​ ∂ L ​ = 0 : Now we have Solving it gives λ 2 = − 1 μ ,   e 1 − λ 1 = μ \lambda_2 = - \frac 1 {\mu}, \ e^{1-\lambda_1} = \mu λ 2 ​ = − μ 1 ​ ,   e 1 − λ 1 ​ = μ . Then In the common definition of exponential distribution, λ = 1 μ \lambda = \frac 1 \mu λ = μ 1 ​ , f ( x ) = λ e − λ x f(x) = \lambda e^{-\lambda x} f ( x ) = λ e − λ x . Its tail function: ​ y = x y = ∞ ​ = e − λ x If some event is happening in fixed rate ( λ \lambda λ ), exponential distribution measures how long do we need to wait for the next event , if how long we will need to wait is irrelevant how long we have aleady waited (memorylessness) . Exponential distribution can measure: How to understand memorlessness? For example, a kind of radioactive atom decays once per 5 minutes on average. If the time unit is minute, then λ = 1 5 \lambda = \frac 1 5 λ = 5 1 ​ . For a specific atom, if we wait for it to decay, the time we need to wait is on average 5 minutes. However, if we have already waited for 3 minutes and it still hasn't decay, the expected time that we need to wait is still 5 minutes. If we have waited for 100 minutes and it still hasn't decay, the expected time that we need to wait is still 5 minutes. Because the atom doesn't "remember" how long we have waited. Memorylessness means the probability that we still need to wait needToWait \text{needToWait} needToWait amount of time is irrelevant to how long we have already waited: (We can also rediscover exponential distrbution from just memorylessness.) Memorylessness is related with its maximum entropy property. Maximizing entropy under constraints means maximizing uncertainty and minizing information other than the constraints. The only two constraints are X ≥ 0 X\geq 0 X ≥ 0 , the wait time is positive, and E [ X ] = 1 λ E[X]=\frac 1 \lambda E [ X ] = λ 1 ​ , the average rate of the event. Other than the two constraints, there is no extra information. No information tells waiting reduces time need to wait, no information tells waiting increases time need to wait. So it's the most unbiased: waiting has no effect on the time need to wait. If the radioactive atom has some "internal memory" that changes over time and controls how likely it will decay, then the waiting time distribution encodes extra information other than the two constraints, which makes it no longer max-entropy. 80/20 rule: for example 80% of weallth are in the richest 20% (the real number may be different). It has fractal property : even within the richest 20%, 80% of wealth are in the richest 20% within. Based on this fractal-like property, we can naturally get Pareto distribution . If the total people count is N N N , the total wealth amount is W W W . Then 0.2 N 0.2N 0.2 N people have 0.8 W 0.8W 0.8 W wealth. Applying the same within the 0.2 N 0.2N 0.2 N people: 0.2 ⋅ 0.2 N 0.2 \cdot 0.2 N 0.2 ⋅ 0.2 N people have 0.8 ⋅ 0.8 W 0.8 \cdot 0.8W 0.8 ⋅ 0.8 W wealth. Applying again, 0.2 ⋅ 0.2 ⋅ 0.2 N 0.2 \cdot 0.2 \cdot 0.2 N 0.2 ⋅ 0.2 ⋅ 0.2 N people have 0.8 ⋅ 0.8 ⋅ 0.8 W 0.8 \cdot 0.8 \cdot 0.8 W 0.8 ⋅ 0.8 ⋅ 0.8 W welath. Generalize it, 0.2 k N 0.2^k N 0. 2 k N people have 0.8 k W 0.8^k W 0. 8 k W wealth ( k k k can be generalized to continuous real number). If the wealth variable is X X X (assume X > 0 X > 0 X > 0 ), its probability density function is f ( x ) f(x) f ( x ) , and porportion of people correspond to probability, the richest 0.2 k 0.2^k 0. 2 k porportion of people group have 0.8 k W 0.8^kW 0. 8 k W wealth, t t t is the wealth threshold (minimum wealth) of that group: Note that f ( x ) f(x) f ( x ) represents probability density function (PDF), which correspond to density of proportion of people. N ⋅ f ( x ) N\cdot f(x) N ⋅ f ( x ) is people amount density over wealth. Multiplying it with wealth x x x and integrate gets total wealth in range: We can rediscover Pareto distribution from these. The first thing to do is extract and eliminate k k k : Then we can take derivative to t t t on two sides: f ( t ) ≠ 0 f(t) \neq 0 f ( t )  = 0 . Divide two sides by − f ( t ) -f(t) − f ( t ) : Take derivative to t t t on two sides again: Now t t t is an argument and can be renamed to x x x . And do some adjustments: Now we get the PDF. We still need to make the total probability area to be 1 to make it a valid distribution. But there is no extra unknown parameter in PDF to change. The solution is to crop the range of X X X . If we set the minimum wealth in distribution to be m m m (but doesn't constraint the maximum wealth), creating constraint X ≥ m X \geq m X ≥ m , then using the previous result Now we rediscovered (a special case of) Pareto distribution from just fractal 80/20 rule. We can generalize it further for other cases like 90/10 rule, 80/10 rule, etc. and get Pareto (Type I) distribution. It has two parameters, shape parameter α \alpha α (correspond to − log ⁡ 0.2 log ⁡ 0.8 − log ⁡ 0.2 = log ⁡ 5 log ⁡ 4 ≈ 1.161 -\frac {\log 0.2} {\log 0.8-\log 0.2} = \frac{\log 5}{\log 4} \approx 1.161 − l o g 0.8 − l o g 0.2 l o g 0.2 ​ = l o g 4 l o g 5 ​ ≈ 1.161 ) and minimum value m m m : Note that in real world the wealth of one can be negative (has debts more than assets). The Pareto distribution is just an approximation. m m m means the threshold where Pareto distribution starts to be good approximation. If α ≤ 1 \alpha \leq 1 α ≤ 1 then its theoretical mean is infinite. Of course if we have finite samples then the sample mean will be finite, but if the theoretical mean is infinite, the more sample we have, the larger the sample mean tend to be, and the trend won't stop. If α ≤ 2 \alpha \leq 2 α ≤ 2 then its theoretical variance is infinite. Recall that centrol limit theorem require finite variance. The standarized sum of values taken from Pareto distribution whose α ≤ 2 \alpha \leq 2 α ≤ 2 does not follow central limit theorem because it has infinite variance. Pareto distribution is often described using tail function (rather than probability density function): There are additive values, like length, mass, money. For additive values, we often compute arithmetic average 1 n ( x 1 + x 2 + . . + x n ) \frac 1 n (x_1 + x_2 + .. + x_n) n 1 ​ ( x 1 ​ + x 2 ​ + .. + x n ​ ) . There are also multiplicative values, like asset return rate, growth ratio. For multiplicative values, we often compute geometric average ( x 1 ⋅ x 2 ⋅ . . . ⋅ x n ) 1 n (x_1 \cdot x_2 \cdot ... \cdot x_n)^{\frac 1 n} ( x 1 ​ ⋅ x 2 ​ ⋅ ... ⋅ x n ​ ) n 1 ​ . For example, if an asset grows by 20% in first year, drops 10% in second year and grows 1% in third year, then the average growth ratio per year is ( 1.2 ⋅ 0.9 ⋅ 1.01 ) 1 3 (1.2 \cdot 0.9 \cdot 1.01)^{\frac 1 3} ( 1.2 ⋅ 0.9 ⋅ 1.01 ) 3 1 ​ . Logarithm allows turning multiplication into addition, and turning power into multiplication. If y = log ⁡ x y = \log x y = lo g x , then log of geometric average of x x x is arithmetic average of y y y : Pareto distribution maximizes entropy under geometric mean constraint E [ log ⁡ X ] E[\log X] E [ lo g X ] . If we have constraints X ≥ m > 0 X \geq m > 0 X ≥ m > 0 , E [ log ⁡ X ] = g E[\log X] = g E [ lo g X ] = g , using largrange multiplier to maximize entropy: ⎨ ⎧ ​ ∫ m ∞ ​ f ( x ) lo g f ( x ) 1 ​ d x + λ 1 ​ ( ∫ m ∞ ​ f ( x ) d x − 1 ) + λ 2 ​ ( ∫ m ∞ ​ f ( x ) lo g x d x − g ) ​ L ( f , λ 1 , λ 2 ) = ∫ m ∞ (   − f ( x ) log ⁡ f ( x ) + λ 1 f ( x ) + λ 2 f ( x ) log ⁡ x   ) d x − λ 1 − g λ 2 \mathcal{L}(f, \lambda_1, \lambda_2) = \int_m^{\infty} (\ -f(x)\log f(x) + \lambda_1 f(x) + \lambda_2 f(x) \log x \ ) dx -\lambda_1 - g \lambda_2 L ( f , λ 1 ​ , λ 2 ​ ) = ∫ m ∞ ​ (   − f ( x ) lo g f ( x ) + λ 1 ​ f ( x ) + λ 2 ​ f ( x ) lo g x   ) d x − λ 1 ​ − g λ 2 ​ ∂ L ∂ f = − log ⁡ f ( x ) − 1 + λ 1 + λ 2 log ⁡ x \frac{\partial \mathcal{L}}{\partial f} = -\log f(x) - 1 + \lambda_1 + \lambda_2 \log x ∂ f ∂ L ​ = − lo g f ( x ) − 1 + λ 1 ​ + λ 2 ​ lo g x ∂ L ∂ λ 1 = ∫ m ∞ f ( x ) d x − 1 \frac{\partial \mathcal{L}}{\partial \lambda_1} = \int_m^{\infty} f(x) dx -1 ∂ λ 1 ​ ∂ L ​ = ∫ m ∞ ​ f ( x ) d x − 1 ∂ L ∂ λ 2 = ∫ m ∞ f ( x ) log ⁡ x   d x − g \frac{\partial \mathcal{L}}{\partial \lambda_2} = \int_m^{\infty} f(x) \log x \ dx-g ∂ λ 2 ​ ∂ L ​ = ∫ m ∞ ​ f ( x ) lo g x   d x − g Solve ∂ L ∂ f = 0 \frac{\partial \mathcal{L}}{\partial f}=0 ∂ f ∂ L ​ = 0 : Solve ∂ L ∂ λ 1 = 0 \frac{\partial \mathcal{L}}{\partial \lambda_1}=0 ∂ λ 1 ​ ∂ L ​ = 0 : To make ∫ m ∞ x λ 2 d x \int_m^{\infty} x^{\lambda_2}dx ∫ m ∞ ​ x λ 2 ​ d x be finite, λ 2 < − 1 \lambda_2 < -1 λ 2 ​ < − 1 . ​ x = m x = ∞ ​ = − λ 2 ​ + 1 1 ​ m λ 2 ​ + 1 = e 1 − λ 1 ​ m λ 2 + 1 λ 2 + 1 = − e 1 − λ 1 e − 1 + λ 1 = − λ 2 + 1 m λ 2 + 1 (1) \frac{m^{\lambda_2+1}}{\lambda_2+1} = -e^{1-\lambda_1} \tag{1}\quad\quad\quad e^{-1+\lambda_1}=-\frac{\lambda_2+1}{m^{\lambda_2+1}} λ 2 ​ + 1 m λ 2 ​ + 1 ​ = − e 1 − λ 1 ​ e − 1 + λ 1 ​ = − m λ 2 ​ + 1 λ 2 ​ + 1 ​ ( 1 ) Solve ∂ L ∂ λ 2 = 0 \frac{\partial \mathcal{L}}{\partial \lambda_2}=0 ∂ λ 2 ​ ∂ L ​ = 0 : If we temporarily ignore e − 1 + λ 1 e^{-1+\lambda_1} e − 1 + λ 1 ​ and compute ∫ m ∞ x λ 2 log ⁡ x   d x \int_m^{\infty} x^{\lambda_2} \log x \ dx ∫ m ∞ ​ x λ 2 ​ lo g x   d x . Let u = log ⁡ x u=\log x u = lo g x , x = e u x=e^u x = e u , d x = e u d u dx = e^udu d x = e u d u : ​ u = l o g m u = ∞ ​ Then By using (1) e − 1 + λ 1 = − λ 2 + 1 m λ 2 + 1 e^{-1+\lambda_1}=-\frac{\lambda_2+1}{m^{\lambda_2+1}} e − 1 + λ 1 ​ = − m λ 2 ​ + 1 λ 2 ​ + 1 ​ : Let α = − 1 log ⁡ m − g \alpha = -\frac 1 {\log m - g} α = − l o g m − g 1 ​ , it become: Now we rediscovered Pareto (Type I) distribution by maximizing entropy. In the process we have λ 2 < − 1 \lambda_2 \lt -1 λ 2 ​ < − 1 . From λ 2 + 1 = 1 log ⁡ m − g \lambda_2+1 = \frac 1 {\log m - g} λ 2 ​ + 1 = l o g m − g 1 ​ we know log ⁡ m − g < 0 \log m - g <0 lo g m − g < 0 , which is m < e g m < e^g m < e g . For example, if wealth follows Pareto distribution, how to compute the wealth share of the top 1%? Generally how to compute the share of the top p p p porpotion? We firstly need to compute the threshold value t t t of the top n n n : Then compute the share ​ x = b x = ∞ ​ = ( − α m − α − α + 1 1 ​ ) b − α + 1 To make that integration finite, we need − α + 1 < 0 -\alpha+1< 0 − α + 1 < 0 , α > 1 \alpha > 1 α > 1 . The share porpotion is irrelevant to m m m . Some concrete numbers: A distribution is power law distribution if its tail function P ( X > x ) P(X>x) P ( X > x ) is roughly porpotional to x − α x^{-\alpha} x − α , where α \alpha α is called exponent. The "roughly" here means that it can have small deviations that is infinitely small when x x x is large enough. Rigorously speaking it's P ( X > x ) ∝ L ( x ) x − α P(X>x) \propto L(x) x^{-\alpha} P ( X > x ) ∝ L ( x ) x − α where L L L is a slow varying function that requires lim ⁡ x → ∞ L ( r x ) L ( x ) = 1 \lim_{x \to \infty} \frac{L(rx)}{L(x)}=1 lim x → ∞ ​ L ( x ) L ( r x ) ​ = 1 for positive r r r . Note that in some places the power law is written as P ( X > x ) ∝ L ( x ) x − ( α − 1 ) P(X>x) \propto L(x) x^{-(\alpha-1)} P ( X > x ) ∝ L ( x ) x − ( α − 1 ) . In these places the α \alpha α is 1 larger than the α \alpha α in Pareto distribution. The same α \alpha α can have different meaning in different places. Here I will use the α \alpha α that's consistent with the α \alpha α in Pareto distribution. The lower the exponent α \alpha α , the more right-skewed it is, and the more extreme values it have. The power law parameter estimation according to Power laws, Pareto distributions and Zipf’s law : Book The Black Swan also provides some estimation of power law parameter in real world: Note that the estimation is not accurate because they are sensitive to rare extreme samples. Note that there are things whose estimated α < 1 \alpha < 1 α < 1 : intensity of solar flares, intensity of wars, frequency of family names. Recall that in Pareto (Type I) distribuion if α ≤ 1 \alpha \leq 1 α ≤ 1 then the theoretical mean is infinite. The sample mean tend to be higher and higher when we collect samples and the trend won't stop. If the intensity of war do follow power law and the real α < 1 \alpha < 1 α < 1 , then much larger wars exists in the future. Note that most of these things has estimated α < 2 \alpha < 2 α < 2 . In Pareto (Type I) distribution if α ≤ 2 \alpha \leq 2 α ≤ 2 then its theoretical variance is infinite. Not having a finite variance makes them not follow central limit theorem and should not be modelled using gaussian distribution. There are other distributions that can have extreme values: They all have less extreme values than power law distributions, but more extreme values than normal distribution and exponential distribution. If T T T follows exponential distribution, then a T a^T a T follows Pareto (Type I) distribution if a > 1 a>1 a > 1 . If T T T follows exponential distribution, its probability density f T ( t ) = λ e − λ t f_T(t) = \lambda e^{-\lambda t} f T ​ ( t ) = λ e − λ t ( T ≥ 0 T\geq 0 T ≥ 0 ), its cumulative distribution function F T ( t ) = P ( T < t ) = 1 − e − λ t F_T(t) = P(T<t) = 1-e^{-\lambda t} F T ​ ( t ) = P ( T < t ) = 1 − e − λ t If Y = a T Y=a^T Y = a T , a > 1 a>1 a > 1 , then Because T ≥ 0 T\geq 0 T ≥ 0 , Y ≥ a 0 = 1 Y \geq a^0=1 Y ≥ a 0 = 1 . Now Y Y Y 's tail function is in the same form as Pareto (Type I) distribution, where α = λ log ⁡ a ,   m = 1 \alpha=\frac{\lambda}{\log a}, \ m =1 α = l o g a λ ​ ,   m = 1 . If the lifetime of something follows power law distribution, then it has Lindy effect: the longer that it has existed, the longer that it will likely to continue existing. If the lifetime T T T follows Pareto distribution, if something keeps living at time t t t , then compute the expected lifetime under that condition. (The mean is weighted average. The conditional mean is also weighted average but under condition. But as the total integrated weight is not 1, it need to divide the total integrated weight.) (For that integration to be finite, − α + 1 < 0 -\alpha+1<0 − α + 1 < 0 , α > 1 \alpha>1 α > 1 ) The expected lifetime is α α − 1 t \frac{\alpha}{\alpha-1} t α − 1 α ​ t under the condition that it has already lived to time t t t . The expected remaining lifetime is α α − 1 t − t = 1 α − 1 t \frac{\alpha}{\alpha-1} t-t= \frac{1}{\alpha-1}t α − 1 α ​ t − t = α − 1 1 ​ t . It increases by t t t . Lindy effect often doesn't apply to physical things. Lindy effect often applies to information, like technology, culture, art, social norm, etc. If some numbers spans multiple orders of magnitudes, Benford's law says that about 30% of numbers have leading digit 1, about 18% of numbers have leading digit of 2, ... The digit d d d 's porportion is log ⁡ 10 ( 1 + 1 d ) \log_{10} \left(1 + \frac 1 d \right) lo g 10 ​ ( 1 + d 1 ​ ) . Pareto distribution is a distribution that spans many orders of magnitudes. Let's compute the distribution of first digit if the number follows Pareto distribution. If x x x starts with digit d d d then d 10 k ≤ x < ( d + 1 ) 10 k d 10^k \leq x < (d+1) 10^k d 1 0 k ≤ x < ( d + 1 ) 1 0 k , k = 0 , 1 , 2 , . . . k=0, 1, 2, ... k = 0 , 1 , 2 , ... Pareto distribution has a lower bound m m m . If we make m m m randomly distributed then analytically computing the probability of each starting digit become hard due to edge cases. In this case, doing a Monte Carlo simulation is easier. How to randomly sample numbers from a Pareto distribution? Firstly we know the cumulative distribution function F ( x ) = P ( X < x ) = 1 − P ( X > x ) = 1 − m α x − α F(x) = P(X<x) = 1-P(X>x) = 1- m^\alpha x^{-\alpha} F ( x ) = P ( X < x ) = 1 − P ( X > x ) = 1 − m α x − α . We can then get quantile function, which is the inverse of F F F : F ( x ) = p ,    Q ( p ) = x F(x)=p, \ \ Q(p) = x F ( x ) = p ,     Q ( p ) = x Now we can randomly sample p p p between 0 and 1 then Q ( p ) Q(p) Q ( p ) will follow Pareto distribution. Given x x x how to calculate its first digit? If 10 ≤ x < 100 10\leq x<100 10 ≤ x < 100 ( 1 ≤ log ⁡ 10 x < 2 1 \leq \log_{10} x < 2 1 ≤ lo g 10 ​ x < 2 ) then first digit is ⌊ x 10 ⌋ \lfloor {\frac x {10}} \rfloor ⌊ 10 x ​ ⌋ . If 100 ≤ x < 1000 100 \leq x < 1000 100 ≤ x < 1000 ( 2 ≤ log ⁡ 10 x < 3 2 \leq \log_{10}x < 3 2 ≤ lo g 10 ​ x < 3 ) then the first digit is ⌊ x 100 ⌋ \lfloor {\frac x {100}} \rfloor ⌊ 100 x ​ ⌋ . Generalize it, the first digit d d d is: Because Pareto distribution has a lot of extreme values, directly calculating the sample will likely to exceed floating-point range and give some . So we need to use log scale. Only calculate using log ⁡ x \log x lo g x and avoid using x x x directly. Sampling in log scale: Calculating first digit in log scale: When α \alpha α approaches 0 0 0 it accurately follows Benford's law. The larger α \alpha α the larger deviation with Benford's law. If we fix the min value m m m as a specific number, like 3 3 3 , when α \alpha α is not very close to 0 0 0 it significantly deviates with Benford's law. However if we make m m m a random value between 1 and 10 then it will be close to Benford's law. We have a null hypothesis H 0 H_0 H 0 ​ , like "the coin is fair", and an alternative hypothesis H 1 H_1 H 1 ​ , like "the coin is unfair". We now need to test how likely H 1 H_1 H 1 ​ is true using data. If you have some data and it's extreme if we assume null hypothesis H 0 H_0 H 0 ​ , then P-value is the probability of getting the result that's as extreme or more extreme than the data if we assume null hypothesis H 0 H_0 H 0 ​ is true. If p-value is small then the alternative hypothesis is likely true. If I do ten coin flips then get 9 heads and 1 tail, the probability that the coin flip is fair but still get 9 heads and 1 tail. P-value is the probability that we get as extreme or more extreme as the result, and the "extreme" is two sided, so p-value is P ( 9 heads 1 tail ) + P ( 10 heads 0 tail ) + P ( 1 heads 9 tail ) + P ( 0 heads 10 tail ) P(\text{9 heads 1 tail}) + P(\text{10 heads 0 tail}) + P(\text{1 heads 9 tail}) + P(\text{0 heads 10 tail}) P ( 9 heads 1 tail ) + P ( 10 heads 0 tail ) + P ( 1 heads 9 tail ) + P ( 0 heads 10 tail ) assume coin flip is fair. Can we swap the null hypothesis and alternative hypothesis? For two conflicting hypothesis, which one should be the null hypothesis? The key is burden of proof . The null hypothesis is the default that most people tend to agree and does not need proving. The alternative hypothesis is special and require you to prove using the data. The lower the p value, the higher your confidence that alternative hypothesis is true. But due to randomness you cannot be 100% sure. If you are doing an AB test, you keep collecting data, and when there is statistical significance (like p-value lower than 0.05) you make a conclusion, this is not statistically sound. A random fluctation in the process could lead to false positive results. A more rigorous approach is to determine required sample size before AB test. And the fewer data you have the stricter hypothesis test should be (lower p-value threshold). According to O'Brien-Fleming Boundary, the p-value threshold should be 0.001 when you have 25% data, 0.005 when you have 50% data, 0.015 when you have 75% data and 0.045 when you have 100% data. If I have some samples and I calculate values like mean, variance, median, etc. The calculated value is called statistic. The statistics themselves are also random. If you are sure "In 95% probability the real median is between 8.1 and 8.2" then [ 8.1 , 8.2 ] [8.1,8.2] [ 8.1 , 8.2 ] is a confidence interval with 95% confidence level. Confidence interval can measure how uncertain a statistics is. One way of computing confidence interval is called bootstrap . It doesn't require you to assume that the statistic is normally distributed. But it do require the samples to be i.i.d. It works by resample from the data and create many replacements of the data, then calculate the statistics of the replacement data, then get the confidence interval. For example if the original samples are [ 1.0 , 2.0 , 3.0 , 4.0 , 5.0 ] [1.0, 2.0, 3.0, 4.0, 5.0] [ 1.0 , 2.0 , 3.0 , 4.0 , 5.0 ] , resample means randomly select one from original data and repeat 5 times, giving things like [ 4.0 , 2.0 , 4.0 , 5.0 , 2.0 ] [4.0, 2.0, 4.0, 5.0, 2.0] [ 4.0 , 2.0 , 4.0 , 5.0 , 2.0 ] or [ 3.0 , 2.0 , 4.0 , 4.0 , 5.0 ] [3.0, 2.0, 4.0, 4.0, 5.0] [ 3.0 , 2.0 , 4.0 , 4.0 , 5.0 ] (they are likely to contain duplicates). Then compute the statistics for each resample. If the confidence level is 95%, then the confidence interval's lower bound is the 2.5% percentile number in these statistics, and the upper bound is the 97.5% percentile number in these statistics. When we train a model (including deep learning and linear regression) we want it to also work on new data that's not in training set. But the training itself is to change the model parameter to fit training data. Overfitting means the training make the model "memorize" the training data and does not discover the underlying rule in real world that generates training data. Reducing overfitting is a hard topic. The ways to reduce overfitting: Regularization. Force the model to be "simpler". Force the model to compress data. Weight sharing is also regularization (CNN is weight sharing comparing to MLP). Add inductive bias to limit the possibility of model. (The old way of regularization is to simply reduce parameter count, but in deep learning, there is deep double descent effect where more parameter is better.) Make the model more expressive. If the model is not exprssive enough to capture real underlying rule in real world that generates training data, it's simply unable to generalize. An example is that RNN is less expressive than Transformer due to fixed-size state. Make the training data more comprehensive. Reinforcement learning, if done properly, can provide more comprehensive training data than supervised learning, because of the randomness in interacting with environment. How to test how overfit a model is? Frequentist: Probability is an objective thing. We can know probability from the result of repeating a random event many times in the same condition. Bayesian: Probability is a subjective thing. Probability means how you think it's likely to happen based on your initial assumptions and the evidences you see. Probability is relative to the information you have. A discrete distribution can be a table, telling the probability of each possible outcome. A discrete distribuiton can be a function, where the input is a possible outcome and the output is probability. A discrete distribution can be a vector (an array), where i-th number is the probability of i-th outcome. A discrete distribution can be a histogram, where each pillar is a possible outcome, and the height of pillar is probability. A continuous distribution can be described by a probability density function (PDF) f f f . A continuous distribution has infinitely many outcomes, and the probability of each specific outcome is zero (usually). We care about the probability of a range: P ( a < X < b ) = ∫ a b f ( x ) d x P(a<X<b)=\int_a^b f(x)dx P ( a < X < b ) = ∫ a b ​ f ( x ) d x . The integral of the whole range should be 1: ∫ − ∞ ∞ f ( x ) d x = 1 \int_{-\infty}^{\infty}f(x)dx=1 ∫ − ∞ ∞ ​ f ( x ) d x = 1 . The value of PDF can be larger than 1. A distribution can be described by cumulative distribution function. F ( x ) = P ( X ≤ x ) F(x) = P(X \leq x) F ( x ) = P ( X ≤ x ) . It can be integration of PDF: F ( x ) = ∫ − ∞ x f ( x ) d x F(x) = \int_{-\infty}^x f(x)dx F ( x ) = ∫ − ∞ x ​ f ( x ) d x . It start from 0 and monotonically increase then reach 1. Quantile function Q Q Q is the inverse of cumulative distribution function. Q ( p ) = x Q(p) = x Q ( p ) = x means F ( x ) = p F(x)=p F ( x ) = p and P ( X ≤ x ) = p P(X \leq x) = p P ( X ≤ x ) = p . The top 25% value is Q ( 0.75 ) Q(0.75) Q ( 0.75 ) . The bottom 25% value is Q ( 0.25 ) Q(0.25) Q ( 0.25 ) . Prior means what I assume the distribution is before knowing some new information. If I see some new information and improved my understanding of the distribution, then the new distribution that I assume is posterior . The mean of two random variables can add up E [ X + Y ] = E [ X ] + E [ Y ] E [ ∑ i X i ] = ∑ i E [ X i ] E[X + Y] = E[X] + E[Y]\quad \quad \quad E[\sum_iX_i] = \sum_iE[X_i] E [ X + Y ] = E [ X ] + E [ Y ] E [ ∑ i ​ X i ​ ] = ∑ i ​ E [ X i ​ ] Multiplying a random variable by a constant k k k multiplies its mean E [ k X ] = k ⋅ E [ X ] E[kX] = k \cdot E[X] E [ k X ] = k ⋅ E [ X ] A constant's mean is that constant E [ k ] = k E[k] = k E [ k ] = k The theoretical mean is weighted average using theoretical probabilities The estimated mean (empirical mean, sample mean) is non-weighted average over samples The theoretical mean is an accurate value, determined by the theoretical distribution The estimated mean is an inaccurate random variable, because it's calculated from random samples Layer normalization : it works on a vector. It treats each element in a vector as different samples from the same distribution, and then replace each element with their Z-score (using sample mean and sample stdev). Batch normalization : it works on a batch of vectors. It treats the elements in the same index in different vectors in batch as different samples from the same distribtion, and then compute Z-score (using sample mean and sample stdev). The input x = ( x 1 , x 2 , . . . , x n ) \boldsymbol{x} = (x_1,x_2,...,x_n) x = ( x 1 ​ , x 2 ​ , ... , x n ​ ) The vector of ones: 1 = ( 1 , 1 , . . . , 1 ) \boldsymbol{1} = (1, 1, ..., 1) 1 = ( 1 , 1 , ... , 1 ) Computing sample mean can be seen as scaling 1 n \frac 1 n n 1 ​ then dot product with the vector of ones: μ ^ = 1 n x ⋅ 1 {\hat \mu}= \frac 1 n \boldsymbol{x} \cdot \boldsymbol{1} μ ^ ​ = n 1 ​ x ⋅ 1 Subtracting the sample mean can be seen as subtracting μ ^ ⋅ 1 \hat {\mu} \cdot \boldsymbol{1} μ ^ ​ ⋅ 1 , let's call it y \boldsymbol y y : y = x − μ ^ ⋅ 1 = x − 1 n ( x ⋅ 1 ) ⋅ 1 \boldsymbol y = \boldsymbol x - {\hat \mu} \cdot \boldsymbol{1} = \boldsymbol x- \frac 1 n (\boldsymbol{x} \cdot \boldsymbol{1}) \cdot \boldsymbol{1} y = x − μ ^ ​ ⋅ 1 = x − n 1 ​ ( x ⋅ 1 ) ⋅ 1 Recall projection: projecting vector a \boldsymbol a a onto b \boldsymbol b b is ( a ⋅ b b ⋅ b ) ⋅ b (\frac{\boldsymbol a \cdot \boldsymbol b}{\boldsymbol b \cdot \boldsymbol b}) \cdot \boldsymbol b ( b ⋅ b a ⋅ b ​ ) ⋅ b . ( 1 ) 2 = n (\boldsymbol 1)^2 = n ( 1 ) 2 = n . So 1 n ( x ⋅ 1 ) ⋅ 1 \frac 1 n (\boldsymbol{x} \cdot \boldsymbol{1}) \cdot \boldsymbol{1} n 1 ​ ( x ⋅ 1 ) ⋅ 1 is the projection of x \boldsymbol x x onto 1 \boldsymbol 1 1 . Subtracting it means removing the component in the direction of 1 \boldsymbol 1 1 from x \boldsymbol x x . So y \boldsymbol y y is orthogonal to 1 \boldsymbol 1 1 . y \boldsymbol y y is in a hyper-plane orthogonal to 1 \boldsymbol 1 1 . Standard deviation can be seen as the length of y \boldsymbol y y divide by n \sqrt{n} n ​ (or n − 1 \sqrt{n-1} n − 1 ​ ): σ 2 = 1 n ( y ) 2 \boldsymbol\sigma^2 = \frac 1 n (\boldsymbol y)^2 σ 2 = n 1 ​ ( y ) 2 , σ = 1 n ∣ y ∣ \boldsymbol\sigma = \frac 1 {\sqrt{n}} \vert \boldsymbol y \vert σ = n ​ 1 ​ ∣ y ∣ . Dividing by standard deviation can be seen as projecting it onto unit sphere then multiply by n \sqrt n n ​ (or n − 1 \sqrt{n-1} n − 1 ​ ). So computing Z-score can be seen as firstly projecting onto a hyper-plane that's orthogonal to 1 \boldsymbol 1 1 and then projecting onto unit sphere then multiply by n \sqrt n n ​ (or n − 1 \sqrt{n-1} n − 1 ​ ). The n-th moment: E [ X n ] E[X^n] E [ X n ] . Mean is the first moment. The n-th central moment: E [ ( X − μ ) n ] E[(X-\mu)^n] E [( X − μ ) n ] . Variance is the second central moment. The n-th central standardized moment: E [ ( X − μ σ ) n ] E[(\frac{X-\mu}{\sigma})^n] E [( σ X − μ ​ ) n ] . Skewness is the third central standardized moment. Kurtosis is the fourth central standardized moment. we have a random variable Y that's correlated with X we know the true mean of Y: E [ Y ] E[Y] E [ Y ] , How uncertain a distribution is. How much information a sample in that distribution carries. If that event always happens, then it carries zero information. I ( E ) = 0 I(E) = 0 I ( E ) = 0 if P ( E ) = 1 P(E) = 1 P ( E ) = 1 . The more rare an event is, the larger information (more surprise) it carries. I ( E ) I(E) I ( E ) increases as P ( E ) P(E) P ( E ) decreases. The information of two independent events happen together is the sum of the information of each event. Here I use ( X , Y ) (X, Y) ( X , Y ) to denote the combination of X X X and Y Y Y . That means I ( ( X , Y ) ) = I ( X ) + I ( Y ) I((X, Y)) = I(X) + I(Y) I (( X , Y )) = I ( X ) + I ( Y ) if P ( ( X , Y ) ) = P ( X ) ⋅ P ( Y ) P((X, Y)) = P(X) \cdot P(Y) P (( X , Y )) = P ( X ) ⋅ P ( Y ) . This implies the usage of logarithm. A fair coin toss with two cases has 1 bit of information entropy: 0.5 ⋅ log ⁡ 2 ( 1 0.5 ) + 0.5 ⋅ log ⁡ 2 ( 1 0.5 ) = 1 0.5 \cdot \log_2(\frac{1}{0.5}) + 0.5 \cdot \log_2(\frac{1}{0.5}) = 1 0.5 ⋅ lo g 2 ​ ( 0.5 1 ​ ) + 0.5 ⋅ lo g 2 ​ ( 0.5 1 ​ ) = 1 bit. If the coin is biased, for example the head has 90% probability and tail 10%, then its entropy is: 0.9 ⋅ log ⁡ 2 ( 1 0.9 ) + 0.1 ⋅ log ⁡ 2 ( 1 0.1 ) ≈ 0.47 0.9 \cdot \log_2(\frac{1}{0.9}) + 0.1 \cdot \log_2(\frac{1}{0.1}) \approx 0.47 0.9 ⋅ lo g 2 ​ ( 0.9 1 ​ ) + 0.1 ⋅ lo g 2 ​ ( 0.1 1 ​ ) ≈ 0.47 bits. If it's even more biased, having 99.99% probability of head and 0.01% probability of tail, then its entropy is: 0.9999 ⋅ log ⁡ 2 ( 1 0.9999 ) + 0.0001 ⋅ log ⁡ 2 ( 1 0.0001 ) ≈ 0.0015 0.9999 \cdot \log_2(\frac{1}{0.9999}) + 0.0001 \cdot \log_2(\frac{1}{0.0001}) \approx 0.0015 0.9999 ⋅ lo g 2 ​ ( 0.9999 1 ​ ) + 0.0001 ⋅ lo g 2 ​ ( 0.0001 1 ​ ) ≈ 0.0015 bits. If a coin toss is fair but has 0.01% percent of standing up on the table, having 3 cases each with probability 0.0001, 0.49995, 0.49995, then its entropy is 0.0001 ⋅ log ⁡ 2 ( 1 0.0001 ) + 0.49995 ⋅ log ⁡ 2 ( 1 0.49995 ) + 0.49995 ⋅ log ⁡ 2 ( 1 0.49995 ) ≈ 1.0014 0.0001 \cdot \log_2(\frac{1}{0.0001}) + 0.49995 \cdot \log_2(\frac{1}{0.49995}) + 0.49995 \cdot \log_2(\frac{1}{0.49995}) \approx 1.0014 0.0001 ⋅ lo g 2 ​ ( 0.0001 1 ​ ) + 0.49995 ⋅ lo g 2 ​ ( 0.49995 1 ​ ) + 0.49995 ⋅ lo g 2 ​ ( 0.49995 1 ​ ) ≈ 1.0014 bits. (The standing up event itself has about 13.3 bits of information, but its probability is low so it contributed small in information entropy) The "distance" between two distributions. If I "expect" the distribution is B, but the distribution is actually A, how much "surprise" do I get on average. If I design a loseless compression algorithm optimized for B, but use it to compress data from A, then the compression will be not optimal and contain redundant information. KL divergence measures how much redundant information it has on average. We have two distributions: A A A is the target distribution, B B B is the output of our model We have n n n samples from A A A : x 1 , x 2 , . . . x n x_1, x_2, ... x_n x 1 ​ , x 2 ​ , ... x n ​ We know the probablity of each sample in each distribution. We know P A ( x i ) P_A(x_i) P A ​ ( x i ​ ) and P B ( x i ) P_B(x_i) P B ​ ( x i ​ ) Mutual information I ( X ; Y ) I(X;Y) I ( X ; Y ) is zero if the joint distribution ( X , Y ) (X,Y) ( X , Y ) is the same as X ⊗ Y X\otimes Y X ⊗ Y , which means X and Y are independent. Mutual information I ( X ; Y ) I(X;Y) I ( X ; Y ) is positive if X and Y are not independent. Mutual information is never negative, because KL divergence is never negative. Minimize I ( Input  ;  IntermediaryRepresentation ) I(\text{Input} \ ; \ \text{IntermediaryRepresentation}) I ( Input   ;   IntermediaryRepresentation ) . Try to compress the intermediary representation and reduce unnecessary information related to input. Maximize I ( IntermediaryRepresentation  ;  Output ) I(\text{IntermediaryRepresentation} \ ; \ \text{Output}) I ( IntermediaryRepresentation   ;   Output ) . Try to keep the information in intermediary representation that's releveant to the output as much as possible. In continuous case, convolution takes two probability density functions, and give a new probability density function. In discrete case, convolution can take two functions and give a new function. Each function inputs an outcome and outputs the probability of that outcome. In discrete case, convolution can take two vectors and give a new vector. Each vector's i-th element correspond to the probability of i-th outcome. How likely that we get samples x 1 , x 2 , . . . , x n x_1, x_2, ... , x_n x 1 ​ , x 2 ​ , ... , x n ​ from the modelled distribution using parameter θ \theta θ . how likely a parameter θ \theta θ is the real underlying parameter, given some independent samples x 1 , x 2 , . . . , x n x_1,x_2,...,x_n x 1 ​ , x 2 ​ , ... , x n ​ . If I assume that the coin flip is fair, θ = 0.5 \theta=0.5 θ = 0.5 , then likelihood is about 0.000977. If I assume θ = 0.9 \theta=0.9 θ = 0.9 , then likelihood is about 0.387, which is larger. If I assume θ = 0.999 \theta=0.999 θ = 0.999 then likelihood is about 0.00099, which is smaller than when assuming θ = 0.9 \theta=0.9 θ = 0.9 . It's a valid probability density function: ∫ − ∞ ∞ f ( x ) d x = 1 \int_{-\infty}^{\infty} f(x)dx=1 ∫ − ∞ ∞ ​ f ( x ) d x = 1 , and f ( x ) ≥ 0 f(x) \geq 0 f ( x ) ≥ 0 The mean: ∫ − ∞ ∞ x f ( x ) d x = μ \int_{-\infty}^{\infty} x f(x) dx = \mu ∫ − ∞ ∞ ​ x f ( x ) d x = μ The variance constraint: ∫ − ∞ ∞ f ( x ) ( x − μ ) 2 d x = σ 2 \int_{-\infty}^{\infty} f(x) (x-\mu)^2 dx = \sigma^2 ∫ − ∞ ∞ ​ f ( x ) ( x − μ ) 2 d x = σ 2 Moving the probability density function along X axis doesn't change entropy, so we can fix the mean as 0 (we can replace x x x as x − μ x-\mu x − μ after finishing deduction). log ⁡ 1 f ( x ) \log\frac{1}{f(x)} lo g f ( x ) 1 ​ already implicitly tells f ( x ) > 0 f(x)>0 f ( x ) > 0 It turns out that the mean constraint ∫ − ∞ ∞ x f ( x ) d x = 0 \int_{-\infty}^{\infty} x f(x) dx = 0 ∫ − ∞ ∞ ​ x f ( x ) d x = 0 is not necessary to deduce the result, so we can not include it in Largrange multipliers. (Including it is also fine but will make it more complex.) if you multiply a constant, the result still follow normal distribution. X ∼ N →   k X ∼ N X \sim N \rightarrow \ kX \sim N X ∼ N →   k X ∼ N if you add a constant, the result still follow normal distribution. X ∼ N → ( X + k ) ∼ N X \sim N \rightarrow (X+k) \sim N X ∼ N → ( X + k ) ∼ N If you add up two independent normal random variables, the result still follows normal distribution. X ∼ N , Y ∼ N → ( X + Y ) ∼ N X \sim N, Y \sim N \rightarrow (X+Y) \sim N X ∼ N , Y ∼ N → ( X + Y ) ∼ N A linear combination of many independent normal distributions also follow normal distribution. X 1 ∼ N , X 2 ∼ N , . . . X n ∼ N → ( k 1 X 1 + k 2 X 2 + . . . + k n X n ) ∼ N X_1 \sim N, X_2 \sim N, ... X_n \sim N \rightarrow (k_1X_1 + k_2X_2 + ... + k_nX_n) \sim N X 1 ​ ∼ N , X 2 ​ ∼ N , ... X n ​ ∼ N → ( k 1 ​ X 1 ​ + k 2 ​ X 2 ​ + ... + k n ​ X n ​ ) ∼ N We have a (row) vector x \boldsymbol x x of independent random variables x = ( x 1 , x 2 , . . . x n ) \boldsymbol x=(x_1, x_2, ... x_n) x = ( x 1 ​ , x 2 ​ , ... x n ​ ) , each element in vector follows a normal distribution (not necessarily the same normal distribution), then, if we apply an affine transformation on that vector, which means multipling a matrix A A A and then adding an offset b \boldsymbol b b , y = A x + b \boldsymbol y=A\boldsymbol x+\boldsymbol b y = A x + b , then each element of y \boldsymbol y y is a linear combination of normal distributions, y i = x 1 A i , 1 + x 2 A i , 2 + . . . x n A i , n + b i y_i=x_1 A_{i,1} + x_2 A_{i, 2} + ... x_n A_{i,n} + b_i y i ​ = x 1 ​ A i , 1 ​ + x 2 ​ A i , 2 ​ + ... x n ​ A i , n ​ + b i ​ , so each element in y \boldsymbol y y also follow normal distribution. Now y \boldsymbol y y follows multivariate normal distribution. It's symmetric: Cov [ X , Y ] = Cov [ Y , X ] \text{Cov}[X,Y] = \text{Cov}[Y,X] Cov [ X , Y ] = Cov [ Y , X ] If X and Y are independent, Cov [ X , Y ] = 0 \text{Cov}[X,Y]=0 Cov [ X , Y ] = 0 Adding constant Cov [ X + k , Y ] = Cov [ X , Y ] \text{Cov}[X+k,Y] = \text{Cov}[X,Y] Cov [ X + k , Y ] = Cov [ X , Y ] . Variance is invariant to translation. Multiplying constant Cov [ k ⋅ X , Y ] = k ⋅ Cov [ X , Y ] \text{Cov}[k\cdot X,Y] = k \cdot \text{Cov}[X,Y] Cov [ k ⋅ X , Y ] = k ⋅ Cov [ X , Y ] Addition Cov [ X + Y , Z ] = Cov [ X , Z ] + Cov [ Y , Z ] \text{Cov}[X+Y,Z] = \text{Cov}[X,Z]+\text{Cov}[Y,Z] Cov [ X + Y , Z ] = Cov [ X , Z ] + Cov [ Y , Z ] p 0 ( x 0 ) p_0(x_0) p 0 ​ ( x 0 ​ ) is the probability density of original clean value (for image generation, it correspond to the probability distribution of images that we want to generate) p 1 ( x 1 ) p_1(x_1) p 1 ​ ( x 1 ​ ) is the probability density of noise-added value p 1 ∣ 0 ( x 1 ∣ x 0 ) p_{1 \vert 0}(x_1 \vert x_0) p 1∣0 ​ ( x 1 ​ ∣ x 0 ​ ) is the probability density of noise-added value, given clean training data x 0 x_0 x 0 ​ . It's a normal distribution given x 0 x_0 x 0 ​ . It can also be seen as a function that take two arguments x 0 , x 1 x_0, x_1 x 0 ​ , x 1 ​ . p 0 ∣ 1 ( x 0 ∣ x 1 ) p_{0 \vert 1}(x_0 \vert x_1) p 0∣1 ​ ( x 0 ​ ∣ x 1 ​ ) is the probability density of the original clean value given noise-added value. It can also be seen as a function that take two arguments x 0 , x 1 x_0, x_1 x 0 ​ , x 1 ​ . The lifetime of machine components. The time until a radioactive atom decays. The time length of phone calls. The time interval between two packets for a router. Log-normal distribution : If log ⁡ X \log X lo g X is normally distributed, then X X X follows log-normal distribution. Put in another way, if Y Y Y is normally distributed, then e Y e^Y e Y follows log-normal distribution. Stretched exponential distribution : P ( X > x ) P(X>x) P ( X > x ) is roughly porpotional to e − k x β e^{-kx^\beta} e − k x β ( β < 1 \beta < 1 β < 1 ) Power law with exponential cutoff : P ( X > x ) P(X>x) P ( X > x ) is roughly porpotional to x − α e − λ x x^{-\alpha} e^{-\lambda x} x − α e − λ x Regularization. Force the model to be "simpler". Force the model to compress data. Weight sharing is also regularization (CNN is weight sharing comparing to MLP). Add inductive bias to limit the possibility of model. (The old way of regularization is to simply reduce parameter count, but in deep learning, there is deep double descent effect where more parameter is better.) Make the model more expressive. If the model is not exprssive enough to capture real underlying rule in real world that generates training data, it's simply unable to generalize. An example is that RNN is less expressive than Transformer due to fixed-size state. Make the training data more comprehensive. Reinforcement learning, if done properly, can provide more comprehensive training data than supervised learning, because of the randomness in interacting with environment. Separate the data into training set and test set. Only train using training set and check model performance on test set. Test sensitivity to random fluctation. We can add randomness to parameter, input, hyperparameter, etc., then see model performance. An overfit model is more prone to random perturbation because memorization is more "fragile" than real underlying rule. Survivorship bias and selection bias. Simpson's paradox and base rate fallacy. Confusing correlation with causalty. Try too many different hypothesis. Spurious correlations Collect data until significance. Wrongly remove outliers.

0 views
qouteall notes 10 months ago

Notes about Cognitive Biases

A summarization of the main cognitive biases, relating to financial trading, combined with my own understanding. Diminishing marginal utility : For example, one is hungry and then eats 3 pieces of bread, the first piece eaten while hungry is has more utility than the second piece eaten after the first, and so on. The more of something you have, the less utility another such thing has. Corresponding to diminishing marginal utility, the happiness of gaining $200 is less than two times of happiness of gaining $100. The perception of gain is convex. The same applies to pain. The pain of losing $100 two times is higher than losing $200 in one time. Weber-Fechner law: Human's sensor perception is roughly logarithmic to the actual value. The "gain/loss" is relative to the expectation (frame of reference). Different people have different expectations in different scenarios. Expectation management is important. If the outcome is good but doesn't meet the high expectation, it still causes disappointment. Vice versa. The expectation can gradually change. People gradually get used to the new norm. This make people be able to endure bad environments, and not get satisfied after achievement. Shifting baseline syndrome (boiling frog syndrome) : If the reality keeps changing slowly, the expectation also tend to keep nudging, eventually move a lot without being noticed. This is also common in long-term psychological manipulation. Relative deprivation : When people expect to have something that they don't have, they think they lose that thing, although they don't actually losing it. Door-in-the-face effect : Firstly make a large request that will likely be rejected, then make a modest request. The firstly made large request changes expectation to make the subsequent modest request easier to accept. ... 譬如你说,这屋子太暗,须在这里开一个窗,大家一定不允许的。但如果你主张拆掉屋顶他们就来调和,愿意开窗了。 ... If you say, "This room is too dark; we need to open a window here," everyone would definitely refuse. However, if you propose removing the roof, they would come to a compromise and agree to open a window. Protective pessimism : Being pessimistic can reduce risk of disappointment. Procrastination is also related to protective pessimism. If you believe that the outcome will be bad, then reducing cost (time and efforts put into it) is "beneficial". In real life, some risks are hard to reverse or are irreversible, so avoiding risk is more important than gaining. In investment, losing 10% requires gaining 11.1% to recover, and losing 50% requires gaining 100% to recover. Keep staying in the game is important as it makes one exposed to future opportunities. So, losses have a larger mental impact than gains of the same size. The pain of losing $100 is bigger than the happiness of gaining $100. Unfortunately, loss aversion make being unhappy easier and make being happy harder. Relative deprivation is also a kind of loss that people tend to avoid. For example, when the people near one get rich by investing a bubble asset, one may also choose to invest the bubble asset to avoid the "relative loss" between one and others. We prefer deterministic gain instead of risky gain . A bird in the hand is worth two in the bush. Given 100% chance to gain $450 or 50% chance to gain $1000, people tend to choose the former. The professions that face uncertain gain, like academic research, where it's common that researching a problem for years without getting any meaningful result, are not suitable for most people. We prefer having hope rather than accepting failure . Given 100% chance to lose $500 or 50% chance to lose $1100, most people will choose the latter. The second one has "hope" and the first one means accepting failure. In this case, "no losing" is usually taken as expectation. What if the expectation is "already losing $500"? Then the two choices become: 1. no change 2. 50% gain $500 and 50% lose $600. In this case, people tend to choose the first choice which has lower risk. The expectation point is very important. Telescoping effect: Vierordt's law: Shorter time intervals tend to be overestimated. Longer time intervals tend to be underestimated. Oddball effect: The time that have novel and unexpected experience feels longer. It can be seen that we feel time length via the amount of memory . Novel and unexpected experiences correspond to more memory. Forgetting "compresses" time. When people become older, novel experiences become more rare, thus time feels faster. The memory of feeling risk has higher "weight" (risk aversion) , so time feels slower when feeling risk. In contrast, happy time feels going faster. Reference: Time perception - Wikipedia Hedonic treadmill : after some time of happiness, the expectation goes up and happiness reduces. The things that people gained will gradually be taken for granted, and they always pursue for more. Do not spoil what you have by desiring what you have not; remember that what you now have was once among the things you only hoped for. If happiness can be predicted, some happiness moves earlier. For example, one is originally happy when eating delicious chocolate. Then one become happy just after buying chocolate before eating it, and the happiness of actually eating chocolate reduces. In future the happiness can move earlier into deciding to buy chocolate. This effect is also called second-order conditioning . Material consumption can give short-term satisfaction, but cannot give long-term well-being (paradox of materialism). Long-term well being can better be achieved by sustainable consumption with temperance. Means-end inversion : one originally want money (means) to improve life quality (end). However, the process of making money can sacrifice life quality. Examples: investing all money and leave little for consumption, or choosing a high-paying job with no work-life balance (golden handcuffs). We al­ready walked too far, down to we had for­got­ten why em­barked. A man on a thousand mile walk has to forget his goal and say to himself every morning, "Today I'm going to cover twenty-five miles and then rest up and sleep." - Leo Tolstoy, War and Peace People tend to maintain their ego by self-serving bias : People tend to be overconfident about themselves: The overconfidence is sometimes useful: When looking at past, people find past events (including Black Swan events) reasonable and predictable, although they didn't predicted these events in prior. In a complex world, one event can have two contradicting interpretations . For example: People make execuses about their prediction failure, such as: People tend to justify previous behavior , even if these behaviors was made randomly, or made under external factors that does not exist now. Self justitication shows self-control and consistency, facilitating social collaboration. This is related to Stockholm Syndrome. After experiencing pain in the past, people tend to justify their previous pain. Ben Franklin effect : People like someone more after doing a favor for them. Endowment effect : We value more on the things that we own (including ideas). Investors tend to be biased to positive information of the stock they own. Disaggreing an idea tend to be treated as insult. Foot-in-the-door effect : One agreed on a small request tend to subsequently agree on a larger request. Saying becomes believing. People want to show an image of high capability (to both others and self). But a failure can debunk the high-capability-image. Self-handicapping is one way of protecting the image. It's an extension of protective pessimism . Examples of self-handicapping: When one succeedes despite self-handicapping, it shows great capability. But if one fails, self-handicapping can only protect image to self, not from others. People usually just judge from result and see failed self-handicapping as low capability. Setting unrealistic high goals is sometimes a form of self-handicapping. But not always. Self-handicapping is also a way of reducing responsibility . This is common in large corporations and governments: intentionally create reasons of failure to reduce responsibility. People tend to fight the things that oppose their desire. Examples: Providing external reward may reduce internal motivation (overjustification effect). Being helped doesn't always elicit gratitude. The one being helped may feel being inferior in social status, thus helping may cause hatred, especially when reciprocal helping cannot be done. People tend to avoid thinking about inevitable death because it's unpleasant. People may subcounciously feel like they live forever, then: Stoicism proposes thinking about death all the time ( memento mori ). Thinking about death can make one not procrastinate important things, make one value the present and reduce worrying about small problems. But Stocism does NOT propose indulgence and overdrafting the future. Confirmation bias : People tend to seek and accept the evidences that confirm their beliefs, and reluctant to accept contradictory evidences. Confirmation bias can even make one subcounciously ignore some information. Motivated reasoning : when they does not want to accept contradictory evidences, they may make up and believe in non-falsifiable explanations to explain the evidence in a way that follows the original belief. Examples of non-falsifiable explanations: With confirimation bias, more information increases confidence, but doesn't lead to better understanding . If you don't have an opinion, resist the pressure to have one. - N. N. Taleb, Link Information cocoon (echo chamber) : People tend to actively choose to digest the information source that they like, and make friends with the one having similar beliefs. Another thing I think should be avoided is extremely intense ideology, because it cabbages up one’s mind. ... I have what I call an iron prescription that helps me keep sane when I naturally drift toward preferring one ideology over another. And that is I say “ I’m not entitled to have an opinion on this subject unless I can state the arguments against my position better than the people do who are supporting it . I think that only when I reach that stage am I qualified to speak.” - Charlie Munger Belief bias : if the conclusion confirms people's existing belief, then people tend to believe it, regardless of the reasoning correctness, vice versa. Bullshit asymmetry principle : Refuting misinformation is much harder than producing misinformation. To produce a misinformation and make it spread, you just need to make it follow people's existing beliefs. But to refute a misinformation, you need to find sound evidences. This is also reversal of the burden of proof . The good side of stubborness is to maintain diversity of ideas in a society, helping innovation and overcoming of unknown risks. People tend to justify the groups they belong (group justification), and justify the society that they are in (system justification). People love to correct others and persuade others. Some ideas are memes that drive people to spread the idea. Correcting others also provide superiority satisfaction. However, due to belief stability, it's hard to persuade/teach others. People dislike being persuaded/teached. This effect is common on internet social media. The trouble with having an open mind, of course, is that people will insist on coming along and trying to put things in it. - Terry Pratchett Cunningham's Law: The best way to get the right answer on the internet is not to ask a question; it's to post the wrong answer. Commitment can be a good thing. A lot of goals require continual time, efforts and resources to achieve. However, there are investments that turn out to be bad and should be given up to avoid futher loss. All the previous investments become sunk cost. People are reluctant to give up because they have already invested a lot in them. Doing stop-loss signals failure. We want to have hope rather than accepting failure. Opportunity cost : if you allocate resource (time, money) to one thing, that resource cannot be used in other things that may be better. Opportunity cost is not obvious. The difference between "good persistence" and "bad obstinacy": The persistent are like boats whose engines can't be throttled back. The obstinate are like boats whose rudders can't be turned. ... The persistent are much more attached to points high in the decision tree than to minor ones lower down, while the obstinate spray "don't give up" indiscriminately over the whole tree. - Paul Graham, The Right Kind of Stubborn An environment that doesn't tolerant failure makes people not correct mistakes and be obstinate on the wrong path (especially in authoritarian environments, where loyalty and execution attitude override honesty). When you’re in the midst of building a product, you will often randomly stumble across an insight that completely invalidates your original thesis. In many cases, there will be no solution. And now you’re forced to pivot or start over completely. If you’ve only worked at a big company, you will be instinctually compelled to keep going because of how pivoting would reflect on stakeholders. This behavior is essentially ingrained in your subconscious - from years of constantly worrying about how things could jeopardize your performance review, and effectively your compensation. This is why so many dud products at BigCos will survive with anemic adoption. Instead, it’s important to build an almost academic culture of intellectual honesty - so that being wrong is met with a quick (and stoic) acceptance by everyone. There is nothing worse than a team that continues to chase a mirage. - Nikita Bier, Link Drip pricing: Only show extra price (e.g. service fee) when the customer has already decided to buy. The customer that already spent efforts in deciding tend to keep the decision. Ignoring negative information or warning signs to avoid psychological discomfort. Robert Trivers proposes that we deceive ourselves to better deceive others: Saying becomes believing. Telling a lie too many times may make one truly believe in it. We can learn from the world in an information-efficient way: learning from very few information quickly. 1 The flip side of information-efficient learning is hasty generalization . We tend to generalize from very few examples quickly, rather than using logical reasoning and statistical evidence, thus easily get fooled by randomness. The reality is complex, so we need to simplify things to make them easier to understand and easier to remember. However, the simplification can get wrong. There are too many informations in the world. We have some heuristics for filtering information. To simplify, we tend to make up reasons of why things happen . A reasonable thing is simpler and easier to memorize than raw complex facts. This process is also compression. 2 People tend to see false pattern from random things. This effect is apophenia . Related: most people cannot actually behave randomly even if they try to be random. An example: Aaronson Oracle . If there are two lights, the first flashes in 70% probability and the second flashes in 30% probability. When asked to predict which light flashes next, people tend to try to find patterns even if the light flash is purely random, having correct rate about 58%. People tend to do frequency matching , the predictions also contain 70% first light and 30% second light. But in that lab experiment enviornment, the light flash is purely random and the probability stays the same, so the optimal strategy is to not try to predict and always choose the first which has larger probability, having correct rate 70%. Reference: The Left Hemisphere’s Role in Hypothesis Formation Although the strategy of always choosing the highest-probability choice is optimal in that lab experiment environment, it's not a good strategy in the complex changing real world: When statistical analysis shows that A correlates with B, the possible causes are: Examples of correlation of A and B are actually driven by another factor C: Among my favorite examples of misunderstood fitness markers is a friend of a friend who had heard that grip strength was correlated with health. He bought one of this grip squeeze things, and went crazy with it, eventually developing tendonitis. - Paul Kedrosky, Link Narrative fallacy is introduced in The Black Swan : We like stories, we like to summarize, and we like to simplify, i.e., to reduce the dimension of matters. The fallacy is associated with our vulnerability to overinterpretation and our predilection for compact stories over raw truths. It severely distorts our mental representation of the world; it is particularly acute when it comes to the rare events. - The Black Swan Narrative fallacy includes: People tend to judge things by first impression . This makes people generate belief by only one observation, which is information-efficient, but can also be biased. Nominal fallacy: Understand one thing just by its names. Examples: People like to judge a decision by its immediate result. However, the real world is full of randomness. A good plan may yield bad result and a bad plan may yield good result. And the short-term result can differ to long-term result. There is no perfect strategy that will guarantee success. Overemphasizing short-term outcomes leads to abandoning good strategies prematurely. The quicker the feedback gives, the quicker people can learn (this also applies to reinforcement learning AI). But if the feedback delays 6 months, it's hard to learn from it, and people may do wrong hasty generalization using random coincidents , before the real feedback comes, thus get fooled by randomness. When feedback comes early, its correlation with previous behavior is high, having high signal-to-noise ratio . If feedback comes late, many previous behaviors may correlate with it, so feedback has low signal-to-noise ratio. Reducing cost by removing safety measures usually does not cause any visible accidents in the short run, but the benefit of reduced costs are immediately visible. When the accident actually happened because of the removed safety measures, it may be years later. People crave quick feedback . Successful video games and gambling mechanisms utilize this by providing immediate responses to actions. What's more, for most people, concrete visual and audio feedback is more appealing than abstract feedback (feedback of working with words and math symbols). The previously mentioned reverse psychology is also related to learning. Being forced to learn make one dislike learning it. Self-directed learning make one focus on what they are interested in, thus is more effective. To summarize, most people naturally prefer the learning that: It's also hard to learn if the effect of decision is applied to other people, especially for decision-makers: It is so easy to be wrong - and to persist in being wrong - when the costs of being wrong are paid by others. - Thomas Sowell People tend to simplify causal relationship and ignore complex nuance. If X is a factor that causes Y, then people tend to treat X as the only reason that caused Y, over-simplifying causal relationship. Usually, the superficial effect is seen as the reason, instead of the underlying root cause. Examples of causal oversimplification: Oversimplification: "Poor people are poor because they are lazy." Other related factors: Education access, systemic discrimination, health disparities, job market conditions, the Matthew effect, etc. Oversimplification: "Immigrants are the cause of unemployment." Other related factors: Manufacturing relocation, automation technologies, economic cycles, skill mismatches, overall labor market conditions, etc. Oversimplification: "The Great Depression happened because of the stock market crash of 1929." Other related factors: Excessive financial risk-taking, lack of regulatory oversight, production overcapacity, wealth inequality, and international economic imbalances, etc. Oversimplification: "That company succeeded because of the CEO." Other related factors: Market conditions, employee contributions, incentive structures, company culture, government-business relationships, competitive landscape, and the cumulative impact of past leadership decisions, etc. For every complex problem there is an answer that is clear, simple, and wrong. - H. L. Mencken People often dream of a "silver bullet" that simply magically works: We tend to simplify things. One way of simplification is to ignore the grey-zone and complex nuance, reducing things into two simple extremes. Examples of binary thinking: People's evaluations are anchored on the expectation, and not meeting an expectation could make people's belief turn to another extreme . Technology Hype Cycle : Internet has indeed changed the world. But the dot com bubble burst. It's just that the power of Internet required time to unleash, and people placed too much expectation in it too early. Neglect of probability : either neglect a risk entirely or overreact to the risk. Strawman argument is a technique in debating: refute a changed version of opponent's idea. It often utilizes binary thinking: refute a more extreme version of opponent's idea . Examples: Halo effect : Liking one aspect of a thing cause liking all aspects of that thing and its related things. (爱屋及乌) Horn effect is the inverse of halo effect: if people dislike one aspect of a thing, they tend to dislike all aspects of that thing and its related things. People tend to judge words by the political stance of the person who said it. Disaggrement on ideas tend to become insults to people. Halo effect and horn effect are related to binary thinking . People prefer definite answer, over ambiguity or uncertainty (such as "I don't know", "it depends on exact case", "need more investigation"), even if the answer is inaccurate or made up. This is related to narrative fallacy : people like to make up reasons explaining why things happen. One day in December 2003, when Saddam Hussein was captured, Bloomberg News flashed the following headine at 13:01: U.S. TREASUERIES RISE; HUSSEIN CAPTURE MAY NOT CURB TERROISM. ...... As these U.S. Treasury bonds fell in price (they fluctuate all day long, so there was nothing special about that) ...... they issued the next bulletin: U.S. TREASURIES FALL; HUSSEIN CAPTURE BOOSTS ALLURE OF RISKY ASSETS. - The Black Swan People dislike uncertain future and keep predicting the future, while ignoring their terrible past prediction record (hindsight bias). People like to wrongly apply a theory to real world, because applying the theory can give results. Example: assuming that an unknown distribution is gaussian even when it's not. Zeigarnik effect : People focus on uncompleted things more than completed things. When some desire is not fulfilled (gambling not winning, PvP game not winning, browsing social media not seeing wanted content, etc.), the desire becomes more significant. This effect can cause one not wanting to sleep. Need for closure is also related to curiosity . People may idealize the things that they are not familiar with: People tend to idealize the distant past and forget the past misery. This helps people get out of trauma, and at the same time idealize the past things: People may think that they deeply understand something, until writing it down. When writing it down, the "gaps" of the idea will be revealed. Pure thinking is usually vague and incomplete, but people overestimate the rationality of their pure thinking. The reason I've spent so long establishing this rather obvious point [that writing helps you refine your thinking] is that it leads to another that many people will find shocking. If writing down your ideas always makes them more precise and more complete, then no one who hasn't written about a topic has fully formed ideas about it. And someone who never writes has no fully formed ideas about anything nontrivial. It feels to them as if they do, especially if they're not in the habit of critically examining their own thinking. Ideas can feel complete. It's only when you try to put them into words that you discover they're not. So if you never subject your ideas to that test, you'll not only never have fully formed ideas, but also never realize it. - Paul Graham, Link Even so, writing the idea down may be still not enough, because natural language is vague , and vagueness can hide practical details . The issues hidden by the vagueness in language will be revealed in real practice (e.g. turning software requirement into code). Having ideas is easy and cheap. If you search the internet carefully you are likely to find ideas similar to yours. The important is to validate and execute the idea. According to predictive processing theory , the brain predicts (hallucinates) the most parts of perception (what you see, hear, touch, etc.). The sensory signals just correct that prediction (hallucination). Body transfer illusion (fake hand experiment) Prior belief (confirmation bias) can often greatly affect perception. This not only affects recognition of objects, but also affects reading of text. Under confirmation bias, when reading text, one may skip important words subcounciously . Free energy principle : The brain tries to minimize free energy. Free energy = Surprise + Change of Belief The ways of reducing free energy: Survivorship bias means that only consider "survived", observed samples and does not consider "silent", "dead", unobserved samples, neglecting the selection mechanism of samples. A popular image of survivorship bias: The planes that get hit in critical places never come back, thus don't get included in the stat of bullet holes, forming the regions missing bullet hole in that image. Other examples of survivorship bias: A more generalized version of survivor bias is selection bias : When the sampling is not uniform enough and contains selection mechanism (not necessary 100% accurate selection), there will be bias in the result. The opinions on social media does not necessarily represent most peoples' view. There are several selection mechanisms in it: 1. not all people use the same social media platform 2. the people using social media may not post opinions 3. not all posted opinions will be seen by you due to algorithmic recommendation. Some physicists propose Anthropic Principle : the physical laws allow life because the existence of life "selects" the physical law. The speciality of the physical laws come from survivorship bias. Availability bias : When thinking, the immediate examples that come into mind plays a big role. Example: If you recently saw a car crash, you tend to think that traveling by car is riskier than traveling by plane. However, if you recently watched a movie about a plane crash, you might feel that planes are more dangerous. Nothing in life is as important as you think it is when you are thinking about it. - Daniel Kahnman Vividness bias : People tend to believe more from vivid things and stories, over abstract statistical evidences. This is related to anecdotal fallacy and narrative fallacy . The Italian Toddler : In the late 1970s, a toddler fell into a well in Italy. The rescue team could not pull him out of the hole and the child stayed at the bottom of the well, helplessly crying. ...... the whole of Italy was concerned with his fate ...... The child's cries produced acute pains of guilt in the powerless rescuers and reporters. His pictures was prominently displayed on magazines and newspapers ..... Meanwhile, the civil war was raging in Lebanon ...... Five miles away, people were dying from the war, citizens where threatened with car bombs, but the fate of the Italian child ranked high among the interests of the population in the Christian quarter of Beirut. - The Black Swan Enforcing safety measures is usually unappreciated. Because people only see the visible cost and friction caused by safety measures (concrete), and do not see the consequences of not applying safety measures in a parallel universe (abstract), until an incident really happens (concrete). People are more likely to pay terrorism insurance than for plain insurance that covers terrorism and other things. If people are given some choices, people tend to choose one of the provided choices and ignore the fact that other choices exist. This is also framing effect. People tend to attribute one product to one public figure, or attribute a company to its CEO, because that's the name that they know, and because of causal simplification tendency. People often think the quality of new movies/games/novels declines, worse than the ones produced in "golden age" before. However it's mainly due to people only remember good ones and neglect the bad ones filtered by time. Interestingly, LLMs also seem to have availability bias: the information mentioned before in context can guide or mislead subsequent output. The knowledge that's "implicit" in LLM may be suppressed by context. When reviewing a document, most reviews tend to nitpick on the most easy-to-understand places, like diagram, or summarization, while not reading subsequent text that explain the nuances. When judging on other people's decisions, people often just see visible downsides and don't see it's a tradeoff that avoids larger downsides. Agenda-setting theory : what media pay attention to can influence people's attention, then influence people's opinions. Saliency bias : We pay attention to the salient things that grab attention. The things that we don't pay attention to are ignored. Attention is a core mechanism of how brain works 5 . People tend to believe more from stories, anecdotes or individual examples, even if these examples are made up or are just statistical outlier. On the contrary, people are less likely to believe in abstract statistical evidences. People prefer familiar things. One reason is the availability bias. Another reason is that people self-justifys their previous attention and dedication. This is highly related to availability bias. When making decisions, people tend to focus on what they already know, and ignore the aspects that they do not know or are not familiar with. We have already considered what we already know, so we should focus on what we don't know in decision making. This is related to risk compensation: People tend to take more risk in familiar situations. Imprinting : At young age, people are more likely to embrace new things. At older age, people are more likely to prefer familiar things and avoid taking risk in unfamiliar things. (Baby duck syndrome). - Douglas Adams Most people tend to treat LLM chatbot as similar to human, because most familiar form of intelligence is human. However, LLM is different to human in many fundamental ways: One similarity: Both human and LLM can "hallucinate" in consistent way. When human forgets something, human tend to make up consistent information to fill the hole in memory. LLM's hallucinations are also seemingly plausible, not just random. Noticing something more frequently after learning about it, leading to overestimating its prevalence or importance. Sometimes, one talked about something then sees its ad in social media, thus suspecting that their phone and social media app is recording voice for ad recommendation. Of course that possibility exists, but perception of that possibility is exaggerated by frequency illusion. People tend to judge things by comparing it with examples (stereotypes) that come into mind, and tend to think that one sample is representative to the whole group. Representative bias can sometimes be misleading: Say you had the choice between two surgeons of similar rank in the same department in some hospital. The first is highly refined in appearance; he wears silver-rimmed glasses, has a thin build, delicate hands, measured speech, and elegant gestures. ... The second one looks like a butcher; he is overweight, with large hands, uncouth speech, and an unkempt appearance. His shirt is dangling from the back. ... Now if I had to pick, I would overcome my sucker-proneness and take the butcher any minute. Even more: I would seek the butcher as a third option if my choice was between two doctors who looked like doctors. Why? Simply the one who doesn’t look the part, conditional on having made a (sort of) successful career in his profession, had to have much to overcome in terms of perception. And if we are lucky enough to have people who do not look the part, it is thanks to the presence of some skin in the game, the contact with reality that filters out incompetence, as reality is blind to looks. - Skin in the game Note that the above quote should NOT be simplified to tell that "the unprofessional-looking ones are always better". It depends on exact case. When an event has occured frequently, people tend to believe that it will occur less frequently in the future. One related topic is the law of large numbers : if there are enough samples of a random event, the average of the results will converge. The law of large numbers focus on the total average, and does not consider exact order. The law of large number works by diluting unevenness rather than correcting unevenness . For example, a fair coin toss will converge to 1/2 heads and 1/2 tails. Even if the past events contain 90% heads and 10% tails, this does not mean that the future events will contain more tails to "correct" past unevenness. The large amount of future samples will dilute the finite amount of uneven past samples, eventually reaching to 50% heads. Actually, gambler's fallacy can be correct in a system with negative feedback loop, where the short-term distribution changes by past samples . These long-term feedback loops are common in nature, such as the predator-prey amount relation. It also appears in markets with cycles. (Note that in financial markets, some cycles are much longer than expected, forming trends.) In a PvP game with Elo-score-based matching mechanism, losing makes make you more likely to win in the short term. One related concept is regression to the mean , meaning that, if one sample is significantly higher than average, the next sample is likely to be lower than the last sample , and vice versa. Example: if a student's score follows normal distribution with average 80, when that student gets 90 scores, they will likely to get a score worse than 90 in the next exam. The difference between gambler's fallacy and regression to the mean: Regression fallacy : after doing something and regression to the mean happens, people tend to think what they do caused the effect (hasty generalization). Example: the kid gets a bad score; parent criticizes; the kid then get a better score. It's seen that criticizing makes the score get better, although this is just regression to the mean that can happen naturally. People tend to think that more specific and reasonable cases are more likely than abstract and general cases. Consider two scenarios: Although B is more specific to A, thus have a lower probability than A, people tend to think B is more likely than A. B implies a causal relationship, thus look more reasonable. People tend to think that a story with more details is more plausible, and treat probability as plausibility. A story with more details is not necessarily more plausible, as the details can be made up. Making a story more reasonable allows better information compression, thus making it easier to remember and recall. People often assume that others know what they know. So people often omit important details when explaining things, causing problems in communication and teaching. When learning a new domain of knowledge, it's beneficial to ask "stupid questions". These "stupid questions" are actually fundamental questions, but seen as stupid by experts, who already forgot the thinking when not knowing the fundamentals, under curse of knowledge. One benefit of AI is that you can ask "stupid questions" without being humiliated (but be wary of hallucinations). Simplicity is often confused by familiarity . If one is very familiar with a complex thing, they tend to think that thing is simple. Normalcy bias: Thinking that past trend will always continue. This is partially due to confirmation bias. Although the market has trends, and a trend may be much longer than expected, no trend continues forever. Anything that is physically constrained cannot grow forever. Most people are late-trend-following in investment: not believing in a trend in the beginning, then firmly believing in the trend in its late stage . This is dangerous, because the market has cycles, and some macro-scale cycles can span years or even decades. The experiences gained in the surge part of the cycle are harmful in the decline part of the cycle and vice versa . Overemphasizing recent events, while ignoring long-term trends. People tend to This is related to Amara's law : we tend to overestimate the effect of a technology in the short run and underestimate the effect in the long run. This is also related to availability bias, where the just-seen events are more obvious and easier to recall than old events and non-obvious underlying trends. Normalcy bias means underreact to new events, but recency bias means overreact to new events, which is the opposite of normalcy bias. These two are actually not conflicting. Which one takes effect initially is related to actual situation and existing beliefs ( confirmation bias ). When one person does not believe in a trend but the trend continued for a long time, binary thinking may make that person turn their belief 180 degrees and deeply believe in the trend. People tend to make decisions based on how information is presented (framed) rather than objective facts. There are many ways to frame one fact. For example, one from positive aspect, one from negative aspect: The content creator could emphasize one aspect and downplay another aspect, and use different wording or art style to convey different opinions. The people reading the information could be easily influenced by the framing subconsciously. The name of a thing affects how people perceive it. Examples: A loaded question is a question that contains an assumption (framing). Following that assumption can lead to a biased answer. Example: "Do you support the attempt by the US to bring freedom and democracy to other places in the world?" The current LLMs are mostly trained to satisfy the user. If you ask LLM a loaded question that has a bias, the LLM often follow your bias to please you. Asking the right question requires the right assumption. Mehrabian's rule : When communicating attitudes and feelings, the impact is 7% verbal (words), 38% vocal (tone of voice), 55% non-verbal (facial expressions, gestures, posture). Note that this doesn't apply to all kinds of communications. Just looking confident can often make other people believe. This even applies when the talker is AI: A friend sent me MRI brain scan results and I put it through Claude. No other AI would provide a diagnosis, Claude did. Claude found an aggressive tumour. The radiologist report came back clean. I annoyed the radiologists until they re-checked. They did so with 3 radiologists and their own AI. Came back clean, so looks like Claude was wrong. But looks how convincing Claude sounds! We're still early... Anchoring bias : People's judgement may be influenced by reference "anchors", even if the reference anchor is irrelevant to decision making. Anchoring is a kind of framing. A salesman may firstly show customers an expensive product, then show cheap products, making customers feel the product being cheaper, utilizing anchoring bias. The Anchoring Bias and its Effect on Judges . Decoy effect : Adding a new worse option to make another option look relatively better. Lie by omission : A person can tell a lot of truth while omitting the important facts, stressing unimportant facts (wrong framing), intentially causing misunderstanding, but at the same time be not lying in literal sense. The price chart is often drawn by making lowest price at the bottom and highest price at the top. The offset and scale of the chart is also framing. If one stock already have fallen by 30%, the latest price is in the bottom of the chart, so the stock seems cheap when looking at the chart, but it may actually be not cheap at all, and vice versa. Reversal of burden of proof : One common debating technique is to reverse the burden of proof to opponent: "My claim is true because you cannot prove it is false." "You are guilty because you cannot prove you are innocent." PowerPoint (keynote, slide) medium is good for persuading, but bad for communicating information. PowerPoint medium encourages author to omit imformation instead of writing details. Amazon bans PowerPoint for internal usage . See also: Columbia Space Shuttle Disaster , Military spaghetti powerpoint . Two different talking styles: the charismatic leader one and the intellectual expert type: Note that the above are two simplified stereotypes. The real cases may be different. "Shooting the messenger" means blaming the one who bring the bad news, even though the messenger has no responsibility of causing the bad news. The same effect happens in other forms: Imagine someone who keeps adding sand to a sand pile without any visible consequence, until suddenly the entire pile crumbles. It would be foolish to blame the collapse on the last grain of sand rather than the structure of the pile, but that is what people do consistently, and that is the policy error. ... As with a crumbling sand pile, it would be foolish to attribute the collapse of a fragile bridge to the last truck that crossed it, and even more foolish to try to predict in advance which truck might bring it down. ... Obama’s mistake illustrates the illusion of local causal chains-that is, confusing catalysts for causes and assuming that one can know which catalyst will produce which effect. - The Black Swan of Cairo; How Suppressing Volatility Makes the World Less Predictable and More Dangerous People tend to value scarce things even they are not actually valuable and undervalue good things that are abundant. People tend to value something only after losing it. Health is forgotten until it’s the only thing that matters. - Bryan Johnson, Link The correlation of overall samples may be contradictory to the correlation inside each sub-groups. Base rate fallacy : there are more vaccinated COVID-19 patients than un-vaccinated COVID-19 patients in hospital, but that doesn't mean vaccine is bad: In these cases, confounding variable correspond to which subgroup the sample is in. Statified analysis means analyzing separately in each subgroup, controlling the confounding variable. When one person is in a small group with similar opinions, they tend to think that the general population have the similar opinions. When they encounter a person that disagrees with them, they tend to think the disagreer is minority or is defective in some way. This effect is exacerbated by algorithmic recommendation of social medias. We also tend to think other people are similar to us in some ways. We learn from very few examples, and that few examples include ourselves. We don't see things as they are. We see things as we are. We use relations to efficiently query information in memory. The brain is good at looking up relations, in an automatically, unintentionally and subconscious way. Being exposed to information makes human recognize similar concepts quicker. Examples: Being exposed to information also changes behavior and attitudes. Examples: The main moral of priming research is that our thoughts and our behavior are influenced, much more than we know or want, by the environment of the moment. - Think, fast and slow Note that the famous "age priming" effect (walk more slowly after reminding aging concepts) failed to be replicated. The placebo effect is also possibly related with priming. Spontaneous trait transfer : listeners tend to associate what the talker say to the talker, even when talker is talking about another person: Flattering subcounciously increase favorability, even when knowing it's flattering (this even applies to sycophant AI). Saying harsh criticism subcounciously reduce favorability, even when knowing the criticism is beneficial. Placebo still works even when knowing it's placebo. When making decisions, human tend to follow intuitions, which is quick and energy-efficient , but also less accurate. Thinking, Fast and Slow proposes that human mind has two systems: Most thinking mainly uses System 1 while being unnoticed. With intense emotion, the rationality (System 2) is being overridden, making one more likely to make mistakes. Some examples: Being calm can "increase intelligence". When one is in intense emotion, logical argument often has little effect in persuading, and emotional connection is often more effective. People tend to choose the default and easiest choice. Partially due to laziness, partially due to fear of unknown risk. In software product design, the default options in software plays a big role in how user will use and feel about the software. Increasing the cost of some behavior greatly reduces the people doing that behavior: Ask for no, don’t ask for yes . Status quo bias : tend to maintain status quo. This is related to risk aversion, as change may cause risk. A related concept is omission bias : People treats the harm of doing something (commision) higher than the harm of not doing anything (omission). Doing things actively bears more responsibility. In the trolley problem, not doing anything reduces perceived responsibility. If there is an option to postpone some work, the work may eventually never be done. Action bias: In the places where doing action is normal, people prefer to do something instead of doing nothing, even when doing action has no effect or negative effects. When being judged by other people, people tend to do action to show their value, productivity and impression of control: For high-liquidity assets (e.g. stocks), people tend to do impulsive trading when market exhibit volatility. But for low-liquidity harder-to-trade assets (e.g. real estate) people tend to hold when the market exhibit volatility. Action bias does not contradict with default effect. When one is asked to work and show value, doing action is the default behavior, and not doing action is more risky, as people tend to question the one that does not look like working. It's not the things you buy and sell that make you money; it's the things you hold. - Howard Marks Also, when under pressure, people tend do make actions in hurry before thinking, which increase the chance of making mistakes. Law of least effort : people tend to choose the easiest way to do things, choosing path of least resistance. 7 Some seemingly easy solutions do not address the root cause, having negligible effect or negative effect in the long run. Applying the easy solution gives the fake impression that the problem is being addressed, achieving mental comfort. This is related to means-end inversion . To achieve the root goal (end) we work on a sub-goal (means) that helps root goal. But focusing on an easy but unimportant sub-goal may hurt the root goal, by taking resources from hard but important sub-goals. A similar phenomenon occurs commonly in medicine: treatments usually mainly suppress superficial symptoms (e.g. painkiller) instead of curing the root cause of illness. This is usually due many other factors. Pepole tend to spend much time making decision on small things but spend very few time making decisions on big things (e.g. buy house with mortgage, big investment): Path dependence means sticking to what worked in the past and avoid changing, even when the paradigm has shifted and the past successful decisions are no longer appropriate. I think people's thinking process is too bound by convention or analogy to prior experiences. It's rare that people try to think of something on a first principles basis. They'll say, "We'll do that because it's always been done that way." Or they'll not do it because "Well, nobody's ever done that, so it must not be good." But that's just a ridiculous way to think. You have to build up the reasoning from the ground up - "From the first principles" is the phrase that's used in physics. You look at the fundamentals and construct your reasoning from that, and then you see if you have a conclusion that works or doesn't work, and it may or may not be different from what people have done in the past. - Elon Musk Law of the instrument : "If the only tool you have is a hammer, it is tempting to treat everything as if it were a nail." One easy way to make decisions is to simply follow the people around us. This is beneficial in ancient world: for example, if a tiger comes and some people start fleeing, following them is better than spending time recognizing the tiger. Social proof heuristic : Assuming that surrounding people know the situation better, so following them is correct. Following the crowd is also a great way of reducing responsibility : when everyone is guilty, the law cannot punish everyone. The one that acts independently bears more responsibility (omission bias). People often fear of acting independently. When many people follow each other, they will confirm each other, creating self-reinforcing feedback loop. This is also a reason of the momentum in markets. People tend to be overconfident when people around them are confident, and vice versa. Two kinds of knowing: The former may suddenly convert to the later and unleash big power, especially when people can freely communicate information. In "The emperor's new cloth" story, "king is clothless" is originally not common knowledge, even though everyone knows. But once the child states the truth publicly, that knowlege becomes common knowledge. Price grow often depends on "delta" of believers, instead of the existing believers. Veblen good : higher price induce more demand, unlike normal commodity. Measuring people's belief by observing the people around you is inaccurate, because the people near you don't necessarily represent all people (representative bias). Herd mentality is in some sense a kind of trend following strategy. If the trend is some new good technology then following is good regardless of early or late. However, for speculative financial assets, the price grow depends on new people and money entering, so most people will start following too late and cannot profit from it. One similar effect, in-group bias : Favoring investments or opinions from people within one's own group or those who share similar characteristics. Bystander effect : People are less likely to help a victim in the presence of other people. In an experiment , requesting jumping the queue of using a copy machine: Providing a non-reasonable reason "because I have to make some copies" also increases accept rate similar to a normal reason. Mental accounting: Treating different parts of money differently, based on their source or intended use. For example, one can separate the budgets for entertainment, housing and food. It's a simple huristic that can avoid excessive spending: if each part doesn't overspend, then they won't overspend overall. Mental accounting is related to sunk cost and loss aversion. If one sub-account is low, people tend to be more saving in that sub-account, making loss aversion more significant, and the previous waste in that sub-account become sunk cost . In investment, mental accounting can happen on different forms: Seperate by different time intervals. Setting profit target in each time interval (by month, season or year) can be detrimental in a market with momentum. If the profit in the time interval is meet, stop investing misses large profit from trend. If the profit in the time interval is not meet near the end, then the trader tend to be more nervous and more aggressive, which is dangerous. However, setting stop-loss in each time interval may be good. When the current trading strategy does not fit the market, temporarily stopping could get through the current part of cycle that temporarily doesn't suit the strategy. The stopping period also helps calm down and become rational. Separate by different specific assets (e.g. stocks). If the mental accounts are separated based on different stocks, after losing from one stock, one may insist to gain the loss back from the same stock, even if investing in other stocks is better overall. Separate by different categories of assets. People tend to prefer investing medium risk asset using all money over investing high-risk asset using partial money (barbell strategy), even when the total volatility and expected return are the same, because the invested money is in a different mental account than not-invested money, and risk aversion. Lipstick effect is related to mental accounting. When the income declines, the mental account of luxury spending still exists, just shrunk, so cheaper lipsticks get more sales. Mental accounting is one kind of narrow framing bias : Narrow framing bias: focusing too much on one aspect while neglecting other aspects. Zero-risk bias: preferring to eliminate one type of risk entirely rather than reducing overall risk (usually at the expense of increasing exposure to other risks). It's related to risk aversion and availability bias: risk aversion usually focuses the risk that comes to mind, and ignores other kinds of risks. It's also related to binary thinking: thinking that a risk is either completely eliminated or not being taken any action on. The rational decision is to do a tradeoff to reduce overall risk. It should NOT be simplified to "avoiding risk is bad". The point is to not do extreme tradeoffs to eliminte one kind of risk but increase exposure to other kinds of risks. People tend to avoid regret. Regret aversion has two aspects: The world is full of randomness. There is no decision that guarantees to be optimal. We should accept that we cannot always make perfect decisions . Validating the strategy in the long run is more important than result in of individual decisions. We tend to regret doing something in short term, but regret not doing something in the long term. Reference . 'I have led a toothless life', he thought. 'A toothless life. I have never bitten into anything. I was waiting. I was reserving myself for later on - and I have just noticed that my teeth have gone. ...' - Jean-Paul Sartre Some decisions are consequential and irreversible or nearly irreversible – one-way doors – and these decisions must be made methodically, carefully, slowly, with great deliberation and consultation. If you walk through and don’t like what you see on the other side, you can’t get back to where you were before. We can call these Type 1 decisions. But most decisions aren’t like that – they are changeable, reversible – they’re two-way doors. If you’ve made a suboptimal Type 2 decision, you don’t have to live with the consequences for that long. You can reopen the door and go back through. Type 2 decisions can and should be made quickly by high judgment individuals or small groups. As organizations get larger, there seems to be a tendency to use the heavy-weight Type 1 decision-making process on most decisions, including many Type 2 decisions. The end result of this is slowness, unthoughtful risk aversion, failure to experiment sufficiently, and consequently diminished invention. - Jeff Bezos The more non-trivial things you do, the more mistakes you will make . No one can make no mistakes when doing many non-trivial things. However, company KPIs often has large weight on punishing mistakes (loss aversion in KPI design). This cause veteran employees learn to do non-trivial things as few as possible and as conservative as possible. Having safety measures make people feel safer and take more risks. For example, drivers may drive faster when safety belt is on, and cyclists may ride faster when wearing helmet. People tend to be overconfident in familiar situations, but that's where accidents likely to occur: Most accidents (69%) occurred on slopes that were very familiar to the victims. Fewer accidents occurred on slopes that were somewhat familiar (13%) and unfamiliar (18%) to the victim. - Evidence of heuristic traps in recreational avalanche accidents "Fight or flight" are the two options for dealing with physical threat (e.g. a tiger) in the ancient world. But in the modern world, there are non-physical threats and modern risks (e.g. exam failure, losing job). These modern threats can be dealt with neither concrete fight or flight. So they may cause depression, anxiety and immobilization. Related: Cortisol is a kind of hormone that's correlated with stress. Cortisol has many effects, like making you more vigilent and less relax. If the cortisol level keeps being high for long time, there will be health issues like weight gain, weakened immune system, sleep deprivation, digest issues, etc. From evolutionary perspective, cortisol makes one more likely to survive under physical threats (e.g. a tiger) at the expense of other aspects. These physical threats are usually quick and short (e.g. either die or flee from tiger). But the modern risks are usually long and chronic (e.g. worry about exam several months before exam, worry about job loss during economic recession every day), so that cortisol system is not adaptive. The rational activities (System 2 activities) require mental energy (willpower): If there is no enough mental energy, one is less likely to resist impulse behaviors or think about hard problems, and possibly have difficulty in social interactions. 8 These factors affect mental energy: Mental resting is different to body resting. Intense thinking when lying on the bed even consumes mental energy. Mental resting involves focusing on simple things with low cognitive demand. Before you try to increase your willpower, try to decrease the friction in your environment. - James Clear, Link In the process of self-justification, people's memory may be distorted. Human memory is actually very unreliable . People usually cannot notice that their memory has been distorted, and insist that their memory is correct. People tend to simplify their memory and fill the gaps using their own beliefs . This is also an information compression process, at the same time producing wrong memory and biases. Memorizing is lossy compression. Recall is lossy decompression, where details can be made up in a congruent way. Each recall can reshape the memory according to the existing beliefs. (This is similar to quantum effects: if you observe something, you change it.) I have a pet theory that when people introspect about themselves, their brain sometimes just scrambles to generate relevant content. So they feel like they're gaining insight into deeper parts of themselves when they're actually just inventing it on the fly. - Amanda Askell, Link Information is costly to store, and even more costly to index and query . Sometimes forgetting is just not being able to query the specific memory that is stored in brain (and may be recalled if some cue were found that enables querying it). The "querying capacity" of brain is limited and can be occupied by distracting things. 10 Taking notes is one way to mitigate the unreliable memory issue. Every time a messsage is relayed through a person, some of its information gets lost, and some noise gets added . The person relaying the message will add their own understanding (which can be misleading), and omit the information that they think is not important (but can be actually important). This issue is very common in big corporation and governments. Good communication requires reducing middlemen . 11 People usually remember the "special" things well. This is an information compression mechanism that filters out the unimportant details. Peak-end rule : People judge an experience largely based on how they felt at its peak (its most intense point) and at its end. The most efficient way to improve user experience is to improve the experience in the peak and in the end. Serial position effect : people tend to recall the first and last items best, and the middle items worst. Interestingly, the same effect also applies to LLMs, called "lost in the middle". Cryptomnesia : Treating other peoples' idea as own original idea, after forgetting the source of the idea. Sleeper effect : After exposed to persuation, some people initially don't agree because of some reasons. But after time passes, people may forget the reasons why they initially disagree, and may gradually agree to it. Persuations that don't work immediately may still have long-term effects. People seeks information that they are interested in. The seeking of interesting information drives both curiosity and information addiction . As with food, we spent most of our history deprived of information and craving it; now we have way too much of it to function and manage its entropy and toxicity. - N. N. Taleb, Link Most information in the world is junk. The best way to think about it is it's like with food. There was a time, like centuries ago in many countries, where food was scarce, so people ate whatever they could get, especially if it was full of fat and sugar. And they thought that more food is always good. ... Then we reach a time of abundance in food. We have all these industrialized processed food, which is artificially full of fat and sugar and salt and whatever. It was always been for us that more food is always good. No, definitely not all these junk food. And the same thing has happend with information. Information was once scarce. So if you could get your hands on a book you would read it, because there was nothing else. And now information is abundant. We are flooded with information, and much of it is junk information, which is artificially full of greed, anger and fear, because of this battle for attention. It's not good for us. We basically need to go on an information diet. Again the first step is to realize that it's not the case that more information is always good for us. We need a limited amount. And we actually need more time to digest the information. And then we have to be of course also careful about the quality of what we take in, because of the abundance of junk information. The basic misconception I think is this link between information and truth. The people think "ok if I get a lot of information, this is the raw material of truth, and more information will mean more knowledge". That's not the case. Even in nature more information is not about the truth. The basic function of information in history, and also in biology, is to connect. Information is connection. And when you look at history you see that, very often, the easiest way to connect the people is not with the truth. Because truth is a costly and rare kind of information. It's usually easier to connect people with fantasy, with fiction. Why? Because the truth tends to be not just costly, truth tends to be complicated, and it tends to be uncomfortable and sometimes painful. In politics, a politician who would tell people the whole truth about their nation is unlikely to win the elections. Every nation has these skeleton in the cupboard, all these dark sides and dark episodes that people don't want to be confronted with. If you want to connect nations, religions, political parties, you often do it with fiction and fantasies. - Yuval Noah Harari, Link Information bias: Seeking out more information even when more information is no longer useful. With confirmation bias, more information lead to higher confidence, but not better accuracy. This is contrary to statistics, where more samples lead to more accurate result (but still suffer from systematic sampling bias). Having no information is better than having wrong information . Wrong information reinforced by confirmation bias can make you stuck in a wrong path. Popularity of false information increase the value of true information . The "carrot problem": during WWII, the British claimed their pilots' night-vision success came from eating carrots, hiding their new radar technology. The best way of hiding something is to override it with another thing. Browsing social media makes people learn biased distribution of world . Such as: ... the primary function of conversation is not to communicate facts but to reinforce social ties. - Gurwinder, Link The 80/20 rule also applies to social media: 80% of the voice come from 20% of users. The dominant narrative on internet may not represent most people's views. What's more, social media may make people: Social medias are doing "natural selection" to memes . The recommendation algorithm makes the posts that induce more interactions (likes and arguing) more popular. It selects the memes that are good at letting human to spread them . What memes have higher ability to spread? In the ancient world, when there was no algorithmic recommendataion, there was still the "natural selection" of memes (stories, cultures) but slower. Memes facilitate being spreaded. On the contrary, antimemes resist being spreaded. Antimemes are easier to be forgotten than other information. Some antimemes are worth reviewing periodically. People wants attention from others. Some people try to gain attention by posting things on the internet. Attention is a psychological commodity which people value inherently . Producers who go viral produce 183% more posts per day for the subsequent month. - Paying attention by Karthik Srinivasan Having attention from others is indeed useful: it increases exposure to possible allies, mates and opportunities. Giving randomized feedback (variable-ratio reinforcement) make people more addicted to the behavior. Random outcome is usually more exciting than known outcome. This is related to information addiction. Randomized things give more information than deterministic things. Unfortunately, just knowing the cognitive biases is not enough to avoid and overcome them. A lot of cognitive biases originate from the biological basis of human's cognitive function, which cannot change from just knowledge. Reference: G.I. Joe Phenomena: Understanding the Limits of Metacognitive Awareness on Debiasing Note that the cognitive biases are not necessarily negative things . They are tradeoffs: sometimes worse, sometimes better. Consider two financial trading strategies: In the long term, strategy 2 greatly outperforms strategy 1, but people prefer strategy 1, because of many reasons: It's a common misconception that a you need a win rate more than 50% to be profitable. With a favorable risk-reward ratio, profit is possible despite a low win rate. Similarily, a 99% win rate doesn't necessarily imply profit in the long term. The skewness is important. Disposition effect: Disposition effect works well in oscillating markets. However, markets can also exhibit momentum, where disposition effect is detrimental. On the contrary, current deep learning technology is information-inefficient, as it requires tons of training data to get good results. Current (2025 Oct) LLMs have limited in-context learning ability, but still suffer from context rot and cannot do continuous learning. ↩ It has implications in AI: Attempting (lossy) compression will naturally lead to learning, which is the core mehanism of why unsupervised learning works. See also ↩ It's a common view that AI capability can improve exponetially. And when AI is smart enough to keep improving itself, its intelligence will skyrocket into superintelligence, far above human. But it's highly possible that future AI will still be bottlenecked by 1. energy production 2. compute power 3. getting verification from real world. These 3 factors limits how fast AI can take effect and self-improve. Especially the third limitation (Go game and Lean math proving can be verified purely in computer, but other kinds of science, like chemistry and biology, are too complex to be fully simulated in computer, thus getting verification from real world is a very important bottleneck. Also, some AI researchers say they are bottlecked by training speed rather than ideas. Deep learning has chaotic characteristics, so how fast AI experiments can be done is an important bottleneck in AI self-improve.). There will probably be no dramatic "suddenly winning AI race forever". ↩ I think there is a third way of reducing free energy: hallucination. Confirmation bias can be seen as a mild version of hallucination. Hallucination make the brain "filter" some sensory signal and "fill the gap" with prediction. ↩ Related: modern deep learning also relies on attention mechanism (transformer). ↩ The weather is a non-linear chaotic system. Global warming can indeed make some region's winter colder. ↩ Related: In physics, there is principle of least action, but the "action" here means a physical quantity, not the common meaning of "action". ↩ Long-term planning require larger computation capacity. In reinforcement learning AI, if the model is small, it cannot learn to do long-term planning. Only when the model is big and has enough computation capacity, does it start to sacrifice short-term reward for larger long-term reward. So, in some sense, not being able to control oneself is related to "lacking compute resource". Note that self-control is also affected by many other factors. ↩ Related: Using GLP-1 may cause one harder to focus and pay attention due to reduced blood sugar level and other factors. However, GLP-1 can improve brain fog related to inflammation. The overall effect is complex and not yet well understood. ↩ The similar principle also applies to computer databases. Just writing information into a log is easy and fast. But indexing the information to make them queryable is harder. ↩ PDF is harder to edit than other document formats. This can be a good trait when you don't want information relayers to modify your document. (Note that it's not a perfect solution. Others can still ask AI to transcribe screenshots, take text snippets and form new documents. It only increases difficulty of modifying.) ↩ In perception, recent time is "stretched". Recent events are recalled to be eariler than the actual time of the event. (backward telescoping) In perception, distant past time is "compressed". The events in distant past are recalled as more recent than the actual time. (forward telescoping) People overestimate the correctness and rationality of their belief. Dunning-Kruger effect : overestimate capability when low in capability, and understimate when high in capability. (Low-capability ones tend to criticize other people's work even though they cannot do the work themselves.) Restraint bias : Overestimate the ability of controlling emotion, controlling impulse behaviors and resisting addiction. False uniqueness : We tend to think that we have special talents and special virtues. Hindsight bias : Overconfident in understanding history and the ability to predict. Bias blind spot : People are hard to recognize their own biases. An expert in one domain tend to think they are generally intelligent in all domains. (Some intellectual experts don't know they are susceptible to psychological manipulation.) Being confident helps persuading others, increasing social impact. Self-fulfilling prophecy : sometimes having confidence make people perform better, make others collaborate more, and then get good result (Note that the power of mere confidence is limited by physical conditions). Federal reserve increases interst rate. Bearish: it tightens money supply. Bullish: it's a sign of strong economy. A company reports great profit. Bearish: that great profit was anticipted and priced in. The potential is being exhausted. Bullish: that company is growing fast. A large company buys a startup at high price. Bearish: the large company is trapped in bureaucracy. It cannot compete with the startup despite having more resources. Bullish: the startup's business will synergize with the large company's. It's a strategic move. See their prediction as "almost" correct. Distort the memory and change the past prediction. Blame prediction failure to outside factors, e.g. the statistical data is being manipulated, conspiracy theories. Blame that they are just unlucky as the Black Swan event is low-probability. (Black Swan events are rare, but you are still likely to encounter multiple Black Swan events in life.) Attribute self success by own characteristics (capability, virtue, etc.). Attribute self failure by external factors (luck, situation, etc.). Attribute other people's success by external factors. Attribute other people's failure by their characteristics. Playing videogames instead of learning before exam. Procrastination. Reduce the time finishing the task. Refusing help. Refusing medical treatment. Drinking alcohol and using drugs. Choosing difficult conditions and methods. Being disallowed to play videogames makes videogames more fun to play with. Being forced to learn makes one dislike learning. People tend to gain more interest in the information being banned by government. When the love is objected by parents, the love strengthens. Restricting buying something make people buy it more eagerly. Same for restricting selling. People feel like having plenty time to procrastinate People tend to not value the present because "life is permanent" People focus too much on small problems People tend to keep their belief stable (being stubborn). People tend to avoid conflicting beliefs (cognitive dissonance) . People tend to justify their previous behavior. Behavior can shape attitudes. People have a tendency to pursuade others by their belief (meme spread). "There is [a secret evil group] that controls everything. You don't see evidence of its existence because it's so powerful that it hides all evidences." "The AI doesn't work on your task just because you prompted it wrongly." (without telling how to "prompt correctly".) "You believe in [an idea] because you get a sense of moral superiority from that." An environmental activist may justify other environmental activists' illegal behaviors, because they are deemed in the same group. A middle-class tend to believe "the poor are lazy" and "the wealthy work harder". Keep watching a bad movie because you paid it and already spent time watching it. Keeping an unfulfilling relationship because of the past commitments. Persistent people keep their original root goal. They are happy to make corrections on exact methods for achieving the root goal. They can accept failure of sub-goals. Obstinate people keep both the root goal and the exact method to achieve the goal. Suggesting them to change the exact method is seen as offending their self-esteem. Not wanting to diagnose health problem. Reluctant to check the account after an investment failed. If one tries to deceive others without internally believing in the lie, the brain need to process two pieces of conflicting information, which takes more efforts and is slower. When one knows one is telling lie, one may unable to control the nervousness, which can show in ways like heart beat rate, face blush, body movement, etc. Deceiving self before deceiving others can avoid these nervousness signals. See a few rude peoples in one city, then conclude that "people from that city are rude". People who only live in one country think that some societal issue is specific to the country that they are in. In fact, most societal issues apply to most countries. Illusion of control : A gambler may have the illusion that their behavior can control the random outcomes after seeing occasional coincidents. Making different choices can increase exploration and help discovering new things. Only making one decision reduces exploration. In real world, the distribution may change and the highest-probability choice may change. Always choosing the same choice can be risky, especially when the opponent can learn your behavior. In real world, many things have patterns, so pattern-seeking may be useful. In real world, the "good" is often multi-dimensional. Overly optimizing for one aspect often hurt other aspects. Not choosing the seemingly optimal choice may have hidden benefits. A caused B. B caused A. Another factor, C, caused A and B. (confounding variable) Self-reinforcement feedback loop. A reinforces B. B reinforces A. Initial random divergence gets amplified. A selection mechanism that favors the combination of A and B (survivorship bias). More complex interactions. The sampling or analyze is biased. The children wearing larger shoe has better reading skills: both driven by age. Just wearing a large shoe won't make the kid smarter. Countries with more TVs had longer life expectancy: both driven by economy condition. Just buying a TV won't make you live longer. Ice cream sales increases at the same time drowning incidents increase: both driven by summer. People tend to make the known facts reasonable, by finding reasons or making up reasons. This can be seen as an information compression mechanism (reasonable facts are easier to remember). People prefer simpler understanding of the world. This is also information compression. This includes causal simplification , binary thinking . People tend to believe in concrete things and stories other than abstract statistics. This is related anecdotal fallacy . Knowing that LLM has "temperature" so think LLM is heat-based algorithm. Knowing that LLM has "token" so think LLM is a Web3 crypto thing. "How can you be against Patriot Act/Safety Bill/Responsible Disclosure? Do you hate your country/want to kill people/be not responsible?" Link Has quick feedback. Has concrete visual and audio feedback, instead of abstract feedback. Is self-directed rather than forced. Oversimplification: "Poor people are poor because they are lazy." Other related factors: Education access, systemic discrimination, health disparities, job market conditions, the Matthew effect, etc. Oversimplification: "Immigrants are the cause of unemployment." Other related factors: Manufacturing relocation, automation technologies, economic cycles, skill mismatches, overall labor market conditions, etc. Oversimplification: "The Great Depression happened because of the stock market crash of 1929." Other related factors: Excessive financial risk-taking, lack of regulatory oversight, production overcapacity, wealth inequality, and international economic imbalances, etc. Oversimplification: "That company succeeded because of the CEO." Other related factors: Market conditions, employee contributions, incentive structures, company culture, government-business relationships, competitive landscape, and the cumulative impact of past leadership decisions, etc. People hope that a "secret advanced weapon" can reverse the systematic disadvantage in war. This almost never happens in real world. Hoping that a secret recipe or a secret techonology alone can succeed. Coca Cola succeedes not just by the "secret recipe". The brading, global production system and logistic network are also important. Modern technologies are complex and have many dependencies. You cannot just simply copy "one key techonology" and get the same result. Even just imitating existing technology often requires a whole infrastructure, many talents and years of work. 3 "That person is a good person." / "That person is a bad person." "You're either with us or against us.", "Anything less than absolute loyalty is absolute disloyalty." "Bitcoin is the future." / "Bitcoin is a scam". "This asset is completely safe." / "This bubble is going to collapse tomorrow." FOMO (fear of missing out) / risk averse. "No one understands it better than me." / "I don't understand even a tiny bit of it." "It's very easy to do" / "It's impossible." The idol maintains a perfect image. / Image collapse, true nature exposes. "We will win quickly." / "We will lose quickly." "I can do it perfectly." / "I cannot do it perfectly so I will fail." "[X] is the best thing and everyone should use it." / "[X] has this drawback so it's not only useless but also harmful." "Market is always fully effective." / "Market is never effective." Doesn't admit tradeoffs exist. A: "We should increase investment for renewable energy." B: "You want to ban oil, gas, and coal, removing millions of jobs and crash the economy?" A: "The history curriculum should include more perspectives to present a more objective and nuanced view of our nation." B: "So you want to rewrite history to make our children hate their own country?" A: "We should implement stricter gun control." B: "It's useless, because no matter how strict it is, criminals will always find a way to get guns illegally." (perfect solution fallacy) A person falling in love thinks the partner is flawless. Thinking that a beautiful/handsome person is more intelligent and kind. A person that likes one Apple product thinks that all designs of all Apple products are correct and superior. When one likes one opinion of a political candidate, one tend to ignore the candidate's shortcomings. People may idealize their partner, until living with the parter for some time. "The grass is greener on the other side" ( Greener grass syndrome ). Assuming that another career/lifestyle/country (that you are not familar with) is better than the current one. After a long time since bearing a child, women tend to forget the pain of bearing a child and may want another child. After decades passed since the collapse of Soviet Union, some people remember more of the good aspects of the Soviet Union. Surprise is the difference between perception and prediction. Change of Belief is how much belief changes to improve prediction. Passive: Change the belief (understanding of the world). Active: Use action (change environment, move to another environment, etc.) to make the perception better match prediction. 4 Most gamblers are initially lucky, because the unlucky ones tend to quit gambling early. Assume that many fund managers randomly pick stocks. After one year, some of the lucky ones have good performance, while others are overlooked. In the short term, you cannot know whether success come from just luck. "Taleb's rat health club": Feeding poison to rats increases average health, because the unhealthy ones are more likely to die from poison. Social media has more negative news than positive news. Bad news travels fast. The succeded research results are published and the failed attempts are hidden (P-hacking). Only special and interesting events appear on news. The more representative common but not newsworthy events are overlooked. "Someone smoked their entire life and lived until 97, so smoking is actually not that bad." "Someone never went to college and turned out to be successful, so college is a waste of time and money." "Someone made a fortune trading cryptocurrency, and so can I." "It was the coldest winter on record in my town this year. Global warming can't be real." 6 Anything that is in the world when you’re born is normal and ordinary and is just a natural part of the way the world works. Anything that's invented between when you’re 15 and 35 is new and exciting and revolutionary and you can probably get a career in it. Anything invented after you're 35 is against the natural order of things. Deep learning is very different to how human brain works. Jagged intelligence . See also LLM is good at many things that are hard for human. LLM's knowledge is larger than any individual human. LLM is bad at many things that are easy for human. LLM doesn't have intrinsic opinion as human do. Asking LLM a loaded question, LLM tend to follow the assumption in question to please the user. Directly asking a question tend to get different answer than asking a question under context (other chat history). When tossing coin, if head appear frequently, people tend to think tail will appear frequently. (If the coin is fair and tosses are statistically independent, this is false. If the coin is biased, it's also false.) When a stock goes down for a long time, people tend to think it will be more likely to rise. Gambler's fallacy: if the past samples deviate to mean, assume the distribution of future samples change to "compensate" the deviations. This is wrong when the distribution doesn't change. Regression to the mean: if the last sample is far from the mean, the next sample will likely to be closer to the mean than the last sample. It compares the next sample with the last sample, not the future mean with the past mean. A:  "The company will achieve higher-than-expected earnings next quarter." B:  "The company will launch a successful new product, and will achieve higher-than-expected earnings next quarter." overestimate the short-term effect of a recent event , and underestimate the long-term effect of an old event . "90% of people survive this surgery" / "10% of people die from this surgery". "This ground beef is 80% lean" / "This ground beef is 20% fat". "Save $20 by buying now!" / "You'll lose the $20 discount if you wait". "99.9% effective against germs" / "Fails to kill 0.1% of germs". "Collateral damage" / "Death" "Gun control" / "Gun safety" "Government subsidy" / "Using taxpayer money" "Risk measurement" / "Risk forecast" Blaming the journalist exposing the bad things in society. Refuse medical treatment, because medical treatment reminds illness and show weakness. In corporation, the responsibility of solving a problem usually belongs to the one raising the problem, not the one creating the problem. When an online learning material is always there, people have no pressure to learn and often just bookmark it. A thing that's sold in a time-limited or amount-limited way is deemed to be valuable. Restrict buying something make people buy it more eagerly even when they don't need that thing. Same as restricting some information may increase people's perceived value of that information. In the COVID-19 pandemic, a developed country have higher overall fatality rate than a developing country. But in each age group, the developed country's fatality rate is lower. The developed country has a larger portion of old population. After improving a product, the overall customer satisfaction score may decrease, because the product gets popular and attracted the customers that don't fit the product, even though the original customers' satisfaction score increases. You post on internet something that 90% people like and 1% people hate. The people liking the post usually don't direct-message you. But the people hating it often have strong motivation to direct-message you. So your direct message may contain more haters than likers, even though most people like your post. Reminding "yellow" makes recognizing "banana" faster. Reminding "dog" makes recognizing "cat" faster. Being more likely interpret things as danger signals after watching a horror movie. Red in food packaging increases people's intention to buy it. Being familiar with a brand after exposed to its ads, even after trying to ignore ads. Sleeper effect: After exposed to persuation, people that don't initially agree may gradually agree after time passes. If you praise another person, the listeners tend to subcounciously think that you are also good. If you say something bad about another person, the listeners tend to subcounciously think you are also bad. Often quickly making decision before having complete information is better than waiting for complete investigation. Sometime multiple decisions both can fulfill the goal. The important is to quickly do action, rather than which decision is optimal. System 1 thinks by intuition and heuristics, which is fast and efficient, but inaccurate and biased. System 2 thinks by rational logical reasoning, which is slower and requires more efforts, but is more accurate. When being criticized, the more eager you are trying to prove you correct, the more mistake you may make. The trader experiencing loss tend to do more irrational trading and lose more money. If a software functionality require manually enabling it, much fewer users will know and use that functionality. Just 1 second longer page load time may reduce user conversion by 30%. Source Each setup procedure will frustrate a portion of users, making them give up installing the software. Why I’m Done Making Desktop Applications . A good product requires minimal configuration to start working. A personal doctor may do useless medications to show they are working. ( Antifragile argues that useless medications are potentially harmful. It's naïve interventionism.) A politician tend to do political action to show that they are working on an affair. These policies usually superficially helps the problem but doesn't address the root cause, and may exacerbate the problem. One example is to subsidize house buyers, which makes housing price higher, instead of building more houses. Financial analysts tend to give a definitive result when knowing there isn't enough sound evidence. Focusing on buying exercise equipments instead of exercising. Paying to gym without going to gym. Buying supplements instead of adopting healthier lifestyle. Focusing on buying courses, books, study equipments instead of actually studying. Keep bookmarking online learning materials instead of reading them. Musicians focusing on buying instruments (gear acquisition syndrome). When writing, focusing on the formatting instead of the content. A manager pushing employees to seemingly work hard instead of improving efficiency. A parent train child by punishing hard, instead of using scientific training methods. Bikeshedding effect: during meetings, people spend most time talking about trivial matters. Staying in comfort zone. Only learn/practice the familiar things and avoid touching unfamiliar things. Avoiding the unpleasant information when learning. Only care about the visible numbers (KPI, OKR), and ignore the important things behind the numbers, like perverse incentives caused by the KPI, statistical bias, and the validity of interpretations from the numbers. Streetlight effect: Only search in the places that's easy to search, not the places that the target is in. Hiding the signal of error instead of diagnosing and solving the error. As the big decision is important, people tend to be nervous when thinking about it. Thinking about big decisions is tiresome, as the future is uncertain, and there are many factors to analyze. So people tend to procrastinate to avoid the unpleasant feeling of thinking about big decisions, or simply follow others (herd metality). The small decisions (e.g. choosing item in shop, choosing restaurant) require less mental efforts and cause less nervous feeling. Thinking on these decisions can give feeling of control. These decisions usually have quick feedback (human crave quick feedback). I know something. But I am not sure other people also know it. Other people may also be not sure I know it. There is no consensus even if everyone thinks the same. This is pluralistic ignorance . It happens when there is something that prevents communicating information. I know something. I also know other people also know it. I also know other people know me know it. It's common knowledge . This is the kind of knowledge that drives herd mentality. Seperate by different time intervals. Setting profit target in each time interval (by month, season or year) can be detrimental in a market with momentum. If the profit in the time interval is meet, stop investing misses large profit from trend. If the profit in the time interval is not meet near the end, then the trader tend to be more nervous and more aggressive, which is dangerous. However, setting stop-loss in each time interval may be good. When the current trading strategy does not fit the market, temporarily stopping could get through the current part of cycle that temporarily doesn't suit the strategy. The stopping period also helps calm down and become rational. Separate by different specific assets (e.g. stocks). If the mental accounts are separated based on different stocks, after losing from one stock, one may insist to gain the loss back from the same stock, even if investing in other stocks is better overall. Separate by different categories of assets. People tend to prefer investing medium risk asset using all money over investing high-risk asset using partial money (barbell strategy), even when the total volatility and expected return are the same, because the invested money is in a different mental account than not-invested money, and risk aversion. Choose to pay off a small debt completely, rather than paying off high-interest-rate debt first, as eliminating one debt feels more satisfying and reduces memory pressure. Enforcing extreme lockdown to eliminate the risk of a pandemic, causing more risk in other diseases (because hospitals are locked down) and more risk in basic living (food supply is constrainted due to extreme lockdown). Wanting to hedge inflation by heavily investing in risky assets, whose risk can be higher than inflation. In a liquidity crisis, cash is more valuable than assets. For future: people tend to avoid making decisions that may cause regret in the future. This is related to risk aversion: not making optimal decision is also a kind of risk. For past: people tend to avoid regretting their past actions, trying to prove the correctness of their past actions, thus fall into sunk cost fallacy . Resisting impulse behavior consumes willpower (e.g. resist eating sweet food when on a diet). Paying attention and thinking hard problems consume willpower. For introverts, social interaction consumes willpower. But for extroverts, staying alone consumes willpower. Sleeping and mental resting can replenish mental energy. Body conditions (like blood sugar level 9 ) affects mental energy. Exercising self-control can strengthen mental energy, similar to muscular strength. Lingering emotion (e.g. keep ruminating past mistakes) costs mental energy. Overestimating the amount of perfect parters , who are beautiful/handsome, have high income and does exaggeraged love. Believing in false consensus , the consensus that only exists on an internet community. Overestimating the proportion of bad news, as bad news travels fast in social media, thus facilitating cynicism. Get used to interesting easy-to-digest information and become less tolerant to not-so-interesting hard-to-digest information. Get used to moving attention (distraction) and not get used to keeping attention. In social media, different posts are usually irrelevant and understanding them requires moving attention (forget previous context). Have less intention of trying things via real world practice. Watching videos about a new experience is much easier than experiencing in real life. Induce anger. (Seeing a post that you dislike that's popular on social media.) Induce superiority satisfaction. Express existing thoughts. Utilizes confirmation bias. Simple and easy-to-understand. Looks convincing and reasonable. Utilizes narrative fallacy. Exaggerated. Polarized. Utilizes binary thinking. Providing interesting new information. Utilizes information addiction. Antimemes are usually long, complex and nuanced, reflecting real-world complexity, being hard-to-grasp. Antimemes usually don't spur much emotions. Antimemes are usually boring and "obvious" (hindsight bias). The information that conflicts with existing beliefs are also antimemes. (confirmation bias) PvP gaming (every round has randomly different opponents and outcome) Browsing social media (random posts) Gacha game (pulling is random) Strategy 1 has small gains frequently but has huge loss rarely (suffering from negative Black Swan events). Strategy 2 has small losses frequently but has rare huge gains (utilizing positive Black Swan events). The first strategy has better Sharpe ratio as long as the rare Black Swan don't come. The second strategy has lower Sharpe ratio because of the high volatility (although volatility has positive skewness). Moral hazard : in some places the money manager can take a share of profit but are only slighly punished when the huge asset loss happens (no skin in the game). This incentive structure allow them to use Strategy 1 while transferring tail risk to asset owner. The previously mentioned cognitive biases: Convex perception. Frequent small gains feels better than a rare huge gain, and frequent small losses feels worse than a rare huge loss. Loss aversion. The loss aversion focused more on recent visible loss rather than potential rare large loss. Availability bias and outcome bias. The frequent small losses are more visible than rare potential big loss. Delayed feedback issue. The rare loss in strategy 1 usually come late. Oddball effect. The time experiencing loss feels longer. Investors tend to sell the asset that increased in value (make uncertain profit certain). Investors tend to not sell the asset that dropped in value, hoping them to rebound (prefer having hope instead of making loss certain). What's more, increasing position can amortize the loss rate, which creates an illusion that loss reduces. The Black Swan Social Psychology (by David G.Myers) Thinking, fast and slow Elephant in the brain On the contrary, current deep learning technology is information-inefficient, as it requires tons of training data to get good results. Current (2025 Oct) LLMs have limited in-context learning ability, but still suffer from context rot and cannot do continuous learning. ↩ It has implications in AI: Attempting (lossy) compression will naturally lead to learning, which is the core mehanism of why unsupervised learning works. See also ↩ It's a common view that AI capability can improve exponetially. And when AI is smart enough to keep improving itself, its intelligence will skyrocket into superintelligence, far above human. But it's highly possible that future AI will still be bottlenecked by 1. energy production 2. compute power 3. getting verification from real world. These 3 factors limits how fast AI can take effect and self-improve. Especially the third limitation (Go game and Lean math proving can be verified purely in computer, but other kinds of science, like chemistry and biology, are too complex to be fully simulated in computer, thus getting verification from real world is a very important bottleneck. Also, some AI researchers say they are bottlecked by training speed rather than ideas. Deep learning has chaotic characteristics, so how fast AI experiments can be done is an important bottleneck in AI self-improve.). There will probably be no dramatic "suddenly winning AI race forever". ↩ I think there is a third way of reducing free energy: hallucination. Confirmation bias can be seen as a mild version of hallucination. Hallucination make the brain "filter" some sensory signal and "fill the gap" with prediction. ↩ Related: modern deep learning also relies on attention mechanism (transformer). ↩ The weather is a non-linear chaotic system. Global warming can indeed make some region's winter colder. ↩ Related: In physics, there is principle of least action, but the "action" here means a physical quantity, not the common meaning of "action". ↩ Long-term planning require larger computation capacity. In reinforcement learning AI, if the model is small, it cannot learn to do long-term planning. Only when the model is big and has enough computation capacity, does it start to sacrifice short-term reward for larger long-term reward. So, in some sense, not being able to control oneself is related to "lacking compute resource". Note that self-control is also affected by many other factors. ↩ Related: Using GLP-1 may cause one harder to focus and pay attention due to reduced blood sugar level and other factors. However, GLP-1 can improve brain fog related to inflammation. The overall effect is complex and not yet well understood. ↩ The similar principle also applies to computer databases. Just writing information into a log is easy and fast. But indexing the information to make them queryable is harder. ↩ PDF is harder to edit than other document formats. This can be a good trait when you don't want information relayers to modify your document. (Note that it's not a perfect solution. Others can still ask AI to transcribe screenshots, take text snippets and form new documents. It only increases difficulty of modifying.) ↩

0 views