Latest Posts (20 found)

The LLM Context Tax: Best Tips for Tax Avoidance

Every token you send to an LLM costs money. Every token increases latency. And past a certain point, every additional token makes your agent dumber. This is the triple penalty of context bloat: higher costs, slower responses, and degraded performance through context rot, where the agent gets lost in its own accumulated noise. Context engineering is very important. The difference between a $0.50 query and a $5.00 query is often just how thoughtfully you manage context. Here’s what I’ll cover: Stable Prefixes for KV Cache Hits - The single most important optimization for production agents Append-Only Context - Why mutating context destroys your cache hit rate Store Tool Outputs in the Filesystem - Cursor’s approach to avoiding context bloat Design Precise Tools - How smart tool design reduces token consumption by 10x Clean Your Data First (Maximize Your Deductions) - Strip the garbage before it enters context Delegate to Cheaper Subagents (Offshore to Tax Havens) - Route token-heavy operations to smaller models Reusable Templates Over Regeneration (Standard Deductions) - Stop regenerating the same code The Lost-in-the-Middle Problem - Strategic placement of critical information Server-Side Compaction (Depreciation) - Let the API handle context decay automatically Output Token Budgeting (Withholding Tax) - The most expensive tokens are the ones you generate The 200K Pricing Cliff (The Tax Bracket) - The tax bracket that doubles your bill overnight Parallel Tool Calls (Filing Jointly) - Fewer round trips, less context accumulation Application-Level Response Caching (Tax-Exempt Status) - The cheapest token is the one you never send With Claude Opus 4.6, the math is brutal: That’s a 10x difference between cached and uncached inputs. Output tokens cost 5x more than uncached inputs. Most agent builders focus on prompt engineering while hemorrhaging money on context inefficiency. In most agent workflows, context grows substantially with each step while outputs remain compact. This makes input token optimization critical: a typical agent task might involve 50 tool calls, each accumulating context. The performance penalty is equally severe. Research shows that past 32K tokens, most models show sharp performance degradation. Your agent isn’t just getting expensive. It’s getting confused. This is the single most important metric for production agents: KV cache hit rate. The Manus team considers this the most important optimization for their agent infrastructure, and I agree completely. The principle is simple: LLMs process prompts autoregressively, token by token. If your prompt starts identically to a previous request, the model can reuse cached key-value computations for that prefix. The killer of cache hit rates? Timestamps. A common mistake is including a timestamp at the beginning of the system prompt. It’s a simple mistake but the impact is massive. The key is granularity: including the date is fine. Including the hour is acceptable since cache durations are typically 5 minutes (Anthropic default) to 10 minutes (OpenAI default), with longer options available. But never include seconds or milliseconds. A timestamp precise to the second guarantees every single request has a unique prefix. Zero cache hits. Maximum cost. Move all dynamic content (including timestamps) to the END of your prompt. System instructions, tool definitions, few-shot examples, all of these should come first and remain identical across requests. For distributed systems, ensure consistent request routing. Use session IDs to route requests to the same worker, maximizing the chance of hitting warm caches. Context should be append-only. Any modification to earlier content invalidates the KV cache from that point forward. This seems obvious but the violations are subtle: The tool definition problem is particularly insidious. If you dynamically add or remove tools based on context, you invalidate the cache for everything after the tool definitions. Manus solved this elegantly: instead of removing tools, they mask token logits during decoding to constrain which actions the model can select. The tool definitions stay constant (cache preserved), but the model is guided toward valid choices through output constraints. For simpler implementations, keep your tool definitions static and handle invalid tool calls gracefully in your orchestration layer. Deterministic serialization matters too. Python dicts don’t guarantee order. If you’re serializing tool definitions or context as JSON, use sort_keys=True or a library that guarantees deterministic output. A different key order = different tokens = cache miss. Cursor’s approach to context management changed how I think about agent architecture. Instead of stuffing tool outputs into the conversation, write them to files. In their A/B testing, this reduced total agent tokens by 46.9% for runs using MCP tools. The insight: agents don’t need complete information upfront. They need the ability to access information on demand. Files are the perfect abstraction for this. We apply this pattern everywhere: Shell command outputs : Write to files, let agent tail or grep as needed Search results : Return file paths, not full document contents API responses : Store raw responses, let agent extract what matters Intermediate computations : Persist to disk, reference by path When context windows fill up, Cursor triggers a summarization step but exposes chat history as files. The agent can search through past conversations to recover details lost in the lossy compression. Clever. A vague tool returns everything. A precise tool returns exactly what the agent needs. Consider an email search tool: The two-phase pattern: search returns metadata, separate tool returns full content. The agent decides which items deserve full retrieval. This is exactly how our conversation history tool works at Fintool. It passes date ranges or search terms and returns up to 100-200 results with only user messages and metadata. The agent then reads specific conversations by passing the conversation ID. Filter parameters like has_attachment, time_range, and sender let the agent narrow results before reading anything. The same pattern applies everywhere: Document search : Return titles and snippets, not full documents Database queries : Return row counts and sample rows, not full result sets File listings : Return paths and metadata, not contents API integrations : Return summaries, let agent drill down Each parameter you add to a tool is a chance to reduce returned tokens by an order of magnitude. Garbage tokens are still tokens. Clean your data before it enters context. For emails, this means: For HTML content, the gains are even larger. A typical webpage might be 100KB of HTML but only 5KB of actual content. CSS selectors that extract semantic regions (article, main, section) and discard navigation, ads, and tracking can reduce token counts by 90%+. Markdown uses significantly fewer tokens than HTML , making conversion valuable for any web content entering your pipeline. For financial data specifically: Strip SEC filing boilerplate (every 10-K has the same legal disclaimers) Collapse repeated table headers across pages Remove watermarks and page numbers from extracted text Normalize whitespace (multiple spaces, tabs, excessive newlines) Convert HTML tables to markdown tables The principle: remove noise at the earliest possible stage, not after tokenization. Every preprocessing step that runs before the LLM call saves money and improves quality. Not every task needs your most expensive model. The Claude Code subagent pattern processes 67% fewer tokens overall due to context isolation. Instead of stuffing every intermediate search result into a single global context, workers keep only what’s relevant inside their own window and return distilled outputs. Tasks perfect for cheaper subagents: Data extraction : Pull specific fields from documents Classification : Categorize emails, documents, or intents Summarization : Compress long documents before main agent sees them Validation : Check outputs against criteria Formatting : Convert between data formats The orchestrator sees condensed results, not raw context. This prevents hitting context limits and reduces the risk of the main agent getting confused by irrelevant details. Scope subagent tasks tightly. The more iterations a subagent requires, the more context it accumulates and the more tokens it consumes. Design for single-turn completion when possible. Every time an agent generates code from scratch, you’re paying for output tokens. Output tokens cost 5x input tokens with Claude. Stop regenerating the same patterns. Our document generation workflow used to be painfully inefficient: OLD APPROACH: User: “Create a DCF model for Apple” Agent: *generates 2,000 lines of Excel formulas from scratch* Cost: ~$0.50 in output tokens alone NEW APPROACH: User: “Create a DCF model for Apple” Agent: *loads DCF template, fills in Apple-specific values* Cost: ~$0.05 The template approach: Skill references template : dcf_template.xlsx in /public/skills/dcf/ Agent reads template once : Understands structure and placeholders Agent fills parameters : Company-specific values, assumptions WriteFile with minimal changes : Only modified cells, not full regeneration For code generation, the same principle applies. If your agent frequently generates similar Python scripts, data processing pipelines, or analysis frameworks, create reusable functions: # Instead of regenerating this every time: def process_earnings_transcript(path): # 50 lines of parsing code... # Reference a skill with reusable utilities: from skills.earnings import parse_transcript, extract_guidance The agent imports and calls rather than regenerates. Fewer output tokens, faster responses, more consistent results. Subscribe now LLMs don’t process context uniformly. Research shows a consistent U-shaped attention pattern: models attend strongly to the beginning and end of prompts while “losing” information in the middle. Strategic placement matters: System instructions : Beginning (highest attention) Current user request : End (recency bias) Critical context : Beginning or end, never middle Lower-priority background : Middle (acceptable loss) For retrieval-augmented generation, this means reordering retrieved documents. The most relevant chunks should go at the beginning and end. Lower-ranked chunks fill the middle. Manus uses an elegant hack: they maintain a todo.md file that gets updated throughout task execution. This “recites” current objectives at the end of context, combating the lost-in-the-middle effect across their typical 50-tool-call trajectories. We use a similar architecture at Fintool. As agents run, context grows until it hits the window limit. You used to have two options: build your own summarization pipeline, or implement observation masking (replacing old tool outputs with placeholders). Both require significant engineering. Now you can let the API handle it. Anthropic’s server-side compaction automatically summarizes your conversation when it approaches a configurable token threshold. Claude Code uses this internally, and it’s the reason you can run 50+ tool call sessions without the agent losing track of what it’s doing. The key design decisions: Trigger threshold : Default is 150K tokens. Set it lower if you want to stay under the 200K pricing cliff, or higher if you need more raw context before summarizing. Custom instructions : You can replace the default summarization prompt entirely. For financial workflows, something like “Preserve all numerical data, company names, and analytical conclusions” prevents the summary from losing critical details. Pause after compaction : The API can pause after generating the summary, letting you inject additional context (like preserving the last few messages verbatim) before continuing. This gives you control over what survives the compression. Compaction also stacks well with prompt caching. Add a cache breakpoint on your system prompt so it stays cached separately. When compaction occurs, only the summary needs to be written as a new cache entry. Your system prompt cache stays warm. The beauty of this approach: context depreciates in value over time, and the API handles the depreciation schedule for you. Output tokens are the most expensive tokens. With Claude Sonnet, outputs cost 5x inputs. With Opus, they cost 5x inputs that are already expensive. Yet most developers leave max_tokens unlimited and hope for the best. # BAD: Unlimited output response = client.messages.create( model=”claude-sonnet-4-20250514”, max_tokens=8192, # Model might use all of this messages=[...] ) # GOOD: Task-appropriate limits TASK_LIMITS = { “classification”: 50, “extraction”: 200, “short_answer”: 500, “analysis”: 2000, “code_generation”: 4000, } Structured outputs reduce verbosity. JSON responses use fewer tokens than natural language explanations of the same information. Natural language: “The company’s revenue was 94.5 billion dollars, which represents a year-over-year increase of 12.3 percent compared to the previous fiscal year’s revenue of 84.2 billion dollars.” Structured: {”revenue”: 94.5, “unit”: “B”, “yoy_change”: 12.3} For agents specifically, consider response chunking. Instead of generating a 10,000-token analysis in one shot, break it into phases: Outline phase : Generate structure (500 tokens) Section phases : Generate each section on demand (1000 tokens each) Review phase : Check and refine (500 tokens) This gives you control points to stop early if the user has what they need, rather than always generating the maximum possible output. With Claude Opus 4.6 and Sonnet 4.5, crossing 200K input tokens triggers premium pricing. Your per-token cost doubles: Opus goes from $5 to $10 per million input tokens, and output jumps from $25 to $37.50. This isn’t gradual. It’s a cliff. This is the LLM equivalent of a tax bracket. And just like tax planning, the right strategy is to stay under the threshold when you can. For agent workflows that risk crossing 200K, implement a context budget. Track cumulative input tokens across tool calls. When you approach the cliff, trigger aggressive compression: observation masking, summarization of older turns, or pruning low-value context. The cost of a compression step is far less than doubling your per-token rate for the rest of the conversation. Every sequential tool call is a round trip. Each round trip re-sends the full conversation context. If your agent makes 20 tool calls sequentially, that’s 20 times the context gets transmitted and billed. The Anthropic API supports parallel tool calls: the model can request multiple independent tool calls in a single response, and you execute them simultaneously. This means fewer round trips for the same amount of work. The savings compound. With fewer round trips, you accumulate less intermediate context, which means each subsequent round trip is also cheaper. Design your tools so that independent operations can be identified and batched by the model. The cheapest token is the one you never send to the API. Before any LLM call, check if you’ve already answered this question. At Fintool, we cache aggressively for earnings call summarizations and common queries. When a user asks for Apple’s latest earnings summary, we don’t regenerate it from scratch for every request. The first request pays the full cost. Every subsequent request is essentially free. This operates above the LLM layer entirely. It’s not prompt caching or KV cache. It’s your application deciding that this query has a valid cached response and short-circuiting the API call. Good candidates for application-level caching: Factual lookups : Company financials, earnings summaries, SEC filings Common queries : Questions that many users ask about the same data Deterministic transformations : Data formatting, unit conversions Stable analysis : Any output that won’t change until the underlying data changes The cache invalidation strategy matters. For financial data, earnings call summaries are stable once generated. Real-time price data obviously isn’t. Match your cache TTL to the volatility of the underlying data. Even partial caching helps. If an agent task involves five tool calls and you can cache two of them, you’ve cut 40% of your tool-related token costs without touching the LLM. The Meta Lesson Context engineering isn’t glamorous. It’s not the exciting part of building agents. But it’s the difference between a demo that impresses and a product that scales with decent gross margin. The best teams building sustainable agent products are obsessing over token efficiency the same way database engineers obsess over query optimization. Because at scale, every wasted token is money on fire. The context tax is real. But with the right architecture, it’s largely avoidable. Subscribe now Every token you send to an LLM costs money. Every token increases latency. And past a certain point, every additional token makes your agent dumber. This is the triple penalty of context bloat: higher costs, slower responses, and degraded performance through context rot, where the agent gets lost in its own accumulated noise. Context engineering is very important. The difference between a $0.50 query and a $5.00 query is often just how thoughtfully you manage context. Here’s what I’ll cover: Stable Prefixes for KV Cache Hits - The single most important optimization for production agents Append-Only Context - Why mutating context destroys your cache hit rate Store Tool Outputs in the Filesystem - Cursor’s approach to avoiding context bloat Design Precise Tools - How smart tool design reduces token consumption by 10x Clean Your Data First (Maximize Your Deductions) - Strip the garbage before it enters context Delegate to Cheaper Subagents (Offshore to Tax Havens) - Route token-heavy operations to smaller models Reusable Templates Over Regeneration (Standard Deductions) - Stop regenerating the same code The Lost-in-the-Middle Problem - Strategic placement of critical information Server-Side Compaction (Depreciation) - Let the API handle context decay automatically Output Token Budgeting (Withholding Tax) - The most expensive tokens are the ones you generate The 200K Pricing Cliff (The Tax Bracket) - The tax bracket that doubles your bill overnight Parallel Tool Calls (Filing Jointly) - Fewer round trips, less context accumulation Application-Level Response Caching (Tax-Exempt Status) - The cheapest token is the one you never send That’s a 10x difference between cached and uncached inputs. Output tokens cost 5x more than uncached inputs. Most agent builders focus on prompt engineering while hemorrhaging money on context inefficiency. In most agent workflows, context grows substantially with each step while outputs remain compact. This makes input token optimization critical: a typical agent task might involve 50 tool calls, each accumulating context. The performance penalty is equally severe. Research shows that past 32K tokens, most models show sharp performance degradation. Your agent isn’t just getting expensive. It’s getting confused. Stable Prefixes for KV Cache Hits This is the single most important metric for production agents: KV cache hit rate. The Manus team considers this the most important optimization for their agent infrastructure, and I agree completely. The principle is simple: LLMs process prompts autoregressively, token by token. If your prompt starts identically to a previous request, the model can reuse cached key-value computations for that prefix. The killer of cache hit rates? Timestamps. A common mistake is including a timestamp at the beginning of the system prompt. It’s a simple mistake but the impact is massive. The key is granularity: including the date is fine. Including the hour is acceptable since cache durations are typically 5 minutes (Anthropic default) to 10 minutes (OpenAI default), with longer options available. But never include seconds or milliseconds. A timestamp precise to the second guarantees every single request has a unique prefix. Zero cache hits. Maximum cost. Move all dynamic content (including timestamps) to the END of your prompt. System instructions, tool definitions, few-shot examples, all of these should come first and remain identical across requests. For distributed systems, ensure consistent request routing. Use session IDs to route requests to the same worker, maximizing the chance of hitting warm caches. Append-Only Context Context should be append-only. Any modification to earlier content invalidates the KV cache from that point forward. This seems obvious but the violations are subtle: The tool definition problem is particularly insidious. If you dynamically add or remove tools based on context, you invalidate the cache for everything after the tool definitions. Manus solved this elegantly: instead of removing tools, they mask token logits during decoding to constrain which actions the model can select. The tool definitions stay constant (cache preserved), but the model is guided toward valid choices through output constraints. For simpler implementations, keep your tool definitions static and handle invalid tool calls gracefully in your orchestration layer. Deterministic serialization matters too. Python dicts don’t guarantee order. If you’re serializing tool definitions or context as JSON, use sort_keys=True or a library that guarantees deterministic output. A different key order = different tokens = cache miss. Store Tool Outputs in the Filesystem Cursor’s approach to context management changed how I think about agent architecture. Instead of stuffing tool outputs into the conversation, write them to files. In their A/B testing, this reduced total agent tokens by 46.9% for runs using MCP tools. The insight: agents don’t need complete information upfront. They need the ability to access information on demand. Files are the perfect abstraction for this. We apply this pattern everywhere: Shell command outputs : Write to files, let agent tail or grep as needed Search results : Return file paths, not full document contents API responses : Store raw responses, let agent extract what matters Intermediate computations : Persist to disk, reference by path The two-phase pattern: search returns metadata, separate tool returns full content. The agent decides which items deserve full retrieval. This is exactly how our conversation history tool works at Fintool. It passes date ranges or search terms and returns up to 100-200 results with only user messages and metadata. The agent then reads specific conversations by passing the conversation ID. Filter parameters like has_attachment, time_range, and sender let the agent narrow results before reading anything. The same pattern applies everywhere: Document search : Return titles and snippets, not full documents Database queries : Return row counts and sample rows, not full result sets File listings : Return paths and metadata, not contents API integrations : Return summaries, let agent drill down For HTML content, the gains are even larger. A typical webpage might be 100KB of HTML but only 5KB of actual content. CSS selectors that extract semantic regions (article, main, section) and discard navigation, ads, and tracking can reduce token counts by 90%+. Markdown uses significantly fewer tokens than HTML , making conversion valuable for any web content entering your pipeline. For financial data specifically: Strip SEC filing boilerplate (every 10-K has the same legal disclaimers) Collapse repeated table headers across pages Remove watermarks and page numbers from extracted text Normalize whitespace (multiple spaces, tabs, excessive newlines) Convert HTML tables to markdown tables The Claude Code subagent pattern processes 67% fewer tokens overall due to context isolation. Instead of stuffing every intermediate search result into a single global context, workers keep only what’s relevant inside their own window and return distilled outputs. Tasks perfect for cheaper subagents: Data extraction : Pull specific fields from documents Classification : Categorize emails, documents, or intents Summarization : Compress long documents before main agent sees them Validation : Check outputs against criteria Formatting : Convert between data formats Scope subagent tasks tightly. The more iterations a subagent requires, the more context it accumulates and the more tokens it consumes. Design for single-turn completion when possible. Reusable Templates Over Regeneration (Standard Deductions) Every time an agent generates code from scratch, you’re paying for output tokens. Output tokens cost 5x input tokens with Claude. Stop regenerating the same patterns. Our document generation workflow used to be painfully inefficient: OLD APPROACH: User: “Create a DCF model for Apple” Agent: *generates 2,000 lines of Excel formulas from scratch* Cost: ~$0.50 in output tokens alone NEW APPROACH: User: “Create a DCF model for Apple” Agent: *loads DCF template, fills in Apple-specific values* Cost: ~$0.05 The template approach: Skill references template : dcf_template.xlsx in /public/skills/dcf/ Agent reads template once : Understands structure and placeholders Agent fills parameters : Company-specific values, assumptions WriteFile with minimal changes : Only modified cells, not full regeneration Strategic placement matters: System instructions : Beginning (highest attention) Current user request : End (recency bias) Critical context : Beginning or end, never middle Lower-priority background : Middle (acceptable loss) The key design decisions: Trigger threshold : Default is 150K tokens. Set it lower if you want to stay under the 200K pricing cliff, or higher if you need more raw context before summarizing. Custom instructions : You can replace the default summarization prompt entirely. For financial workflows, something like “Preserve all numerical data, company names, and analytical conclusions” prevents the summary from losing critical details. Pause after compaction : The API can pause after generating the summary, letting you inject additional context (like preserving the last few messages verbatim) before continuing. This gives you control over what survives the compression. Outline phase : Generate structure (500 tokens) Section phases : Generate each section on demand (1000 tokens each) Review phase : Check and refine (500 tokens) This is the LLM equivalent of a tax bracket. And just like tax planning, the right strategy is to stay under the threshold when you can. For agent workflows that risk crossing 200K, implement a context budget. Track cumulative input tokens across tool calls. When you approach the cliff, trigger aggressive compression: observation masking, summarization of older turns, or pruning low-value context. The cost of a compression step is far less than doubling your per-token rate for the rest of the conversation. Parallel Tool Calls (Filing Jointly) Every sequential tool call is a round trip. Each round trip re-sends the full conversation context. If your agent makes 20 tool calls sequentially, that’s 20 times the context gets transmitted and billed. The Anthropic API supports parallel tool calls: the model can request multiple independent tool calls in a single response, and you execute them simultaneously. This means fewer round trips for the same amount of work. The savings compound. With fewer round trips, you accumulate less intermediate context, which means each subsequent round trip is also cheaper. Design your tools so that independent operations can be identified and batched by the model. Application-Level Response Caching (Tax-Exempt Status) The cheapest token is the one you never send to the API. Before any LLM call, check if you’ve already answered this question. At Fintool, we cache aggressively for earnings call summarizations and common queries. When a user asks for Apple’s latest earnings summary, we don’t regenerate it from scratch for every request. The first request pays the full cost. Every subsequent request is essentially free. This operates above the LLM layer entirely. It’s not prompt caching or KV cache. It’s your application deciding that this query has a valid cached response and short-circuiting the API call. Good candidates for application-level caching: Factual lookups : Company financials, earnings summaries, SEC filings Common queries : Questions that many users ask about the same data Deterministic transformations : Data formatting, unit conversions Stable analysis : Any output that won’t change until the underlying data changes

1 views

The Crumbling Workflow Moat: Aggregation Theory's Final Chapter

For decades, software companies commanded premium pricing not only for their data, but for their interfaces . The specialized keyboards. The Excel integrations. The workflow automations. Users spent years mastering these systems. Companies built processes hardcoded to specific tools. Switching meant massive productivity loss. The interface WAS the product. I haven’t used Google in a year. An LLM chat is my browser. Soon, knowledge workers won’t use specialized software interfaces either. The LLM chat will be their interface to everything. This isn’t incremental change. This is the completion of Ben Thomson’s Aggregation Theory. In this article: Why Aggregation Theory left suppliers with one critical asset: their interface How vertical software built empires on workflow complexity, not data Why LLMs absorb the interface layer entirely When interfaces are commoditized, it’s API versus API Valuation Framework: the math is brutal Who wins, who loses, and what comes next Subscribe now Ben Thompson’s framework reshaped how we think about internet economics. The value chain was simple: Suppliers → Distributors → Consumers . Pre-internet, high distribution costs created leverage for distributors. TV networks controlled what content got aired. Newspapers decided which stories mattered. Retailers chose which products reached shelves. Then distribution costs collapsed to zero. Transaction costs followed. Power shifted from distributors to a new species: aggregators. The classic aggregators emerged: Google aggregated websites via search. Facebook aggregated content via social graph. Amazon aggregated merchants via marketplace. Uber and Airbnb aggregated physical supply via mobile apps. Thompson identified the virtuous cycle: Better UX → More users → More suppliers → Better UX. The aggregator wins by owning the consumer relationship, commoditizing suppliers until they become interchangeable. THE WEB 2.0 AGGREGATION STACK But suppliers retained two critical assets. Their interface and their data. The paradox of Web 2.0 aggregation was structural. Google commoditized discovery. When you search “best Italian restaurant SF,” you don’t care which site ranks #1. The source is fungible. But you still visit that site. You see their brand. You experience their UX. You navigate their reservation system. This created a hard limit on commoditization: Discovery : Commoditized (Google owns it) Interface : Protected (suppliers own it) Data : Protected (suppliers own it) The interface layer mattered for four reasons: Brand persistence : Users saw the New York Times, not just “a news source.” Brand equity survived aggregation. UX differentiation : Suppliers could compete on design, speed, features. A better interface meant higher conversion. Switching costs : Users developed muscle memory, workflow habits. Learning a new system had real friction. Monetization control : Suppliers owned their conversion funnels. They controlled the paywall, the checkout, the subscription flow. Vertical software is the perfect case study. Financial data terminals, legal research platforms, medical databases, real estate analytics, recruiting tools. They all pull from data that’s largely commoditized or licensable. Yet they command premium pricing. Why? Because the interface IS the moat. THE INTERFACE MOAT IN VERTICAL SOFTWARE Same data. Different interfaces. Premium pricing. Knowledge workers spent years learning specialized interfaces. The muscle memory is real. They’re not paying for data. They’re paying to not relearn a workflow they’ve spent a decade mastering. Companies built models and processes hardcoded to specific plugins. Changing providers means rebuilding workflows, retraining teams, risking errors during the transition. Switching costs weren’t about data. They were about the interface. This is why vertical software traded at 20-30x earnings. The market believed the interface was defensible. But is it today? Subscribe now LLMs don’t just aggregate suppliers. They absorb the interface itself. When LLMs commoditize the interface, what’s left? Just the data. And then it’s API against API. Pure commodity competition. The three-layer collapse: What changes structurally: THE VISIBILITY COLLAPSE Users never see the supplier’s brand Users never experience the supplier’s UX Users don’t know where information originated The entire web becomes a backend database Consider a knowledge worker today using specialized vertical software. They open the application. Navigate to the screening tool. Set parameters. Export to Excel. Build a model. Run scenarios. Each step involves interacting with the software’s interface. Each step reinforces the switching cost. Now consider a knowledge worker with an LLM chat: “ Show me all software companies with >$1B market cap, P/E under 30, growing revenue >20% YoY. “ “ Build a DCF model for the top 5. “ “ Run sensitivity analysis on discount rate.” The user never touched any specialized interface. They don’t know (or care) which data provider the LLM queried. The LLM found the cheapest available source with adequate coverage. This is complete commoditization. Not just of discovery, but of the entire supplier experience. When interfaces are commoditized, all that remains is API versus API. What happens to pricing power when interfaces disappear: The old model (vertical software): $10-25K/seat/year Multi-year contracts with annual escalators 95%+ retention because switching means retraining Gross margins >80% The new model: Data licensing fees (pennies per query) No user lock-in (LLM can switch sources instantly) Margin compression to commodity levels Retention based purely on data quality and coverage The math is brutal. If a vertical software company’s interface was 60% of their value, and LLMs eliminate interface value entirely, what remains is pure data value. And if that data isn’t proprietary, if it can be licensed or replicated, there’s nothing left. VALUE DECOMPOSITION If no proprietary data you are in big trouble. This is Aggregation Theory applied to its logical conclusion. Look at financial data software. Companies that built empires on interface complexity are watching their moats evaporate. A $20B market cap company with no truly proprietary data should trade at $5-8B once LLMs absorb their interface value. That’s not a bear case. That’s math. The same logic applies everywhere interfaces created moats: Financial data : Terminals that charge $12-24K/year for interfaces over largely commoditized data feeds. When an LLM can query the same data directly, the interface premium evaporates. Legal research : Platforms charging premium prices for interfaces over case law that’s largely public domain. The specialized search and citational tools become worthless when an LLM can do it better. Medical databases : Clinical decision support tools that charge physicians for point-of-care recommendations. Exactly what LLMs excel at. Real estate analytics : Comprehensive databases accessed through specialized workflow tools. LLMs querying the same data through APIs eliminate the workflow lock-in. Recruiting : Search and outreach tools charging $10K+/year. When an LLM can query professional networks and draft personalized outreach, the interface value disappears. The only survivors: companies with truly proprietary data that cannot be replicated or licensed. If interfaces are irrelevant, what do suppliers need? The old stack: Frontend framework (React, Vue) Design system (component library) UX research (user testing, A/B tests) Brand marketing (differentiation) SEO optimization (Google discovery) The new stack: Clean, structured data (markdown, JSON) API/MCP endpoints (machine accessibility) Data quality monitoring (accuracy, freshness) That’s it. All software becomes API. A restaurant today invests in a beautiful website with parallax scrolling, professional food photography, reservation system integration, review management, local SEO. All to make humans want to click “Book Now.” A restaurant in the LLM era needs: # Bella Vista Italian Restaurant ## Location: 123 Main St, San Francisco ## Hours: Mon-Thu 5-10pm, Fri-Sat 5-11pm ## Menu: - Margherita Pizza: $22 - Spaghetti Carbonara: $24 ## Reservation API: POST /book {date, time, party_size} That’s everything an LLM needs. The $50K website becomes a text file and an API endpoint. Vertical software’s beautiful interfaces become: MCP endpoint: /query Parameters: {filters, fields, format} Returns: [structured data] No keyboard shortcuts to learn. No plugins to install. No interface to build. Just data, accessible via API. Subscribe now Traditional REST APIs had structural limitations that preserved switching costs: Rigid schemas requiring exact field names Extensive documentation humans had to read Bespoke integration for every service Stateless interactions without conversation context This created a moat: integration effort. Even if data was commoditized, the cost of switching APIs was non-trivial. Someone had to write new code, test edge cases, handle errors differently. MCP changes this. Model Context Protocol eliminates integration friction: When switching between data sources requires zero integration work, the only differentiator is data quality, coverage, and price. This is true commodity competition. SWITCHING COST COLLAPSE The New Aggregation Framework Reframing Thompson’s model for the LLM era: AGGREGATION EVOLUTION Original Aggregation Theory (2015): Suppliers → [Aggregator] → Consumers The aggregator (Google/Facebook) achieved zero distribution cost, zero transaction cost, and commoditized suppliers. But suppliers kept their interface and their data. LLM Aggregation Theory (2025): APIs → [LLM Chat] → Consumers The LLM achieves zero distribution cost, zero transaction cost, AND zero interface cost. Complete supplier invisibility. What remains is API versus API. The aggregator layer gets thicker while the supplier layer gets thinner . In Web 2.0, Google was a thin routing layer. It pointed you to suppliers who owned your attention once you clicked. The supplier had the relationship. The supplier had the interface. The supplier converted you. In the LLM era, the chat owns your entire interaction. Suppliers are invisible infrastructure. You don’t know where the information came from. You don’t experience their brand. You never see their interface. Vertical software in 2020: The product that owned the workflow. Vertical software in 2030: An API that the LLM queries. The moat wasn’t data. It was that knowledge workers lived inside these interfaces 10 hours a day. That interface now lives inside the LLM chat. The New Value Matrix The Winners: LLM Chat Interface Owners: Whoever owns the chat interface owns the user relationship. OpenAI with ChatGPT. Anthropic with Claude. Microsoft with Copilot. Google with Gemini. They capture the interface value that vertical software loses. The new aggregators. Proprietary Data Owners: Companies with truly unique, non-replicable data. The key test: Can this data be licensed or scraped? If yes, not defensible. If no, you survive. MCP-First Startups : Companies building for agents, not humans. No legacy interface to protect. No beautiful UI to maintain. Just clean data served through MCP endpoints that LLMs can query. They can undercut incumbents on price because they have no interface investment to recoup. The Losers: Interface-Moat Businesses : Any vertical software where “workflow” was the value. The interface that justified premium pricing becomes worthless. A $20B company with no proprietary data becomes a $5-8B company. Traditional Aggregators (Maybe): Google and Meta commoditized suppliers. Now LLMs could commoditize them. But here’s the nuance: only if they fail to own the LLM chat layer themselves. Google has Gemini and insane distribution. Meta has Llama. The race is on. If they win the chat interface, they stay aggregators. If they lose it, they become the commoditized. Content Creators : UGC platforms lose relevance when AI generates personalized content. The creator economy inverts: infinite AI content, zero human creators needed for most use cases. The UI/UX Industry: Beautiful interfaces become irrelevant when the LLM chat is the only interface. Hundreds of billions per year in frontend development... for what? Figma (amazing product!) is down by 90%. The framework for repricing interface businesses is simple: How much of the business is interface versus data? Most vertical software is 60-80% interface, 20-40% data. When LLMs absorb the interface, that value evaporates. Is the data truly proprietary? If it can be licensed, scraped, or replicated, there’s no moat left. Pure commodity competition. This is not a bear case. This is math. The market hasn’t priced this in because LLM capabilities are new (less than 2 years at scale), MCP adoption is early (less than 1 year), enterprise buyers move slowly (3-5 year contracts), and incumbents are in denial. But the repricing is coming in my opinion. The arc of internet economics: Pre-Internet (1950-1995) : Distributors controlled suppliers. High distribution costs created leverage. Web 1.0 (1995-2005) : Distribution costs collapsed. Content went online but remained siloed. Web 2.0 (2005-2023) : Transaction costs collapsed. Aggregators emerged. Suppliers were commoditized but kept their interfaces. LLM Era (2023+) : Interface costs collapse. LLMs complete aggregation. Suppliers become APIs. It’s API versus API, and whoever has no proprietary data loses. What Thompson got right: Suppliers would be commoditized. Consumer experience would become paramount. Winner-take-all dynamics would emerge. What Thompson couldn’t have predicted: The interface itself would be absorbed. Suppliers would become invisible. The aggregator would BE the experience, not just route to it. All software would become API. In the LLM era, the internet becomes a database. Structured data in, natural language out. No websites, no interfaces, no brands. Just APIs serving data to AI. For someone who spent a decade building beautiful interfaces, this is bittersweet. All those carefully crafted interactions, pixel-perfect layouts, workflow optimizations... obsolete. But this is what progress looks like. The UX of chatting with an LLM is infinitely better than navigating specialized software. And that’s all that matters. Aggregation Theory told us suppliers would be commoditized. LLMs are finishing the job. The interface moat is dead. What remains is data. And if your data isn’t proprietary, neither is your business. Subscribe now For decades, software companies commanded premium pricing not only for their data, but for their interfaces . The specialized keyboards. The Excel integrations. The workflow automations. Users spent years mastering these systems. Companies built processes hardcoded to specific tools. Switching meant massive productivity loss. The interface WAS the product. I haven’t used Google in a year. An LLM chat is my browser. Soon, knowledge workers won’t use specialized software interfaces either. The LLM chat will be their interface to everything. This isn’t incremental change. This is the completion of Ben Thomson’s Aggregation Theory. In this article: Why Aggregation Theory left suppliers with one critical asset: their interface How vertical software built empires on workflow complexity, not data Why LLMs absorb the interface layer entirely When interfaces are commoditized, it’s API versus API Valuation Framework: the math is brutal Who wins, who loses, and what comes next Subscribe now But suppliers retained two critical assets. Their interface and their data. The Interface Moat: Why Commoditization Had a Ceiling The paradox of Web 2.0 aggregation was structural. Google commoditized discovery. When you search “best Italian restaurant SF,” you don’t care which site ranks #1. The source is fungible. But you still visit that site. You see their brand. You experience their UX. You navigate their reservation system. This created a hard limit on commoditization: Discovery : Commoditized (Google owns it) Interface : Protected (suppliers own it) Data : Protected (suppliers own it) Same data. Different interfaces. Premium pricing. Knowledge workers spent years learning specialized interfaces. The muscle memory is real. They’re not paying for data. They’re paying to not relearn a workflow they’ve spent a decade mastering. Companies built models and processes hardcoded to specific plugins. Changing providers means rebuilding workflows, retraining teams, risking errors during the transition. Switching costs weren’t about data. They were about the interface. This is why vertical software traded at 20-30x earnings. The market believed the interface was defensible. But is it today? Subscribe now LLMs: The Final Aggregator LLMs don’t just aggregate suppliers. They absorb the interface itself. When LLMs commoditize the interface, what’s left? Just the data. And then it’s API against API. Pure commodity competition. The three-layer collapse: What changes structurally: THE VISIBILITY COLLAPSE Users never see the supplier’s brand Users never experience the supplier’s UX Users don’t know where information originated The entire web becomes a backend database $10-25K/seat/year Multi-year contracts with annual escalators 95%+ retention because switching means retraining Gross margins >80% Data licensing fees (pennies per query) No user lock-in (LLM can switch sources instantly) Margin compression to commodity levels Retention based purely on data quality and coverage If no proprietary data you are in big trouble. This is Aggregation Theory applied to its logical conclusion. Look at financial data software. Companies that built empires on interface complexity are watching their moats evaporate. A $20B market cap company with no truly proprietary data should trade at $5-8B once LLMs absorb their interface value. That’s not a bear case. That’s math. The same logic applies everywhere interfaces created moats: Financial data : Terminals that charge $12-24K/year for interfaces over largely commoditized data feeds. When an LLM can query the same data directly, the interface premium evaporates. Legal research : Platforms charging premium prices for interfaces over case law that’s largely public domain. The specialized search and citational tools become worthless when an LLM can do it better. Medical databases : Clinical decision support tools that charge physicians for point-of-care recommendations. Exactly what LLMs excel at. Real estate analytics : Comprehensive databases accessed through specialized workflow tools. LLMs querying the same data through APIs eliminate the workflow lock-in. Recruiting : Search and outreach tools charging $10K+/year. When an LLM can query professional networks and draft personalized outreach, the interface value disappears. The only survivors: companies with truly proprietary data that cannot be replicated or licensed. From Software to APIs: The New Supplier Stack If interfaces are irrelevant, what do suppliers need? The old stack: Frontend framework (React, Vue) Design system (component library) UX research (user testing, A/B tests) Brand marketing (differentiation) SEO optimization (Google discovery) Clean, structured data (markdown, JSON) API/MCP endpoints (machine accessibility) Data quality monitoring (accuracy, freshness) Rigid schemas requiring exact field names Extensive documentation humans had to read Bespoke integration for every service Stateless interactions without conversation context The New Aggregation Framework Reframing Thompson’s model for the LLM era: AGGREGATION EVOLUTION Original Aggregation Theory (2015): Suppliers → [Aggregator] → Consumers The aggregator (Google/Facebook) achieved zero distribution cost, zero transaction cost, and commoditized suppliers. But suppliers kept their interface and their data. LLM Aggregation Theory (2025): APIs → [LLM Chat] → Consumers The LLM achieves zero distribution cost, zero transaction cost, AND zero interface cost. Complete supplier invisibility. What remains is API versus API. The aggregator layer gets thicker while the supplier layer gets thinner . In Web 2.0, Google was a thin routing layer. It pointed you to suppliers who owned your attention once you clicked. The supplier had the relationship. The supplier had the interface. The supplier converted you. In the LLM era, the chat owns your entire interaction. Suppliers are invisible infrastructure. You don’t know where the information came from. You don’t experience their brand. You never see their interface. Vertical software in 2020: The product that owned the workflow. Vertical software in 2030: An API that the LLM queries. The moat wasn’t data. It was that knowledge workers lived inside these interfaces 10 hours a day. That interface now lives inside the LLM chat. Winners and Losers: A Framework The New Value Matrix The Winners: LLM Chat Interface Owners: Whoever owns the chat interface owns the user relationship. OpenAI with ChatGPT. Anthropic with Claude. Microsoft with Copilot. Google with Gemini. They capture the interface value that vertical software loses. The new aggregators. Proprietary Data Owners: Companies with truly unique, non-replicable data. The key test: Can this data be licensed or scraped? If yes, not defensible. If no, you survive. MCP-First Startups : Companies building for agents, not humans. No legacy interface to protect. No beautiful UI to maintain. Just clean data served through MCP endpoints that LLMs can query. They can undercut incumbents on price because they have no interface investment to recoup. Interface-Moat Businesses : Any vertical software where “workflow” was the value. The interface that justified premium pricing becomes worthless. A $20B company with no proprietary data becomes a $5-8B company. Traditional Aggregators (Maybe): Google and Meta commoditized suppliers. Now LLMs could commoditize them. But here’s the nuance: only if they fail to own the LLM chat layer themselves. Google has Gemini and insane distribution. Meta has Llama. The race is on. If they win the chat interface, they stay aggregators. If they lose it, they become the commoditized. Content Creators : UGC platforms lose relevance when AI generates personalized content. The creator economy inverts: infinite AI content, zero human creators needed for most use cases. The UI/UX Industry: Beautiful interfaces become irrelevant when the LLM chat is the only interface. Hundreds of billions per year in frontend development... for what? Figma (amazing product!) is down by 90%.

0 views

Lessons from Building AI Agents for Financial Services

I’ve spent the last two years building AI agents for financial services. Along the way, I’ve accumulated a fair number of battle scars and learnings that I want to share. Here’s what I’ll cover: - The Sandbox Is Not Optional - Why isolated execution environments are essential for multi-step agent workflows - Context Is the Product - How we normalize heterogeneous financial data into clean, searchable context - The Parsing Problem - The hidden complexity of extracting structured data from adversarial SEC filings - Skills Are Everything - Why markdown-based skills are becoming the product, not the model - The Model Will Eat Your Scaffolding - Designing for obsolescence as models improve - The S3-First Architecture - Why S3 beats databases for file storage and user data - The File System Tools - How ReadFile, WriteFile, and Bash enable complex financial workflows - Temporal Changed Everything - Reliable long-running tasks with proper cancellation handling - Real-Time Streaming - Building responsive UX with delta updates and interactive agent workflows - Evaluation Is Not Optional - Domain-specific evals that catch errors before they cost money - Production Monitoring - The observability stack that keeps financial agents reliable Why financial services is extremely hard. This domain doesn’t forgive mistakes. Numbers matter. A wrong revenue figure, a misinterpreted guidance statement, an incorrect DCF assumption. Professional investors make million-dollar decisions based on our output. One mistake on a $100M position and you’ve destroyed trust forever. The users are also demanding. Professional investors are some of the smartest, most time-pressed people you’ll ever work with. They spot bullshit instantly. They need precision, speed, and depth. You can’t hand-wave your way through a valuation model or gloss over nuances in an earnings call. This forces me to develop an almost paranoid attention to detail. Every number gets double-checked. Every assumption gets validated. Every model gets stress-tested. You start questioning everything the LLM outputs because you know your users will. A single wrong calculation in a DCF model and you lose credibility forever. I sometimes feel that the fear of being wrong becomes our best feature. Over the years building with LLM, we’ve made bold infrastructure bets early and I think we have been right. For instance, when Claude Code launched with its filesystem-first agentic approach, we immediately adopted it. It was not an obvious bet and it was a massive revamp of our architecture. I was extremely lucky to have Thariq from Anthropic Claude Code jumping on a Zoom and opening my eyes to the possibilities. At the time the whole industry, including Fintool, was all building elaborate RAG pipelines with vector databases and embeddings. After reflecting on the future of information retrieval with agents I wrote “ the RAG obituary ” and Fintool moved fully to agentic search. We even decided to retire our precious embedding pipeline. Sad but whatever is best for the future! People thought we were crazy. The article got a lot of praise and a lot of negative comments. Now I feel most startups are adopting these best practices. I believe we’re early on several other architectural choices too. I’m sharing them here because the best way to test ideas is to put them out there. Let’s start with the biggest one. When we first started building Fintool in 2023, I thought sandboxing might be overkill. “We’re just running Python scripts” I told myself. “What could go wrong?” Haha. Everything. Everything could go wrong. The first time an LLM decided to `rm -rf /` on our server (it was trying to “clean up temporary files”), I became a true believer. Here’s the thing: agents need to run multi-step operations. A professional investor asks for a DCF valuation and that’s not a single API call. The agent needs to research the company, gather financial data, build a model in Excel, run sensitivity analysis, generate complex charts, iterate on assumptions. That’s dozens of steps, each potentially modifying files, installing packages, running scripts. You can’t do this without code execution. And executing arbitrary code on your servers is insane. Every chat application needs a sandbox. Today each user gets their own isolated environment. The agent can do whatever it wants in there. Delete everything? Fine. Install weird packages? Go ahead. It’s your sandbox, knock yourself out. The architecture looks like this: Three mount points. Private is read/write for your stuff. Shared is read-only for your organization. Public is read-only for everyone. The magic is in the credentials. We use AWS ABAC (Attribute-Based Access Control) to generate short-lived credentials scoped to specific S3 prefixes. User A literally cannot access User B’s data. The IAM policy uses ` ${aws:PrincipalTag/S3Prefix} ` to restrict access. The credentials physically won’t allow it. This is also very good for Enterprise deployment. We also do sandbox pre-warming. When a user starts typing, we spin up their sandbox in the background. By the time they hit enter, the sandbox is ready. 600 second timeout, extended by 10 minutes on each tool usage. The sandbox stays warm across conversation turns. So sandboxes are amazing but the under-discussed magic of sandboxes is the support for the filesystem. Which brings us to the next lesson learned about context. Your agent is only as good as the context it can access. The real work isn’t prompt engineering it’s turning messy financial data from dozens of sources into clean, structured context the model can actually use. This requires a massive domain expertise from the engineering team. The heterogeneity problem. Financial data comes in every format imaginable: - SEC filings : HTML with nested tables, exhibits, signatures - Earnings transcripts : Speaker-segmented text with Q&A sections - Press releases : Semi-structured HTML from PRNewswire - Research reports : PDFs with charts and footnotes - Market data : Snowflake/databases with structured numerical data - News : Articles with varying quality and structure - Alternative data : Satellite imagery, web traffic, credit card panels - Broker research : Proprietary PDFs with price targets and models - Fund filings : 13F holdings, proxy statements, activist letters Each source has different schemas, different update frequencies, different quality levels. Agent needs one thing: clean context it can reason over. The normalization layer. Everything becomes one of three formats: - Markdown for narrative content (filings, transcripts, articles) - CSV/tables for structured data (financials, metrics, comparisons) - JSON metadata for searchability (tickers, dates, document types, fiscal periods) Chunking strategy matters. Not all documents chunk the same way: - 10-K filings : Section by regulatory structure (Item 1, 1A, 7, 8...) - Earnings transcripts : Chunk by speaker turn (CEO remarks, CFO remarks, Q&A by analyst) - Press releases : Usually small enough to be one chunk - News articles : Paragraph-level chunks - 13F filings : By holder and position changes quarter-over-quarter The chunking strategy determines what context the agent retrieves. Bad chunks = bad answers. Tables are special. Financial data is full of tables and csv. Revenue breakdowns, segment performance, guidance ranges. LLMs are surprisingly good at reasoning over markdown tables: But they’re terrible at reasoning over HTML `<table>` tags or raw CSV dumps. The normalization layer converts everything to clean markdown tables. Metadata enables retrieval. The user asks the agent: “ What did Apple say about services revenue in their last earnings call? ” To answer this, Fintool needs: - Ticker resolution (AAPL → correct company) - Document type filtering (earnings transcript, not 10-K) - Temporal filtering (most recent, not 2019) - Section targeting (CFO remarks or revenue discussion, not legal disclaimers) This is why `meta.json` exists for every document. Without structured metadata, you’re doing keyword search over a haystack. It speeds up the search, big time! Anyone can call an LLM API. Not everyone has normalized decades of financial data into searchable, chunked markdown with proper metadata. The data layer is what makes agents actually work. The Parsing Problem Normalizing financial data is 80% of the work. Here’s what nobody tells you. SEC filings are adversarial. They’re not designed for machine reading. They’re designed for legal compliance: - Tables span multiple pages with repeated headers - Footnotes reference exhibits that reference other footnotes - Numbers appear in text, tables, and exhibits—sometimes inconsistently - XBRL tags exist but are often wrong or incomplete - Formatting varies wildly between filers (every law firm has their own template) We tried off-the-shelf PDF/HTML parsers. They failed on: - Multi-column layouts in proxy statements - Nested tables in MD&A sections (tables within tables within tables) - Watermarks and headers bleeding into content - Scanned exhibits (still common in older filings and attachments) - Unicode issues (curly quotes, em-dashes, non-breaking spaces) The Fintool parsing pipeline: Raw Filing (HTML/PDF) Document structure detection (headers, sections, exhibits) Table extraction with cell relationship preservation Entity extraction (companies, people, dates, dollar amounts) Cross-reference resolution (Ex. 10.1 → actual exhibit content) Fiscal period normalization (FY2024 → Oct 2023 to Sep 2024 for Apple) Quality scoring (confidence per extracted field) Table extraction deserves its own work. Financial tables are dense with meaning. A revenue breakdown table might have: - Merged header cells spanning multiple columns - Footnote markers (1), (2), (a), (b) that reference explanations below - Parentheses for negative numbers: $(1,234) means -1234 - Mixed units in the same table (millions for revenue, percentages for margins) - Prior period restatements in italics or with asterisks We score every extracted table on: - Cell boundary accuracy (did we split/merge correctly?) - Header detection (is row 1 actually headers, or is there a title row above?) - Numeric parsing (is “$1,234” parsed as 1234 or left as text?) - Unit inference (millions? billions? per share? percentage?) Tables below 90% confidence get flagged for review. Low-confidence extractions don’t enter the agent’s context—garbage in, garbage out. Fiscal period normalization is critical. “Q1 2024” is ambiguous: - Calendar Q1 (January-March 2024) - Apple’s fiscal Q1 (October-December 2023) - Microsoft’s fiscal Q1 (July-September 2023) - “Reported in Q1” (filed in Q1, but covers the prior period) We maintain a fiscal calendar database for 10,000+ companies. Every date reference gets normalized to absolute date ranges. When the agent retrieves “Apple Q1 2024 revenue,” it knows to look for data from October-December 2023. This is invisible to users but essential for correctness. Without it, you’re comparing Apple’s October revenue to Microsoft’s January revenue and calling it “same quarter.” Here’s the thing nobody tells you about building AI agents: the model is not the product. The skills are now the product. I learned this the hard way. We used to try making the base model “smarter” through prompt engineering. Tweak the system prompt, add examples, write elaborate instructions. It helped a little. But skills were the missing part. In October 2025, Anthropic formalized this with Agent Skills a specification for extending Claude with modular capability packages. A skill is a folder containing a `SKILL.md` file with YAML frontmatter (name and description), plus any supporting scripts, references, or data files the agent might need. We’d been building something similar for months before the announcement. The validation felt good but more importantly, having an industry standard means our skills can eventually be portable. Without skills, models are surprisingly bad at domain tasks. Ask a frontier model to do a DCF valuation. It knows what DCF is. It can explain the theory. But actually executing one? It will miss critical steps, use wrong discount rates for the industry, forget to add back stock-based compensation, skip sensitivity analysis. The output looks plausible but is subtly wrong in ways that matter. The breakthrough came when we started thinking about skills as first-class citizens. Like part of the product itself. A skill is a markdown file that tells the agent how to do something specific. Here’s a simplified version of our DCF skill: That’s it. A markdown file. No code changes. No production deployment. Just a file that tells the agent what to do. Skills are better than code. This matters enormously: 1. Non-engineers can create skills. Our analysts write skills. Our customers write skills. A portfolio manager who’s done 500 DCF valuations can encode their methodology in a skill without writing a single line of Python. 2. No deployment needed. Change a skill file and it takes effect immediately. No CI/CD, no code review, no waiting for release cycles. Domain experts can iterate on their own. 3. Readable and auditable. When something goes wrong, you can read the skill and understand exactly what the agent was supposed to do. Try doing that with a 2,000-line Python module. We have a copy-on-write shadowing system: Priority: private > shared > public So if you don’t like how we do DCF valuations, write your own. Drop it in `/private/skills/dcf/SKILL.md`. Your version wins. Why we don’t mount all skills to the filesystem. This is important. The naive approach would be to mount every skill file directly into the sandbox. The agent can just `cat` any skill it needs. Simple, right? Wrong. Here’s why we use SQL discovery instead: 1. Lazy loading. We have dozens of skills with extensive documentation like the DCF skill alone has 10+ industry guideline files. Loading all of them into context for every conversation would burn tokens and confuse the model. Instead, we discover skill metadata (name, description) upfront, and only load the full documentation when the agent actually uses that skill. 2. Access control at query time. The SQL query implements our three-tier access model: public skills available to everyone, organization skills for that org’s users, private skills for individual users. The database enforces this. You can’t accidentally expose a customer’s proprietary skill to another customer. 3. Shadowing logic. When a user customizes a skill, their version needs to override the default. SQL makes this trivial—query all three levels, apply priority rules, return the winner. Doing this with filesystem mounts would be a nightmare of symlinks and directory ordering. 4. Metadata-driven filtering. The `fs_files.metadata` column stores parsed YAML frontmatter. We can filter by skill type, check if a skill is main-agent-only, or query any other structured attribute—all without reading the files themselves. The pattern: S3 is the source of truth, a Lambda function syncs changes to PostgreSQL for fast queries, and the agent gets exactly what it needs when it needs it. Skills are essential. I cannot emphasize this enough. If you’re building an AI agent and you don’t have a skills system, you’re going to have a bad time. My biggest argument for skills is that top models (Claude or GPT) are post-trained on using Skills. The model wants to fetch skills. Models just want to learn and what they want to learn is our skills... Until they ate it. Here’s the uncomfortable truth: everything I just told you about skills? It’s temporary in my opinion. Models are getting better. Fast. Every few months, there’s a new model that makes half your code obsolete. The elaborate scaffolding you built to handle edge cases? The model just... handles them now. When we started, we needed detailed skills with step-by-step instructions for some simple tasks. “First do X, then do Y, then check Z.” Now? We can often just say for simple task “do an earnings preview” and the model figures it out (kinda of!) This creates a weird tension. You need skills today because current models aren’t smart enough. But you should design your skills knowing that future models will need less hand-holding. That’s why I’m bullish on markdown file versus code for model instructions. It’s easier to update and delete. We send detailed feedback to AI labs. Whenever we build complex scaffolding to work around model limitations, we document exactly what the model struggles with and share it with the lab research team. This helps inform the next generation of models. The goal is to make our own scaffolding obsolete. My prediction: in two years, most of our basic skills will be one-liners. “Generate a 20 tabs DCF.” That’s it. The model will know what that means. But here’s the flip side: as basic tasks get commoditized, we’ll push into more complex territory. Multi-step valuations with segment-by-segment analysis. Automated backtesting of investment strategies. Real-time portfolio monitoring with complex triggers. The frontier keeps moving. So we write skills. We delete them when they become unnecessary. And we build new ones for the harder problems that emerge. And all that are files... in our filesystem. Here’s something that surprised me: S3 for files is a better database than a database. We store user data (watchlists, portfolio, preferences, memories, skills) in S3 as YAML files. S3 is the source of truth. A Lambda function syncs changes to PostgreSQL for fast queries. Writes → S3 (source of truth) Lambda trigger PostgreSQL (fs_files table) Reads ← Fast queries - Durability : S3 has 11 9’s. A database doesn’t. - Versioning : S3 versioning gives you audit trails for free - Simplicity : YAML files are human-readable. You can debug with `cat`. - Cost : S3 is cheap. Database storage is not. The pattern: - Writes go to S3 directly - List queries hit the database (fast) - Single-item reads go to S3 (freshest data) The sync architecture. We run two Lambda functions to keep S3 and PostgreSQL in sync: S3 (file upload/delete) fs-sync Lambda → Upsert/delete in fs_files table (real-time) EventBridge (every 3 hours) fs-reconcile Lambda → Full S3 vs DB scan, fix discrepancies Both use upsert with timestamp guards—newer data always wins. The reconcile job catches any events that slipped through (S3 eventual consistency, Lambda cold starts, network blips). User memories live here too. Every user has a `/private/memories/UserMemories.md` file in S3. It’s just markdown—users can edit it directly in the UI. On every conversation, we load it and inject it as context: This is surprisingly powerful. Users write things like “I focus on small-cap value stocks” or “Always compare to industry median, not mean” or “My portfolio is concentrated in tech, so flag concentration risk.” The agent sees this on every conversation and adapts accordingly. No migrations. No schema changes. Just a markdown file that the user controls. Watchlists work the same way. YAML files in S3, synced to PostgreSQL for fast queries. When a user asks about “my watchlist,” we load the relevant tickers and inject them as context. The agent knows what companies matter to this user. The filesystem becomes the user’s personal knowledge base. Skills tell the agent how to do things. Memories tell it what the user cares about. Both are just files. Agents in financial services need to read and write files. A lot of files. PDFs, spreadsheets, images, code. Here’s how we handle it. ReadFile handles the complexity: WriteFile creates artifacts that link back to the UI: Bash gives persistent shell access with 180 second timeout and 100K character output limit. Path normalization on everything (LLMs love trying path traversal attacks, it’s hilarious). Bash is more important than you think. There’s a growing conviction in the AI community that filesystems and bash are the optimal abstraction for AI agents. Braintrust recently ran an eval comparing SQL agents, bash agents, and hybrid approaches for querying semi-structured data. The results were interesting: pure SQL hit 100% accuracy but missed edge cases. Pure bash was slower and more expensive but caught verification opportunities. The winner? A hybrid approach where the agent uses bash to explore and verify, SQL for structured queries. This matches our experience. Financial data is messy. You need bash to grep through filing documents, find patterns, explore directory structures. But you also need structured tools for the heavy lifting. The agent needs both—and the judgment to know when to use each. We’ve leaned hard into giving agents full shell access in the sandbox. It’s not just for running Python scripts. It’s for exploration, verification, and the kind of ad-hoc data manipulation that complex tasks require. But complex tasks mean long-running agents. And long-running agents break everything. Subscribe now Before Temporal, our long-running tasks were a disaster. User asks for a comprehensive company analysis. That takes 5 minutes. What if the server restarts? What if the user closes the tab and comes back? What if... anything? We had a homegrown job queue. It was bad. Retries were inconsistent. State management was a nightmare. Then we switched to Temporal and I wanted to cry tears of joy! That’s it. Temporal handles worker crashes, retries, everything. If a Heroku dyno restarts mid-conversation (happens all the time lol), Temporal automatically retries on another worker. The user never knows. The cancellation handling is the tricky part. User clicks “stop,” what happens? The activity is already running on a different server. We use heartbeats sent every few seconds. We run two worker types: - Chat workers : User-facing, 25 concurrent activities - Background workers : Async tasks, 10 concurrent activities They scale independently. Chat traffic spikes? Scale chat workers. Next is speed. In finance, people are impatient. They’re not going to wait 30 seconds staring at a loading spinner. They need to see something happening. So we built real-time streaming. The agent works, you see the progress. Agent → SSE Events → Redis Stream → API → Frontend The key insight: delta updates, not full state. Instead of sending “here’s the complete response so far” (expensive), we send “append these 50 characters” (cheap). Streaming rich content with Streamdown. Text streaming is table stakes. The harder problem is streaming rich content: markdown with tables, charts, citations, math equations. We use Streamdown to render markdown as it arrives, with custom plugins for our domain-specific components. Charts render progressively. Citations link to source documents. Math equations display properly with KaTeX. The user sees a complete, interactive response building in real-time. AskUserQuestion: Interactive agent workflows. Sometimes the agent needs user input mid-workflow. “Which valuation method do you prefer?” “Should I use consensus estimates or management guidance?” “Do you want me to include the pipeline assets in the valuation?” We built an `AskUserQuestion` tool that lets the agent pause, present options, and wai When the agent calls this tool, the agentic loop intercepts it, saves state, and presents a UI to the user. The user picks an option (or types a custom answer), and the conversation resumes with their choice. This transforms agents from autonomous black boxes into collaborative tools. The agent does the heavy lifting, but the user stays in control of key decisions. Essential for high-stakes financial work where users need to validate assumptions. “Ship fast, fix later” works for most startups. It does not work for financial services. A wrong earnings number can cost someone money. A misinterpreted guidance statement can lead to bad investment decisions. You can’t just “fix it later” when your users are making million-dollar decisions based on your output. We use Braintrust for experiment tracking. Every model change, every prompt change, every skill change gets evaluated against a test set. Generic NLP metrics (BLEU, ROUGE) don’t work for finance. A response can be semantically similar but have completely wrong numbers. Building eval datasets is harder than building the agent. We maintain ~2,000 test cases across categories: Ticker disambiguation. This is deceptively hard: - “Apple” → AAPL, not APLE (Appel Petroleum) - “Meta” → META, not MSTR (which some people call “meta”) - “Delta” → DAL (airline) or is the user talking about delta hedging (options term)? The really nasty cases are ticker changes. Facebook became META in 2021. Google restructured under GOOG/GOOGL. Twitter became X (but kept the legal entity). When a user asks “What happened to Facebook stock in 2023?”, you need to know that FB → META, and that historical data before Oct 2021 lives under the old ticker. We maintain a ticker history table and test cases for every major rename in the last decade. Fiscal period hell. This is where most financial agents silently fail: - Apple’s Q1 is October-December (fiscal year ends in September) - Microsoft’s Q2 is October-December (fiscal year ends in June) - Most companies Q1 is January-March (calendar year) “Last quarter” on January 15th means: - Q4 2024 for calendar-year companies - Q1 2025 for Apple (they just reported) - Q2 2025 for Microsoft (they’re mid-quarter) We maintain fiscal calendars for 10,000+ companies. Every period reference gets normalized to absolute date ranges. We have 200+ test cases just for period extraction. Numeric precision. Revenue of $4.2B vs $4,200M vs $4.2 billion vs “four point two billion.” All equivalent. But “4.2” alone is wrong—missing units. Is it millions? Billions? Per share? We test unit inference, magnitude normalization, and currency handling. A response that says “revenue was 4.2” without units fails the eval, even if 4.2B is correct. Adversarial grounding. We inject fake numbers into context and verify the model cites the real source, not the planted one. Example: We include a fake analyst report stating “Apple revenue was $50B” alongside the real 10-K showing $94B. If the agent cites $50B, it fails. If it cites $94B with proper source attribution, it passes. We have 50 test cases specifically for hallucination resistance. Eval-driven development. Every skill has a companion eval. The DCF skill has 40 test cases covering WACC edge cases, terminal value sanity checks, and stock-based compensation add-backs (models forget this constantly). PR blocked if eval score drops >5%. No exceptions. Our production setup looks like this: We auto-file GitHub issues for production errors. Error happens, issue gets created with full context: conversation ID, user info, traceback, links to Braintrust traces and Temporal workflows. Paying customers get `priority:high` label. Model routing by complexity: simple queries use Haiku (cheap), complex analysis uses Sonnet (expensive). Enterprise users always get the best model. The biggest lesson isn’t about sandboxes or skills or streaming. It’s this: The model is not your product. The experience around the model is your product. Anyone can call Claude or GPT. The API is the same for everyone. What makes your product different is everything else: the data you have access to, the skills you’ve built, the UX you’ve designed, the reliability you’ve engineered and frankly how well you know the industry which is a function of how much time you spend with your customers. Models will keep getting better. That’s great! It means less scaffolding, less prompt engineering, less complexity. But it also means the model becomes more of a commodity. Your moat is not the model. Your moat is everything you build around it. For us, that’s financial data, domain-specific skills, real-time streaming, and the trust we’ve built with professional investors. What’s yours? Thanks for reading! Subscribe for free to receive new posts and support my work. I’ve spent the last two years building AI agents for financial services. Along the way, I’ve accumulated a fair number of battle scars and learnings that I want to share. Here’s what I’ll cover: - The Sandbox Is Not Optional - Why isolated execution environments are essential for multi-step agent workflows - Context Is the Product - How we normalize heterogeneous financial data into clean, searchable context - The Parsing Problem - The hidden complexity of extracting structured data from adversarial SEC filings - Skills Are Everything - Why markdown-based skills are becoming the product, not the model - The Model Will Eat Your Scaffolding - Designing for obsolescence as models improve - The S3-First Architecture - Why S3 beats databases for file storage and user data - The File System Tools - How ReadFile, WriteFile, and Bash enable complex financial workflows - Temporal Changed Everything - Reliable long-running tasks with proper cancellation handling - Real-Time Streaming - Building responsive UX with delta updates and interactive agent workflows - Evaluation Is Not Optional - Domain-specific evals that catch errors before they cost money - Production Monitoring - The observability stack that keeps financial agents reliable Why financial services is extremely hard. This domain doesn’t forgive mistakes. Numbers matter. A wrong revenue figure, a misinterpreted guidance statement, an incorrect DCF assumption. Professional investors make million-dollar decisions based on our output. One mistake on a $100M position and you’ve destroyed trust forever. The users are also demanding. Professional investors are some of the smartest, most time-pressed people you’ll ever work with. They spot bullshit instantly. They need precision, speed, and depth. You can’t hand-wave your way through a valuation model or gloss over nuances in an earnings call. This forces me to develop an almost paranoid attention to detail. Every number gets double-checked. Every assumption gets validated. Every model gets stress-tested. You start questioning everything the LLM outputs because you know your users will. A single wrong calculation in a DCF model and you lose credibility forever. I sometimes feel that the fear of being wrong becomes our best feature. Over the years building with LLM, we’ve made bold infrastructure bets early and I think we have been right. For instance, when Claude Code launched with its filesystem-first agentic approach, we immediately adopted it. It was not an obvious bet and it was a massive revamp of our architecture. I was extremely lucky to have Thariq from Anthropic Claude Code jumping on a Zoom and opening my eyes to the possibilities. At the time the whole industry, including Fintool, was all building elaborate RAG pipelines with vector databases and embeddings. After reflecting on the future of information retrieval with agents I wrote “ the RAG obituary ” and Fintool moved fully to agentic search. We even decided to retire our precious embedding pipeline. Sad but whatever is best for the future! People thought we were crazy. The article got a lot of praise and a lot of negative comments. Now I feel most startups are adopting these best practices. I believe we’re early on several other architectural choices too. I’m sharing them here because the best way to test ideas is to put them out there. Let’s start with the biggest one. The Sandbox Is Not Optional When we first started building Fintool in 2023, I thought sandboxing might be overkill. “We’re just running Python scripts” I told myself. “What could go wrong?” Haha. Everything. Everything could go wrong. The first time an LLM decided to `rm -rf /` on our server (it was trying to “clean up temporary files”), I became a true believer. Here’s the thing: agents need to run multi-step operations. A professional investor asks for a DCF valuation and that’s not a single API call. The agent needs to research the company, gather financial data, build a model in Excel, run sensitivity analysis, generate complex charts, iterate on assumptions. That’s dozens of steps, each potentially modifying files, installing packages, running scripts. You can’t do this without code execution. And executing arbitrary code on your servers is insane. Every chat application needs a sandbox. Today each user gets their own isolated environment. The agent can do whatever it wants in there. Delete everything? Fine. Install weird packages? Go ahead. It’s your sandbox, knock yourself out. The architecture looks like this: Three mount points. Private is read/write for your stuff. Shared is read-only for your organization. Public is read-only for everyone. The magic is in the credentials. We use AWS ABAC (Attribute-Based Access Control) to generate short-lived credentials scoped to specific S3 prefixes. User A literally cannot access User B’s data. The IAM policy uses ` ${aws:PrincipalTag/S3Prefix} ` to restrict access. The credentials physically won’t allow it. This is also very good for Enterprise deployment. We also do sandbox pre-warming. When a user starts typing, we spin up their sandbox in the background. By the time they hit enter, the sandbox is ready. 600 second timeout, extended by 10 minutes on each tool usage. The sandbox stays warm across conversation turns. So sandboxes are amazing but the under-discussed magic of sandboxes is the support for the filesystem. Which brings us to the next lesson learned about context. Context Is the Product Your agent is only as good as the context it can access. The real work isn’t prompt engineering it’s turning messy financial data from dozens of sources into clean, structured context the model can actually use. This requires a massive domain expertise from the engineering team. The heterogeneity problem. Financial data comes in every format imaginable: - SEC filings : HTML with nested tables, exhibits, signatures - Earnings transcripts : Speaker-segmented text with Q&A sections - Press releases : Semi-structured HTML from PRNewswire - Research reports : PDFs with charts and footnotes - Market data : Snowflake/databases with structured numerical data - News : Articles with varying quality and structure - Alternative data : Satellite imagery, web traffic, credit card panels - Broker research : Proprietary PDFs with price targets and models - Fund filings : 13F holdings, proxy statements, activist letters Each source has different schemas, different update frequencies, different quality levels. Agent needs one thing: clean context it can reason over. The normalization layer. Everything becomes one of three formats: - Markdown for narrative content (filings, transcripts, articles) - CSV/tables for structured data (financials, metrics, comparisons) - JSON metadata for searchability (tickers, dates, document types, fiscal periods) Chunking strategy matters. Not all documents chunk the same way: - 10-K filings : Section by regulatory structure (Item 1, 1A, 7, 8...) - Earnings transcripts : Chunk by speaker turn (CEO remarks, CFO remarks, Q&A by analyst) - Press releases : Usually small enough to be one chunk - News articles : Paragraph-level chunks - 13F filings : By holder and position changes quarter-over-quarter The chunking strategy determines what context the agent retrieves. Bad chunks = bad answers. Tables are special. Financial data is full of tables and csv. Revenue breakdowns, segment performance, guidance ranges. LLMs are surprisingly good at reasoning over markdown tables: But they’re terrible at reasoning over HTML `<table>` tags or raw CSV dumps. The normalization layer converts everything to clean markdown tables. Metadata enables retrieval. The user asks the agent: “ What did Apple say about services revenue in their last earnings call? ” To answer this, Fintool needs: - Ticker resolution (AAPL → correct company) - Document type filtering (earnings transcript, not 10-K) - Temporal filtering (most recent, not 2019) - Section targeting (CFO remarks or revenue discussion, not legal disclaimers) This is why `meta.json` exists for every document. Without structured metadata, you’re doing keyword search over a haystack. It speeds up the search, big time! Anyone can call an LLM API. Not everyone has normalized decades of financial data into searchable, chunked markdown with proper metadata. The data layer is what makes agents actually work. The Parsing Problem Normalizing financial data is 80% of the work. Here’s what nobody tells you. SEC filings are adversarial. They’re not designed for machine reading. They’re designed for legal compliance: - Tables span multiple pages with repeated headers - Footnotes reference exhibits that reference other footnotes - Numbers appear in text, tables, and exhibits—sometimes inconsistently - XBRL tags exist but are often wrong or incomplete - Formatting varies wildly between filers (every law firm has their own template) We tried off-the-shelf PDF/HTML parsers. They failed on: - Multi-column layouts in proxy statements - Nested tables in MD&A sections (tables within tables within tables) - Watermarks and headers bleeding into content - Scanned exhibits (still common in older filings and attachments) - Unicode issues (curly quotes, em-dashes, non-breaking spaces) The Fintool parsing pipeline: Raw Filing (HTML/PDF) ↓ Document structure detection (headers, sections, exhibits) ↓ Table extraction with cell relationship preservation ↓ Entity extraction (companies, people, dates, dollar amounts) ↓ Cross-reference resolution (Ex. 10.1 → actual exhibit content) ↓ Fiscal period normalization (FY2024 → Oct 2023 to Sep 2024 for Apple) ↓ Quality scoring (confidence per extracted field) Table extraction deserves its own work. Financial tables are dense with meaning. A revenue breakdown table might have: - Merged header cells spanning multiple columns - Footnote markers (1), (2), (a), (b) that reference explanations below - Parentheses for negative numbers: $(1,234) means -1234 - Mixed units in the same table (millions for revenue, percentages for margins) - Prior period restatements in italics or with asterisks We score every extracted table on: - Cell boundary accuracy (did we split/merge correctly?) - Header detection (is row 1 actually headers, or is there a title row above?) - Numeric parsing (is “$1,234” parsed as 1234 or left as text?) - Unit inference (millions? billions? per share? percentage?) Tables below 90% confidence get flagged for review. Low-confidence extractions don’t enter the agent’s context—garbage in, garbage out. Fiscal period normalization is critical. “Q1 2024” is ambiguous: - Calendar Q1 (January-March 2024) - Apple’s fiscal Q1 (October-December 2023) - Microsoft’s fiscal Q1 (July-September 2023) - “Reported in Q1” (filed in Q1, but covers the prior period) We maintain a fiscal calendar database for 10,000+ companies. Every date reference gets normalized to absolute date ranges. When the agent retrieves “Apple Q1 2024 revenue,” it knows to look for data from October-December 2023. This is invisible to users but essential for correctness. Without it, you’re comparing Apple’s October revenue to Microsoft’s January revenue and calling it “same quarter.” Skills Are Everything Here’s the thing nobody tells you about building AI agents: the model is not the product. The skills are now the product. I learned this the hard way. We used to try making the base model “smarter” through prompt engineering. Tweak the system prompt, add examples, write elaborate instructions. It helped a little. But skills were the missing part. In October 2025, Anthropic formalized this with Agent Skills a specification for extending Claude with modular capability packages. A skill is a folder containing a `SKILL.md` file with YAML frontmatter (name and description), plus any supporting scripts, references, or data files the agent might need. We’d been building something similar for months before the announcement. The validation felt good but more importantly, having an industry standard means our skills can eventually be portable. Without skills, models are surprisingly bad at domain tasks. Ask a frontier model to do a DCF valuation. It knows what DCF is. It can explain the theory. But actually executing one? It will miss critical steps, use wrong discount rates for the industry, forget to add back stock-based compensation, skip sensitivity analysis. The output looks plausible but is subtly wrong in ways that matter. The breakthrough came when we started thinking about skills as first-class citizens. Like part of the product itself. A skill is a markdown file that tells the agent how to do something specific. Here’s a simplified version of our DCF skill: That’s it. A markdown file. No code changes. No production deployment. Just a file that tells the agent what to do. Skills are better than code. This matters enormously: 1. Non-engineers can create skills. Our analysts write skills. Our customers write skills. A portfolio manager who’s done 500 DCF valuations can encode their methodology in a skill without writing a single line of Python. 2. No deployment needed. Change a skill file and it takes effect immediately. No CI/CD, no code review, no waiting for release cycles. Domain experts can iterate on their own. 3. Readable and auditable. When something goes wrong, you can read the skill and understand exactly what the agent was supposed to do. Try doing that with a 2,000-line Python module. We have a copy-on-write shadowing system: Priority: private > shared > public So if you don’t like how we do DCF valuations, write your own. Drop it in `/private/skills/dcf/SKILL.md`. Your version wins. Why we don’t mount all skills to the filesystem. This is important. The naive approach would be to mount every skill file directly into the sandbox. The agent can just `cat` any skill it needs. Simple, right? Wrong. Here’s why we use SQL discovery instead: 1. Lazy loading. We have dozens of skills with extensive documentation like the DCF skill alone has 10+ industry guideline files. Loading all of them into context for every conversation would burn tokens and confuse the model. Instead, we discover skill metadata (name, description) upfront, and only load the full documentation when the agent actually uses that skill. 2. Access control at query time. The SQL query implements our three-tier access model: public skills available to everyone, organization skills for that org’s users, private skills for individual users. The database enforces this. You can’t accidentally expose a customer’s proprietary skill to another customer. 3. Shadowing logic. When a user customizes a skill, their version needs to override the default. SQL makes this trivial—query all three levels, apply priority rules, return the winner. Doing this with filesystem mounts would be a nightmare of symlinks and directory ordering. 4. Metadata-driven filtering. The `fs_files.metadata` column stores parsed YAML frontmatter. We can filter by skill type, check if a skill is main-agent-only, or query any other structured attribute—all without reading the files themselves. The pattern: S3 is the source of truth, a Lambda function syncs changes to PostgreSQL for fast queries, and the agent gets exactly what it needs when it needs it. Skills are essential. I cannot emphasize this enough. If you’re building an AI agent and you don’t have a skills system, you’re going to have a bad time. My biggest argument for skills is that top models (Claude or GPT) are post-trained on using Skills. The model wants to fetch skills. Models just want to learn and what they want to learn is our skills... Until they ate it. The Model Will Eat Your Scaffolding Here’s the uncomfortable truth: everything I just told you about skills? It’s temporary in my opinion. Models are getting better. Fast. Every few months, there’s a new model that makes half your code obsolete. The elaborate scaffolding you built to handle edge cases? The model just... handles them now. When we started, we needed detailed skills with step-by-step instructions for some simple tasks. “First do X, then do Y, then check Z.” Now? We can often just say for simple task “do an earnings preview” and the model figures it out (kinda of!) This creates a weird tension. You need skills today because current models aren’t smart enough. But you should design your skills knowing that future models will need less hand-holding. That’s why I’m bullish on markdown file versus code for model instructions. It’s easier to update and delete. We send detailed feedback to AI labs. Whenever we build complex scaffolding to work around model limitations, we document exactly what the model struggles with and share it with the lab research team. This helps inform the next generation of models. The goal is to make our own scaffolding obsolete. My prediction: in two years, most of our basic skills will be one-liners. “Generate a 20 tabs DCF.” That’s it. The model will know what that means. But here’s the flip side: as basic tasks get commoditized, we’ll push into more complex territory. Multi-step valuations with segment-by-segment analysis. Automated backtesting of investment strategies. Real-time portfolio monitoring with complex triggers. The frontier keeps moving. So we write skills. We delete them when they become unnecessary. And we build new ones for the harder problems that emerge. And all that are files... in our filesystem. The S3-First Architecture Here’s something that surprised me: S3 for files is a better database than a database. We store user data (watchlists, portfolio, preferences, memories, skills) in S3 as YAML files. S3 is the source of truth. A Lambda function syncs changes to PostgreSQL for fast queries. Writes → S3 (source of truth) ↓ Lambda trigger ↓ PostgreSQL (fs_files table) ↓ Reads ← Fast queries Why? - Durability : S3 has 11 9’s. A database doesn’t. - Versioning : S3 versioning gives you audit trails for free - Simplicity : YAML files are human-readable. You can debug with `cat`. - Cost : S3 is cheap. Database storage is not. The pattern: - Writes go to S3 directly - List queries hit the database (fast) - Single-item reads go to S3 (freshest data) The sync architecture. We run two Lambda functions to keep S3 and PostgreSQL in sync: S3 (file upload/delete) ↓ SNS Topic ↓ fs-sync Lambda → Upsert/delete in fs_files table (real-time) EventBridge (every 3 hours) ↓ fs-reconcile Lambda → Full S3 vs DB scan, fix discrepancies Both use upsert with timestamp guards—newer data always wins. The reconcile job catches any events that slipped through (S3 eventual consistency, Lambda cold starts, network blips). User memories live here too. Every user has a `/private/memories/UserMemories.md` file in S3. It’s just markdown—users can edit it directly in the UI. On every conversation, we load it and inject it as context: This is surprisingly powerful. Users write things like “I focus on small-cap value stocks” or “Always compare to industry median, not mean” or “My portfolio is concentrated in tech, so flag concentration risk.” The agent sees this on every conversation and adapts accordingly. No migrations. No schema changes. Just a markdown file that the user controls. Watchlists work the same way. YAML files in S3, synced to PostgreSQL for fast queries. When a user asks about “my watchlist,” we load the relevant tickers and inject them as context. The agent knows what companies matter to this user. The filesystem becomes the user’s personal knowledge base. Skills tell the agent how to do things. Memories tell it what the user cares about. Both are just files. The File System Tools Agents in financial services need to read and write files. A lot of files. PDFs, spreadsheets, images, code. Here’s how we handle it. ReadFile handles the complexity: WriteFile creates artifacts that link back to the UI: Bash gives persistent shell access with 180 second timeout and 100K character output limit. Path normalization on everything (LLMs love trying path traversal attacks, it’s hilarious). Bash is more important than you think. There’s a growing conviction in the AI community that filesystems and bash are the optimal abstraction for AI agents. Braintrust recently ran an eval comparing SQL agents, bash agents, and hybrid approaches for querying semi-structured data. The results were interesting: pure SQL hit 100% accuracy but missed edge cases. Pure bash was slower and more expensive but caught verification opportunities. The winner? A hybrid approach where the agent uses bash to explore and verify, SQL for structured queries. This matches our experience. Financial data is messy. You need bash to grep through filing documents, find patterns, explore directory structures. But you also need structured tools for the heavy lifting. The agent needs both—and the judgment to know when to use each. We’ve leaned hard into giving agents full shell access in the sandbox. It’s not just for running Python scripts. It’s for exploration, verification, and the kind of ad-hoc data manipulation that complex tasks require. But complex tasks mean long-running agents. And long-running agents break everything. Subscribe now Temporal Changed Everything Before Temporal, our long-running tasks were a disaster. User asks for a comprehensive company analysis. That takes 5 minutes. What if the server restarts? What if the user closes the tab and comes back? What if... anything? We had a homegrown job queue. It was bad. Retries were inconsistent. State management was a nightmare. Then we switched to Temporal and I wanted to cry tears of joy! That’s it. Temporal handles worker crashes, retries, everything. If a Heroku dyno restarts mid-conversation (happens all the time lol), Temporal automatically retries on another worker. The user never knows. The cancellation handling is the tricky part. User clicks “stop,” what happens? The activity is already running on a different server. We use heartbeats sent every few seconds. We run two worker types: - Chat workers : User-facing, 25 concurrent activities - Background workers : Async tasks, 10 concurrent activities They scale independently. Chat traffic spikes? Scale chat workers. Next is speed. Real-Time Streaming In finance, people are impatient. They’re not going to wait 30 seconds staring at a loading spinner. They need to see something happening. So we built real-time streaming. The agent works, you see the progress. Agent → SSE Events → Redis Stream → API → Frontend The key insight: delta updates, not full state. Instead of sending “here’s the complete response so far” (expensive), we send “append these 50 characters” (cheap). Streaming rich content with Streamdown. Text streaming is table stakes. The harder problem is streaming rich content: markdown with tables, charts, citations, math equations. We use Streamdown to render markdown as it arrives, with custom plugins for our domain-specific components. Charts render progressively. Citations link to source documents. Math equations display properly with KaTeX. The user sees a complete, interactive response building in real-time. AskUserQuestion: Interactive agent workflows. Sometimes the agent needs user input mid-workflow. “Which valuation method do you prefer?” “Should I use consensus estimates or management guidance?” “Do you want me to include the pipeline assets in the valuation?”

0 views

Model-Market Fit

In June 2007, Marc Andreessen published what became the defining essay on startup strategy. “The Only Thing That Matters” argued that of the three elements of a startup—team, product, and market— market matters most . A great market pulls the product out of the startup. The product doesn’t need to be great; it just has to basically work. Andreessen’s insight has guided a generation of founders. But nineteen years later, something has changed. A new variable has entered the equation. One that determines whether the market can pull anything at all. That variable is the model. For AI startups, there is a prerequisite layer beneath product-market fit: the degree to which current model capabilities can satisfy what a market demands . I call it Model-Market Fit, or MMF . When MMF exists, Andreessen’s framework applies perfectly. The market pulls the product out. When it doesn’t, no amount of brilliant UX, go-to-market strategy, or engineering can make customers adopt a product whose core AI task doesn’t solve their job to be done. The pattern is unmistakable once you see it. A model crosses a capability threshold. Within months, a vertical that had been dormant for years suddenly explodes with activity. For years, legal tech AI was stuck below scale. There were plenty of companies but none broke through. Document review tools that required more human oversight than they saved. Contract analysis that missed critical clauses. Every legal startup before 2023 struggled to cross $100M ARR. I remember this firsthand. I founded Doctrine in 2016, which grew to become the leading AI legal platform in Europe. But it was incredibly hard to raise money because all companies were sub-scale and the market wasn’t hot at all. Investors saw legal AI as a niche with limited upside. The market existed. Law firms desperately wanted automation. But the state-of-the-art models couldn’t handle the core tasks lawyers needed. BERT and similar transformer models excelled at classification like sorting documents, identifying contract types, flagging potential issues. But legal work requires generation and reasoning: drafting memos that synthesize complex case law, summarizing depositions while preserving nuanced arguments, generating discovery requests tailored to specific fact patterns. Traditional ML could categorize a contract as “employment” or “NDA,” but it couldn’t write a coherent brief explaining why a non-compete clause was unenforceable under California law. Then GPT-4 arrived in March 2023. Within eighteen months, Silicon Valley startups raised over hundreds of millions. Doctrine’s business is on fire. Thomson Reuters acquired Casetext for $650 million. Dozens of legal AI startups emerged. The legal AI market minted more unicorns in 12 months than in the previous 10 years combined. The market hadn’t changed. The model capability threshold had been crossed. Similarly, coding assistants existed before Sonnet. GitHub Copilot had millions of users. But there’s a difference between autocomplete that occasionally helps and an AI that genuinely understands your codebase and creates high-quality code for you. I experienced this firsthand. I tried Cursor early on, before Sonnet. It was meh. I installed it, tested it for a few days, deleted it. Did the same thing again a month later. Same result… interesting demo, not a workflow. Then Claude 3.5 Sonnet dropped! Within a week, I couldn’t work without Cursor. Neither could anyone on my team. The product became the workflow. We weren’t “using an AI assistant,” we were pair programming with something that understood our entire codebase. Cursor’s growth went vertical. Not because they shipped some brilliant new feature. Because the underlying model crossed the threshold that made their product actually work . They got Model Market Fit. The most important thing is MMF. The startups that won weren’t necessarily first, but they were prepared when the model capability threshold was finally crossed. So far in coding or legal, none of the incumbents won. It was always new players. Today’s leading legal startups had spent months understanding exactly how lawyers work like what output formats they need, what compliance requirements exist, how associates actually research cases. The race doesn’t go to the first mover. It goes to the first to product-market fit after model-market fit exists. The corollary is equally important: when MMF doesn’t exist, the market cannot pull. The demand is there. The willingness to pay is there. But the core task doesn’t work. Let’s review some examples. Mathematicians would love an AI that could prove novel theorems. The market is real, research institutions, defense contractors, and tech companies would pay millions for genuine mathematical reasoning. But even the most advanced models can’t do it consistently. They can verify known proofs. They can assist with mechanical steps. They can occasionally produce insights on bounded problems. But originating novel proofs on open problems? The capability threshold remains uncrossed. GPT-5, o1, o3... each generation improves incrementally, but we’re not at the point where you can feed an AI an open conjecture and expect a rigorous proof. Yet. Investment banks and hedge funds desperately want AI that can perform comprehensive financial analysis. The market is massive; a single successful trade or M&A deal can generate hundreds of millions in fees. But AI remains surprisingly bad at the core tasks that matter most. Excel output is still unreliable when dealing with complex financial models. More critically, AI struggles to combine quantitative analysis with qualitative insights from 200-page documents... exactly what analysts spend their days doing. A human analyst reads through earnings calls, regulatory filings, and industry reports, then synthesizes that qualitative intelligence with spreadsheet models to make investment recommendations. AI can handle pieces of this workflow, but the end-to-end reasoning that justifies million-dollar positions? The capability gap is wide today. This will obviously change soon. But for now, the human remains in the loop not as oversight, but as the primary decision-maker. The difference between verticals with MMF and those without is stark. Compare two benchmarks from Vals.ai: LegalBench (legal reasoning tasks): Top models hit 87% accuracy . Gemini 3 Pro leads at 87.04%, with multiple models clustered above 85%. This is production-grade performance. Accurate enough that lawyers can trust the output with light review. Finance Agent (core financial analyst tasks): Top models hit 56.55% accuracy . Even GPT-5.1, the current leader, barely crosses the halfway mark. Claude Sonnet 4.5 with extended thinking sits at 55.32%. That’s a 30-point gap. Legal has MMF. Finance doesn’t. The benchmarks reveal what intuition suggests: models have crossed the threshold for legal reasoning but remain fundamentally unreliable for financial analysis. You can ship a legal AI product today. A finance AI product that does the actual job of an analyst? Very soon but not now. The pharmaceutical industry has invested billions in AI-driven drug discovery. The market is enormous because a single successful drug is worth tens of billions. Yet the breakthroughs remain elusive. AI can accelerate certain steps: identifying candidate molecules, predicting protein structures (AlphaFold was transformative here), optimizing clinical trial design. But the end-to-end autonomous discovery that would justify the valuations? It doesn’t exist. The human remains in the loop not because the workflow is designed that way, but because the AI can’t actually do the job. There’s a reliable signal for missing MMF: examine how “human-in-the-loop” is positioned . When MMF exists, human-in-the-loop is a feature. It maintains quality, builds trust, handles edge cases. The AI does the work; the human provides oversight. When MMF doesn’t exist, human-in-the-loop is a crutch. It hides the fact that the AI can’t perform the core task. The human isn’t augmenting, they’re compensating. Strip away the human, and the product doesn’t work. The test is simple: if all human correction were removed from this workflow, would customers still pay? If the answer is no, there’s no MMF. There’s only a demo. This creates a brutal strategic dilemma. Do you build for current MMF or anticipated MMF? If MMF doesn’t exist today, building a startup around it means betting on model improvements that are on someone else’s roadmap. You don’t control when or whether the capability arrives. You’re burning runway while Anthropic and OpenAI decide your fate. Worse, you might be wrong about what capability is needed. Models might scale differently than you expect. The 80% to 99% accuracy gap that your vertical requires might be five years away, or it might never close in the way you imagined. Of course, if you believe in Artificial General Intelligence, then you know that models will eventually be able to do pretty much anything. But “eventually” is doing a lot of work in that sentence. The question isn’t whether AI will solve the problem; it’s when, and whether your startup survives long enough to see it (which is a function of your runway). But there’s a counterargument often shared at Ycombinator, and it’s compelling. When MMF unlocks, you need more than just model capability. You need: - Domain-specific data pipelines - Regulatory relationships - Customer trust built over years - Deep workflow integration - Understanding of how professionals actually work Legal startups didn’t just plug in GPT-4. They had already built the scaffolding. When the model arrived, they were ready to run. There’s also the question of influence. The teams closest to the problem shape how models get evaluated, fine-tuned, and deployed. They’re not passively waiting for capability; they’re defining what capability means in their vertical. The question isn’t whether to be early. It’s how early, and what you’re building while you wait. The dangerous zone is the middle: MMF that’s 24 to 36 months away. Close enough to seem imminent. Far enough to burn through multiple funding rounds waiting. This is where conviction and runway become everything. If you’re betting on MMF that’s 2+ years out, you better be in a gigantic market worth the wait. Consider healthcare and financial services. These markets are so massive that even Anthropic and OpenAI are going all-in despite very mixed current results. The potential upside justifies positioning early, even if the models aren’t quite there yet. When you’re targeting trillion dollar markets, the risk-reward calculation changes entirely. The math is simple: expected value = probability of MMF arriving × market size × your likely share . Product-market fit has famously resisted precise measurement. Andreessen described it qualitatively: “ You can always feel when product/market fit isn’t happening... And you can always feel product/market fit when it’s happening. ” MMF is similarly intuitive, but we can be more specific. Can the model, given the same inputs a human expert receives, produce output that a customer would pay for without significant human correction? This test has three components: 1. Same inputs : The model gets what the human would get—documents, data, context. No magical preprocessing that a real workflow couldn’t provide. 2. Output a customer would pay for : Not a demo. Not a proof of concept. Production-quality work that solves a real problem. 3. Without significant human correction : The human might review, refine, or approve. But if they’re rewriting 50% of the output, the model isn’t doing the job. In unregulated verticals, 80% accuracy might be enough. An AI that writes decent first drafts of marketing copy creates value even if humans edit heavily. In regulated verticals—finance, legal, healthcare—80% accuracy is often useless. A contract review tool that misses 20% of critical clauses isn’t augmenting lawyers; it’s creating liability. A medical diagnostic that’s wrong one time in five isn’t a product; it’s a lawsuit haha! The gap between 80% and 99% accuracy is often infinite in practice. It’s the difference between “promising demo” and “production system.” Many AI startups are stuck in this gap, raising money on demos while waiting for the capability that would make their product actually work. There’s a second capability frontier that most discussions of MMF miss: the ability to work autonomously over extended periods . Current MMF examples (legal document review, coding assistance) are fundamentally short-horizon tasks today. Prompt in, output out, maybe a few tool calls. The model does something useful in seconds or minutes. But the highest-value knowledge work isn’t like that. A financial analyst doesn’t answer one question; they spend days building a model, stress-testing assumptions, and synthesizing information across dozens of sources. A strategy consultant doesn’t produce a single slide; they iterate through weeks of research, interviews, and analysis. A drug discovery researcher doesn’t run one experiment; they design and execute campaigns spanning months. These workflows require something models can’t yet do reliably: sustained autonomous operation . The agentic threshold isn’t just “can the model use tools.” It’s: - Persistence : Can it maintain goals and context across hours or days? - Recovery : Can it recognize failures, diagnose problems, and try alternative approaches? - Coordination : Can it break complex objectives into subtasks and execute them in sequence? - Judgment : Can it know when to proceed versus when to stop and ask for guidance? Today’s agents can handle tasks measured in minutes. Tomorrow’s need to handle tasks measured in days. That’s not an incremental improvement—it’s a phase change in capability. This is why finance doesn’t have MMF despite models being “good at reading documents.” Reading a 10-K is a 30-second task. Building an investment thesis is a multi-day workflow requiring the agent to gather data, build models, test scenarios, and synthesize conclusions—all while maintaining coherent reasoning across the entire process. The next wave of MMF unlocks will come from smarter models AND models that can work for days on the same task. Andreessen’s core insight was that market matters more than team or product because a great market pulls the product out of the startup. The market creates the gravitational force. The AI corollary: model capability is the prerequisite for that gravitational pull to begin . No market, however large and hungry, can pull a product that doesn’t work. And in AI, “doesn’t work” is determined by the model, not by your engineering or design. You can build the most beautiful interface, the most elegant workflow, the most sophisticated data pipeline… and if the underlying model can’t perform the core task, none of it matters. MMF → PMF → Success. Skip the first step, and the second becomes impossible. This is both constraint and opportunity. For founders, it means being ruthlessly honest about where capability actually is versus where you hope it will be. For investors, it means evaluating not just market size and team quality, but the gap between current model capability and what the market requires. And for everyone building in AI: the question isn’t just whether the market wants what you’re building. It’s whether the models can deliver it. That’s the only thing that matters.

1 views

Are LLMs Plateauing? No. You Are.

“GPT-5 isn’t that impressive.” People claim the jump from GPT4o to GPT-5 feels incremental. They’re wrong. LLM intelligence hasn’t plateaued. Their perception of intelligence has . Let me explain with a simple example: translation from French to English. GPT-4o was already at 100% accuracy for this task. Near-perfect translations, proper idioms, cultural context. Just nailed it. Now try GPT-o1, o3, GPT-5, or whatever comes next. The result? Still 100% accurate. From your perspective, nothing changed. Zero improvement. The model looks identical. They have plateaued. But here’s the thing: most people’s tasks are dead simple. - “Do this math for me” - “Explain this concept” - “Translate this text” - “Rewrite that email” These tasks were already saturated by earlier models. They are testing intelligence on problems that have already been solved. Of course they don’t see progress. They are like someone measuring a rocket’s speed with a car speedometer. Once you hit the max reading, everything looks the same. Intelligence is multi-dimensional. It’s a spectrum of capabilities tested against increasingly difficult tasks. Think about how we measure human intelligence: - A 5-year-old doing addition → Smart kid - A PhD solving differential equations → Brilliant mathematician - A Fields Medalist proving novel theorems → Genius Same concept, wildly different difficulty levels. You wouldn’t judge the mathematician by giving them 2+2 . Yet that’s exactly what we’re doing with LLMs. We test them on tasks that earlier models already maxed out, then declare progress has stopped. Raw LLM intelligence is exploding. But it’s happening at the frontier. On tasks that push the absolute limits of reasoning. Take GPT-5-Pro. It demonstrated the ability to produce novel mathematical proofs . Not “solve this known problem.” Not “explain this proof.” Create new mathematics. Example: In an experiment by Sébastien Bubeck , GPT-5-Pro improved a bound in convex optimization from 1/L to 1.5/L. It reasoned for 17 minutes to generate a correct proof for an open problem. Read that again. An LLM improved a mathematical bound . It generated original research. This isn’t just solving known problems. The AI is creating new knowledge. We’re approaching a world where AI models will tackle the hardest unsolved problems in mathematics. The Millennium Prize Problems. P vs NP. The Riemann Hypothesis. Problems that have stumped humanity’s greatest minds for decades or centuries. This isn’t incremental. This is a model operating at the level of professional mathematicians. And this capability emerged in the latest generation. But if you’re only asking it to “explain gradient descent” or “fix my Python bug,” you’ll never see this intelligence. You’re testing a Formula 1 car in a parking lot . Current frontier models (GPT-5-Pro, Claude 4.5) can already outperform most humans on most intellectual tasks. Not “simple” tasks. Most tasks. - Legal analysis? Better than most lawyers. - Medical diagnosis? Better than most doctors - Code review? Better than most senior engineers. - Financial modeling? Better than most analysts. And they do it in seconds. No fatigue. No ego. No “I need to look that up.” (also no close to no compensation, lol!) Soon, these models will be smarter than most humans combined . The collective intelligence of humanity, accessible in a chat interface. But here’s what’s missing today: the ability to work over time with tools . Thanks for reading! A human doesn’t rely on raw brain power alone. You use tools: - Reading text to gather information - Writing to organize thoughts - Maintaining todo lists to track objectives - Asking for feedback to improve - Using calculators, spreadsheets, databases, software. Your brain isn’t that powerful in isolation. Your intelligence emerges from orchestrating tools toward a goal . LLMs sucked at this. They were brilliant in a single conversation but couldn’t persist, iterate, or coordinate across time. That’s changing. The breakthrough isn’t smarter models. It’s models that can orchestrate their intelligence over time . Software engineers experienced that firsthand with coding agents. GPT-5-Codex, an open-source coding agent, can read, edit, execute code autonomously. For instance, to refactor a 12,000-line legacy Python project, it will: - Address dependencies - Add test coverage - Fix three race conditions - Run for 7 hours in a sandboxed environment This isn’t “write me a function.” This is sustained, multi-step reasoning with tool use. Planning, executing, validating, iterating. The model maintained context, managed a todo list, ran tests, read errors, and adapted. Just like a human engineer would. That’s the leap. Not raw intelligence but applied intelligence . It will take over most valuable knowledge worker jobs . Here’s where it gets real: the AI Productivity Index (APEX) , the first benchmark for assessing whether frontier AI models can perform knowledge work with high economic value . APEX addresses a massive inefficiency in AI research: outside of coding, most benchmarks test toy problems that don’t reflect real work. APEX changes that. APEX-v1.0 contains 200 test cases across four domains: - Investment banking - Management consulting - Primary medical care How it was built: 1. Source experts with top-tier experience (e.g., Goldman Sachs investment bankers) 2. Experts create prompts reflecting high-value tasks from their day-to-day work 3. Experts create rubrics for evaluating model responses This isn’t “explain what a stock is.” It’s “analyze this M&A deal structure and identify regulatory risks in cross-border jurisdictions.” The results? Current models can already answer a significant portion of these questions. Not all, but enough to be economically valuable. Take stock research for instance. A model can read a 10-K filing and answer questions about it perfectly. At my company Fintool we saturated that benchmark in 2024. But now the challenge is for our AI to do investor’s job: - Monitor earnings calls across hundreds of companies - Extract precise financial metrics and projections - Generate comprehensive research reports - Compare performance across competitors - Track industry trends over time - Identify investment opportunities autonomously Same “intelligence,” radically different capability. The raw LLM power is enhanced with tools . When we tested Fintool-v4 against human equity analysts we found that our agent was 25x faster and 183x cheaper, with 90% accuracy on expert-level tasks. What Happens Next The plateau isn’t in the model. It’s in your benchmark. The next wave isn’t smarter models, it’s models that can actually do things. Even if raw intelligence plateaued tomorrow, expanding agentic capabilities alone would trigger massive economic growth . It’s about: - Models that can maintain todo lists and execute over weeks - Models that can read documentation, try solutions, fail, and iterate - Models that can coordinate with other models and humans - Models that can ask for help when stuck And when millions of these agents are deployed, the world changes. Not because the models got smarter. Because they got useful. Intelligence without application is just a party trick. Intelligence with tool use is the revolution. It’s accelerating. Exponentially. But the real action is happening at the edge. Thanks for reading! Subscribe for free to receive new posts and support my work.

0 views

LLMs Eat Scaffolding for Breakfast

We just deleted thousands of lines of code. Again. Each time a new LLM model comes out, that’s the same story. LLMs have limitations so we build scaffolding around them. Each models introduce new capabilities so that old scaffoldings must be deleted and new ones be added. But as we move closer to super intelligence, less scaffoldings are needed. This post is about what it takes to build successfully in AI today. Every line of scaffolding is a confession: the model wasn’t good enough. LLMs can’t read PDF? Let’s build a complex system to convert PDF to markdown LLMs can’t do math? Let’s build compute engine to return accurate numbers LLMs can’t handle structured output? Let’s build complex JSON validators and regex parsers LLMs can’t read images? Let’s use a specialized image to text model to describe the image to the LLM LLMs can’t read more than 3 pages? Let’s build a complex retrieval pipeline with a search engine to feed the best content to the LLM. LLMs can’t reason? Let’s build chain-of-thought logic with forced step-by-step breakdowns, verification loops, and self-consistency checks. etc, etc... millions of lines of code to add external capabilities to the model. But look at models today: GPT-5 is solving frontier mathematics, Grok-4 Fast can read 3000+ pages with its 2M context window, Claude 4.5 sonnet can ingest images or PDFs, all models have native reasoning capabilities and support structured outputs. The once essential scaffolding are now obsolete. Those tools are backed in the model capabilities. It’s nearly impossible to predict what scaffolding will become obsolete and when. What appears to be essential infrastructure and industry best practice today can transform into legacy technical debt within months. The best way to grasp how fast LLMs are eating scaffolding is to look at their system prompt (the top-level instruction that tells the AI how to behave). Looking at the prompt used in Codex, OpenAI coding agent from GPT-o3 model to GPT-5 is mind-blowing. GPT-o3 prompt: 310 lines GPT-5 prompt: 104 lines The new prompt removed 206 lines. A 66% reduction. GPT-5 needs way less handholding. The old prompt had complex instructions on how to behave as a coding agent (personality, preambles, when to plan, how to validate). The new prompt assumes GPT-5 already knows this and only specifies the Codex-specific technical requirements (sandboxing, tool usage, output formatting). The new prompt removed all the detailed guidance about autonomously resolving queries, coding guidelines, git usage. It’s also less prescriptive. Instead of “do this and this” it says “here are the tools at your disposal.” As we move closer to super intelligence, the models require more freedom and leeway (scary, lol!). Advanced models require simple instructions and tooling. Claude Code, the most sophisticated agent today, relies on a simple filesystem instead of a complex index and use bash commands (find, read, grep, glob) instead of complex tools. It moves so fast. Each model introduces a new paradigm shift. If you miss a paradigm shift, you’re dead. Having an edge in building AI applications require deep technical understanding, insatiable curiosity, and low ego. By the way, because everything changes, it’s good to focus on what won’t change Context window is how much text you can feed the model in a single conversation. Early model could only handle a couple of pages. Now it’s thousands of pages and it’s growing fast. Dario Amodei the founder of Anthropic expects 100M+ context windows while Sam Altman hinted at billions of context tokens . It means the LLMs can see more context so you need less scaffolding like retrieval augmented generation. November 2022 : GPT-3.5 could handle 4K context November 2023 : GPT-4 Turbo with 128K context June 2024 : Claude 3.5 Sonnet with 200K context June 2025 : Gemini 2.5 Pro with 1M context September 2025 : Grok-4 Fast with 2M context Models used to stream at 30-40 tokens per second. Today’s fastest models like Gemini 2.5 Flash and Grok-4 Fast hit 200+ tokens per second. A 5x improvement. On specialized AI chips (LPUs), providers like Cerebras push open-source models to 2,000 tokens per second. We’re approaching real-time LLM: full responses on complex task in under a second. LLMs are becoming exponentially smarter. With every new model, benchmarks get saturated. On the path to AGI, every benchmark will get saturated. Every job can be done and will be done by AI. As with humans, a key factor in intelligence is the ability to use tools to accomplish an objective. That is the current frontier: how well a model can use tools such as reading, writing, and searching to accomplish a task over a long period of time. This is important to grasp. Models will not improve their language translation skills (they are already at 100%), but they will improve how they chain translation tasks over time to accomplish a goal. For example, you can say, “Translate this blog post into every language on Earth,” and the model will work for a couple of hours on its own to make it happen. Tool use and long-horizon tasks are the new frontier. The uncomfortable truth: most engineers are maintaining infrastructure that shouldn’t exist. Models will make it obsolete and the survival of AI apps depends on how fast you can adapt to the new paradigm. That’s what startups have an edge over big companies. Bigcorp are late by at least two paradigms. Some examples of scaffolding that are on the decline: Vector databases : Companies paying thousands/month for when they could now just put docs in the prompt or use agentic-search instead of RAG ( my article on the topic ) LLM frameworks : These frameworks solved real problems in 2023. In 2025? They’re abstraction layers that slow you down. The best practice is now to use the model API directly. Prompt engineering teams : Companies hiring “prompt engineers” to craft perfect prompts when now current models just need clear instructions with open tools Model fine-tuning : Teams spending months fine-tuning models only for the next generation of out of the box models to outperform their fine-tune (cf my 2024 article on that ) Custom caching layers : Building Redis-backed semantic caches that add latency and complexity when prompt caching is built into the API. This cycle accelerates with every model release. The best AI teams master have critical skills: Deep model awareness : They understand exactly what today’s models can and cannot do, building only the minimal scaffolding needed to bridge capability gaps. Strategic foresight : They distinguish between infrastructure that solves today’s problems versus infrastructure that will survive the next model generation. Frontier vigilance : They treat model releases like breaking news. Missing a single capability announcement from OpenAI, Anthropic, or Google can render months of work obsolete. Ruthless iteration : They celebrate deleting code. When a new model makes their infrastructure redundant, they pivot in days, not months. It’s not easy. Teams are fighting powerful forces: Lack of awareness : Teams don’t realize models have improved enough to eliminate scaffolding (this is massive btw) Sunk cost fallacy : “We spent 3 years building this RAG pipeline!” Fear of regression : “What if the new approach is simple but doesn’t work as well on certain edge cases?” Organizational inertia : Getting approval to delete infrastructure is harder than building it Resume-driven development : “RAG pipeline with vector DB and reranking” looks better on a resume than “put files in prompt” In AI the best team builds for fast obsolescence and stay at the edge. Software engineering sits on top of a complex stack. More layers, more abstractions, more frameworks. Complexity was a sophistication. A simple web form in 2024? React for UI, Redux for state, TypeScript for types, Webpack for bundling, Jest for testing, ESLint for linting, Prettier for formatting, Docker for deployment…. AI is inverting this. The best AI code is simple and close to the model. Experienced engineers look at modern AI codebases and think: “This can’t be right. Where’s the architecture? Where’s the abstraction? Where’s the framework?” The answer: The model ate it bro, get over it. The worst AI codebases are the ones that were best practices 12 months ago. As models improve, the scaffolding becomes technical debt. The sophisticated architecture becomes the liability. The framework becomes the bottleneck. LLMs eat scaffolding for breakfast and the trend is accelerating. Thanks for reading! Subscribe for free to receive new posts and support my work. LLMs can’t read PDF? Let’s build a complex system to convert PDF to markdown LLMs can’t do math? Let’s build compute engine to return accurate numbers LLMs can’t handle structured output? Let’s build complex JSON validators and regex parsers LLMs can’t read images? Let’s use a specialized image to text model to describe the image to the LLM LLMs can’t read more than 3 pages? Let’s build a complex retrieval pipeline with a search engine to feed the best content to the LLM. LLMs can’t reason? Let’s build chain-of-thought logic with forced step-by-step breakdowns, verification loops, and self-consistency checks. Vector databases : Companies paying thousands/month for when they could now just put docs in the prompt or use agentic-search instead of RAG ( my article on the topic ) LLM frameworks : These frameworks solved real problems in 2023. In 2025? They’re abstraction layers that slow you down. The best practice is now to use the model API directly. Prompt engineering teams : Companies hiring “prompt engineers” to craft perfect prompts when now current models just need clear instructions with open tools Model fine-tuning : Teams spending months fine-tuning models only for the next generation of out of the box models to outperform their fine-tune (cf my 2024 article on that ) Custom caching layers : Building Redis-backed semantic caches that add latency and complexity when prompt caching is built into the API. Deep model awareness : They understand exactly what today’s models can and cannot do, building only the minimal scaffolding needed to bridge capability gaps. Strategic foresight : They distinguish between infrastructure that solves today’s problems versus infrastructure that will survive the next model generation. Frontier vigilance : They treat model releases like breaking news. Missing a single capability announcement from OpenAI, Anthropic, or Google can render months of work obsolete. Ruthless iteration : They celebrate deleting code. When a new model makes their infrastructure redundant, they pivot in days, not months. Lack of awareness : Teams don’t realize models have improved enough to eliminate scaffolding (this is massive btw) Sunk cost fallacy : “We spent 3 years building this RAG pipeline!” Fear of regression : “What if the new approach is simple but doesn’t work as well on certain edge cases?” Organizational inertia : Getting approval to delete infrastructure is harder than building it Resume-driven development : “RAG pipeline with vector DB and reranking” looks better on a resume than “put files in prompt”

1 views

ChatGPT Killed the Web: For the Better?

I haven’t used Google in a year. No search results, no blue links. ChatGPT became my default web browser in December 2024, and it has completely replaced the entire traditional web for me. Soon, no one will use search engine. No one will click on 10 blue links. But there is more: No one will navigate to websites. Hell, no one will even read a website again. The original web was simple. Static HTML pages. You could read about a restaurant—its menu, hours, location. But that was it. Pure consumption. Then came interactivity. Databases. User accounts. Now you could *do* things like reserve a table at that restaurant, leave a review, upload photos. The web became bidirectional. Every click was an action, every form a transaction. Now we’re entering a new evolution. You don’t navigate and read the restaurant’s website. You don’t fill out the reservation form. An LLM agent does both for you. Look at websites today. Companies spend millions building elaborate user interfaces—frontend frameworks, component libraries, animations that delight users, complex backends orchestrating data flows. Teams obsess over pixel-perfect designs, A/B test button colors, and optimize conversion funnels. All of this sophisticated web infrastructure exists for one purpose: to present information to humans and let them take actions. But if the information is consumed by a LLM - why does it need any of this? You don’t need a website. You need a text file: That’s it. That’s all an LLM needs to answer any question about a restaurant. No need for UI, clean UX etc. Here’s what nobody’s talking about: we don’t need thousands of websites anymore. Take a French boeuf bourguignon recipe. Today, there are hundreds of recipe websites, each with their own version: - AllRecipes with its community ratings - Serious Eats with detailed techniques - Food Network with celebrity chef branding - Marmiton for French speakers - Countless food blogs with personal stories Why do all these exist? They differentiated through: - Better UI design - Fewer ads - Faster load times - Native language content - Unique photography - Personal narratives before the recipe But LLMs don’t care about any of this. They don’t see your beautiful photos. They skip past your childhood story about grandma’s kitchen. They ignore your pop-up ads. They just need the recipe: Language barriers? Irrelevant. The LLM translates instantly. French, Italian, Japanese. It doesn’t matter. What this means: Instead of 10,000 cooking websites, we need maybe... a couple? or a single, comprehensive markdown repository of recipes. This pattern repeats everywhere: - Travel guides - Product reviews - News sites - Educational content The web doesn’t need redundancy when machines are the readers. Wait, there is more: LLM machines can create content too. Web 2.0’s breakthrough was making everyone a writer. YouTube, Instagram, TikTok—billions of people creating content for billions of people to read. But here’s the thing: why do you need a million human creators when AI can be all of them? Your favorite cooking influencer? Soon it’ll be an AI chef who knows exactly what’s in your fridge, your dietary restrictions, and your skill level. No more scrolling through 50 recipe videos to find one that works for you. Your trusted news anchor? An AI that only covers YOUR interests—your stocks, your sports teams, your neighborhood. Not broadcasting to millions, but narrowcasting to one. That fitness instructor you follow? An AI trainer that adapts to your fitness level, your injuries, your equipment. Every video made just for you, in real-time. Web 2.0 writing : Humans create content → Millions read the same thing Web 3.0 writing : AI creates content → Each person reads something unique The entire creator economy—the crown jewel of Web 2.0—collapses into infinite personalized AI agents. Social media feeds won’t be filled with human posts anymore. They’ll be generated in real-time, specifically for you. Every scroll, unique. Every video, personalized. Every post, tailored. The paradox: We’ll have infinite content variety with zero human creators. Maximum personalization through total artificial generation. Just as 10,000 recipe websites collapse into one markdown file for LLMs to read, millions of content creators collapse into personalized AI agents. The “write” revolution of Web 2.0 is being replaced by AI that writes everything, for everyone, individually. Ok what about taking actions like booking a restaurant? Web 2.0 gave us APIs—structured endpoints for programmatic interaction: - `POST /api/reservations` - Rigid schemas: exact field names, specific formats - Documentation hell: dozens of pages explaining endpoints - Integration nightmare: every API different, nothing interoperable APIs assumed developers would read documentation, write integration code, and handle complex error scenarios. They were built for humans to program against; requiring manual updates whenever the API changed, breaking integrations, and forcing developers to constantly maintain compatibility. MCP isn’t just another API. It’s designed for LLM agents: - Dynamic discovery : Agents explore capabilities in real-time through tool introspection - Flexible schemas : Natural language understanding, not rigid fields - Universal interoperability : One protocol, infinite services - Context-aware : Maintains conversation state across actions What makes MCP special technically: - Three primitives : Tools (functions agents can call), Resources (data agents can read), and Prompts (templates for common tasks) - Transport agnostic : Works over STDIO for local servers or HTTP/SSE for remote services - Stateful sessions : Unlike REST APIs, MCP maintains context between calls - Built-in tool discovery : Agents can query `listTools()` to understand capabilities dynamically—no documentation parsing needed Traditional APIs are like giving someone a thick manual and saying “ follow these exact steps. ” MCP is like having a smart assistant who can figure out what’s possible just by looking around . When you walk into that restaurant, the agent doesn’t need a 50-page guide—it instantly knows it can check tables, make reservations, or view the menu. And unlike APIs that forget everything between requests (like talking to someone with amnesia!), MCP remembers the whole conversation—so when you say “ actually, make it 8pm instead ,” it knows exactly what reservation you’re talking about. With traditional API: The agent handles all complexity. No documentation needed. No rigid formats. Just natural interaction. Even better: when the restaurant adds new capabilities—like booking the entire venue for private events, adding wine pairings, or offering chef’s table experiences—there’s no developer work required. The LLM agent automatically discovers the expanded schema and adapts. Traditional APIs would break existing integrations or require manual updates. MCP just works. With markdown for reading and MCP for acting, the entire web infrastructure becomes invisible: - Read : LLM ingests markdown → understands everything about your service - Act : LLM uses MCP → performs any action a user needs Websites become obsolete. Users never leave their chat interface. The web started as simple text documents linked together. We spent 30 years adding complexity such as animations, interactivity, rich media. Now we’re stripping it all away again. But this time, the simplicity isn’t for humans. It’s for machines. And that changes everything . The web as we know it is disappearing. What replaces it will be invisible, powerful, and fundamentally different from anything we’ve built before. For someone like me who love designing beautiful UIs, this is bittersweet. All those carefully crafted interfaces, micro-interactions, and pixel-perfect layouts will be obsolete. But I’m genuinely excited because it’s all about the user experience, and the UX of chatting (or even calling) your agent is infinitely better than website navigation. I can’t wait. Thanks for reading! Subscribe for free to receive new posts and support my work.

0 views

The RAG Obituary: Killed by Agents, Buried by Context Windows

I’ve been working in AI and search for a decade. First building Doctrine, the largest European legal search engine and now building Fintool , an AI-powered financial research platform that helps institutional investors analyze companies, screen stocks, and make investment decisions. After three years of building, optimizing, and scaling LLMs with retrieval-augmented generation (RAG) systems, I believe we’re witnessing the twilight of RAG-based architectures. As context windows explode and agent-based architectures mature, my controversial opinion is that the current RAG infrastructure we spent so much time building and optimizing is on the decline. In late 2022, ChatGPT took the world by storm. People started endless conversations, delegating crucial work only to realize that the underlying model, GPT-3.5 could only handle 4,096 tokens... roughly six pages of text! The AI world faced a fundamental problem: how do you make an intelligent system work with knowledge bases that are orders of magnitude larger than what it can read at once? The answer became Retrieval-Augmented Generation (RAG), an architectural pattern that would dominate AI for the next three years. GPT-3.5 could handle 4,096 token and the next model GPT-4 doubled it to 8,192 tokens, about twelve pages. This wasn’t just inconvenient; it was architecturally devastating. Consider the numbers: A single SEC 10-K filing contains approximately 51,000 tokens (130+ pages). With 8,192 tokens, you could see less than 16% of a 10-K filing. It’s like reading a financial report through a keyhole! RAG emerged as an elegant solution borrowed directly from search engines. Just as Google displays 10 blue links with relevant snippets for your query, RAG retrieves the most pertinent document fragments and feeds them to the LLM for synthesis. The core idea is beautifully simple: if you can’t fit everything in context, find the most relevant pieces and use those . It turns LLMs into sophisticated search result summarizers. Basically, LLMs can’t read the whole book but they can know who dies at the end; convenient! Long documents need to be chunked into pieces and it’s when problems start. Those digestible pieces are typically 400-1,000 tokens each which is basically 300-750 words. The problem? It isn’t as simple as cutting every 500 words. Consider chunking a typical SEC 10-K annual report. The document has a complex hierarchical structure: - Item 1: Business Overview (10-15 pages) - Item 1A: Risk Factors (20-30 pages) - Item 7: Management’s Discussion and Analysis (30-40 pages) - Item 8: Financial Statements (40-50 pages) After naive chunking at 500 tokens, critical information gets scattered: - Revenue recognition policies split across 3 chunks - A risk factor explanation broken mid-sentence - Financial table headers separated from their data - MD&A narrative divorced from the numbers it’s discussing If you search for “revenue growth drivers,” you might get a chunk mentioning growth but miss the actual numerical data in a different chunk, or the strategic context from MD&A in yet another chunk! At Fintool, we’ve developed sophisticated chunking strategies that go beyond naive text splitting: - Hierarchical Structure Preservation : We maintain the nested structure from Item 1 (Business) down to sub-sections like geographic segments, creating a tree-like document representation - Table Integrity : Financial tables are never split—income statements, balance sheets, and cash flow statements remain atomic units with headers and data together - Cross-Reference Preservation : We maintain links between narrative sections and their corresponding financial data, preserving the “See Note X” relationships - Temporal Coherence : Year-over-year comparisons and multi-period analyses stay together as single chunks - Footnote Association : Footnotes remain connected to their referenced items through metadata linking Each chunk at Fintool is enriched with extensive metadata: - Filing type (10-K, 10-Q, 8-K) - Fiscal period and reporting date - Section hierarchy (Item 7 > Liquidity > Cash Position) - Table identifiers and types - Cross-reference mappings - Company identifiers (CIK, ticker) - Industry classification codes This allows for more accurate retrieval but even our intelligent chunking can’t solve the fundamental problem: we’re still working with fragments instead of complete documents! Once you have the chunks, you need a way to search them. One way is to embed your chunks. Each chunk is converted into a high‑dimensional vector (typically 1,536 dimensions in most embedding models). These vectors live in a space where, theoretically, similar concepts are close together. When a user asks a question, that question also becomes a vector. The system finds the chunks whose vectors are closest to the query vector using cosine similarity. It’s elegant in theory and in practice, it’s a nightmare of edge cases. Embedding models are trained on general text and struggle with specific terminologies. They find similarities but they can’t distinguish between “revenue recognition” (accounting policy) and “revenue growth” (business performance). Consider that example: Query: “ What is the company’s litigation exposure ? RAG searches for “litigation” and returns 50 chunks: - Chunks 1-10: Various mentions of “litigation” in boilerplate risk factors - Chunks 11-20: Historical cases from 2019 (already settled) - Chunks 21-30: Forward-looking safe harbor statements - Chunks 31-40: Duplicate descriptions from different sections - Chunks 41-50: Generic “we may face litigation” warnings What RAG Reports: $500M in litigation (from Legal Proceedings section) What’s Actually There: - $500M in Legal Proceedings (Item 3) - $700M in Contingencies note (”not material individually”) - $1B new class action in Subsequent Events - $800M indemnification obligations (different section) - $2B probable losses in footnotes (keyword “probable” not “litigation”) The actual Exposure is $5.1B. 10x what RAG found. Oupsy! By late 2023, most builders realized pure vector search wasn’t enough. Enter hybrid search: combine semantic search (embeddings) with the traditional keyword search (BM25). This is where things get interesting. BM25 (Best Matching 25) is a probabilistic retrieval model that excels at exact term matching. Unlike embeddings, BM25: - Rewards Exact Matches : When you search for “EBITDA,” you get documents with “EBITDA,” not “operating income” or “earnings” - Handles Rare Terms Better : Financial jargon like “CECL” (Current Expected Credit Losses) or “ASC 606” gets proper weight - Document Length Normalization : Doesn’t penalize longer documents - Term Frequency Saturation : Multiple mentions of “revenue” don’t overshadow other important terms At Fintool, we’ve built a sophisticated hybrid search system: 1. Parallel Processing : We run semantic and keyword searches simultaneously 2. Dynamic Weighting : Our system adjusts weights based on query characteristics: - Specific financial metrics? BM25 gets 70% weight - Conceptual questions? Embeddings get 60% weight - Mixed queries? 50/50 split with result analysis 3. Score Normalization : Different scoring scales are normalized using: - Min-max scaling for BM25 scores - Cosine similarity already normalized for embeddings - Z-score normalization for outlier handling So at the end the embeddings search and the keywords search retrieve chunks and the search engine combines them using Reciprocal Rank Fusion. RRF merges rankings so items that consistently appear near the top across systems float higher, even if no system put them at #1! So now you think it’s done right? But hell no! Here’s what nobody talks about: even after all that retrieval work, you’re not done. You need to rerank the chunks one more time to get a good retrieval and it’s not easy. Rerankers are ML models that take the search results and reorder them by relevance to your specific query limiting the number of chunks sent to the LLM. Not only LLMs are context poor, they also struggle when dealing with too much information . It’s vital to reduce the number of chunks sent to the LLM for the final answer. The Reranking Pipeline: 1. Initial search retrieval with embeddings + keywords gets you 100-200 chunks 2. Reranker ranks the top 10 3. Top 10 are fed to the LLM to answer the question Here is the challenge with reranking: - Latency Explosion : Rerank adds between 300-2000ms per query. Ouch. - Cost Multiplication : it adds significant extra cost to every query. For instance, Cohere Rerank 3.5 costs $2.00 per 1,000 search units, making reranking expensive. - Context Limits : Rerankers typically handle few chunks (Cohere Rerank supports only 4096 tokens), so if you need to re-rank more than that, you have to split it into different parallel API calls and merge them! - Another Model to Manage : One more API, one more failure point Re-rank is one more step in a complex pipeline. What I find difficult with RAG is what I call the “cascading failure problem”. 1. Chunking can fail (split tables) or be too slow (especially when you have to ingest and chunk gigabytes of data in real-time) 2. Embedding can fail (wrong similarity) 3. BM25 can fail (term mismatch) 4. Hybrid fusion can fail (bad weights) 5. Reranking can fail (wrong priorities) Each stage compounds the errors of the previous stage. Beyond the complexity of hybrid search itself, there’s an infrastructure burden that’s rarely discussed. Running production Elasticsearch is not easy. You’re looking at maintaining TB+ of indexed data for comprehensive document coverage, which requires 128-256GB RAM minimum just to get decent performance. The real nightmare comes with re-indexing. Every schema change forces a full re-indexing that takes 48-72 hours for large datasets. On top of that, you’re constantly dealing with cluster management, sharding strategies, index optimization, cache tuning, backup and disaster recovery, and version upgrades that regularly include breaking changes. Here are some structural limitations: 1. Context Fragmentation - Long documents are interconnected webs, not independent paragraphs - A single question might require information from 20+ documents - Chunking destroys these relationships permanently 2. Semantic Search Fails on Numbers - “$45.2M” and “$45,200,000” have different embeddings - “Revenue increased 10%” and “Revenue grew by a tenth” rank differently - Tables full of numbers have poor semantic representations 3. No Causal Understanding - RAG can’t follow “See Note 12” → Note 12 → Schedule K - Can’t understand that discontinued operations affect continuing operations - Can’t trace how one financial item impacts another 4. The Vocabulary Mismatch Problem - Companies use different terms for the same concept - “Adjusted EBITDA” vs “Operating Income Before Special Items” - RAG retrieves based on terms, not concepts 5. Temporal Blindness - Can’t distinguish Q3 2024 from Q3 2023 reliably - Mixes current period with prior period comparisons - No understanding of fiscal year boundaries These aren’t minor issues. They’re fundamental limitations of the retrieval paradigm. Three months ago I stumbled on an innovation on retrievial that blew my mind In May 2025, Anthropic released Claude Code, an AI coding agent that works in the terminal. At first, I was surprised by the form factor. A terminal? Are we back in 1980? no UI? Back then, I was using Cursor, a product that excelled at traditional RAG. I gave it access to my codebase to embed my files and Cursor ran a search n my codebase before answering my query. Life was good. But when testing Claude Code, one thing stood out: It was better and faster and not because their RAG was better but because there was no RAG. Instead of a complex pipeline of chunking, embedding, and searching, Claude Code uses direct filesystem tools: 1. Grep (Ripgrep) - Lightning-fast regex search through file contents - No indexing required. It searches live files instantly - Full regex support for precise pattern matching - Can filter by file type or use glob patterns - Returns exact matches with context lines - Direct file discovery by name patterns - Finds files like `**/*.py` or `src/**/*.ts` instantly - Returns files sorted by modification time (recency bias) - Zero overhead—just filesystem traversal 3. Task Agents - Autonomous multi-step exploration - Handle complex queries requiring investigation - Combine multiple search strategies adaptively - Build understanding incrementally - Self-correct based on findings By the way, Grep was invented in 1973. It’s so... primitive. And that’s the genius of it. Claude Code doesn’t retrieve. It investigates: - Runs multiple searches in parallel (Grep + Glob simultaneously) - Starts broad, then narrows based on discoveries - Follows references and dependencies naturally - No embeddings, no similarity scores, no reranking It’s simple, it’s fast and it’s based on a new assumption that LLMs will go from context poor to context rich. Claude Code proved that with sufficient context and intelligent navigation, you don’t need RAG at all. The agent can: - Load entire files or modules directly - Follow cross-references in real-time - Understand structure and relationships - Maintain complete context throughout investigation This isn’t just better than RAG—it’s a fundamentally different paradigm. And what works for code can work for any long documents that are not coding files. The context window explosion made Claude Code possible: 2022-2025 Context-Poor Era: - GPT-4: 8K tokens (~12 pages) - GPT-4-32k: 32K tokens (~50 pages) 2025 and beyond Context Revolution: - Claude Sonnet 4: 200k tokens (~700 pages) - Gemini 2.5: 1M tokens (~3,000 pages) - Grok 4-fast: 2M tokens (~6,000 pages) At 2M tokens, you can fit an entire year of SEC filings for most companies. The trajectory is even more dramatic: we’re likely heading toward 10M+ context windows by 2027, with Sam Altman hinting at billions of context tokens on the horizon. This represents a fundamental shift in how AI systems process information. Equally important, attention mechanisms are rapidly improving—LLMs are becoming far better at maintaining coherence and focus across massive context windows without getting “lost” in the noise. Claude Code demonstrated that with enough context, search becomes navigation: - No need to retrieve fragments when you can load complete files - No need for similarity when you can use exact matches - No need for reranking when you follow logical paths - No need for embeddings when you have direct access It’s mind-blowing. LLMs are getting really good at agentic behaviors meaning they can organize their work into tasks to accomplish an objective. Here’s what tools like ripgrep bring to the search table: - No Setup : No index. No overhead. Just point and search. - Instant Availability : New documents are searchable the moment they hit the filesystem (no indexing latency!) - Zero Maintenance : No clusters to manage, no indices to optimize, no RAM to provision - Blazing Fast : For a 100K line codebase, Elasticsearch needs minutes to index. Ripgrep searches it in milliseconds with zero prep. - Cost : $0 infrastructure cost vs a lot of $$$ for Elasticsearch So back to our previous example on SEC filings. An agent can SEC filing structure intrinsically: - Hierarchical Awareness : Knows that Item 1A (Risk Factors) relates to Item 7 (MD&A) - Cross-Reference Following : Automatically traces “See Note 12” references - Multi-Document Coordination : Connects 10-K, 10-Q, 8-K, and proxy statements - Temporal Analysis : Compares year-over-year changes systematically For searches across thousands of companies or decades of filings, it might still use hybrid search, but now as a tool for agents: - Initial broad search using hybrid retrieval - Agent loads full documents for top results - Deep analysis within full context - Iterative refinement based on findings My guess is traditional RAG is now a search tool among others and that agents will always prefer grep and reading the whole file because they are context rich and can handle long-running tasks. Consider our $6.5B lease obligation question as an example: Step 1: Find “lease” in main financial statements → Discovers “See Note 12” Step 2: Navigate to Note 12 → Finds “excluding discontinued operations (Note 23)” Step 3: Check Note 23 → Discovers $2B additional obligations Step 4: Cross-reference with MD&A → Identifies management’s explanation and adjustments Step 5: Search for “subsequent events” → Finds post-balance sheet $500M lease termination Final answer: $5B continuing + $2B discontinued - $500M terminated = $6.5B The agent follows references like a human analyst would. No chunks. No embeddings. No reranking. Just intelligent navigation. Basically, RAG is like a research assistant with perfect memory but no understanding: - “Here are 50 passages that mention debt” - Can’t tell you if debt is increasing or why - Can’t connect debt to strategic changes - Can’t identify hidden obligations - Just retrieves text, doesn’t comprehend relationships Agentic search is like a forensic accountant: - Follows the money systematically - Understands accounting relationships (assets = liabilities + equity) - Identifies what’s missing or hidden - Connects dots across time periods and documents - Challenges management assertions with data 1. Increasing Document Complexity - Documents are becoming longer and more interconnected - Cross-references and external links are proliferating - Multiple related documents need to be understood together - Systems must follow complex trails of information 2. Structured Data Integration - More documents combine structured and unstructured data - Tables, narratives, and metadata must be understood together - Relationships matter more than isolated facts - Context determines meaning 3. Real-Time Requirements - Information needs instant processing - No time for re-indexing or embedding updates - Dynamic document structures require adaptive approaches - Live data demands live search 4. Cross-Document Understanding Modern analysis requires connecting multiple sources: - Primary documents - Supporting materials - Historical versions - Related filings RAG treats each document independently. Agentic search builds cumulative understanding. 5. Precision Over Similarity - Exact information matters more than similar content - Following references beats finding related text - Structure and hierarchy provide crucial context - Navigation beats retrieval The evidence is becoming clear. While RAG served us well in the context-poor era, agentic search represents a fundamental evolution. The potential benefits of agentic search are compelling: - Elimination of hallucinations from missing context - Complete answers instead of fragments - Faster insights through parallel exploration - Higher accuracy through systematic navigation - Massive infrastructure cost reduction - Zero index maintenance overhead The key insight? Complex document analysis—whether code, financial filings, or legal contracts—isn’t about finding similar text. It’s about understanding relationships, following references, and maintaining precision. The combination of large context windows and intelligent navigation delivers what retrieval alone never could. RAG was a clever workaround for a context-poor era . It helped us bridge the gap between tiny windows and massive documents, but it was always a band-aid. The future won’t be about splitting documents into fragments and juggling embeddings. It will be about agents that can navigate, reason, and hold entire corpora in working memory. We are entering the post-retrieval age. The winners will not be the ones who maintain the biggest vector databases, but the ones who design the smartest agents to traverse abundant context and connect meaning across documents. In hindsight, RAG will look like training wheels. Useful, necessary, but temporary. The next decade of AI search will belong to systems that read and reason end-to-end. Retrieval isn’t dead—it’s just been demoted.

0 views

But But, You Were Supposed to Be a GPT Wrapper?!

My team and I are building Fintool, Warren Buffett as a service . It's a set of AI agents that analyze massive amounts of financial data and documents to assist institutional investors in making investment decisions. To simplify for customers, we explain Fintool as a sort of ChatGPT on SEC filings and earnings calls. We got our fair share of "yOU aRe JuST a GPT wRapPer" from people who had no clue what they were talking about but wanted to sound smart and provocative. Anyway! For more serious people I thought it would be nice to disclose our infrastructure and unique challenges. Our goal is to ingest as much financial data as possible—ranging from news, management presentations, internal notes, broker research, market data, rating agency reports, alternative data, internal data and much more. We started with SEC filings, but our infrastructure is designed to scale and adapt, with no limit to the types of data sources it can handle. Our data ingestion pipeline uses Apache Spark to efficiently process vast amounts of structured and unstructured data. The primary data source is the SEC database, which provides, on average, around 3,000 filings daily. We've built a custom Spark job to pull data from the SEC, process HTML files, and distribute the workload across our Spark cluster for real-time ingestion. With SEC filings and earnings calls alone, we manage 70 million chunks, 2 million documents, and around 5 TB of data in Databricks for every ten years of data. Many documents are unstructured and often exceed 200 pages in length. Each data source has a dedicated Spark streaming job, ensuring a continuous flow of data into our system, making Fintool one of the very few real-time systems in production in our market. We outperform nearly all incumbents in processing time, often being hours faster. Monitoring the 100% uptime of all these pipelines and catching errors early is a significant challenge. Any failure in these processes could lead to incomplete or delayed data, affecting the reliability of Fintool. Our customers can’t miss a company earnings or an 8-K filing announcing that an executive is departing the company.  To address this, we have built robust monitoring tools that help us detect and resolve issues swiftly, ensuring the system remains operational and dependable. To make sense of the different formats, we've developed a custom parser that can handle both structured and unstructured data. This parser extracts millions of data points using a combination of unsupervised machine learning models, all optimized for financial documents. For instance, extracting tables with numerical data and footnotes accurately presents unique challenges, as it requires ensuring the numbers are correctly linked to their respective headers and that important context from footnotes is preserved. Imagine a company reports non-GAAP earnings with a footnote clarifying that $2 billion in employee stock-based compensation isn’t included; without accounting for that $2 billion, the earnings figures could be misleading! One of our goals is to handle as many complex operations offline as possible. By doing this, we save on costs and improve quality, as it allows us to thoroughly analyze the output—something that is not feasible during real-time user queries. We have recently partnered with OpenAI on a research project to use LLMs to extract every data point in SEC filings. Every week, we process 50 billion tokens, equivalent to 468,750 books of 200 pages each, or 12 times the size of Wikipedia.  Accounting is exceptionally complex. SEC filings often use different terminologies or formats for similar items—terms like “Revenue,” “Net Sales,” or “Turnover” vary by company or industry—making consistent data extraction a challenge. Key figures like "Net Income" may come with footnotes detailing adjustments (e.g., “excluding litigation costs”), and companies frequently report figures for different time periods, such as quarterly versus year-to-date, within the same filing. Some companies don’t report in USD, and others occasionally change accounting methods (e.g., revenue recognition policies), noted in footnotes, which requires careful adjustments to make financials comparable over time. It’s complex, but Fintool is bringing order to it all. Our advanced data pipelines are engineered to locate, verify, deduplicate, and cross-compare every data point, ensuring unmatched accuracy and insight. This is how we've built the most reliable financial fundamentals database on the market! Next, we break down these documents into manageable, meaningful segments while preserving context—crucial for downstream tasks like search and question answering. We use a sliding window approach with a variable-sized window (typically 400 tokens) to ensure coherence between segments. We also employ hierarchical chunking to create a tree-like structure of document sections, capturing everything from top-level sections like "Financial Statements" to specific sub-sections. Our system treats tables as atomic units, keeping table headers and data cells intact for accuracy. To maintain context, each chunk is enriched with metadata (e.g., document title, section headers), and we use an overlap strategy where consecutive chunks share a small overlap (about 10%) to ensure continuity. This allows us to accurately capture the narrative, even in long documents - a 10-K annual report is between 150 to 200 pages. Those docs are then ready to be embedded! We compute embeddings for each document chunk using a fine-tuned open-source model running on our GPUs. This model was fine-tuned on hundreds of real-life examples from expert financial questions. These embeddings allow us to represent complex financial data in a way that captures semantic meaning. For example, if a document mentions 'net income growth' alongside 'operating cash flow trends,' the embeddings capture the relationship between these terms, allowing the system to understand the context and link related financial concepts effectively. The embedding computation pipeline processes data in batches and stores the results in Elasticsearch, which supports vector storage and search through its dense_vector field type. Elasticsearch enables k-nearest neighbor (kNN) search using similarity metrics such as cosine similarity and dot product. Since we normalize our embeddings to unit length, cosine similarity and dot product yield equivalent results, allowing us to use either for efficient similarity search. We chose not to use a dedicated vector database, as it would add complexity and reduce performance, particularly when merging results from both keyword and vector searches. Managing this combination effectively without compromising speed and accuracy is challenging, which is why we opted for this more streamlined approach. To speed up our embeddings search, we quantize the embeddings, compressing them to significantly reduce memory usage—by as much as 75%. This reduction means we can access and process data faster, allowing for quicker responses while maintaining effective search performance. Quantization not only optimizes memory but also boosts efficiency across the entire search process. Our search infrastructure integrates both keyword-based and semantic search methods to deliver accurate and comprehensive answers. For keyword search, we use an enhanced BM25 algorithm, which helps us find relevant information based on traditional keyword matching. On the semantic side, we leverage vector-based similarity search using ElasticSearch to locate information based on meaning rather than just keywords. Despite all the buzz around vector search, our evaluations revealed that relying on vector search alone falls short of expectations. While many startups offer vector databases combined with vector search as a service, we have more confidence in Elastic's technology. Through extensive optimizations, we’ve achieved a streamlined Elastic index of approximately 500GB, containing about 2 million documents for every 10 years of data This combination of keyword and semantic search allows us to achieve hybrid retrieval, which significantly enhances search relevance and accuracy. For example, keyword search is ideal for finding specific financial terms like 'net income,' which require precise matching. Meanwhile, vector search helps understand broader questions, such as "companies showing signs of liquidity stress," which involves context and relationships between multiple financial metrics. We then use reranking techniques to improve retrieval performance. Our re-ranker takes a list of candidate chunks and uses a cross-encoder model to assign a relevance score, ensuring the most relevant chunks are prioritized. This cross-encoder model allows for a deeper and more precise evaluation of the relationship between the query and each document, resulting in significantly more accurate final rankings. Re-ranking can add hundreds of milliseconds of latency but, in our experience, is worth it.  Talking about improving the search, we are currently exploring knowledge graphs since the publication of the GraphRAG framework by Microsoft. It uses an LLM to automatically extract data points to create a rich graph from a collection of text documents. This graph represents entities as nodes, relationships as edges, and claims as covariates on edges. An example of a node in the knowledge graph could be 'Apple Inc. (AAPL)' as an entity, representing the company. Relationships (edges) might include connections like 'has CEO' linked to 'Tim Cook' or 'sold shares on [date].' These nodes and relationships help institutional investors quickly identify key details about companies, such as executive leadership changes, important filings, or financial events. GraphRAG automatically generates summaries for these entities.  When a user asks a query, we will leverage the knowledge graph and community summaries to provide more structured and contextually relevant information compared to traditional retrieval-augmented generation approaches. For example, an institutional investor might ask, "Which companies in the S&P 500 are experiencing liquidity stress and have recently made executive changes?" GraphRAG supports both global search to reason about the holistic context (e.g., liquidity stress across the market) and local search for specific entities (e.g., identifying companies with recent executive changes). This hybrid approach helps connect disparate pieces of information, providing more comprehensive and insightful answers.  The challenge with GraphRAG search lies in the high cost of both building and querying the graph, as well as managing query-time latency and integrating it with our keyword + vector search. A potential solution could be an efficient, fast classifier to reserve GraphSearch for only the most complex queries. We use LLMs for a variety of tasks such as understanding the query, expanding it, and classifying its type. For each user query, we trigger multiple classifiers that help determine whether the question requires searching specific filings, calculating numerical values, or taking other specific actions. To handle these tasks, we utilize a variety of LLMs—from proprietary models to open-source Llama models, with different sizes and providers to balance speed and cost. For instance, we might use OpenAI GPT4o for complex tasks and Llama-3 8B on Groq, a specialized provider for fast inference, for simpler tasks. We created an LLM Benchmarking Service that continuously evaluates the performance of these models across numerous tasks. This service helps us dynamically route each query to the best-performing model.  Having a model-agnostic interface is crucial to ensure we are not constrained by any particular model, especially with new models emerging every six months with enhanced capabilities. This flexibility allows us to always leverage the best available tools for optimal performance. We don't spend any resources training or fine-tuning our own models - we wrote about this strategy in Burning Billions: The Gamble Behind Training LLM Models . As you can see, answering a user's question is not trivial. It relies on a massive infrastructure, dozens of classifiers, and a hybrid retrieval pipeline. Additionally, we use a specialized LLM pipeline to generate accurate citations for every piece of information in the response, which also serves as a way to fact-check everything the LLM outputs. For example, if the answer references a specific SEC filing, the LLM provides an exact citation, guiding the user directly to the original document. Subscribe now Evaluating and monitoring an LLM-based Retrieval Augmented Generation system presents its own challenges. Any problem could originate from various components—such as data pipelines, machine learning models for structuring data, the retrieval search and vector representation, the reranker, or the LLM itself. Identifying the root cause of an issue requires a comprehensive understanding of each part of the infrastructure and its interactions, ensuring that every step contributes effectively to the overall accuracy and reliability of the system. To address these challenges, we have developed specialized monitoring tools that help us catch potential errors across the entire pipeline. We also use Datadog to store a lot of logs so we can quickly identify and fix production issues. Obviously, we want to catch errors early so we always benchmark our product against finance-specific benchmarks. The catch is that some improvements can improve our embeddings but might deteriorate the overall performance of the product. As you see, it’s very complex!  There is so much more we could talk about, and I hope this provides a broad overview of our approach. Each of these sections could easily be expanded into a dedicated blog post! In short, I believe that making LLMs work in finance is both highly challenging and immensely rewarding. We're steadily building our infrastructure piece by piece, productizing and delivering each advancement along the way. Our ultimate goal is to create an autonomous "Warren Buffett as a Service" that can handle the workload of dozens of analysts, transforming the financial research landscape. Let me finish by sharing some of the things I'm most excited about for the future Faster inference Many companies are working on specialized chips that are designed to deliver extremely low-latency, high-throughput processing with high parallelism. Today, we are using Groq a provider capable of streaming at 800 tokens per second, but they are now claiming they can reach 2000 tokens per second. To put this into context, processing at multiple thousands tokens per second means that complex responses will be delivered almost instantaneously. I'm more excited by faster inferences than by smaller models like LLama 8B or Mistral 3B. While smaller LLMs are useful because they are faster, if larger models become extremely efficient and deliver superior intelligence, there may be no need for smaller models. The power of large, smart models would make them the optimal choice for most tasks. Why does this matter? With such speed, an advanced AI agent can take control of Fintool to analyze thousands of companies simultaneously, performing billions of searches on company data in a fraction of the time. Imagine if Warren Buffett could read all filings, compute numbers, and analyze management teams instantly for thousands of companies. Cheaper cost per token I'm excited by the price of superintelligence getting closer to zero. The cost per GPT token has already dropped by 99%, and I'm confident it will continue to drop due to intense competition between major players like Microsoft and Meta, as well as innovations in semiconductors and economies of scale with large data centers. With costs continuing to decrease, we are approaching a future where large-scale AI computations are affordable, enabling widespread adoption and insane innovations.  Autonomous AI Agents Multi-Agent Systems, which consist of AI agents that can work independently or collaborate with other agents to perform complex tasks. For example, these agents could autonomously collaborate in stress-testing scenarios or optimize complex investment strategies. Additionally, Self-Healing Systems, capable of real-time monitoring, debugging, and repairing themselves could, for instance, detect and correct discrepancies in market data or errors in algorithms, enhancing reliability and resilience.  Onwards!   Our data ingestion pipeline uses Apache Spark to efficiently process vast amounts of structured and unstructured data. The primary data source is the SEC database, which provides, on average, around 3,000 filings daily. We've built a custom Spark job to pull data from the SEC, process HTML files, and distribute the workload across our Spark cluster for real-time ingestion. With SEC filings and earnings calls alone, we manage 70 million chunks, 2 million documents, and around 5 TB of data in Databricks for every ten years of data. Many documents are unstructured and often exceed 200 pages in length. Each data source has a dedicated Spark streaming job, ensuring a continuous flow of data into our system, making Fintool one of the very few real-time systems in production in our market. We outperform nearly all incumbents in processing time, often being hours faster. Monitoring the 100% uptime of all these pipelines and catching errors early is a significant challenge. Any failure in these processes could lead to incomplete or delayed data, affecting the reliability of Fintool. Our customers can’t miss a company earnings or an 8-K filing announcing that an executive is departing the company.  To address this, we have built robust monitoring tools that help us detect and resolve issues swiftly, ensuring the system remains operational and dependable. 50 Billions Tokens per Week? Parsing Complex Financial Data To make sense of the different formats, we've developed a custom parser that can handle both structured and unstructured data. This parser extracts millions of data points using a combination of unsupervised machine learning models, all optimized for financial documents. For instance, extracting tables with numerical data and footnotes accurately presents unique challenges, as it requires ensuring the numbers are correctly linked to their respective headers and that important context from footnotes is preserved. Imagine a company reports non-GAAP earnings with a footnote clarifying that $2 billion in employee stock-based compensation isn’t included; without accounting for that $2 billion, the earnings figures could be misleading! One of our goals is to handle as many complex operations offline as possible. By doing this, we save on costs and improve quality, as it allows us to thoroughly analyze the output—something that is not feasible during real-time user queries. We have recently partnered with OpenAI on a research project to use LLMs to extract every data point in SEC filings. Every week, we process 50 billion tokens, equivalent to 468,750 books of 200 pages each, or 12 times the size of Wikipedia.  Accounting is exceptionally complex. SEC filings often use different terminologies or formats for similar items—terms like “Revenue,” “Net Sales,” or “Turnover” vary by company or industry—making consistent data extraction a challenge. Key figures like "Net Income" may come with footnotes detailing adjustments (e.g., “excluding litigation costs”), and companies frequently report figures for different time periods, such as quarterly versus year-to-date, within the same filing. Some companies don’t report in USD, and others occasionally change accounting methods (e.g., revenue recognition policies), noted in footnotes, which requires careful adjustments to make financials comparable over time. It’s complex, but Fintool is bringing order to it all. Our advanced data pipelines are engineered to locate, verify, deduplicate, and cross-compare every data point, ensuring unmatched accuracy and insight. This is how we've built the most reliable financial fundamentals database on the market! Smart Chunking for Context-Aware Document Segmentation Next, we break down these documents into manageable, meaningful segments while preserving context—crucial for downstream tasks like search and question answering. We use a sliding window approach with a variable-sized window (typically 400 tokens) to ensure coherence between segments. We also employ hierarchical chunking to create a tree-like structure of document sections, capturing everything from top-level sections like "Financial Statements" to specific sub-sections. Our system treats tables as atomic units, keeping table headers and data cells intact for accuracy. To maintain context, each chunk is enriched with metadata (e.g., document title, section headers), and we use an overlap strategy where consecutive chunks share a small overlap (about 10%) to ensure continuity. This allows us to accurately capture the narrative, even in long documents - a 10-K annual report is between 150 to 200 pages. Those docs are then ready to be embedded! Custom Embeddings for Semantic Representation We compute embeddings for each document chunk using a fine-tuned open-source model running on our GPUs. This model was fine-tuned on hundreds of real-life examples from expert financial questions. These embeddings allow us to represent complex financial data in a way that captures semantic meaning. For example, if a document mentions 'net income growth' alongside 'operating cash flow trends,' the embeddings capture the relationship between these terms, allowing the system to understand the context and link related financial concepts effectively. The embedding computation pipeline processes data in batches and stores the results in Elasticsearch, which supports vector storage and search through its dense_vector field type. Elasticsearch enables k-nearest neighbor (kNN) search using similarity metrics such as cosine similarity and dot product. Since we normalize our embeddings to unit length, cosine similarity and dot product yield equivalent results, allowing us to use either for efficient similarity search. We chose not to use a dedicated vector database, as it would add complexity and reduce performance, particularly when merging results from both keyword and vector searches. Managing this combination effectively without compromising speed and accuracy is challenging, which is why we opted for this more streamlined approach. To speed up our embeddings search, we quantize the embeddings, compressing them to significantly reduce memory usage—by as much as 75%. This reduction means we can access and process data faster, allowing for quicker responses while maintaining effective search performance. Quantization not only optimizes memory but also boosts efficiency across the entire search process. Search Infra: Combining Keywords and Semantic Search Our search infrastructure integrates both keyword-based and semantic search methods to deliver accurate and comprehensive answers. For keyword search, we use an enhanced BM25 algorithm, which helps us find relevant information based on traditional keyword matching. On the semantic side, we leverage vector-based similarity search using ElasticSearch to locate information based on meaning rather than just keywords. Despite all the buzz around vector search, our evaluations revealed that relying on vector search alone falls short of expectations. While many startups offer vector databases combined with vector search as a service, we have more confidence in Elastic's technology. Through extensive optimizations, we’ve achieved a streamlined Elastic index of approximately 500GB, containing about 2 million documents for every 10 years of data This combination of keyword and semantic search allows us to achieve hybrid retrieval, which significantly enhances search relevance and accuracy. For example, keyword search is ideal for finding specific financial terms like 'net income,' which require precise matching. Meanwhile, vector search helps understand broader questions, such as "companies showing signs of liquidity stress," which involves context and relationships between multiple financial metrics. We then use reranking techniques to improve retrieval performance. Our re-ranker takes a list of candidate chunks and uses a cross-encoder model to assign a relevance score, ensuring the most relevant chunks are prioritized. This cross-encoder model allows for a deeper and more precise evaluation of the relationship between the query and each document, resulting in significantly more accurate final rankings. Re-ranking can add hundreds of milliseconds of latency but, in our experience, is worth it.  Share Knowledge Graph, the Next Step to Connect the Dots Talking about improving the search, we are currently exploring knowledge graphs since the publication of the GraphRAG framework by Microsoft. It uses an LLM to automatically extract data points to create a rich graph from a collection of text documents. This graph represents entities as nodes, relationships as edges, and claims as covariates on edges. An example of a node in the knowledge graph could be 'Apple Inc. (AAPL)' as an entity, representing the company. Relationships (edges) might include connections like 'has CEO' linked to 'Tim Cook' or 'sold shares on [date].' These nodes and relationships help institutional investors quickly identify key details about companies, such as executive leadership changes, important filings, or financial events. GraphRAG automatically generates summaries for these entities.  When a user asks a query, we will leverage the knowledge graph and community summaries to provide more structured and contextually relevant information compared to traditional retrieval-augmented generation approaches. For example, an institutional investor might ask, "Which companies in the S&P 500 are experiencing liquidity stress and have recently made executive changes?" GraphRAG supports both global search to reason about the holistic context (e.g., liquidity stress across the market) and local search for specific entities (e.g., identifying companies with recent executive changes). This hybrid approach helps connect disparate pieces of information, providing more comprehensive and insightful answers.  The challenge with GraphRAG search lies in the high cost of both building and querying the graph, as well as managing query-time latency and integrating it with our keyword + vector search. A potential solution could be an efficient, fast classifier to reserve GraphSearch for only the most complex queries. LLM Benchmarking: Routing to the Best Model We use LLMs for a variety of tasks such as understanding the query, expanding it, and classifying its type. For each user query, we trigger multiple classifiers that help determine whether the question requires searching specific filings, calculating numerical values, or taking other specific actions. To handle these tasks, we utilize a variety of LLMs—from proprietary models to open-source Llama models, with different sizes and providers to balance speed and cost. For instance, we might use OpenAI GPT4o for complex tasks and Llama-3 8B on Groq, a specialized provider for fast inference, for simpler tasks. We created an LLM Benchmarking Service that continuously evaluates the performance of these models across numerous tasks. This service helps us dynamically route each query to the best-performing model.  Having a model-agnostic interface is crucial to ensure we are not constrained by any particular model, especially with new models emerging every six months with enhanced capabilities. This flexibility allows us to always leverage the best available tools for optimal performance. We don't spend any resources training or fine-tuning our own models - we wrote about this strategy in Burning Billions: The Gamble Behind Training LLM Models . As you can see, answering a user's question is not trivial. It relies on a massive infrastructure, dozens of classifiers, and a hybrid retrieval pipeline. Additionally, we use a specialized LLM pipeline to generate accurate citations for every piece of information in the response, which also serves as a way to fact-check everything the LLM outputs. For example, if the answer references a specific SEC filing, the LLM provides an exact citation, guiding the user directly to the original document. Subscribe now LLM Evaluation and Monitoring Evaluating and monitoring an LLM-based Retrieval Augmented Generation system presents its own challenges. Any problem could originate from various components—such as data pipelines, machine learning models for structuring data, the retrieval search and vector representation, the reranker, or the LLM itself. Identifying the root cause of an issue requires a comprehensive understanding of each part of the infrastructure and its interactions, ensuring that every step contributes effectively to the overall accuracy and reliability of the system. To address these challenges, we have developed specialized monitoring tools that help us catch potential errors across the entire pipeline. We also use Datadog to store a lot of logs so we can quickly identify and fix production issues. Obviously, we want to catch errors early so we always benchmark our product against finance-specific benchmarks. The catch is that some improvements can improve our embeddings but might deteriorate the overall performance of the product. As you see, it’s very complex!  There is so much more we could talk about, and I hope this provides a broad overview of our approach. Each of these sections could easily be expanded into a dedicated blog post! In short, I believe that making LLMs work in finance is both highly challenging and immensely rewarding. We're steadily building our infrastructure piece by piece, productizing and delivering each advancement along the way. Our ultimate goal is to create an autonomous "Warren Buffett as a Service" that can handle the workload of dozens of analysts, transforming the financial research landscape. Let me finish by sharing some of the things I'm most excited about for the future Faster inference Cheaper cost per token Autonomous AI Agents

0 views

Fintool, Warren Buffett as a Service

As a dedicated Warren Buffett fan, I’ve made it a point to attend the Berkshire Hathaway Annual Meeting every year since I moved to the US. His personal values have greatly influenced my ethics in life, and I'm fascinated by his approach to business. I've written numerous blog posts over the years on investing , competitive moats , Intelligent CEO s, or whether to buy a house —all inspired by Buffett. Concepts like margin of safety and buying below intrinsic value were key to running and eventually selling my previous startup. When I sold my previous company—a legal search engine powered by AI—I invested a portion of my gains into BRK stocks, trusting in Buffett’s methodology. But as someone who has spent over a decade working in AI, a question kept nagging at me: Could an advanced language model do what Warren Buffett does? Jim Simons from Renaissance Technology made over $100B in profits by using machine learning to analyze vast amounts of quantitative data to identify subtle patterns and anomalies that can be exploited for trading. He relies heavily on quantitative data, but what if we could now do the same for qualitative textual data now that LLMs have reasoning capabilities? Warren Buffett's letters, biographies, and investment decisions provide a wealth of knowledge about how to find, analyze, and understand companies. There are even textbooks on value investing that detail the step-by-step process. What if we could break down Buffett’s process into individual tasks and use an AI agent to replicate his approach? At Fintool, we took on that challenge. We deconstructed most of the tasks that Buffett performs to analyze a business—reading SEC filings, understanding earnings, evaluating management decisions—and we built an AI financial analyst to handle these tasks with precision and scale. In some fields, like law, language models are already performing well. Ask an AI to draft an NDA or a Share Purchase Agreement (SPA), and it can quickly generate a document that’s almost ready to go, with minor tweaks. At worst, you might need to provide some context or feed in additional documents, but the model already knows the structure and intent. Ask ChatGPT to generate a Non-Disclosure Agreement (NDA) for a software company and it will do great. Ask ChatGPT to analyze the owner earnings over the past 5 years of founder-led companies in the S&P 500 and it will fail. Finance demands both the strengths and exposes the weaknesses of LLMs. Financial professionals require real-time data, but advanced LLMs like GPT-4 have a knowledge cut-off of October 2023. There is zero tolerance for errors—hallucinations simply aren't acceptable when billions of dollars are at stake. Finance involves processing vast numerical data, an area where LLMs often struggle, and requires scanning multiple companies comprehensively, while LLMs can struggle to effectively analyze even a single one. The combination of financial data complexity, the need for speed, and absolute accuracy makes it one of the toughest challenges for AI to tackle. Let's go back to our question: Compare the owner earnings over the past 5 years of founder-led companies in the S&P 500. Our LLM Warren Buffett needs to do the following: Identify founder-led companies within the S&P 500 by reading at least 500 DEF14A Proxy Statements (approximately 100 pages per document). Understand that Owner Earnings = Net Income + Depreciation and Amortization + Non-Cash Charges - Capital Expenditures (required to maintain the business) - Changes in Working Capital. Extract financial data from the past 5 years (net income, CapEx, working capital changes) for the 500 companies by reading at least 2,500 annual reports. Compute the data by comparing year-over-year owner earnings growth or decline, looking at trends such as increasing CapEx, expanding net income, or significant working capital changes. Write a comprehensive, error-proof report. This is very hard, every step have to be correct. Institutional investors ask hundreds of questions like that. By reading Buffett's shareholder letters, biographies, and value investing textbooks, we broke down Buffett's workflow into specific tasks. Then, we started building our infrastructure piece by piece to replicate these tasks for institutional investors, allowing them to quantitatively and qualitatively analyze a business. I won't go into the hundreds of tasks we identified, but for instance, we created a "screener API" where you can ask qualitative questions on thousands of companies, like " Which tech companies are discussing increasing Capex for AI initiatives? ". With just one data type—SEC filings and earnings calls—we have 70 million chunks, 2 million documents, approximately 500GB of data in Elastic, and around 5TB of data in Databricks for every ten years of data. And that's just one part of the vast amount of data we handle! From Fintool company screener We also built another API for our agents that can retrieve any number from any filings, along with its source. Additionally, we have an API that excels at computing numbers efficiently. For that challenge, we have partnered with OpenAI on a research project to use LLMs to extract every data point in SEC filings. Every week, we process 50 billion tokens, equivalent to 468,750 books of 200 pages each, or 12 times the size of Wikipedia. Our sophisticated data pipelines are designed to locate, verify, deduplicate, and compare every data point for accuracy and insight.  Fintool “Spreadsheet Builder” answering a question on precise data points We are continuously adding new capabilities to our infrastructure. Our Warren Buffett Agent will use these APIs around the clock to find investment opportunities, analyze them, and respond to customer requests. Although the final product is still in development, we already have a live version in use. The results are promising. Fintool reaches 97% in FinanceBench , the industry-leading benchmark for financial questions for public equity analysts, far outpacing any other models. Delivering Practical Value to Customers Today I refuse to let our website be a placeholder with vague statements like "we are an AI lab building financial agents." Instead, every part of our growing infrastructure is put to practical use and sold to real customers, including major hedge funds like Kennedy Capital and companies like PwC. Their feedback is essential in refining our product, which we believe will be a significant advancement for the industry. Today, customers use Fintool to ask broad questions like " List consumer staples companies in the S&P 500 that are discussing shrinkage? " or niche questions like " Break down Nvidia CEO compensation and equity package ." They can also configure AI agents to scan news filings for critical information such as an executive departure or earnings restatements. This is only the beginning. Institutional investors are among the most highly paid knowledge workers in the world. They make millions for their ability to sift through thousands of SEC filings, spot insights, and make calculated decisions on which companies to back. As Greylock noted in their article on vertical AI : “There are several attributes that make financial services well-suited to AI. The market is huge, with $11 trillion in market cap in the U.S. alone, and there's demonstrated demand for AI tools.” We couldn’t agree more. When you look at the daily responsibilities of these professionals, it’s easy to see where AI fits in. The work requires a mix of mathematical expertise and human judgment. Yet, a significant portion of their workload involves mundane, manual tasks—tasks that Fintool’s AI can automate and optimize. Subscribe now The financial research industry is one of the largest and most profitable software verticals in the world, dominated by a handful of key players. Just take a look at the numbers: Bloomberg: $12B in revenue S&P Global: $12.5B in revenue, $6.6B EBITDA FactSet: $1.8B in revenue, $842.5M EBITDA MSCI: $2.5B in revenue, $1.7B EBITDA These companies are highly successful because financial professionals are willing to pay a premium for tools that give them an edge. Active investment managers spend more than $30B per year for data and research services. A bloomberg Terminal The Economics of AI in Finance Adding to that, the unit economics of using AI are vastly better than hiring human analysts. At Fintool, we’re building software that can replace expensive knowledge workers, automating processes that once required teams of analysts. It's crucial knowing the industry is having a talent shortage. According to the venture firm NFX , “The biggest opportunities will exist where the unit economics of hiring AI are 100x better than hiring a person to do the job.” At Fintool, we fit perfectly into that framework. Here’s why: Automatable Processes : From screening SEC filings to running detailed financial models, a large part of an investor's workflow can be done by AI. Cost Savings : In an industry where top analysts are paid millions, the cost savings from using AI are astronomical. Hiring Challenges : Recruiting top financial analysts is a competitive and costly process, often with long onboarding periods. AI can eliminate these pain points. Tool Fragmentation : Today’s financial professionals juggle a wide array of tools. Fintool consolidates these into one powerful platform. Vast Training Data : Fintool leverages proprietary data and vast amounts of public filings to create a unique advantage. We’re creating Warren Buffett as a service—a platform that uses advanced language models to find financial opportunities at scale. With the unit economics favoring AI, and the immense potential to revolutionize how institutional investors work, we believe Fintool is positioned to be the next big thing in financial analysis. If we succeed, we won’t just be building a tool to analyze businesses—we’ll be building the future of how financial professionals make decisions. Thanks for reading Nicolas Bustamante! Subscribe for free to receive new posts and support my work. Identify founder-led companies within the S&P 500 by reading at least 500 DEF14A Proxy Statements (approximately 100 pages per document). Understand that Owner Earnings = Net Income + Depreciation and Amortization + Non-Cash Charges - Capital Expenditures (required to maintain the business) - Changes in Working Capital. Extract financial data from the past 5 years (net income, CapEx, working capital changes) for the 500 companies by reading at least 2,500 annual reports. Compute the data by comparing year-over-year owner earnings growth or decline, looking at trends such as increasing CapEx, expanding net income, or significant working capital changes. Write a comprehensive, error-proof report. From Fintool company screener We also built another API for our agents that can retrieve any number from any filings, along with its source. Additionally, we have an API that excels at computing numbers efficiently. For that challenge, we have partnered with OpenAI on a research project to use LLMs to extract every data point in SEC filings. Every week, we process 50 billion tokens, equivalent to 468,750 books of 200 pages each, or 12 times the size of Wikipedia. Our sophisticated data pipelines are designed to locate, verify, deduplicate, and compare every data point for accuracy and insight.  Fintool “Spreadsheet Builder” answering a question on precise data points We are continuously adding new capabilities to our infrastructure. Our Warren Buffett Agent will use these APIs around the clock to find investment opportunities, analyze them, and respond to customer requests. Although the final product is still in development, we already have a live version in use. The results are promising. Fintool reaches 97% in FinanceBench , the industry-leading benchmark for financial questions for public equity analysts, far outpacing any other models. Delivering Practical Value to Customers Today I refuse to let our website be a placeholder with vague statements like "we are an AI lab building financial agents." Instead, every part of our growing infrastructure is put to practical use and sold to real customers, including major hedge funds like Kennedy Capital and companies like PwC. Their feedback is essential in refining our product, which we believe will be a significant advancement for the industry. Today, customers use Fintool to ask broad questions like " List consumer staples companies in the S&P 500 that are discussing shrinkage? " or niche questions like " Break down Nvidia CEO compensation and equity package ." They can also configure AI agents to scan news filings for critical information such as an executive departure or earnings restatements. This is only the beginning. Why It Will Be Big Institutional investors are among the most highly paid knowledge workers in the world. They make millions for their ability to sift through thousands of SEC filings, spot insights, and make calculated decisions on which companies to back. As Greylock noted in their article on vertical AI : “There are several attributes that make financial services well-suited to AI. The market is huge, with $11 trillion in market cap in the U.S. alone, and there's demonstrated demand for AI tools.” We couldn’t agree more. When you look at the daily responsibilities of these professionals, it’s easy to see where AI fits in. The work requires a mix of mathematical expertise and human judgment. Yet, a significant portion of their workload involves mundane, manual tasks—tasks that Fintool’s AI can automate and optimize. Subscribe now A Massive and Profitable Industry The financial research industry is one of the largest and most profitable software verticals in the world, dominated by a handful of key players. Just take a look at the numbers: Bloomberg: $12B in revenue S&P Global: $12.5B in revenue, $6.6B EBITDA FactSet: $1.8B in revenue, $842.5M EBITDA MSCI: $2.5B in revenue, $1.7B EBITDA A bloomberg Terminal The Economics of AI in Finance Adding to that, the unit economics of using AI are vastly better than hiring human analysts. At Fintool, we’re building software that can replace expensive knowledge workers, automating processes that once required teams of analysts. It's crucial knowing the industry is having a talent shortage. According to the venture firm NFX , “The biggest opportunities will exist where the unit economics of hiring AI are 100x better than hiring a person to do the job.” At Fintool, we fit perfectly into that framework. Here’s why: Automatable Processes : From screening SEC filings to running detailed financial models, a large part of an investor's workflow can be done by AI. Cost Savings : In an industry where top analysts are paid millions, the cost savings from using AI are astronomical. Hiring Challenges : Recruiting top financial analysts is a competitive and costly process, often with long onboarding periods. AI can eliminate these pain points. Tool Fragmentation : Today’s financial professionals juggle a wide array of tools. Fintool consolidates these into one powerful platform. Vast Training Data : Fintool leverages proprietary data and vast amounts of public filings to create a unique advantage.

0 views

How to build a shitty product

Everyone wants the recipe to build a great product. But if you take Charlie Munger's advice to "always invert," you might ask: How to you build a truly shitty product? One that's confusing, frustrating, hard to understand, and makes you want to throw your computer out the window. Every organization sets out with the intent to build a good product. So why do so many of them end up creating something average? The answer lies in the structure and approach of the product team. A typical product team is composed of product managers, designers, and developers. Product managers (PMs) are the main touchpoint with users; they gather feedback, create specifications, and organize the roadmap. Designers create what they believe is a user-friendly UI/UX based on the PM specs and their interpretation of user needs. Developers, who may include data engineers, backend, frontend, and full-stack specialists, take these specifications implement them into a product. Subscribe now Product teams often fall into the trap of designing and building based on assumptions or abstract user personas rather than real user interaction. PMs become gatekeepers of feedback, filtering and interpreting user needs before they ever reach designers or developers. By the time insights get translated into product decisions, they’ve lost touch with what users actually experience. This lack of direct feedback leads to products that don’t solve real problems because the team is too insulated from the people they're building for. Too often, product specifications are shaped by internal company constraints—usually engineering limitations—rather than customer needs. As Steve Jobs famously said, " You've got to start with the customer experience and work backwards to the technology. " Inverting this process, where the tech defines what’s possible instead of the customer's needs, is a fast track to building something nobody wants. Over-specifying also kills innovation because developers are reduced to coders implementing someone else's vision, without any flexibility to improve or innovate. The typical product team works sequentially: PMs specify, designers design, and developers build. This waterfall mentality feels efficient on paper but is inherently rigid. When each step is done in isolation, the process becomes fragile and slow to adapt to new information. The longer each team works in their silo without iteration, the more likely the end product will miss the mark. Who's ultimately responsible for the product's success or failure? PMs? Designers? Developers? Bureaucracy tends to dilute responsibility, and when everything is driven by consensus, mediocrity often follows. Consensus avoids disasters, but it also avoids greatness. True ownership, where someone is accountable for both success and failure, is missing. Some teams get so caught up in Agile, Scrum, or other project management frameworks that they forget the ultimate goal is building something users love. Meetings, standups, and sprint planning become bureaucratic rituals that distract from the real work. To build something truly great, you need craftsmen. Product builders who are deeply invested, who care about every detail, and who take responsibility from beginning to end. The builder has to be as close as possible to the customer. Talk to them, visit them in person, answer support queries, watch them use the product, and demo it to them. This kind of empathy—truly putting yourself in the customer's shoes—is rare. Builders also need to understand the customer’s underlying problem, not just the feature requests they articulate. Customers may ask for specific features, but often they don't know the best solution; they just know their pain points. The job of a great product builder is to uncover the real issue. As Paul Graham once said, " Empathy is probably the single most important difference between a good hacker and a great one. " You need to understand how little users understand, and if you assume they’ll figure it out on their own, you’re setting yourself up for failure. Builders need to use the product they create. That’s why B2C products are often better than B2B ones—builders use what they build and feel the pain of its shortcomings. Most great B2B products, like Vercel or GitHub, are made for developers by developers. It’s much harder to eat your own dog food when building vertical applications for niche users, like lawyers or doctors, but the best craftsmen find a way. The best products come from small, tight-knit teams with clear responsibility. When it’s easy to identify who’s responsible, it’s easier to make great things. Small teams can iterate quickly, and greatness comes through iterations. The boldest approach is to have the same person design, build, and refine the product. With AI coding tools, it's now possible to have a good engineer with taste and empathy that goes from listening to users to implementing a solution, without the need for PMs or designers. Instead of trying to launch a complete, polished product out of the gate, focus on building something small and functional. Once you have that, get it into the hands of users and iterate quickly based on their feedback. The magic happens in iteration, not in perfectionism. Real users will help you refine your ideas and identify what’s actually valuable. The faster you can cycle through feedback loops, the better your product becomes. Building a delightful product for a few core users is often better than trying to build something for everyone. By focusing on a specific audience, you can deeply understand their needs and create something truly valuable. A product that solves real problems for a small, dedicated group is more likely to gain traction and eventually appeal to a wider audience. When you build for core users, you create passionate advocates who can help drive growth organically. Paul Graham's "taste" metaphor from Hackers and Painters applies here: you should always strive for good taste in both code and design, removing unnecessary complexity. Simplicity doesn’t mean lacking features; it means that every feature has a purpose, and every line of code serves the user. Good taste in design and code means prioritizing what truly matters to users and avoiding bloat. A simple, elegant product is not only easier to maintain but also more delightful to use. It's also essential to kill features over time—removing what is no longer needed or valuable ensures the product remains focused and effective. You create great products with small teams, but it is also the pitfall of most companies. Big teams introduce layers of complexity, miscommunication, and slow decision-making. Small teams are nimble, communicate better, and move faster. When a team is small, it’s easier to stay aligned on the mission, and everyone has a clear stake in the product’s success. It also prevents diffusion of responsibility—everyone is accountable. This sounds ideal, but it's not the default approach—especially in large companies. Why? Because big companies prefer reducing the standard deviation of outcomes. Only a small percentage of developers can design great software independently, and it’s difficult for management to hire them - often they don't like to work for bureaucratic organizations. Instead of trusting one brilliant craftsman, most companies opt for a system where design is done by committee and developers just implement the designs. This approach reduces uncertainty. Great results are traded for predictably average ones. If a brilliant craftsman leaves, the company could be in trouble, but if a committee member leaves, it doesn't matter. There’s redundancy in every role. Take Google—you could fire half the workforce, and it would barely affect product quality. But if you fired someone like Jony Ive from Apple’s small design team, there would be no iPhone. Similarly, look at Telegram Messenger—one of the best digital products ever. They have close to 1 billion active users and yet a small team of just 30 engineers. Pavel Durov takes all the customer-facing decisions while his brother and co-founder, Nikolai, handles decisions regarding infrastructure, cryptography, and backend. They've created amazing results, but if Pavel, Nikolai, or key programmers were to leave, the product would stagnate. Big companies dampen oscillations; they avoid disaster, but they also miss the high points. And that’s fine, because their goal isn’t to make great products—it's to be slightly better than their competition. As a reminder, my new startup is called Fintool . We are building Warren Buffett as a service, leveraging large language models (LLMs) to perform the tasks of institutional investors. We follow an approach that emphasizes small teams with clear responsibilities, a lack of rigid roles like product managers, and a relentless focus on speed and iteration. We keep our team extremely lean, with each member responsible for a specific section of the product. For example, we have one team member focused on data engineering to ingest terabytes of financial documents, another on machine learning for search, retrieval, and LLMs, and a full-stack engineer working on the product interface. By assigning clear ownership to each team member, we ensure accountability and expertise in every aspect of our product. Our accountability is customer-first, with engineers often emailing and interacting directly with customers. This approach means customers know exactly who to blame if something doesn't work. We believe high-performing teams do their best work and have the most fun in person. Remote work is highly inefficient, requiring the whole team to jump on Zoom meetings, write notes to share information, and lacking serendipity. Serendipity is the lifeblood of startups—one good idea shared spontaneously at the coffee machine can change the destiny of the company. Additionally, we value each other's company too much to spend our days in boring Zoom calls. We encourage every craftsman on our team to talk directly with customers, visit them in person, and implement the best solutions. We value discussions and brainstorming, but we minimize meetings to maintain fast iterations and provide high freedom for team members to choose their approach. We follow the "Maker's Schedule," as described by Paul Graham: Makers need long, uninterrupted blocks of time to focus on deep work. A typical maker’s day is structured around productivity and creativity, where interruptions or frequent meetings can be disruptive (I hate meetings.) We value speed and push in production every day. One of our core values is to "Release early, release often, and listen to your customers." Speed matters in business, so we push better-than-perfect updates to customers as soon as possible. We believe mastery comes from repeated experiments and learning from mistakes—it's about 10,000 iterations, not 10,000 hours. Another company value is "Clone and improve the best." We don't reinvent the wheel; we enhance proven successesWe are shameless cloners standing on the shoulders of giants. If a design or an existing pattern works well for our use case, we will copy it. Using AI tools, like Cursor the AI code editor, is mandatory at Fintool. We believe AI provides a massive productivity advantage. Most people prefer sticking to their old ways of worker but it’s not how we operate. We won't hire or retain team members who aren't AI-first. With the speed of AI-assisted front-end coding, we believe that traditional design tools like Figma are becoming less necessary. Anyone can create a nice-looking Figma until they start implementing and discover UX challenges. By leveraging a standard component library like Shadcn UI and using tools that convert prompts directly into interfaces, we can iterate faster and achieve better outcomes. A skilled engineer with good taste can design efficient and visually pleasing interfaces without the need for a designer. It keeps the team smaller and increases the speed. Our approach at Fintool focuses on leveraging the strengths of a small, empowered team, with each member deeply connected to the product's success. This method allows for rapid iteration, close customer relationships, and the ability to deliver a product that truly meets user needs. However, the main drawbacks are the high dependency on our people. If a key team member is on holiday or leaves the company, progress slows down significantly. We also rely heavily on hiring exceptional individuals—those who are not only talented but also open-minded, like to interact with customers, have a craftsman's mindset and the discipline to work hard. Finding such people is extremely challenging but it’s essentiel for building something truly great. It’s hard but worth it. We are hiring . “ There is no easy way. There is only hard work, late nights, early mornings, practice, rehearsal, repetition, study, sweat, blood, toil, frustration, and discipline. ” - Jocko Willink Thanks for reading, you can subscribe for free to receive new posts

1 views

San Francisco Life: Insider Tips ♥️

I moved to San Francisco in August 2021, and it quickly became my favorite city. I love it so much that even when I go on vacation, I’m always excited to come back—sometimes I wish I didn’t have to leave at all. There’s so much to adore about this place: the perfect, temperate weather, the proximity to both beaches and stunning natural spots, the walkable and bike-friendly streets, the charming neighborhoods filled with colorful homes, the incredible food scene, and of course, being surrounded by some of the smartest people on the planet. The green zone is hands down the best part of San Francisco. It’s walkable, quiet, beautiful, and conveniently close to everything—grocery stores, restaurants, you name it. The blue zone is great too, though it has a more upscale feel and is a bit less walkable due to the hills. Still, it has its charm, just with a different vibe. The yellow zone is more affordable, but I wouldn’t recommend it unless you’re an avid surfer—it’s foggy for about half the year. As for the red zone, I’d advise staying away, as it’s at the heart of the city’s drug crisis. Other neighborhoods are fine, a bit more suburban and not quite as close to the action, but they offer a good balance of affordability and quality living. Where to eat French : Ardoise , Routier Pasta : Bella Trattoria , The Italian Homemade Company Pizza : Tony’s Steak House : House of Prime Ribs German : Suppenküche Mediterranean : Beit Rima (Cole Valley), Kokkari Brunch : Le Marais Bakery , Wooden Spoon American Breakfast : Pork Store Cafe , Devil's Teeth Baking Company Crêpes : La Sarrasine , Croissants : Arsicault (the one on Arguello and go during the week to avoid an hour long line), Tartine (good but less than Arsicault) Burrito : Underdogs , La Taqueria Ramen : Taishoken , Marufuku Sushi : Ebisu Ice cream : Salt and Straw , The Ice Cream Bar , Philmore Creamer y, Bi-Rite Creamery Coffee shop : Cafe Reveille , Sightglass , The Mill Hot Chocolate : Dandelion Bread : The Mill , Jane Baker y, Thorough Bread Start at the Baker Beach Sea Cliff Access (12 25th Ave, San Francisco, CA 94121) or park here if you have a car. Walk Baker Beach and then climb the Sand Ladder . You will then turn left and start the Batteries to Bluffs Trail till the beautiful Bridge view on Battery Boutelle. The trail is amazing. Be ready to climb a lot of stairs! I’ve hiked there more than I can count and I still love it. Lands end Trail I recommend starting here and to walk to the Lands End Labyrinth . The views are absolutely stunning and it’s hard to think that you are still in a major city! Most of the trail is kid friendly and it works if you have stroller. My favorite beaches Baker Beach Baker Beach is where I like to fish, to picnic and to play Spikeball with friends on a sunny afternoon. I love the incredible view of the bridge and the fact that’s less windy than Ocean Beach. China Beach It’s a cozier and smaller version than Baker Beach. It’s slightly less accessible since you have to go down a hill but there is a parking at the top. I like it even if I prefer Baker because the bridge feels closer. I think what bothers me a bit with China Beach is the abandoned old lifeguard station - so much wasted potential! Ocean Beach Definitely my number one beach to watch the sunset and enjoy a good bone fire! My favorite is to bike and stop at Fulton/Great Highway . I’ve been there so many times and it never disappoints. Please check fog.today first to verify that there is no fog at the beach. Favorite Bike Rides Hawk Hill By far my favorite, I sometimes bike there twice a week. Unless you are an experience biker you will need an electric bike. I like to rent them from SF Wheels or Unlimited Biking for $80 for the whole day. Climbing Hawk Hill offers the best view of the bridge. The best part? Once you reached the top, the downhill is one of the most stunning ride in California. Surfing I’m a beginner Wing Foiler and one of the best spot in the U.S is Crissy Field. I recommend parking at Crissy Field South Beach . If you are more into regular surfing, Ocean beach is a great spot for confirmed surfer. If you are new to surfing, just drive to Pacifica which is an easier spot! Self driving car : Waymo Bike around neighborhood : Castro, Duboce Triangle, Hayes Valley, Cole Valley up to Ocean beach via the Golden Gate Park City hikes : Mount Sutro to Twin Peak , Baker Beach Costal Trail , Lands End Trail Cable Car : map Sunrise : go to Corona Heights or Tank Hill Alcatraz Island : book a night tour Museums : Academy of Science (Thursday night nocture, they have cocktails and DJ) Sunset : verify on fog.today that it’s not foggy and go to Baker Beach or Ocean Beach. Parks : Dolores , bike through the immense Golden Gate Park , walk in Crissy Field Bouldering : Mission Cliff , Movement Surfing : take a lesson in Pacifica or go to Ocean Beach if you are confirmed Tennis : there are free tennis courts all over the city like in Buena Vista or you can book a court in the Golden Gate Park Jiu-jitsu : Ralph Gracie

0 views

Burning Billions: The Gamble Behind Training LLM Models

Why don’t you train your own large language model? I've been frequently asked this question over the past year. I wrote this piece in September 2023 but never published it, thinking the answer was obvious and would become even more apparent with time. I was asked the same question twice last week, so here is my perspective. As a reminder, Fintool is an AI equity research analyst for institutional investors. We leverage LLM to discover financial insights beyond the reach of human analysis. Fintool helps summarize long annual reports, compute numbers, and find new investment opportunities. We have a front-row seat to witness how LLMs are revolutionizing the way information is organized, consumed, and created. Training large language models is challenging. It requires billions of capital to secure GPUs, hundreds of millions to label data, access to proprietary data sets, and the ability to hire the brightest minds. Vinod Khosla, an early OpenAI investor, estimated that “ a typical model in 2025 will cost $5-10b to train. ” Only hyperscalers like Google, Meta, or Microsoft, who are already spending 25B+ in CAPEX per year, can afford this game. A company like Meta can increase its CAPEX guidance by 3+ billion dollars to train frontier models, and that’s not a big deal considering their $43.847B free cash flow per year. Good luck competing with those guys! The additional challenge is the requirement to always train the next frontier model to stay in the race. If your model is not first, it might as well be last. Users and customers gravitate towards the best, leaving little market for inferior models. It’s a power law where the model with the optimal mix of intelligence, speed, and cost-effectiveness dominates. It’s a multi-billion dollar recurring expense, and the window for monetization is a function of the little time your model can stay at the top of the leaderboard before being outperformed. Sequoia Capital recently emphasized that an estimated $600 billion in revenue would be necessary to justify the massive investments in AI data centers and GPUs. In my view, as seen in most technological booms, a large portion of the money invested will ultimately be wasted, similar to the dot-com bubble that led to excessive investment in telecom infrastructure. The telecom boom saw massive capital inflows into building out networks and laying vast amounts of fiber optic cables. Companies thrived initially, but as the bubble burst, it became evident that much of the infrastructure was redundant, leading to significant financial losses. Global Crossing filed for bankruptcy with $12.4 billion in debt, while WorldCom went bankrupt with $107 billion in largely worthless assets. Similarly, the current surge in investment for LLM infrastructure risks leading to overcapacity and inefficiencies. While a few key players may achieve significant rewards, many others will likely face considerable financial setbacks. Most companies entering the LLM race fail despite massive investments. Bloomberg's effort, BloombergGPT, trained on 363 billion tokens, was quickly outperformed by GPT-3.5 on financial tasks. Even well-funded startups struggle: Inflection, despite raising $1.525 billion, was acqui-hired by Microsoft. Adept, with $415M in funding, is rumored to be exploring a sale, and models developed by Databricks, IBM, or Snowflake are today absent from top LLM rankings. When I usually explains why Fintool doesn’t train its own LLM the pundit always ask: “ Well in that case, why don’t you fine-tune your model on your vertical? ” Subscribe now The reason for fine-tuning is the hope to get better quality on a set of tasks while reducing the cost and increasing the speed because fine-tuned models are smaller than generalist models. In my opinion, this approach is not yet yielding the results worth the millions invested. For instance, OpenAI developed Codex, a model fine-tuned on a large corpus of code, and that model was outperformed by GPT-4, a large generic model. The same was true for text-to-SQL fine-tune models, which were better on some narrow benchmarks but got outclassed by the next general model release. So far, every fine-tuned model was outclassed by the next big generic model. The rapid decline in LLM prices, coupled with significant improvements in quality and latency, makes such investments increasingly unjustifiable, in my opinion. If you don’t like losing millions and billions of dollars, it’s better to stay away from this game. For most organizations, training or fine-tuning is driven by FOMO and a lack of understanding of technological trends. Only a few players, like B2C companies such as Character.ai, which processes 20,000 queries per second (approximately 20% of Google’s search volume), require their own models. LLM are such a commodity that a leaked Google memo stated “ we have no moats nor openai. ” It’s fairly easy to switch models, and the fact that open-source models are getting better fastens the commoditization. There is still a premium for the most intelligent model, but most tasks don’t require the best intelligence. Commoditized tasks are already worth zero, while harder tasks are worth something but not much. Training LLM and selling intelligence as a service is not a great business. Future research estimated that OpenAI makes $2.9B from ChatGPT products versus $510M a year for the API. The fact that the API of the leading provider is only 17% of their revenue exemplifies that most of the value creation and value capture happen at the application layer. Application layers like Fintool are developing model-agnostic infrastructure tailored to specific use cases, leveraging improvements in any AI model. Just as Charlie Munger practices " sit on your ass investing ," waiting for the market to recognize the intrinsic value of his investments, I practice " sit on my ass product building ," where I focus on creating complex workflows that meet specific user needs, while anticipating AI models to become better, faster, and cheaper. When we started Fintool, the cost of analyzing an earnings call for a complex task was roughly $1 with GPT-4. A year later, the cost for GPT-4 has dropped by 79.17%, and the model is significantly smarter and faster. By running open-source models, we dropped the price to less than $0.01. So, while not wasting our time and money on training or fine-tuning, we got better quality and speed with a 99.9% price drop. What’s not to like? Subscribe for free to receive new posts

0 views

What We Learned Building the Largest GPT-Telegram Bot

Hello friends, I co-founded Doctrine , one of the largest AI legal search engines, and despite working on a search product for years, ChatGPT blew my mind. The underlying technology, commonly referred to as large language model (LLM), is as revolutionary as the printing press or the internet.  Thanks for reading Nicolas Bustamante! Subscribe for free to receive new posts and support my work. I was initially skeptical about yet another wave of AI hype, but the fusion of chat interfaces with LLMs got me excited. To understand the technology, my YC co-founder Edouard and I built  Lou, the most popular GPT-4 powered chatbot on Telegram Messenger . With thousands of active users posing tens of thousands of questions daily, it became the ideal platform to understand the current state of the technology and explore potential use cases. Let me tell you what I have learned. Chat-based interfaces are the future of the web. In most cases, it's easier to ask a question to a chat and get an answer rather than browsing the web and reading websites. It's a paradigm shift. Search paradigm: keywords -> click on several links -> read webpages -> answer Chat paradigm: question -> answer It means most users no longer need to go on Google or visit a website. Google! Websites! It's the end of the internet as we know it. There are days when I don't search at all; I chat. I ask Lou all my questions, such as:  Show me their popular API endpoints for the Telegram bot API Write a short text message to my landlord to give him my notice. Recommend me a good book about Charlie Munger. Furthermore, Lou offers a more intimate experience compared to Google. We discovered that some users even refer to Lou as their "best friend." Essentially, it's like having a brilliant friend available to help you around the clock. As a result, information retrieval has become a deeply personal experience. It wouldn't be surprising if, in the near future, people forge strong friendships or even romantic connections with their AI companions. As voice and image generation technologies advance, the possibilities are virtually limitless. Operating an LLM-powered chatbot has led me to believe that people will increasingly rely on chat interfaces rather than traditional search. Chatting effectively consolidates keyword searching, link clicking, and website browsing into a single process. This approach is faster and more personalized and delivers higher-quality results. Naturally, chat models have some limitations at present. They lack access to live data, possess no memory, exhibit poor formatting, may generate irrelevant information, and do not suggest follow-up questions. However, these issues are solvable. We plan to release an updated version of Lou that enables users to access news, make purchases, check stock prices, and explore a host of other capabilities. As a result, I foresee chat-based interfaces capturing a substantial portion of the market share from Google. This shift is already evident, as ChatGPT reached 100 million active users within a few weeks. To provide context, Bing, which launched in 2009, only achieved 100 million daily active users in the previous month. Who will become the next Google? On one side, OpenAI holds all the cards. However, they may choose to concentrate on developing an infrastructure company that enables artificial general intelligence (AGI), rather than pursuing a B2C startup. On the flip side, tech giants like MAMAA face a daunting innovator's dilemma due to their bureaucratic nature. Embracing the chat interface could significantly reduce their ad search revenue. Nevertheless, they possess a captive user base, control distribution channels, operating systems, and even produce hardware! It's hard to tell who will do it, but it will transform the web. The global, horizontal chat interface is poised to dominate the internet in ways Google could never have imagined. This chat will serve as a super aggregator, maintaining direct relationships with users and enjoying near-zero marginal costs for onboarding new users while commoditizing suppliers. User interactions with the internet will increasingly occur via chat, compelling suppliers (all websites) to adapt their architecture to align with chat APIs. Why would anyone visit Zillow to find an apartment, Booking to reserve a hotel, or NerdWallet to compare insurance when the super-chat can provide answers and facilitate direct purchases? Just as these services previously optimized their products to fit Google's algorithms, they will now tailor their offerings to suit the chat interface. Commoditization will reach unprecedented levels, as, in many cases, websites will no longer differentiate value propositions. The super-chat will prioritize the fastest, most affordable, and highly-rated options, driving commoditization and reducing profit margins to benefit consumers. Only the best horizontal player will withstand this shift. I also believe that AI chat solutions integrated vertically in the fields of legal, finance, and healthcare will evolve into monster businesses. I also anticipate a gradual transition from text-based to voice-based interfaces. Why type when you can converse with your AI assistant? In the long run, we may not even need phones, as earbuds and smart glasses could suffice. All right! Moving away from speculative ideas, let me share our insights from a technical perspective. The most remarkable experience is that GPT generates a significant amount of code, shortening our product development cycle. You can literally ask to describe the Telegram API and write Python code to create a bot. How wild is that? We currently dramatically underestimate the productivity boost from this technology for humanity.  Another great thing is that GPT models are excellent at various NLP tasks, from coding to translating to creating a recommendation system. Instead of using several machine learning models, we can use one API for almost everything. GPT outperforms most of the models out there, regardless of their specialization. For instance, GPT-4 outperforms Codex, an OpenAI models fine-tuned to write code. You might think it's expensive to run all your backend tasks on GPT, and you're partially correct. Yes, it's expensive, but not for long. It's a contrarian take, but I think that LLMs will quickly be commoditized.  The model's performance tends to plateau at a certain point. For tasks like finding an entity in a document or classifying questions, GPT-4 excels, but so do numerous open-source models. As time goes on, the quality and performance of these freely available open-source models keep improving, steadily narrowing the gap between them and their GPT counterparts. This progress promotes a competitive environment where cutting-edge technology becomes increasingly accessible to a wider audience. Consequently, the cost of using such models is expected to decline over time. OpenAI's recent substantial price reduction for its GPT-3.5 API serves as an example of this trend. Moreover, each day sees the rise of open-source models achieving GPT-like performance in specialized areas. It's likely that, in the near future, most chat interfaces will employ multiple models concurrently, directing queries to those that provide the most accurate responses at the most competitive rates. I foresee that most tasks performed by large language models (LLMs) will be available at no cost except for highly complex tasks. The crucial factor will be maintaining a direct relationship with users and having access to a comprehensive, private dataset. Ok, now, something weird!  My most peculiar experience involved prompt engineering. Giving the model guidelines, such as specifying a particular formatting type, is done not through code but with plain English instructions. You communicate with the model in the same manner you would with a human, not a machine! For example, our prompt related to our "code assistant" might be something like: "As an advanced chatbot Code Assistant, your primary goal is to assist users to write code. This may involve designing/writing/editing/describing code or providing helpful information. Where possible you should provide code examples to support your points and justify your recommendations or solutions. Make sure the code you provide is correct and can be run without errors. Be detailed and thorough in your responses. Your ultimate goal is to provide a helpful and enjoyable experience for the user. The Format output in Markdown." The paradigm shift is remarkable; the most potent coding language has now become English, not JavaScript or Python! However, I should note that I'm not entirely convinced about the long-term potential of prompt engineering in its current form. We extensively used prompt engineering with GPT-3.5 but later discovered that GPT-4 was so proficient that much of the prompt engineering proved unnecessary. In essence, the better the model, the less you need prompt engineering or even fine-tuning on specific data. What I find even more intriguing is the idea that the model could auto-correct and improve itself, much like a living organism. As LLMs evolve, they have the potential to become increasingly autonomous, enabling them to auto-correct and improve themselves over time. One way this could be achieved is through continuous learning and adaptation, where LLMs refine their responses based on user feedback and real-time data. By giving them access to APIs, they will interact with various information sources to expand their knowledge base and maintain up-to-date information. Over time, these advancements could result in self-sufficient AI agents capable of proactively learning from their environment and autonomously enhancing their performance, thereby transforming how we interact with technology and the digital world. Please note that this is not merely science fiction but rather an engineering challenge poised to be solved in the coming months. We live in such an exciting time! In conclusion, building Lou, the largest GPT-4 powered chatbot on Telegram, has provided invaluable insights into the potential of large language models and chat-based interfaces. The paradigm shift from keyword-based search to chat-based interactions is imminent, and it will redefine the way users engage with the internet. It’s so far an incredible experience from a learning perspective. We will probably switch to a vertical AI chat product in the future as it better fits our respective backgrounds. Thanks for reading Nicolas Bustamante! Subscribe for free to receive new posts and support my work. Show me their popular API endpoints for the Telegram bot API Write a short text message to my landlord to give him my notice. Recommend me a good book about Charlie Munger.

0 views

The End of My Crypto Explorations

My crypto journey started in late 2012 when I encountered Bitcoin while reading about the free banking system for my high school thesis. As a fan of Hayek and Von Mises, I was fascinated by the idea of a currency free from the government's manipulation. I downloaded bitcoin core (the blockchain was less than 10GB!), made some transactions, and looked for things to buy. There were few people to transact with and nothing interesting to buy beyond the stuff on  Silk Road . Bitcoin was volatile; its price collapsed from $1000+ in late 2013 to $200ish in August 2015. I watched the space on and off, which I perceived as a gigantic casino. Remember Namecoin, MaidSafe, Bitconnect, and Bitshares? All these coins had billions in volume and later disappeared, leaving investors shirtless. I started  Doctrine , an AI company operating in the legal industry (think Bloomberg for lawyers). I witnessed the 2017 crypto bubble with thousands of projects raising tens of millions for non-existing products tackling non-existing problems. I was sickened by these pumps and dumps and delighted to use AI to create value for thousands of customers! I ignored the space until 2018, when our first teammate,  Antoine Riard , started to contribute to the Lightning Network, a protocol to make instant and cheap Bitcoin payments. Bitcoin has survived, and my friends kept building on Ethereum despite a 90% drop in price. Speculation had dried out, and promising use cases were emerging. I started running a Bitcoin and Lightning Node on the weekend to understand the state of the technology. Fast forward, I  moved to San Francisco  and decided to explore the space in 2022. I was thrilled to join revolutionary young builders working on decentralizing the Internet and improving our financial system. As a  Bitcoin enthusiast , I looked around infrastructure products to sustain the lightning network, like node managers or stablecoin on lightning, via  RGB . It was tough because no one had yet built a successful company on the lightning network. First, it will take years, if not decades, to develop the network - a thing I've learned running the  SF Lightning Dev Meetup  - and, second, most people don't want to pay with crypto, especially when current payment systems are improving rapidly (see  UPI in India  or  Pix in Brazil )! Most friends were building on Ethereum and Solana, so I looked at these options. I made it clear that I wasn't interested in building for speculative use cases. In my opinion, trading is a negative-sum game in which unsophisticated market participants lose their savings while exchanges and intermediaries capture gigantic, and often hidden, fees.  The unfortunate truth is that the current crypto killer feature is the creation of a global, permissionless, gigantic casino of worthless digital assets. This is quite far from the ideas of decentralization, privacy, and unstoppable digital assets we read in  The Sovereign Individual .  Those ideals are worth fighting for, so I started to believe that speculative use cases were temporary anomalies. Yes, token pumps and dumps were disgusting, but tomorrow we will have equity tokens that are way better than the current paper shares. Yes, NFT collections of ugly profile pics are useless - a guy bought a  picture of a rock for $1.8M lol - but it's the premise of NFT as digital property rights on a decentralized and open ledger! That was my thought process for accepting today's crypto industry, but that wasn't easy. I met daily with crypto founders raving about their latest multi-million fundraising round or their secret NFT mints in which they flipped a jpeg for thousands of dollars. I asked questions regarding product usage, pain points solved for customers, and the business model, and I haven't felt so old in my life! I was a 27 tech founder, but I thought I was a 70-year-old guy asking what seemed like irrelevant questions. An avalanche of money from investors and retail traders can easily fake a product-market fit. I had the great opportunity to help  Nanoly 's founders, the largest data aggregator in decentralized finance.  Hundreds of thousands of retail investors visited the website to find the best yields for their digital assets. Yield farming was all the rage with juicy APY of a couple of hundred percent. Tokens were created out of thin air to reward token liquidity providers. I met with full-time yield farmers and people who worked full-time launching tokens to feed this loop. WTF... Ultimately the high-yield farming market collapsed, leading dozens of companies and funds into bankruptcy. Most of my contacts moved to NFT, creating several collections of profile pictures and selling them to gamblers. Again, this use case sucks, but the promise of NFTs as unique digital property rights stored on a worldwide and permissionless ledger is interesting. I dug into crypto infrastructure products but came to a harsh realization. Crypto speculation is a vast and fast-growing market, while other use cases are small. I've done hundreds of customer interviews and learned that most crypto organizations weren't buying crypto software, which explains why crypto products, from analytics to dev tools, struggle to generate revenue. I understand the narrative that these startups are waiting for the market to grow, but the difference between the Internet in 1999 and crypto today is that Amazon or Netflix had viable customers and growing revenue back then. The bear market helped me to have honest conversations with founders. Most of them have raised millions and enjoyed the hype but are now wondering if they will one day reach product market fit. Talking about fundraising, I got more offers while exploring crypto and with better terms than I could have dreamt with my previous web2 startup (with dozens of millions of ARR, fast-growing and profitable)! I think there are monster businesses to create in crypto around the casino use case. Anything that reduces the cost of trading or makes trading more convenient will be a big business. Most great businesses in the space are wallets with a trading feature (Metamask, Phantom, Fireblocks), exchanges (Binance, FTX, Opensea), fiat on-ramp (Moonpay, Transak), etc.  Many founders I've met are iterating in crypto, hoping to launch a startup unrelated to trading. I made the same mistake of finding niches only to realize that the market wasn't there. If there are no viable customers, no traction, then there is no market - even if the idea seems valuable for humanity.  In short, it's a good-looking technology looking for problems to solve. Note that crypto is complex; it takes months, if not years, to get a decent understanding of the tech stack. Adding to that difficulty, it's evolving fast, so you have to keep up with the latest developments - proof of stake, sharding, zk-rollup - making developing in the crypto industry harder than in web2. Exploring crypto from a tech perspective is fascinating and takes a long time, but what's today's use case beyond speculation? Even Vitalik Buterin, in a  Time interview , recognized that:  "The peril is you have these $3 million monkeys, and it becomes a different kind of gambling,"  adding that "t here definitely are lots of people that are just buying yachts and Lambos " and " those are often far from what's actually the best for the world. "  I had a fantastic time and met very talented builders and explorers; many of them will build great companies inside or outside this industry. Crypto combines the best dreamers pushing the frontier of a decentralized civilization and the worst snake oil scammers. The positive energy in the space and the amount of creative destruction are breathtaking. I do not doubt that the industry will mature over the following decades!  I've decided to stop exploring crypto and focus on other sectors and technologies that suit me better. It's hard to understand the stress and anxiety caused by the constant ups and downs of being a founder. I want to thank all my friends and family for being by my side in my entrepreneurial journey! If you found this article valuable, please consider sharing it 🙌 Thanks for reading Nicolas Bustamante! Subscribe for free to receive new posts and support my work.

0 views

Startups Selling Sand in the Desert

Today's story is about startup guys who work extra hard, match all their competitors' features, lower their prices, increase the scope of their free plan, spend millions to generate pennies, and give everything to kill their rivals because, after all, it's a war for survival! These guys are selling sand in the desert.   Most entrepreneurs compete to be the best. They think there can only be one winner, like in war or sport. To win the competition, rivals must be eradicated by relentless execution, price warfare, and constant product imitations. Those entrepreneurs live in what economists call pure and perfect competition.  The latter concept refers to a competitive state where all companies sell equivalent products, driving profits to the marginal cost of production. I confess that as a consumer, I love this zero-sum game. I remember traveling for free in 2015 in San Francisco when Uber and Lyft engaged in a price war. The same happened in Paris in 2017, where I ate at no cost for weeks when food delivery companies were involved in a race to the bottom. What else to be happy in life than free food and free transportation funded by VC money? Compete harder, please! The ones who compete to be the best are losers. Because  competition is for losers .   This form of competitive convergence is the path to mutually assured destruction. Unlike sport, there can be multiple winners in business. One should aim at being the only one selling water in the desert. The antidote to the disease of competition is a unique and singular value proposition. Michael Porter is one of the brightest minds regarding competitive analysis. His articles  What Is Strategy?  (1996) and  The Five Competitive Forces That Shape Strategy  (2008), as well as his many books, are excellent. If you don't have time to read his complex work, I recommend reading  Understanding Michael Porter  by Joan Magretta.  Porter's solution to the competitive dilemma is to thrive on being unique, not the best, focusing on creating value, not beating rivals. He defines strategy as: " building defenses against the competitive forces or finding a position in the industry where the forces are weakest."   He identifies five forces that determine an industry structure, indicating its competitiveness and thus profitability. The intensity of rivalry among existing competitors.  Sometimes, rival firms are irrationally committed to the business, and financial performance isn't the primary goal. For instance, FANG companies often provide products for free, whatever the cost, to preserve their market position. What I worry the most about is dumb guys burning millions hoping to kill competitors. Look at the scooter company Bird; they raised and spent $723M for a business that is today valued at $170M. Even if you were a reasonable entrepreneur in this market, you wouldn't have survived this mindless capital allocation. (btw, thank you for the free rides!) The bargaining power of buyers . Influential buyers can lower prices while demanding more product value. The buyer captures all value creation, not the company selling the product. Companies that sell to a highly concentrated industry, such as plane manufacturers or telecommunications carriers, deal with powerful buyers. The bargaining power of suppliers.  Powerful suppliers will charge high prices and ask for favorable terms, reducing their customers' profitability. Think about companies selling semiconductors in today's shortage. They can ask for outrageous prices because buyers have no alternatives.  The threat of substitutes . There is no high profitability if it's easy to shift to a product that offers the same value proposition. Most B2B SaaS productivity software falls into this trap. They have a lot of users but no customers paying a reasonable price because all products are the same and it's easy to switch.  The threat of new entrants.  If it is easy to enter an industry by creating a similar product, then profitability will be low. Amazon Web Services enjoys significant profit because entering their industry is very hard. I underestimated how the industry's structure determines business success. In short, as Marc Andreessen put it:  the market always wins . The most determinant factor of a startup's success is the market. He wrote: " In a great market -- a market with lots of real potential customers -- the market pulls product out of the startup. " " Conversely, in a terrible market, you can have the best product in the world and an absolutely killer team, and it doesn't matter -- you're going to fail. " Andy Rachleff sums it up: When a great team meets a lousy market, market wins. When a lousy team meets a great market, market wins. When a great team meets a great market, something special happens. Entrepreneurs should aim at building unique and defendable products in a highly profitable and fast-growing industry. In short, products with significant competitive moats! My idol, Warren Buffet, wrote: " We think of every business as an economic castle. And castles are subject to marauders. And in capitalism, with any castle, you have to expect that millions of people out there are thinking about ways to take your castle away. Then the question is, What kind of moat do you have around that castle that protects it? " It's not the size of the castle that matters but how defensible it is! Buffet again: " The most important thing to me is figuring out how big a moat there is around the business. What I love, of course, is a big castle and a big moat with piranhas and crocodiles. " A business protected by crocodiles, excellent! What are these moats?  Intangible Assets:  benefits   such as patents, brands, reputation, or proprietary process. Think about Coca-Cola, a company that has sold the same beverage since 1886 and whose brand is a childhood symbol for billions of people. Who can compete with that?  Scale:  it allows a limited number of players to provide low-cost services while enjoying high margins. Think about Vanguard, which has $7.2 trillion of assets under management, allowing them to reduce commissions while still earning profits. Same for retail companies such as Cosco or insurance businesses such as GEICO. High switching costs:  it makes it costly and risky for customers to switch providers. ERP or CRM such as Salesforce or SAP are so embedded into the customer's organization that it’s impossible to drop these software. Network Effect:  when the value of a service or product becomes more compelling as more people use it. By far my favorite. Consider Facebook; it's not hard to build a similar web app, but impossible to add their 2.93 billion monthly active users who generate a great data network effect. I recommend reading the great Network Effect Bible by James Currier. Regulation:  When the laws protect incumbents with, for instance, local rules, FDA approval, or licenses. Regulation significantly increases the cost of entry and, sometimes, even avoid new entries in the market. I like to analyze a business from the perspective of competitive moats. From my standpoint, every business attributes are either:  easy to replicate hard to replicate impossible to replicate A great company has many "impossible to replicate" attributes. Teams who focus on building features similar to competitors to "match their feature sets" don't get that a great business is built on uniqueness. A good strategy requires trade-offs; it's more about what you don't do than the stuff that you do. Go unique, or go home! You will be pleased to know that, not all moats are created equal . Morningstar did a study comparing competitive moats and the profitability associated. Morningstar learned that firms with wide moat are far more profitable than narrow moat firms. These wide-moat companies benefit from multiple moat sources that defend their business. Interestingly, network effect is rated the best moat, while scale is the less likely to drive great performances. What is wild is that only 10% of the 1,500 stocks that Morningstar tracks are considered wide-moat companies!  An excellent way to know if a company has powerful moats is to consider the ability to increase the price substantially. Warren Buffet said: " The single most important decision in evaluating a business is pricing power. If you've got the power to raise prices without losing business to a competitor, you've got a very good business. And if you have to have a prayer session before raising the price 10 percent, then you've got a terrible business. "  My  favorite burrito place in San Francisco kept raising prices , trying to keep up with inflation, so I stopped going. Restaurants are a lousy business because of the many alternatives. Sorry guys, my burrito loyalty stops at $15. The goal of a successful enterprise is to earn profits. It means capturing the value in an industry by having a better position than rivals, suppliers, new entrants, substitutes, and even customers! A good way to analyze a company’s performance and its competitive moats is to focus on return on invested capital (ROIC). In the long run, sustainable value creation is the difference between the return on invested capital (ROIC) and the cost of capital. What is important is the return on investment, how much capital the company can invest at a rate above the cost of capital, and for how long. The length of the competitive advantage period is crucial. According to Morningstar, the durability of economic profits is far more important than the magnitude. Quoting Buffet in his  1992 letters : " the best business to own is one that over an extended period can employ large amounts of incremental capital at very high rates of return."  Regarding capital allocation per moat-type, I like  Connor Leonard's following framework : Low/No Moat : Companies that may be perfectly well run and sell good products/services, but which do not exhibit characteristics that prevent other companies from competing away there profits if they start earning attractive returns. Most companies fall into this category. Legacy Moat-Dividend:  A company that is insulated from competition, but does not have much opportunity to grow through reinvesting cash flow. So they pay most of their cash earnings out as dividends. Legacy Moat-Outsider : A company that is insulated from competition, but does not have much opportunity to grow through reinvesting cash flow. So they deploy their cash flow in service of acquiring other companies as well as paying dividends and opportunistically buying back stock. Reinvestment Moat:  A company that is insulated from competition and has the opportunity to reinvest their cash flow into growing the business. Capital-Light Compounder:  A company that is insulated from competition and has the opportunity to grow but which doesn't need to reinvest much cash to do so and is, therefore, able to return cash to shareholders even while growing. The stability of the moat in time is a critical factor. Economic moats are rarely stable; they get a little bit wider or narrower every day. There is a relentless regression to the mean in which the companies' moats fade and returns trend towards the industry average.  In this matter, all industries are not created equal. Some industries have fast regression to the mean, such as the food and beverage industry, while others are slower such as the banking industry. More importantly, the long-term average mean differs between terrible sectors such as real estate or utility and good ones such as software or professional services. Anyway, there are always great defensible businesses in good as well as bad industries. Michael J. Mauboussin did the above analysis in an article I highly recommend reading:  Measuring the Moat: Assessing the Magnitude and Sustainability of Value Creation  (2016). I like Mauboussin's work, which showcases a framework for analyzing different industries and companies' positions in the value chain.  Mauboussin starts by creating an industry map to understand the competitive landscape and, very importantly, the distribution of profits over time. Focusing on profits is crucial because there are businesses that build great products with millions of users but no ability to generate profits. Mauboussin then measures the industry stability, its attractiveness based on Porter's five forces, and tries to assess the likelihood of being disrupted by innovation. Pro tip: he provides a checklist of questions for assessing value creation  page 53 .  I think it's an analysis all companies should perform to understand their business.   Ok, ok, it is a lot. What did we learn? Choose a highly profitable and fast-growing market Create a product well-positioned in the value chain to capture profits Focus on the company's uniqueness to avoid competition Keep reinforcing the competitive moats Reinvest cash at a high rate of return The final competitive battle: the Startup guy vs the Intelligent CEO : When the startup guy talks about how great the team is, the Intelligent CEO focuses on the market and industry structure. When the startup guy talks about how disruptive the marketing is, the Intelligent CEO focuses on the position in the value chain. When the startup guy talks about product adoption, the Intelligent CEO focuses on the durability and the widening of competitive moats. When the startup guy talks about revenue growth, the Intelligent CEO focuses on profit and reinvesting opportunities. Startup guys sell sand in the Sahara while Intelligent CEOs are the only ones selling water in the hot desert! If you found this article valuable, please consider sharing it 🙌 When a great team meets a lousy market, market wins. When a lousy team meets a great market, market wins. When a great team meets a great market, something special happens. easy to replicate hard to replicate impossible to replicate An excellent way to know if a company has powerful moats is to consider the ability to increase the price substantially. Warren Buffet said: " The single most important decision in evaluating a business is pricing power. If you've got the power to raise prices without losing business to a competitor, you've got a very good business. And if you have to have a prayer session before raising the price 10 percent, then you've got a terrible business. "  My  favorite burrito place in San Francisco kept raising prices , trying to keep up with inflation, so I stopped going. Restaurants are a lousy business because of the many alternatives. Sorry guys, my burrito loyalty stops at $15. The goal of a successful enterprise is to earn profits. It means capturing the value in an industry by having a better position than rivals, suppliers, new entrants, substitutes, and even customers! A good way to analyze a company’s performance and its competitive moats is to focus on return on invested capital (ROIC). In the long run, sustainable value creation is the difference between the return on invested capital (ROIC) and the cost of capital. What is important is the return on investment, how much capital the company can invest at a rate above the cost of capital, and for how long. The length of the competitive advantage period is crucial. According to Morningstar, the durability of economic profits is far more important than the magnitude. Quoting Buffet in his  1992 letters : " the best business to own is one that over an extended period can employ large amounts of incremental capital at very high rates of return."  Regarding capital allocation per moat-type, I like  Connor Leonard's following framework : Low/No Moat : Companies that may be perfectly well run and sell good products/services, but which do not exhibit characteristics that prevent other companies from competing away there profits if they start earning attractive returns. Most companies fall into this category. Legacy Moat-Dividend:  A company that is insulated from competition, but does not have much opportunity to grow through reinvesting cash flow. So they pay most of their cash earnings out as dividends. Legacy Moat-Outsider : A company that is insulated from competition, but does not have much opportunity to grow through reinvesting cash flow. So they deploy their cash flow in service of acquiring other companies as well as paying dividends and opportunistically buying back stock. Reinvestment Moat:  A company that is insulated from competition and has the opportunity to reinvest their cash flow into growing the business. Capital-Light Compounder:  A company that is insulated from competition and has the opportunity to grow but which doesn't need to reinvest much cash to do so and is, therefore, able to return cash to shareholders even while growing. The stability of the moat in time is a critical factor. Economic moats are rarely stable; they get a little bit wider or narrower every day. There is a relentless regression to the mean in which the companies' moats fade and returns trend towards the industry average.  In this matter, all industries are not created equal. Some industries have fast regression to the mean, such as the food and beverage industry, while others are slower such as the banking industry. More importantly, the long-term average mean differs between terrible sectors such as real estate or utility and good ones such as software or professional services. Anyway, there are always great defensible businesses in good as well as bad industries. Michael J. Mauboussin did the above analysis in an article I highly recommend reading:  Measuring the Moat: Assessing the Magnitude and Sustainability of Value Creation  (2016). I like Mauboussin's work, which showcases a framework for analyzing different industries and companies' positions in the value chain.  Mauboussin starts by creating an industry map to understand the competitive landscape and, very importantly, the distribution of profits over time. Focusing on profits is crucial because there are businesses that build great products with millions of users but no ability to generate profits. Mauboussin then measures the industry stability, its attractiveness based on Porter's five forces, and tries to assess the likelihood of being disrupted by innovation. Pro tip: he provides a checklist of questions for assessing value creation  page 53 .  I think it's an analysis all companies should perform to understand their business.   Ok, ok, it is a lot. What did we learn? Choose a highly profitable and fast-growing market Create a product well-positioned in the value chain to capture profits Focus on the company's uniqueness to avoid competition Keep reinforcing the competitive moats Reinvest cash at a high rate of return

0 views

Panic in Startupland!

Startupland is in panic mode. Months ago, pundits lectured about the new normal; now, they are counting their losses. What happened? The new normal was a bunch of things: 100x ARR fundraising multiple, IPO shares that doubled on the first trading day, cash flows considered a bad disease, large secondary round for founders and insiders, hedge funds flooding the later-stage market, meme stocks that 10x overnight, oversubscribed fundraising rounds every six months, acquisitions with over-valued stocks, and more strangeness I can't even recall.  The pundits argue that it was caused by an acceleration of the use of software after COVID. Work from home, Zoom calls, and quick digital transformation supposedly unlocked trillions of value. The software industry was bigger than ever, so valuations surged, and the revenue was supposed to match... in the future!  I was skeptical. I wrote in How to Beat the Market : In a bull market, speculators believe that "this time is different" which leads to exuberance. A bubble is characterized by the fact that people believe that some new development will change the world, that patterns that have been the rule in the past, such as business cycles, will no longer occur or that the rules regarding valuation norms and standard of value and safety have changed. More often than not, the time is no different and the pendulum switches back. My opinion is the following: central banks lowered the interest rate, flooding the market with money and creating unstoppable inflation and asset speculation. When cash value decreases fast, investors rush to find productive assets to preserve purchasing power. Too many trillion dollars chasing a few assets sent prices to the moon. Central bankers created a gigantic misallocation of resources - as they have always done since the dawn of time.  What is happening now? Central bankers increase the interest rate to fight sky-high inflation, causing the stock market to collapse. The Federal Reserve issued the biggest hike rate in two decades, and more rate rises are expected. Accordingly, tech stocks got hammered. -80% for Zoom. Are you serious? The first metaverse company and the pillar of the remote economy! It hurts. Shopify at -80%? Aren't we supposed to shop exclusively online now? Robinhood's valuation is less than all the capital they have raised: wasn't day trading Dogecoin a sure thing? Peloton at -90%? Where are all the digital bikers? Is profitability a thing now? Did Tech Twittos lie to us, or what? ):  Let's talk numbers. The attentive reader will notice a sempiternal reversion to the mean. Multiples skyrocketed and are now coming back to a historical average. Nothing more. The "new normal" was bullshit talk similar to Yale economist Irving Fisher who predicted, on the verge of the great depression, in 1929: " Stock prices have reached what looks like a permanently high plateau " Well, thanks for the tip!  The multiple of the enterprise value on the next twelve months’ revenue (EV / NTM) went crazy. The median multiple reached 22x in late 2021, sending the cumulative market capitalization of all public SaaS companies to $2tn! The market has since cooled off, wiping out $1tn of market cap (!!) and reverting back to the mean with an EV / NTM of 7.2x.  source: Meritech trading data Suppose a SaaS company trading at the median multiple: $1 of additional revenue added $22 of enterprise value back in late 2021. It meant companies were encouraged to invest or acquire up to $22 to generate $1 of revenue. Of course, 22 was the median, and some companies, such as Snowflake, traded at 93x. The high burn was praised, and investors pushed for more spending in this unprecedented environment. What could go wrong? Today, $1 of revenue contributes to only $7.2 of enterprise value - and it keeps falling. Likely, companies didn't manage to adjust their burn rate to the sudden reversion to the mean. It means that they are destroying capital fast. What supposedly made sense yesterday is now totally dumb.  The classic mistake is to base business assumptions on an all-time high and speculative market. Last October, in “ I Raise Therefore I am ”, I wrote: The high exit multiples of today drive higher fundraising valuation, but what if exit multiples go back to their historical average? The fundamental issue is that the company's fundraising valuation, terms - aka liquid prefs-, and burn rate won't adapt to the new exit environment. A lot of wealth will be destroyed that way.  What does it mean for the immediate future?  Fewer unicorns . The "new normal" unicorn was easy. Reach $10M ARR by burning tons of cash, raising at 100x ARR, and voila! A $1bn valuation in Techcrunch. Now, according to a prominent VC, Matt Turck : " to justify a $1B valuation, a cloud unicorn today would need to plan on doing $178M in revenues in the next 12 months if you apply the current median cloud software multiple (5.6x forward rev). " Well, it's finally not that easy to be worth $1 billion. By the way, you should follow Matt, who is smart and hilarious. His latest tweet: 10:52 PM ∙ May 19, 2022 3,532 Likes 533 Retweets Unit economics matters. If you have a high burn rate, low business efficiency, and a short runway... Houston, there is a problem! You just went from being the best in class to the worst in class. Pay attention; teachers changed the rule! Gross margins, net dollar retention, EBITDA, CAC, burn multiple... all that suddenly matter. A lot. When free money stops, businesses discover the underlying quality of their operation.  Startup bankruptcies . Keith Rabois put it simply : " If you have a high burn rate and have raised money at high prices, you're going to run into a brick wall very fast ." The end of free money. Who could have imagined that? Startups with a high burn rate that don't manage to become profitable will die like all bad businesses. Expect a hiring freeze and mass layoff. It's the ruthless natural selection of capitalism. Startups got capital killed . This one is subtle. If the startup raised at a crazy valuation, the cash is still here, but the equity might be worthless. It's the Uber, Dropbox, Oscar Health, Lemonade, Robinhood (etc etc etc) common scenario. For instance, Oscar Health raised a total of $1.6bn for a current valuation of $1bn. Many startups that raised money in the last two years will have the same fate. Many VCs will go burst. In the investment biz, price matters. Pay too much, and you will never see your money again. Tiger Global, the daring "new normal" king, took a  $17 billion hit on its investment - the biggest dollar decline for a hedge fund in history, according to FT. Ouch. Many VC firms will struggle to raise additional funds and close their door. Investments in venture funds are already dropping (-19% quarter-over-quarter according to CB Insights ). Wow, brutal learnings. Cash flows matter, burn matters, gross margin matters, churn matters, and valuation matters. Building a good business matters. Is your SaaS startup skewed? How to quickly access the damages without spending days in a war room? Introducing the hype ratio and the burn multiple . Hype Ratio = Capital Raised / Annual Recurring Revenue. Burn multiple = Net Burn / Net New ARR If these ratios are superior to 3 then there is a problem. It’s not uncommon to see startups with a 10+ ratio these days. One day you were riding a unicorn only to learn, later on, that you were riding a pig!   Reversion to the mean is inevitable. I would say that it's a severe crisis only when multiples go below their historical average. Most people look at returns on paper without asking if the price makes sense in the first place. Price on fundamentals such as revenue or earnings means a lot, especially in their historical context. A few years of irrational exuberance don't make a market.  The crisis is, for now, pretty light. The 2008 crash sent the NASDAQ to a level last seen in July 1995! 13 years of stock appreciation were gone! Boum. Today's crash sent the NASDAQ to a level last seen... last year. Well.  The market is still expensive . NASDAQ Composite - 45 Year Historical Chart So, how did we get there? Why so many people are surprised (and totally broke) by a classic reversion to the mean? The irrational exuberance of a bull market and a reversion to the mean are common. What is striking each time is the willingness of market participants to bullshit themselves about a new paradigm to jump into the speculative market. A lot of investors and founders jump all in into the cliff. Why is that? Part of the answer is greed and envy. I wrote in How to Beat the Market : Greed is an extremely powerful force that overcomes common sense, prudence, and memory of painful past lessons. It's hard to stay prudent when every speculator around is enjoying significant profits. The combination of the pressure to conform and the desire to get rich cause investor to drop their independence and skepticism which leads to their capitulation by buying into the speculative market. Greed is a drug that affects the investor's rational thinking while envy forces investors to comply with the herd The other part is the incentive, the powerful force that drives the world. Investors who gamble into the speculative market get short-term rewards such as high mark-ups thanks to later fundraising rounds that allow them to raise more money from LPs and enjoy more fees. Founders enjoy large cash-out and money to fuel their business regardless of the unit economics. It’s easy to compromise the long-term future to pocket a big cheque. What about buying a $133M mansion after having sold for $292M on the IPO day? Short-term payday, long-term hell. Ok, what about the silver lining? Good business will have the time of their life. Fast-growing, profitable startups will have the opportunity to buy out struggling competitors, invest in an environment where CAC will decrease, and hire people with fair compensation packages. Startups with positive cash flows are the cool kids again - until the next exuberance of course. It will be fine. Stock market crashes come and go. It's a good reminder that the value of a business is the net present value of its future cash flows. Cashflows are king (I know... such a boomer mindset!) This makes me even more admirative of Warren Buffett or Mark Leonard type of people. Strong-willed people who think for themselves and have the guts to resist short-term temptations. They are the ones ridiculed, the ones not invited to fancy dinners, the ones not covered in magazines, but they are the ones who win. They fight the institutional imperative, the peer pressure, preserve a margin of safety, and are fearful when others are greedy and greedy when others are fearful! There are the Intelligent CEOs . If you found this article valuable, please consider sharing it 🙌

0 views

Where is Your Firephone?

Hello, I am Nicolas Bustamante . I’m an entrepreneur and I write about long-term company building. Check out some of my popular posts: The Intelligent CEO The Impact of the Highly Improbable Surviving Capitalism with Competitive Moats Subscribe to receive actionable advice on company building👇 Subscribe now Have you ever heard about the Firephone? Amazon spent hundreds of millions building a revolutionary smartphone but discontinued commercialization in 2015, one year after its introduction. For most, the  Firephone  was a massive disappointment; for Amazon's CEO, it was a healthy failure contributing to Amazon's success. I agree, and one of my favorite questions to scale up is now: where is your Firephone? After reaching their product-market fit, early-stage startups focus on improving one product. The ambition is to add new features quickly to satisfy early customers. The constant iterations lead to fast revenue growth, high net promoter score, low churn, and high upsell. After a while, the main product isn't sufficient to drive additional growth, and it's required to develop new product lines.  These new product initiatives leverage the existing technology and the current competitive advantages. These sustaining innovations are adjacent to the initial developments and sold to existing customers generating new sales and often higher margins for the company. It's relatively straightforward to develop adjacent product lines because customers often ask for the developments, and it's easy to make financial projections before investing. As time goes by, it's harder to grow through these types of sustaining innovations, and companies have to embrace radical innovations.  Developing radically new products is challenging but crucial in highly competitive markets. Failure to innovate leads to irrelevance, and thus bankruptcy. Such an innovative process is challenging because there is so much uncertainty about future success. The investment required is more significant than sustaining innovations, while the margins seem inferior to the core product. Additionally, radical innovations might not target current profitable customers but niches in a new fast-growing market. All of this leads to a substantial career risk in most companies because the team might be held accountable for the lengthy and costly failures that happen along the way.   The paradox is that managers who successfully launched incremental innovations failed to commercialize radical new product lines. Using the same processes of analyzing risk-adjusted returns and talking to customers, they weed out disruptive product initiatives that are key to tomorrow's growth. Overcoming these challenges requires leaders who understand the power-law distribution of returns. Few product successes will cover the cost for many failed ones. In  the words of Jeff Bezos : " a small number of winners pay for dozens, hundreds of failures. And so every single important thing that we have done has taken a lot of risk-taking, perseverance, guts, and some of them have worked out, most of them have not. " There is a world between understanding the power-law distribution and creating a company culture that rewards big, bold bets. It requires adopting a long-term perspective, accepting to lose tons of money, and waiting patiently for that outlier to generate a significant outcome. Additionally, the best organizations fail early, fail often, and don't gamble the company's future on one product launch. They seek positive optionality over time with a low downside and a big upside!  From my perspective, the innovative culture is one of the most impressive things about Amazon. They started as an e-commerce company but now dominate the cloud industry with AWS and push the frontiers of hardware with Kindle, Alexa, or AmazonGO. All that while losing billions of dollars due to failed product innovations!   And you, where is your Firephone?  If you found this article valuable, please consider sharing it 🙌

0 views

I Joined Figures as an Advisor

Hello, I am Nicolas Bustamante . I’m an entrepreneur and I write about long-term company building. Check out some of my popular posts: The Intelligent CEO The Impact of the Highly Improbable Surviving Capitalism with Competitive Moats Subscribe to receive actionable advice on company building👇 Subscribe now Six months ago, an ex-teammate and friend, Grégoire, introduced me to Virgile to discuss his startup.  Figures  is a software to benchmark, review, and communicate compensation plans. I happily jumped on a call and did my best to answer Virgile's questions by asking him more questions. I then sent him a follow-up email sharing my experience and adding several articles to challenge his perspective. After this initial contact, a strange thing happened.  Having read all the articles, Virgile replied and asked more thoughtful questions. I answered his questions for what I thought would be the end of the conversation, as it often happens. We continued our discussion, and before I knew it, we were talking about  competitive moats ,  fundraising ,  OKR methodology ,  hiring plan ,  culture , and much more.  That was the beginning of my role as an advisor for Figures. I like working with Virgile, an intelligent, humble, hard-working, and funny person, as well as Bastien, Figures' CTO, who independently created an outstanding product that rivals well-funded US competitors. I am grateful to work with them because I like the team and Figures' product matters. I have experienced firsthand the pain of crafting and communicating compensation plans. My company suffered from unnecessary tensions because we lacked the data on specific positions. Regrettably, our business plan second-guessed potential salaries and yearly raises, generating countless mistakes. I remember my team struggling to gather sufficient and up-to-date data per job and seniority. The manual process was painful and inaccurate.  With  Figures , accessing compensation market data is seamless. I love the dashboard that showcases a compensation index for every department, job, and seniority. Figures also emphasizes a gender-equality index that gave us a fresh perspective on this crucial topic. Today, I use Figures' explorer to browse market compensation data whenever I doubt a specific comp. Even better, Figures has a tool to compare a candidate expected comp to our compensation plan and the whole market. Better information drives better decisions for companies and employees alike.  Figures' product is essential, and it solves an age-old problem for companies. Figures is creating a new market category by adding reviewing and communication features within the traditional benchmarking software. Virgile and Bastien have already built  solid competitive moats  with a valuable data network effect, a strong channel partnership distribution, and great integrations to payroll software.  Working on Figures has made me a better builder. While I believe founders should focus on nothing but their early-stage business, late-stage founders should probably work with other startups. At some point, analyzing other's people business is essential to step up and broaden one's perspective. Helping Figures gave me great insight on how to improve my own business.  I'm writing this to  put my reputation at risk . Because I'm available whenever the founders want to tackle a challenge, it will be my responsibility if Figures doesn't reach its ambitious goals. Nevertheless, any success is due to the Figures' team's hard work and dedication. They are the ones building, selling, and operating the business to create value for their customers, and they are crushing it!  If you're working in Europe on compensation plans, do yourself a favor:  schedule a demo with Virgile and buy Figures . If you're an ambitious startup builder, consider joining Figures. They are hiring, remotely or in Paris, an  engineering team leader, a senior product designer, a CSM, a head of marketing, and country launchers for the UK, Spain, and Benelux .  If you found this article valuable, please consider sharing it 🙌

0 views

Look How Big My Team is!

Hello, I am Nicolas Bustamante . I’m an entrepreneur and I write about long-term company building. Check out some of my popular posts: The Intelligent CEO The Impact of the Highly Improbable Surviving Capitalism with Competitive Moats Subscribe to receive actionable advice on company building👇 Subscribe now I'm used to getting asked how many employees work at my startup. For most people, it's a quick way to assess a startup's success. Supposedly, the bigger, the better. But does it really reflect success?  To cut to the chase, headcount is a lousy metric.  Paul Graham has a great tweet  that says: " When people visit your startup, they should be surprised how few people you have. A visitor who walks around and is impressed by the magnitude of your operation is implicitly saying, "Did it really take all these people to make that crappy product? " I agree with him, and I'm often surprised by how many people and how much capital it has taken to build mediocre businesses. Headcount is a vanity metric similar to followers on social media or  fundraising amount . Plenty of examples exist where small teams outcompete large teams in the same market. Small teams leverage speed as a competitive advantage which is key to winning over competitors. Other benefits include better communication, more engagement, more profits to reinvest, and, overall, better productivity. Small groups avoid the  Ringelmann effect , which is the tendency for individuals to become increasingly less productive as the size of their group increases. More importantly, the headcount constraint drives creativity and innovative solutions. Inflating one's team is frequently a bad idea because throwing more people at a problem doesn't solve the problem faster.  It often leads to what I call the " hiring death cycle ." A startup faced with a problem tends to hire more people, making it harder to solve the problem and thus requiring even more hires. The death cycle is reinforced by the " next hire fallacy ," in which, supposedly, the next recruit will suddenly solve the problem. It's common to fall into this vicious cycle, and hard to break out of it.  Controlling headcount expansion when experiencing fast growth is challenging. Targeting an efficient number of teammates is tricky because it's difficult to know and test when fewer people can achieve more. Additionally, managers have an incentive to grow headcounts as it means more responsibility, prestige, and better compensation. Entrepreneurs and investors often push to hire more as it gives an impression of faster progress. However, the key is to focus on the business's performance over time.  An excellent way to assess the efficiency of a business is to compare its revenue to the number of employees. The best companies increase their revenue per employee as they scale, making their revenue grows faster than their cost. One of the critical metrics for SaaS startups is the annual recurring revenue (ARR) divided by full-time employees (FTE). For instance,  the median ARR per FTE ratio for private SaaS startups in the $10-$20M revenue range  is $138,889. The same benchmark exists for  all publicly traded SaaS , with the median being $260,045. The benchmark exemplifies that great companies always increase their ARR/FTE over time.  So, one of the few reasons to ask for the startup headcount is to compare to its revenue and quickly evaluate its soundness. Say a fast-growing SaaS startup has 80 employees and a $15M annual recurring revenue. It implies an excellent $187,500 ARR/FTE ratio, and it's likely a fantastic business. Voila! You can now better assess the success of most startups. If you found this article valuable, please consider sharing it 🙌

0 views