Posts in Analytics (20 found)
Jim Nielsen 1 weeks ago

My Number One “Resource Not Found”

The data is in. The number one requested resource on my blog which doesn’t exist is: According to Netlify’s analytics, that resources was requested 15,553 times over the last thirty days. Same story for other personal projects I manage: “That many requests and it serves a 404? Damn Jim, you better fix that quick!” Nah, I’m good. Why fix it? I have very little faith that the people who I want most to respect what’s in that file are not going to do so . So for now, I’m good serving a 404 for . Change my mind. Reply via: Email · Mastodon · Bluesky iOS Icon Gallery : 18,531 requests. macOS Icon Gallery 10,565 requests.

0 views
Evan Schwartz 3 weeks ago

Scour - October Update

Hi friends, In October, Scour ingested 1,042,894 new posts from 14,140 sources . I was also training for the NYC Marathon (which is why this email comes a few days into November)! Last month was all about Interests: Your weekly email digest now includes a couple of topic recommendations at the end. And, if you use an RSS reader to consume your Scour feed, you’ll also find interest recommendations in that feed as well. When you add a new interest on the Interests page, you’ll now see a menu of similar topics that you can click to quickly add. You can browse the new Popular Interests page to find other topics you might want to add. Infinite scrolling is now optional. You can disable it and switch back to explicit pages on your Settings page. Thanks Tomáš Burkert for this suggestion! Earlier, Scour’s topic recommendations were a little too broad. I tried to fix that and now, as you might have noticed, they’re often too specific. I’m still working on solving this “Goldilocks problem”, so more on this to come! Finally, here were a couple of my favorite posts that I found on Scour in October: Happy Scouring! - Evan Introducing RTEB: A New Standard for Retrieval Evaluation Everything About Transformers Turn off Cursor, turn on your mind

1 views
Jack Vanlightly 3 weeks ago

How Would You Like Your Iceberg Sir? Stream or Batch Ordered?

Today I want to talk about stream analytics, batch analytics and Apache Iceberg. Stream and batch analytics work differently but both can be built on top of Iceberg, but due to their differences there can be a tug-of-war over the Iceberg table itself. In this post I am going to use two real-world systems, Apache Fluss (streaming tabular storage) and Confluent Tableflow (Kafka-to-Iceberg), as a case study for these tensions between stream and batch analytics. Apache Fluss uses zero-copy tiering to Iceberg . Recent data is stored on Fluss servers (using Kafka replication protocol for high availability and durability) but is then moved to Iceberg for long-term storage. This results in one copy of the data. Confluent Kora and Tableflow uses internal topic tiering and Iceberg materialization , copying Kafka topic data to Iceberg, such that we have two copies (one in Kora, one in Iceberg). This post will explain why both have chosen different approaches and why both are totally sane, defensible decisions. First we should understand the concepts of stream-order and batch-order . A streaming Flink job typically assumes its sources come with stream-order . For example, a simple SELECT * Flink query assumes the source is (loosely) temporally ordered, as if it were a live stream. It might be historical data, such as starting at the earliest offset of a Kafka topic, but it is still loaded in a temporal order. Windows and temporal joins also depend on the source being stream-ordered to some degree, to avoid needing large/infinite window sizes which blow up the state. A Spark batch job typically hopes that the data layout of the Iceberg table is batch-ordered , say, partitioned and sorted by business values like region, customer etc), thus allowing it to efficiently prune data files that are not relevant, and to minimize costly shuffles. If Flink is just reading a Kafka topic from start to end, it’s nothing special. But we can also get fancy by reading from two data sources: one historical and one real-time. The idea is that we can unify historical data from Iceberg (or another table format) and real-time data from some kind of event stream. We call the reading from the historical source, bootstrapping . Streaming bootstrap refers to running a continuous query that reads historical data first and then seamlessly switches to live streaming input. In order to do the switch from historical to real-time source, we need to do that switch on a given offset. The notion of a “last tiered offset” is a correctness boundary that ensures that the bootstrap and the live stream blend seamlessly without duplication or gaps. This offset can be mapped to an Iceberg snapshot. Fig 1. Bootstrap a streaming Flink job from historical then switch to real-time. However, if the historical Iceberg data is laid out with a batch-order (partitioned and sorted by business values like region, customer etc) then the bootstrap portion of a SELECT * will appear completely out-of-order relative to stream-order. This breaks the expectations of the user, who wants to see data in the order it arrived (i.e., stream-order), not a seemingly random one.  We could sort the data first from batch-order back to stream-order in the Flink source before it reaches the Flink operator level, but this can get really inefficient. Fig 2. Sort batch-ordered historical data in the Flink source task. If the table has been partitioned by region and sorted by customer, but we want to sort it by the time it arrived (such as by timestamp or Kafka offset), this will require a huge amount of work and data shuffling (in a large table). The result is not only a very expensive bootstrap, but also a very slow one (afterall, we expect fast results with a streaming query). So we hit a wall: Flink wants data ordered temporally for efficient streaming bootstrap. Batch workloads want data ordered by value (e.g., columns) for effective pruning and scan efficiency. These two data layouts are orthogonal. Temporal order preserves ingest locality; value order preserves query locality. You can’t have both in a single physical layout. Fluss is a streaming tabular storage layer built for real-time analytics which can serve as the real-time data layer for lakehouse architectures. I did a comprehensive deep dive into Apache Fluss recently, diving right into the internals if you are interested. Apache Fluss takes a clear stance. It’s designed as a streaming storage layer for data lakehouses, so it optimizes Iceberg for streaming bootstrap efficiency. It does this by maintaining stream-order in the Iceberg table. Fig 3. Fluss stores real-time and historical data in stream-order. Internally, Fluss uses its own offset (akin to the Kafka offset) as the Iceberg sort order. This ensures that when Flink reads from Iceberg, it sees a temporally ordered sequence. The Flink source can literally stream data from Iceberg without a costly data shuffle.  Let’s take look at a Fluss log table. A log table can define: Optional partitioning keys (based on one or more columns). Without them, a table is one large partition. The number of buckets per partition . The bucket is the smallest logical subdivision of a Fluss partition. Optional bucketing key for hash-bucketing. Else rows are added to random buckets, or round-robin. The partitioning and buckets are both converted to an Iceberg partition spec. Fig 4. An example of the Iceberg partition spec and sort order Within each of these Iceberg partitions, the sort order is the Fluss offset. For example, we could partition by a date field, then spread the data randomly across the buckets within each partition. Fig 5. The partitions of an Iceberg table visualized. Inside Flink, the source will generate one “split” per table bucket, routing them by bucket id to split readers. Due to the offset sort order, each Parquet file should contain contiguous blocks of offsets after compaction. Therefore each split reader naturally reads Iceberg data in offset order until it switches to the Fluss servers for real-time data (also in offset order). Fig 6. Flink source bootstraps from Iceberg visualized Once the lake splits have been read, the readers start reading from the Fluss servers for real-time data. This is great for Flink streaming bootstrap (it is just scanning the data files as a cheap sequential scan). Primary key tables are similar but have additional limitations on the partitioning and bucketing keys (as they must be subsets of the primary key). A primary key, such as device_id , is not a good partition column as it’s too fine grained, leading us to use an unpartitioned table. Fig 7. Unpartitioned primary key table with 6 buckets. If we want Iceberg partitioning, we’ll need to add another column (such as a date) to the primary key and then use the date column for the partitioning key (and device_id as a bucket key for hash-bucketing) . This makes the device_id non-unique though. In short, Fluss is a streaming storage abstraction for tabular data in lakehouses and stores both real-time and historical data in stream-order. This layout is designed for streaming Flink jobs. But if you have a Spark job trying to query that same Iceberg table, pruning is almost useless as it does not use a batch-optimized layout. Fluss may well decide to support Iceberg custom partitioning and sorting (batch-order) in the future, but it will then face the same challenges of supporting streaming bootstrap from batch-ordered Iceberg. Confluent’s Tableflow (the Kafka-to-Iceberg materialization layer) took the opposite approach. It stores two copies of the data: one stream-ordered and one optionally batch-ordered. Kafka/Kora internally tiers log segments to object storage, which is a historical data source in stream-order (good for streaming bootstrap). Iceberg is a copy, which allows for stream-order or batch-order, it’s up to the customer. Custom partitioning and sort order is not yet available at the time of writing, but it’s coming. Fig 8. Tableflow continuously materializes a copy of a Kafka topic as an Iceberg table. I already wrote why I think zero-copy Iceberg tiering is a bad fit for Kafka specifically. Much also applies to Kora, which is why Tableflow is a separate distributed component from Kora brokers. So if we’re going to materialize a copy of the data for analytics, we have the freedom to allow customers to optimize their tables for their use case, which is often batch-based analytics. Fig 9. Copy 1 (original): Kora maintains stream-ordered live and historical Kafka data. Copy 2 (derived): Tableflow continuously materializes Kafka topics as Iceberg tables. If the Iceberg table is also stored in stream-order then Flink could do an Iceberg streaming bootstrap and then switch to Kafka. This is not available right now in Confluent, but it could be built. There are also improvements that could be made to historical data stored by Kora/Kafka, such as using a columnar format for log segments (something that Fluss does today). Either way, the materialization design provides the flexibility to execute a streaming bootstrap using a stream-order historical data source, allowing the customer to optimize the Iceberg table according to their needs. Batch jobs want value locality (data clustered by common predicates), aka batch-order. Streaming jobs want temporal locality (data ordered by ingestion), aka stream-order. With a single Iceberg table, once you commit to one, the other becomes inefficient. Given this constraint, we can understand the two different approaches: Fluss chose stream-order in its Iceberg tables to support stream analytics constraints and avoid a second copy of the data. That’s a valid design decision as after all, Fluss is a streaming tabular storage layer for real-time analytics that fronts the lakehouse. But it does mean giving up the ability to use Iceberg’s layout levers of partitioning and sorting to tune batch query performance. Confluent chose a stream-order in Kora and one optionally batch-ordered Iceberg copy (via Tableflow materialization), letting the customer decide the optimum Iceberg layout. That’s also a valid design decision as Confluent wants to connect systems of all kinds, be they real-time or not. Flexibility to handle diverse systems and diverse customer requirements wins out. But it does require a second copy of the data (causing higher storage costs). As the saying goes, the opposite of a good idea can be a good idea. It all depends on what you are building and what you want to prioritize. The only losing move is pretending you can have both (stream-optimized and batch-optimized workloads) in one Iceberg table without a cost. Once you factor in the compute cost of using one format for both workloads, the storage savings disappear. If you really need both, build two physical views and keep them in sync. Some related blog posts that are relevant this one: Beyond Indexes: How Open Table Formats Optimize Query Performance Why I’m not a fan of zero-copy Apache Kafka-Apache Iceberg Understanding Apache Fluss Apache Fluss uses zero-copy tiering to Iceberg . Recent data is stored on Fluss servers (using Kafka replication protocol for high availability and durability) but is then moved to Iceberg for long-term storage. This results in one copy of the data. Confluent Kora and Tableflow uses internal topic tiering and Iceberg materialization , copying Kafka topic data to Iceberg, such that we have two copies (one in Kora, one in Iceberg). Flink wants data ordered temporally for efficient streaming bootstrap. Batch workloads want data ordered by value (e.g., columns) for effective pruning and scan efficiency. Optional partitioning keys (based on one or more columns). Without them, a table is one large partition. The number of buckets per partition . The bucket is the smallest logical subdivision of a Fluss partition. Optional bucketing key for hash-bucketing. Else rows are added to random buckets, or round-robin. Fluss chose stream-order in its Iceberg tables to support stream analytics constraints and avoid a second copy of the data. That’s a valid design decision as after all, Fluss is a streaming tabular storage layer for real-time analytics that fronts the lakehouse. But it does mean giving up the ability to use Iceberg’s layout levers of partitioning and sorting to tune batch query performance. Confluent chose a stream-order in Kora and one optionally batch-ordered Iceberg copy (via Tableflow materialization), letting the customer decide the optimum Iceberg layout. That’s also a valid design decision as Confluent wants to connect systems of all kinds, be they real-time or not. Flexibility to handle diverse systems and diverse customer requirements wins out. But it does require a second copy of the data (causing higher storage costs). Beyond Indexes: How Open Table Formats Optimize Query Performance Why I’m not a fan of zero-copy Apache Kafka-Apache Iceberg Understanding Apache Fluss

0 views
Jeff Geerling 2 months ago

Digging deeper into YouTube's view count discrepancy

For a great many tech YouTube channels, views have been markedly down from desktop ("computer") users since August 10th (or so). This month-long event has kicked up some dust—enough that two British YouTubers, Spiffing Brit and Josh Strife Hayes are having a very British argument 1 over who's right about the root cause. Spiffing Brit argued it's a mix of YouTube's seasonality (it's back to school season) and channels falling off, or as TechLinked puts it, " git gud ", while Josh Strife Hayes points out the massive number of channels which identified a historic shift down in desktop views (compared to mobile, tablet, and TV) starting after August 10. This data was corroborated by this Moist Critical video as well.

0 views
Martin Fowler 3 months ago

Actions to improve impact intelligence

Sriram Narayan continues his article on impact intelligence by outlining five actions that can be done to improve impact intelligence: introduce robust demand management, pay down measurement debt introduce impact validation, offer your CFO/COO an alternative to ROI, equip your teams.

0 views
Martin Fowler 3 months ago

The Reformist CTO’s Guide to Impact Intelligence

The productivity of knowledge workers is hard to quantify and often decoupled from direct business outcomes. The lack of understanding leads to many initiatives, bloated tech spend, and ill-chosen efforts to improve this productivity. Sriram Narayan begins an article that looks at how to avoid this by developing an intelligence of the business impact of their work across a network connecting output to proximate and downstream impact.

0 views
A Smart Bear 4 months ago

Max MRR: Your growth ceiling

Your company will stop growing sooner than you think. The "Max MRR" metric predicts revenue plateaus based on churn and new revenue.

0 views
Grumpy Gamer 4 months ago

Death By Scrolling Part 2

If you haven’t read my previous post about Death By Scrolling way back in February, I suggest you do. Of course this is my lazy way of doing the 2nd promised blog post for Death By Scrolling. In all fairness, I started to write it and it seem awfully familiar so I went back and checked and sure enough I had already written about it. But, I’ll do another real post… I asked for beta testers on Mastodon and got close to 300 sign-ups. I didn’t want to invite everyone all at once. There is an old saying that you can only make a first impression once. Every time I make a new beta version I invite 25 more people. Couple of stats: About 25% of the people never redeem the steam key. Or they redeem it weeks later. This is a little surprising, but maybe it shouldn’t be. People are busy. Of the people who did redeem the key a third play the game once or twice and never again. This is not surprising. Death By Scrolling is a rogue-like and you die a lot. I do mean a lot, it’s right in the title. Some people do not like this type of game, and I’m OK with that. Maybe half the people who play the game never visit the Discord. We can get only so much info from analytics. Having a conversation about what you like and don’t like is very helpful. Again, this isn’t too unexpected. The players that do play more than a few times play a lot and that is good to see. It’s nice to see strategies emerge that we, as the designers, didn’t think of. That is always a good sign. This is the first time I’ve done large-ish beta test for one of my games and it’s been fascinating and very insightful. I’m about to invite the next group of 25 testers. If you’re among this group, please visit the Discord. – Ron

0 views
Peter Steinberger 5 months ago

stats.store: Privacy-First Sparkle Analytics

How curiosity about VibeTunnel users led me to build stats.store - a free, open source analytics backend for Sparkle using AI tools, all while cooking dinner.

0 views
James O'Claire 6 months ago

The Trackers and SDKs in ChatGPT, Claude, Grok and Perplexity

Well for a quick weekend recap I’m going to look at which 3rd party SDKs and API calls I can find in the big 4 Android chat apps based. We’ll be using free data from AppGoblin which you can feel free to browse at any of the links below or on tables. Data is collected via de-compiled SDKs and MITM API traffic. Let’s look first at the development tools. These were interesting to me because I had assumed I’d see more of the dynamic JavaScript libraries like React. Instead we see these are all classic Kotlin apps. If you click through the Chat App names you’ll see the more detailed breakdowns of which specific parts of the libraries they’re using like e (in app animations library) or Kotlin Coil Compose or Square’s . Wow, way more than I expected and with quite the variety! I guess it’s enough we can further break these down. As is common now, most apps have more than one analytics tracker in their app. First up let’s recognize Google, it’s across every app in multiple ways. The main one that is used in most apps is the . GMS which is required for both Firebase and Google Play Services. Here’s an example of the measurement SDKs related to this: Next was statsig.com and wow! I was blown away I found this one in 3 of the 4 apps. This looks like a super popular company and I was surprised as I hadn’t heard of them before. Looking around, they look a bit more developer / product focused, but have tons of features and look quite popular. Finally in the analytics section we’ll add the classic segment.com (marketing analytics) and sentry.io (deployment analytics) which get to call OpenAI and Anthropic as it’s clients. It’s always interesting how every company from games to AI end up needed multiple analytics platforms and probably still depend most on their home BI/Backend. Here’s where the money is at. Now SUPER cool is that RevenueCat is now in both OpenAI and Perplexity. RevenueCat helps to use react native updatable web payment / subscription walls so that marketers can change those sections of the apps without needing to do an entire app update. I believe Perplexity is using Stripe, but that could also be a part of their bigger app ecosystem livekit.io ( AppGoblin livekit.io ) is an AI voice platform which is used by OpenAI and Grok. I’m surprised that OpenAI uses this, as they were quite early to the voice game, but perhaps they use this for some deeper custom voice tools. Perplexity has the most interesting third party tools with MapBox and Shopify. I believe MapBox, which delivers mapping tiles, is used for some of Perplexity’s image generation tools like adding circles/lines etc to maps. After seeing Shopify in Perplexity, I realized there wasn’t a Shopify SDK found for OpenAI (despite checking recently). They have been rolling out shopping features as a way to monetize their app, so I am curious if these are just implemented via API or if they were obfuscated well enough to not be found. If you’re still interested, you can also check out the API calls recorded by each app while open. The data is scrubbed, and I’m not sharing the clear text JSONs associated, but you can see some of the endpoints related to the SDKs. If you have further questions about these, or have a specific piece of data (say GPS, or email) that you’d like to check if it is sent along to any of these, just let me know and we can do further research: https://appgoblin.info/apps/com.openai.chatgpt/data-flows https://appgoblin.info/apps/com.anthropic.claude/data-flows If you have feedback please join the https://appgoblin.info Discord, you can find the link on the home page.

0 views
Jefferson Heard 7 months ago

Examples metrics and fitness functions for Evolutionary Architecture

In my previous article, Your SaaS's most important trait is Evolvability I talk about the need to define fitness functions that ladder up to core company metrics like NPS, CSAT, GRR, and COGS. Just today I had a great followup where a connection on LinkedIn ask me for specifics for an early stage SaaS. I think it'd be valuable to follow up that post with some examples from that conversation. The Ecology of Great Tech No spam. Unsubscribe anytime. Pick a metric first that's important to the company at large. For early stage SaaS, I'd say that's NPS. It's easy to collect, low touch, and Promoters are the people who will help you clinch down renewals and propagate your SaaS to their colleagues at other organizations. The more promotable your software is, the less work your sales and renewals folks will have to do to move their pipeline. Promoters are people who think your software is a joy to use, and that everyone should be using it over whatever they're using today. At an early stage, whatever your software is, you have one or two killer features that really drive engagement and dominate a user's experience of your product. You're asking yourself, "What metrics do I have control over that make the experience Promotion Worthy?" The point is to make it concrete and measurable. Once you can measure it, you want to know two things: Build a now. Measure continuously. Find the trend. Build that into your Site Reliability practice. Push your engineering team to understand what levers they have to control that function and know how quickly they can adapt if it starts trending negative. As your software and company grows, you'll accumulate functions like this for measuring the fitness of your software for common use-cases. It won't be "one key metric" but one or two metrics for each persona. Pivots happen. M&As happen. Product requirements shift as the horizon gets closer. For the kinds of changes you learn to expect as an executive, how well does your tech team adapt to change? As a top software architect or VP of Engineering, these are the kinds of things you measure to see if the team is healthy and if the software is healthy under it. Change is life. Change is necessary for growth. In a healthy, growing company, change is constant. But change introduces stress. Your software architecture's ability to absorb this stress and adapt to new circumstances faster than your competition without creating longer term problems is the ultimate measure of its quality. If your killer feature is messaging, how long does it take for messages and read-receipts to arrive? How long until someone notices lag? How fast is fast enough that improvements aren't noticed? If your killer feature is delivering support through AI, how many times does a user redirect the AI agent for a single question? How complex an inquiry can your AI handle before that's too great? How long does it take for a response to come back? If your killer feature is a calendar, how long does it take for someone to build an appointment, how long does it take to sync to their other calendars, and how close to "on-time" are reminders being delivered? If your killer feature is your financial charting, how up to date are the charts, and how long does it take for a dashboard to load and update? What's the minimum acceptable bound? What's the point of diminishing returns? Do they get thrown into crunch-time in the last 30 days of every project? Does software ship with loose ends and fast-follows that impinge on the next project's start time? Does technical debt accumulate and affect customer experience, support burden, or COGS?

0 views
Nicky Reinert 8 months ago

Adobe Launch DTM Naming Conventions

I’ve worked with Adobe Tracking Suite (which is Adobe Launch and all it’s sibblings) for quite a while and I saw many, some quite chaotic, tracking implementations and tag managers. At some point I felt the need to write down some basic rules to navigate those messy libraries. Hope that helps you, …

0 views
A Smart Bear 8 months ago

All pretty models are wrong, but some ugly models are useful

Identifying useful frameworks for companies, strategy, markets, and organizations, instead of those that just look pretty in PowerPoint.

0 views
Dizzy Zone 9 months ago

On Umami

I’ve been using Umami analytics on this blog for quite some time now. I self host an instance on my homelab. I spent a bit of time researching self-hosted analytics and generally they had a few issues. First, I’d like the analytics platform to be privacy focused. No cookies, GDPR compliant, no PII. Umami checks this mark. Second, many of the alternatives had quite a bit of hardware resource overhead once hosted. They would either consume a ton of memory, the cpu usage would be high or require me to host something like ClickHouse to run them. Since this blog is not the new york times I feel like hosting a specific database for it would be overkill. My homelab is rather small, so keeping things minimal is how I manage to stretch it. Umami consumes around 1% of a CPU core and ~240MiB of RAM on my homelab. Third, postgres as the datastore. Postgres is my go to database, and I host tons of small tools that use it as the backend. I like having a single instance that can then be easily backed up & restored, without having to resort to a ton of different databases. Therefore, any analytics tool would have to use it. Fourth, the tracking script should be minimal in size, not to affect load times too badly. The blog itself is pretty lightweight, so any tracking script bloat would defeat that. The Umami script has a content lenght of 1482 bytes once gzipped, not too shabby. Generally, I’ve been happy with the choice. However, there is one thing in Umami that annoys me more than it probably should: the visit timer. Apparently, the visit time is only updated once a user navigates to another page on the blog. If they simply leave, there’s no visit duration stored whatsoever. This makes the visit time tracker completely useless. I’m not the first one to notice this but the issue has since been moved to a discussion which has seen no progress. Good news is there’s a few things one could do - perhaps add a custom event to track this? Or fork Umami, since it’s open source and fix it. Both of these fall strictly into my “can’t be arsed to do” category, so I guess it’s not that important. Thanks for reading! Perhaps there are other analytics tools that tick the boxes above? Let me know in the comments below.

0 views
Danny McClelland 10 months ago

Privacy

I believe privacy is a fundamental right, and I’ve designed this blog to respect yours. What I Track This blog uses Umami Analytics to collect minimal, anonymous page view data. I track this information solely to understand which content resonates with readers, helping me focus my design and writing efforts on what’s genuinely valuable to my audience. What I collect: Page views and basic navigation patterns General geographic regions (country level only) Referrer information (which site led you here) Device type (desktop, mobile, tablet) What I don’t collect:

0 views

Fintool, Warren Buffett as a Service

As a dedicated Warren Buffett fan, I’ve made it a point to attend the Berkshire Hathaway Annual Meeting every year since I moved to the US. His personal values have greatly influenced my ethics in life, and I'm fascinated by his approach to business. I've written numerous blog posts over the years on investing , competitive moats , Intelligent CEO s, or whether to buy a house —all inspired by Buffett. Concepts like margin of safety and buying below intrinsic value were key to running and eventually selling my previous startup. When I sold my previous company—a legal search engine powered by AI—I invested a portion of my gains into BRK stocks, trusting in Buffett’s methodology. But as someone who has spent over a decade working in AI, a question kept nagging at me: Could an advanced language model do what Warren Buffett does? Jim Simons from Renaissance Technology made over $100B in profits by using machine learning to analyze vast amounts of quantitative data to identify subtle patterns and anomalies that can be exploited for trading. He relies heavily on quantitative data, but what if we could now do the same for qualitative textual data now that LLMs have reasoning capabilities? Warren Buffett's letters, biographies, and investment decisions provide a wealth of knowledge about how to find, analyze, and understand companies. There are even textbooks on value investing that detail the step-by-step process. What if we could break down Buffett’s process into individual tasks and use an AI agent to replicate his approach? At Fintool, we took on that challenge. We deconstructed most of the tasks that Buffett performs to analyze a business—reading SEC filings, understanding earnings, evaluating management decisions—and we built an AI financial analyst to handle these tasks with precision and scale. In some fields, like law, language models are already performing well. Ask an AI to draft an NDA or a Share Purchase Agreement (SPA), and it can quickly generate a document that’s almost ready to go, with minor tweaks. At worst, you might need to provide some context or feed in additional documents, but the model already knows the structure and intent. Ask ChatGPT to generate a Non-Disclosure Agreement (NDA) for a software company and it will do great. Ask ChatGPT to analyze the owner earnings over the past 5 years of founder-led companies in the S&P 500 and it will fail. Finance demands both the strengths and exposes the weaknesses of LLMs. Financial professionals require real-time data, but advanced LLMs like GPT-4 have a knowledge cut-off of October 2023. There is zero tolerance for errors—hallucinations simply aren't acceptable when billions of dollars are at stake. Finance involves processing vast numerical data, an area where LLMs often struggle, and requires scanning multiple companies comprehensively, while LLMs can struggle to effectively analyze even a single one. The combination of financial data complexity, the need for speed, and absolute accuracy makes it one of the toughest challenges for AI to tackle. Let's go back to our question: Compare the owner earnings over the past 5 years of founder-led companies in the S&P 500. Our LLM Warren Buffett needs to do the following: Identify founder-led companies within the S&P 500 by reading at least 500 DEF14A Proxy Statements (approximately 100 pages per document). Understand that Owner Earnings = Net Income + Depreciation and Amortization + Non-Cash Charges - Capital Expenditures (required to maintain the business) - Changes in Working Capital. Extract financial data from the past 5 years (net income, CapEx, working capital changes) for the 500 companies by reading at least 2,500 annual reports. Compute the data by comparing year-over-year owner earnings growth or decline, looking at trends such as increasing CapEx, expanding net income, or significant working capital changes. Write a comprehensive, error-proof report. This is very hard, every step have to be correct. Institutional investors ask hundreds of questions like that. By reading Buffett's shareholder letters, biographies, and value investing textbooks, we broke down Buffett's workflow into specific tasks. Then, we started building our infrastructure piece by piece to replicate these tasks for institutional investors, allowing them to quantitatively and qualitatively analyze a business. I won't go into the hundreds of tasks we identified, but for instance, we created a "screener API" where you can ask qualitative questions on thousands of companies, like " Which tech companies are discussing increasing Capex for AI initiatives? ". With just one data type—SEC filings and earnings calls—we have 70 million chunks, 2 million documents, approximately 500GB of data in Elastic, and around 5TB of data in Databricks for every ten years of data. And that's just one part of the vast amount of data we handle! From Fintool company screener We also built another API for our agents that can retrieve any number from any filings, along with its source. Additionally, we have an API that excels at computing numbers efficiently. For that challenge, we have partnered with OpenAI on a research project to use LLMs to extract every data point in SEC filings. Every week, we process 50 billion tokens, equivalent to 468,750 books of 200 pages each, or 12 times the size of Wikipedia. Our sophisticated data pipelines are designed to locate, verify, deduplicate, and compare every data point for accuracy and insight.  Fintool “Spreadsheet Builder” answering a question on precise data points We are continuously adding new capabilities to our infrastructure. Our Warren Buffett Agent will use these APIs around the clock to find investment opportunities, analyze them, and respond to customer requests. Although the final product is still in development, we already have a live version in use. The results are promising. Fintool reaches 97% in FinanceBench , the industry-leading benchmark for financial questions for public equity analysts, far outpacing any other models. Delivering Practical Value to Customers Today I refuse to let our website be a placeholder with vague statements like "we are an AI lab building financial agents." Instead, every part of our growing infrastructure is put to practical use and sold to real customers, including major hedge funds like Kennedy Capital and companies like PwC. Their feedback is essential in refining our product, which we believe will be a significant advancement for the industry. Today, customers use Fintool to ask broad questions like " List consumer staples companies in the S&P 500 that are discussing shrinkage? " or niche questions like " Break down Nvidia CEO compensation and equity package ." They can also configure AI agents to scan news filings for critical information such as an executive departure or earnings restatements. This is only the beginning. Institutional investors are among the most highly paid knowledge workers in the world. They make millions for their ability to sift through thousands of SEC filings, spot insights, and make calculated decisions on which companies to back. As Greylock noted in their article on vertical AI : “There are several attributes that make financial services well-suited to AI. The market is huge, with $11 trillion in market cap in the U.S. alone, and there's demonstrated demand for AI tools.” We couldn’t agree more. When you look at the daily responsibilities of these professionals, it’s easy to see where AI fits in. The work requires a mix of mathematical expertise and human judgment. Yet, a significant portion of their workload involves mundane, manual tasks—tasks that Fintool’s AI can automate and optimize. Subscribe now The financial research industry is one of the largest and most profitable software verticals in the world, dominated by a handful of key players. Just take a look at the numbers: Bloomberg: $12B in revenue S&P Global: $12.5B in revenue, $6.6B EBITDA FactSet: $1.8B in revenue, $842.5M EBITDA MSCI: $2.5B in revenue, $1.7B EBITDA These companies are highly successful because financial professionals are willing to pay a premium for tools that give them an edge. Active investment managers spend more than $30B per year for data and research services. A bloomberg Terminal The Economics of AI in Finance Adding to that, the unit economics of using AI are vastly better than hiring human analysts. At Fintool, we’re building software that can replace expensive knowledge workers, automating processes that once required teams of analysts. It's crucial knowing the industry is having a talent shortage. According to the venture firm NFX , “The biggest opportunities will exist where the unit economics of hiring AI are 100x better than hiring a person to do the job.” At Fintool, we fit perfectly into that framework. Here’s why: Automatable Processes : From screening SEC filings to running detailed financial models, a large part of an investor's workflow can be done by AI. Cost Savings : In an industry where top analysts are paid millions, the cost savings from using AI are astronomical. Hiring Challenges : Recruiting top financial analysts is a competitive and costly process, often with long onboarding periods. AI can eliminate these pain points. Tool Fragmentation : Today’s financial professionals juggle a wide array of tools. Fintool consolidates these into one powerful platform. Vast Training Data : Fintool leverages proprietary data and vast amounts of public filings to create a unique advantage. We’re creating Warren Buffett as a service—a platform that uses advanced language models to find financial opportunities at scale. With the unit economics favoring AI, and the immense potential to revolutionize how institutional investors work, we believe Fintool is positioned to be the next big thing in financial analysis. If we succeed, we won’t just be building a tool to analyze businesses—we’ll be building the future of how financial professionals make decisions. Thanks for reading Nicolas Bustamante! Subscribe for free to receive new posts and support my work. Identify founder-led companies within the S&P 500 by reading at least 500 DEF14A Proxy Statements (approximately 100 pages per document). Understand that Owner Earnings = Net Income + Depreciation and Amortization + Non-Cash Charges - Capital Expenditures (required to maintain the business) - Changes in Working Capital. Extract financial data from the past 5 years (net income, CapEx, working capital changes) for the 500 companies by reading at least 2,500 annual reports. Compute the data by comparing year-over-year owner earnings growth or decline, looking at trends such as increasing CapEx, expanding net income, or significant working capital changes. Write a comprehensive, error-proof report. From Fintool company screener We also built another API for our agents that can retrieve any number from any filings, along with its source. Additionally, we have an API that excels at computing numbers efficiently. For that challenge, we have partnered with OpenAI on a research project to use LLMs to extract every data point in SEC filings. Every week, we process 50 billion tokens, equivalent to 468,750 books of 200 pages each, or 12 times the size of Wikipedia. Our sophisticated data pipelines are designed to locate, verify, deduplicate, and compare every data point for accuracy and insight.  Fintool “Spreadsheet Builder” answering a question on precise data points We are continuously adding new capabilities to our infrastructure. Our Warren Buffett Agent will use these APIs around the clock to find investment opportunities, analyze them, and respond to customer requests. Although the final product is still in development, we already have a live version in use. The results are promising. Fintool reaches 97% in FinanceBench , the industry-leading benchmark for financial questions for public equity analysts, far outpacing any other models. Delivering Practical Value to Customers Today I refuse to let our website be a placeholder with vague statements like "we are an AI lab building financial agents." Instead, every part of our growing infrastructure is put to practical use and sold to real customers, including major hedge funds like Kennedy Capital and companies like PwC. Their feedback is essential in refining our product, which we believe will be a significant advancement for the industry. Today, customers use Fintool to ask broad questions like " List consumer staples companies in the S&P 500 that are discussing shrinkage? " or niche questions like " Break down Nvidia CEO compensation and equity package ." They can also configure AI agents to scan news filings for critical information such as an executive departure or earnings restatements. This is only the beginning. Why It Will Be Big Institutional investors are among the most highly paid knowledge workers in the world. They make millions for their ability to sift through thousands of SEC filings, spot insights, and make calculated decisions on which companies to back. As Greylock noted in their article on vertical AI : “There are several attributes that make financial services well-suited to AI. The market is huge, with $11 trillion in market cap in the U.S. alone, and there's demonstrated demand for AI tools.” We couldn’t agree more. When you look at the daily responsibilities of these professionals, it’s easy to see where AI fits in. The work requires a mix of mathematical expertise and human judgment. Yet, a significant portion of their workload involves mundane, manual tasks—tasks that Fintool’s AI can automate and optimize. Subscribe now A Massive and Profitable Industry The financial research industry is one of the largest and most profitable software verticals in the world, dominated by a handful of key players. Just take a look at the numbers: Bloomberg: $12B in revenue S&P Global: $12.5B in revenue, $6.6B EBITDA FactSet: $1.8B in revenue, $842.5M EBITDA MSCI: $2.5B in revenue, $1.7B EBITDA A bloomberg Terminal The Economics of AI in Finance Adding to that, the unit economics of using AI are vastly better than hiring human analysts. At Fintool, we’re building software that can replace expensive knowledge workers, automating processes that once required teams of analysts. It's crucial knowing the industry is having a talent shortage. According to the venture firm NFX , “The biggest opportunities will exist where the unit economics of hiring AI are 100x better than hiring a person to do the job.” At Fintool, we fit perfectly into that framework. Here’s why: Automatable Processes : From screening SEC filings to running detailed financial models, a large part of an investor's workflow can be done by AI. Cost Savings : In an industry where top analysts are paid millions, the cost savings from using AI are astronomical. Hiring Challenges : Recruiting top financial analysts is a competitive and costly process, often with long onboarding periods. AI can eliminate these pain points. Tool Fragmentation : Today’s financial professionals juggle a wide array of tools. Fintool consolidates these into one powerful platform. Vast Training Data : Fintool leverages proprietary data and vast amounts of public filings to create a unique advantage.

0 views
Binary Igor 1 years ago

Simple yet Scalable Web Analytics: JSON in SQL with batch inserts

When building landing pages and blogs, we usually want to have some traffic data and its analytics. Monitoring activity on our web pages turns out to be quite useful ... Similarly, when we build web applications, we want to have analytical data to understand the behaviors and interactions of our users.

0 views
pSYoniK 3 years ago

Simple Website Analytics Using CaddyServer Logs

I previously discussed my recent change towards digital minimalism and as part of that transition, I moved towards a static website and removed any external services that I felt weren’t adding value. I also wanted to get some sort of basic analytics, but all the solutions I found were adding unnecessary bloat and for my use case, I didn’t need anything too fancy since I don’t use that analytics information to sell you something or target content. With that in mind I decided to develop a small console application that parses the Caddy server logs generated by this page and outputs the result to the console.

0 views