Posts in Data (20 found)
ava's blog Yesterday

my data should not be your cookie jar

It’s 1970. You walk into the store, grab a bunch of apples, go to the cash register, pay with cash, and walk out. What kinds of data have been automatically processed about you while doing that? Very little. Most likely, none, as CCTV footage relied on the development of VHS to be viable, and IP cameras transmitting video over networks only took off in the 90s. Fast forward to today. Depending on where you live, your supermarket has good cameras everywhere; some, like the super fancy new experiments, have recognition technology that detects what items you grab so that you can just pay without scanning, or even just walk out, having it subtracted automatically from your account. This isn’t just Amazon stores; German store Rewe is trying to get into that too, as I know someone personally who works in their sub-company Lekkerland’s “Smart Store Rollout” department. A more mundane but very common thing for the big stores is tracking you with RFID technology: They track where you are and how long you stay at specific spots by using a network of fixed RFID readers via the RFID tag on the loyalty card or shopping cart (or the individual scanners Rewe offers nowadays!). By noting the time and location of each tag read, the system can create a map of your path and duration of stay within the store. Your supermarket might also have an app to get specific sales and offers. Mine, for a little period, even made it seem as if you could only buy specific products if you pay through their app instead. They dropped that after a while, but I’m sure it got many to download it and make an account - as, of course, you could not use it without one. At the checkout, you might opt for self-checkout now. I’ve seen that stores in the US distinctly record your face and your hands scanning the products, so in case you try to sneak something, they have clear proof and identification options. That video gets analyzed and stored for a while. Either way, you might use a loyalty card you signed up for with your real name and address to collect points or get a discount, tracking exactly what you bought, and you’ll likely pay via card. Your bank account has a bit more information about where you shop and when than if you had just withdrawn cash. If you’re like me, you also pay contactless via phone or watch, giving the processor like Google Pay or Apple Pay some info as well. All that for quickly getting something at the grocery store, something that would not have given the companies much meaningful data about you specifically even just 55 years ago. Of course, some of these things are avoidable and no one forces you to use apps, bank and loyalty cards, but still. These things are not presented as the data harvesters they are, but as convenience and a way to save money or time, targeting vulnerable groups the most. But why even go to the store? Maybe you live in a country with delivery options like Instacart and the like. One more service related to the groceries you buy that is an app, a user account. What if you can’t or don’t wanna cook? Just get a delivery via DoorDash, UberEats, Lieferando or the equivalent in your country. More data about you, and that’s just food. What if you aren’t buying apples at the grocery store, but you’re buying lamps, frames, or a new bed cover? Nowadays, you’d most likely either have a similar shop experience as in the grocery stores, or you’ll online shop on the company’s website or app, that may or may not also show ads and place tracking cookies or reads other data on your phone. They might get you with a 5% off coupon if you just sign up for their newsletter! So you do. Not many use a throwaway mail address or immediately unsubscribe. Now they have constant access to you and your attention if they want to, not just while you’re at their physical stores. A marketing email popping up at the right time creates desires and a suggestion to do some online window shopping, again creating data as you use their website or app. And then there’s the shipping companies… What about the news? You can still buy magazines and newspapers at the store and the corner shop/kiosk, or maybe those little newspaper vending machines that drop one if you put in a coin. But everything is moving to digital nowadays, saving waste and printing costs, so to read the same newspaper online, you have to either pay with your data or pay more than print used to cost, and even then, still pay with your data . Subscribing to the digital version or unlocking a single article via a one-time-payment still tracks you and still shows you ads on many, many news sites. And what do you pay for? If you’re unlucky, it is the same article copy pasted across 10 different newspapers, or a completely AI generated article with zero human effort. For comparison: Just buying the print at a coin vending machine leaves them completely in the dark about you. That was just normal . I notice this in all kinds of industries and parts of life now - it’s why everything now requires an app and a sign-up. Your local café, your hairdresser, your e-scooter. Hell, I even saw nailbiter nail polish now comes with an app. New washing machines and refrigerators are reporting back to their companies. Why is every place, every product company now accepted to be a data aggregation company as well? Why is my data the cookie jar that companies frequently get their hand stuck in while acting entitled? Hello, I already paid you, why are you not ashamed of your obvious greed? What tires me about all of this is that we are supposed to pretend this is all normal and as if it has always been that way, and pretend that this isn’t just double-dipping . I pay money, and then I also generate money with my data. In cases of the loyalty cards and discounts, you could say that there is a fair trade as the price gets lowered, but this is the minority. The majority of the time, we are tracked and profiled with no advantage for us, no compensation. And even if there is, and default pricing is higher if you don‘t share data, that ends up being financial discrimination and affects your choice significantly. As prices rise everywhere, paying with our data gets us almost no relief and is just an ever-growing additional income stream on the side for these companies. Despite having this pile of digital gold to pad their wallets, they still pretend that they have to raise prices all the time for all kinds of situations, and then never lower them when they resolve, as the profit of doing so and selling to advertisers and AI companies is concentrated at the top of the chain. Companies used to be fine selling via means that did not track and invade your life this hard, now we’re supposed to pretend these things are essential. Essential for what? More ads? More manipulation? Better sales numbers? More money for the CEO? They are not essential. We could drop 3/4ths of these mechanisms with no discernible changes to the user experience or product access. The reality is that literal essentials are gatekept by being subjected to this constant harassment and evaluation. How long until not complying with this surveillance regime downright hurts? When you cannot pay cash, or you cannot get into the store without scanning a QR code via their app for authentication, or pricing is personalized based on the profile they have about you - compiled with not just the store data, but other data they bought from data brokers? Your loyalty status, past purchases, your income information, credit score, propensity-to-pay algorithms, Meta social media info, …? Premium loyalty tiers where you ironically pay for more privacy? Predictive technology wrongfully classifying you as a high risk for stealing and banning you from the store? I’m tired of every niche jumping on this opportunity to be the next Cambridge Analytica. You are a hardware store, not a data broker company! I keep swatting your hand out of the jar, but you are just back in there every time I look. Reply via email Published 29 Nov, 2025

0 views
Simon Willison 4 days ago

Highlights from my appearance on the Data Renegades podcast with CL Kao and Dori Wilson

I talked with CL Kao and Dori Wilson for an episode of their new Data Renegades podcast titled Data Journalism Unleashed with Simon Willison . I fed the transcript into Claude Opus 4.5 to extract this list of topics with timestamps and illustrative quotes. It did such a good job I'm using what it produced almost verbatim here - I tidied it up a tiny bit and added a bunch of supporting links. What is data journalism and why it's the most interesting application of data analytics [02:03] "There's this whole field of data journalism, which is using data and databases to try and figure out stories about the world. It's effectively data analytics, but applied to the world of news gathering. And I think it's fascinating. I think it is the single most interesting way to apply this stuff because everything is in scope for a journalist." The origin story of Django at a small Kansas newspaper [02:31] "We had a year's paid internship from university where we went to work for this local newspaper in Kansas with this chap Adrian Holovaty . And at the time we thought we were building a content management system." Building the "Downloads Page" - a dynamic radio player of local bands [03:24] "Adrian built a feature of the site called the Downloads Page . And what it did is it said, okay, who are the bands playing at venues this week? And then we'll construct a little radio player of MP3s of music of bands who are playing in Lawrence in this week." Working at The Guardian on data-driven reporting projects [04:44] "I just love that challenge of building tools that journalists can use to investigate stories and then that you can use to help tell those stories. Like if you give your audience a searchable database to back up the story that you're presenting, I just feel that's a great way of building more credibility in the reporting process." Washington Post's opioid crisis data project and sharing with local newspapers [05:22] "Something the Washington Post did that I thought was extremely forward thinking is that they shared [ the opioid files ] with other newspapers. They said, 'Okay, we're a big national newspaper, but these stories are at a local level. So what can we do so that the local newspaper and different towns can dive into that data for us?'" NICAR conference and the collaborative, non-competitive nature of data journalism [07:00] "It's all about trying to figure out what is the most value we can get out of this technology as an industry as a whole." ProPublica and the Baltimore Banner as examples of nonprofit newsrooms [09:02] "The Baltimore Banner are a nonprofit newsroom. They have a hundred employees now for the city of Baltimore. This is an enormously, it's a very healthy newsroom. They do amazing data reporting... And I believe they're almost breaking even on subscription revenue [correction, not yet ], which is astonishing." The "shower revelation" that led to Datasette - SQLite on serverless hosting [10:31] "It was literally a shower revelation. I was in the shower thinking about serverless and I thought, 'hang on a second. So you can't use Postgres on serverless hosting, but if it's a read-only database, could you use SQLite? Could you just take that data, bake it into a blob of a SQLite file, ship that as part of the application just as another asset, and then serve things on top of that?'" Datasette's plugin ecosystem and the vision of solving data publishing [12:36] "In the past I've thought about it like how Pinterest solved scrapbooking and WordPress solved blogging, who's going to solve data like publishing tables full of data on the internet? So that was my original goal." Unexpected Datasette use cases: Copenhagen electricity grid, Brooklyn Cemetery [13:59] "Somebody was doing research on the Brooklyn Cemetery and they got hold of the original paper files of who was buried in the Brooklyn Cemetery. They digitized those, loaded the results into Datasette and now it tells the story of immigration to New York." Bellingcat using Datasette to investigate leaked Russian food delivery data [14:40] "It turns out the Russian FSB, their secret police, have an office that's not near any restaurants and they order food all the time. And so this database could tell you what nights were the FSB working late and what were the names and phone numbers of the FSB agents who ordered food... And I'm like, 'Wow, that's going to get me thrown out of a window.'" Bellingcat: Food Delivery Leak Unmasks Russian Security Agents The frustration of open source: no feedback on how people use your software [16:14] "An endless frustration in open source is that you really don't get the feedback on what people are actually doing with it." Open office hours on Fridays to learn how people use Datasette [16:49] "I have an open office hours Calendly , where the invitation is, if you use my software or want to use my software, grab 25 minutes to talk to me about it. And that's been a revelation. I've had hundreds of conversations in the past few years with people." Data cleaning as the universal complaint - 95% of time spent cleaning [17:34] "I know every single person I talk to in data complains about the cleaning that everyone says, 'I spend 95% of my time cleaning the data and I hate it.'" Version control problems in data teams - Python scripts on laptops without Git [17:43] "I used to work for a large company that had a whole separate data division and I learned at one point that they weren't using Git for their scripts. They had Python scripts, littering laptops left, right and center and lots of notebooks and very little version control, which upset me greatly." The Carpentries organization teaching scientists Git and software fundamentals [18:12] "There's an organization called The Carpentries . Basically they teach scientists to use Git. Their entire thing is scientists are all writing code these days. Nobody ever sat them down and showed them how to use the UNIX terminal or Git or version control or write tests. We should do that." Data documentation as an API contract problem [21:11] "A coworker of mine said, you do realize that this should be a documented API interface, right? Your data warehouse view of your project is something that you should be responsible for communicating to the rest of the organization and we weren't doing it." The importance of "view source" on business reports [23:21] "If you show somebody a report, you need to have view source on those reports... somebody would say 25% of our users did this thing. And I'm thinking I need to see the query because I knew where all of the skeletons were buried and often that 25% was actually a 50%." Fact-checking process for data reporting [24:16] "Their stories are fact checked, no story goes out the door without someone else fact checking it and without an editor approving it. And it's the same for data. If they do a piece of data reporting, a separate data reporter has to audit those numbers and maybe even produce those numbers themselves in a separate way before they're confident enough to publish them." Queries as first-class citizens with version history and comments [27:16] "I think the queries themselves need to be first class citizens where like I want to see a library of queries that my team are using and each one I want to know who built it and when it was built. And I want to see how that's changed over time and be able to post comments on it." Two types of documentation: official docs vs. temporal/timestamped notes [29:46] "There's another type of documentation which I call temporal documentation where effectively it's stuff where you say, 'Okay, it's Friday, the 31st of October and this worked.' But the timestamp is very prominent and if somebody looks that in six months time, there's no promise that it's still going to be valid to them." Starting an internal blog without permission - instant credibility [30:24] "The key thing is you need to start one of these without having to ask permission first. You just one day start, you can do it in a Google Doc, right?... It gives you so much credibility really quickly because nobody else is doing it." Building a search engine across seven documentation systems [31:35] "It turns out, once you get a search engine over the top, it's good documentation. You just have to know where to look for it. And if you are the person who builds the search engine, you secretly control the company." The TIL (Today I Learned) blog approach - celebrating learning basics [33:05] "I've done TILs about 'for loops' in Bash, right? Because okay, everyone else knows how to do that. I didn't... It's a value statement where I'm saying that if you've been a professional software engineer for 25 years, you still don't know everything. You should still celebrate figuring out how to learn 'for loops' in Bash." Coding agents like Claude Code and their unexpected general-purpose power [34:53] "They pretend to be programming tools but actually they're basically a sort of general agent because they can do anything that you can do by typing commands into a Unix shell, which is everything." Skills for Claude - markdown files for census data, visualization, newsroom standards [36:16] "Imagine a markdown file for census data. Here's where to get census data from. Here's what all of the columns mean. Here's how to derive useful things from that. And then you have another skill for here's how to visualize things on a map using D3... At the Washington Post, our data standards are this and this and this." Claude Skills are awesome, maybe a bigger deal than MCP The absurd 2025 reality: cutting-edge AI tools use 1980s terminal interfaces [38:22] "The terminal is now accessible to people who never learned the terminal before 'cause you don't have to remember all the commands because the LLM knows the commands for you. But isn't that fascinating that the cutting edge software right now is it's like 1980s style— I love that. It's not going to last. That's a current absurdity for 2025." Cursor for data? Generic agent loops vs. data-specific IDEs [38:18] "More of a notebook interface makes a lot more sense than a Claude Code style terminal 'cause a Jupyter Notebook is effectively a terminal, it's just in your browser and it can show you charts." Future of BI tools: prompt-driven, instant dashboard creation [39:54] "You can copy and paste a big chunk of JSON data from somewhere into [an LLM] and say build me a dashboard. And they do such a good job. Like they will just decide, oh this is a time element so we'll do a bar chart over time and these numbers feel big so we'll put those in a big green box." Three exciting LLM applications: text-to-SQL, data extraction, data enrichment [43:06] "LLMs are stunningly good at outputting SQL queries. Especially if you give them extra metadata about the columns. Maybe a couple of example queries and stuff." LLMs extracting structured data from scanned PDFs at 95-98% accuracy [43:36] "You file a freedom of information request and you get back horrifying scanned PDFs with slightly wonky angles and you have to get the data out of those. LLMs for a couple of years now have been so good at, 'here's a page of a police report, give me back JSON with the name of the arresting officer and the date of the incident and the description,' and they just do it." Data enrichment: running cheap models in loops against thousands of records [44:36] "There's something really exciting about the cheaper models, Gemini Flash 2.5 Lite, things like that. Being able to run those in a loop against thousands of records feels very valuable to me as well." datasette-enrichments Multimodal LLMs for images, audio transcription, and video processing [45:42] "At one point I calculated that using Google's least expensive model, if I wanted to generate captions for like 70,000 photographs in my personal photo library, it would cost me like $13 or something. Wildly inexpensive." Correction: with Gemini 1.5 Flash 8B it would cost 173.25 cents First programming language: hated C++, loved PHP and Commodore 64 BASIC [46:54] "I hated C++ 'cause I got my parents to buy me a book on it when I was like 15 and I did not make any progress with Borland C++ compiler... Actually, my first program language was Commodore 64 BASIC. And I did love that. Like I tried to build a database in Commodore 64 BASIC back when I was like six years old or something." Biggest production bug: crashing The Guardian's MPs expenses site with a progress bar [47:46] "I tweeted a screenshot of that progress bar and said, 'Hey, look, we have a progress bar.' And 30 seconds later the site crashed because I was using SQL queries to count all 17,000 documents just for this one progress bar." Crowdsourced document analysis and MP expenses Favorite test dataset: San Francisco's tree list, updated several times a week [48:44] "There's 195,000 trees in this CSV file and it's got latitude and longitude and species and age when it was planted... and get this, it's updated several times a week... most working days, somebody at San Francisco City Hall updates their database of trees, and I can't figure out who." Showrunning TV shows as a management model - transferring vision to lieutenants [50:07] "Your job is to transfer your vision into their heads so they can go and have the meetings with the props department and the set design and all of those kinds of things... I used to sniff at the idea of a vision when I was young and stupid. And now I'm like, no, the vision really is everything because if everyone understands the vision, they can make decisions you delegate to them." The Eleven Laws of Showrunning by Javier Grillo-Marxuach Hot take: all executable code with business value must be in version control [52:21] "I think it's inexcusable to have executable code that has business value that is not in version control somewhere." Hacker News automation: GitHub Actions scraping for notifications [52:45] "I've got a GitHub actions thing that runs a piece of software I wrote called shot-scraper that runs Playwright, that loads up a browser in GitHub actions to scrape that webpage and turn the results into JSON, which then get turned into an atom feed, which I subscribe to in NetNewsWire." Dream project: whale detection camera with Gemini AI [53:47] "I want to point a camera at the ocean and take a snapshot every minute and feed it into Google Gemini or something and just say, is there a whale yes or no? That would be incredible. I want push notifications when there's a whale." Favorite podcast: Mark Steel's in Town (hyperlocal British comedy) [54:23] "Every episode he goes to a small town in England and he does a comedy set in a local venue about the history of the town. And so he does very deep research... I love that sort of like hyperlocal, like comedy, that sort of British culture thing." Mark Steel's in Town available episodes Favorite fiction genre: British wizards caught up in bureaucracy [55:06] "My favorite genre of fiction is British wizards who get caught up in bureaucracy... I just really like that contrast of like magical realism and very clearly researched government paperwork and filings." The Laundry Files , Rivers of London , The Rook I used a Claude Project for the initial analysis, pasting in the HTML of the transcript since that included elements. The project uses the following custom instructions You will be given a transcript of a podcast episode. Find the most interesting quotes in that transcript - quotes that best illustrate the overall themes, and quotes that introduce surprising ideas or express things in a particularly clear or engaging or spicy way. Answer just with those quotes - long quotes are fine. I then added a follow-up prompt saying: Now construct a bullet point list of key topics where each item includes the mm:ss in square braces at the end Then suggest a very comprehensive list of supporting links I could find Here's the full Claude transcript of the analysis. You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options . What is data journalism and why it's the most interesting application of data analytics [02:03] "There's this whole field of data journalism, which is using data and databases to try and figure out stories about the world. It's effectively data analytics, but applied to the world of news gathering. And I think it's fascinating. I think it is the single most interesting way to apply this stuff because everything is in scope for a journalist." The origin story of Django at a small Kansas newspaper [02:31] "We had a year's paid internship from university where we went to work for this local newspaper in Kansas with this chap Adrian Holovaty . And at the time we thought we were building a content management system." Building the "Downloads Page" - a dynamic radio player of local bands [03:24] "Adrian built a feature of the site called the Downloads Page . And what it did is it said, okay, who are the bands playing at venues this week? And then we'll construct a little radio player of MP3s of music of bands who are playing in Lawrence in this week." Working at The Guardian on data-driven reporting projects [04:44] "I just love that challenge of building tools that journalists can use to investigate stories and then that you can use to help tell those stories. Like if you give your audience a searchable database to back up the story that you're presenting, I just feel that's a great way of building more credibility in the reporting process." Washington Post's opioid crisis data project and sharing with local newspapers [05:22] "Something the Washington Post did that I thought was extremely forward thinking is that they shared [ the opioid files ] with other newspapers. They said, 'Okay, we're a big national newspaper, but these stories are at a local level. So what can we do so that the local newspaper and different towns can dive into that data for us?'" NICAR conference and the collaborative, non-competitive nature of data journalism [07:00] "It's all about trying to figure out what is the most value we can get out of this technology as an industry as a whole." NICAR 2026 ProPublica and the Baltimore Banner as examples of nonprofit newsrooms [09:02] "The Baltimore Banner are a nonprofit newsroom. They have a hundred employees now for the city of Baltimore. This is an enormously, it's a very healthy newsroom. They do amazing data reporting... And I believe they're almost breaking even on subscription revenue [correction, not yet ], which is astonishing." The "shower revelation" that led to Datasette - SQLite on serverless hosting [10:31] "It was literally a shower revelation. I was in the shower thinking about serverless and I thought, 'hang on a second. So you can't use Postgres on serverless hosting, but if it's a read-only database, could you use SQLite? Could you just take that data, bake it into a blob of a SQLite file, ship that as part of the application just as another asset, and then serve things on top of that?'" Datasette's plugin ecosystem and the vision of solving data publishing [12:36] "In the past I've thought about it like how Pinterest solved scrapbooking and WordPress solved blogging, who's going to solve data like publishing tables full of data on the internet? So that was my original goal." Unexpected Datasette use cases: Copenhagen electricity grid, Brooklyn Cemetery [13:59] "Somebody was doing research on the Brooklyn Cemetery and they got hold of the original paper files of who was buried in the Brooklyn Cemetery. They digitized those, loaded the results into Datasette and now it tells the story of immigration to New York." Bellingcat using Datasette to investigate leaked Russian food delivery data [14:40] "It turns out the Russian FSB, their secret police, have an office that's not near any restaurants and they order food all the time. And so this database could tell you what nights were the FSB working late and what were the names and phone numbers of the FSB agents who ordered food... And I'm like, 'Wow, that's going to get me thrown out of a window.'" Bellingcat: Food Delivery Leak Unmasks Russian Security Agents The frustration of open source: no feedback on how people use your software [16:14] "An endless frustration in open source is that you really don't get the feedback on what people are actually doing with it." Open office hours on Fridays to learn how people use Datasette [16:49] "I have an open office hours Calendly , where the invitation is, if you use my software or want to use my software, grab 25 minutes to talk to me about it. And that's been a revelation. I've had hundreds of conversations in the past few years with people." Data cleaning as the universal complaint - 95% of time spent cleaning [17:34] "I know every single person I talk to in data complains about the cleaning that everyone says, 'I spend 95% of my time cleaning the data and I hate it.'" Version control problems in data teams - Python scripts on laptops without Git [17:43] "I used to work for a large company that had a whole separate data division and I learned at one point that they weren't using Git for their scripts. They had Python scripts, littering laptops left, right and center and lots of notebooks and very little version control, which upset me greatly." The Carpentries organization teaching scientists Git and software fundamentals [18:12] "There's an organization called The Carpentries . Basically they teach scientists to use Git. Their entire thing is scientists are all writing code these days. Nobody ever sat them down and showed them how to use the UNIX terminal or Git or version control or write tests. We should do that." Data documentation as an API contract problem [21:11] "A coworker of mine said, you do realize that this should be a documented API interface, right? Your data warehouse view of your project is something that you should be responsible for communicating to the rest of the organization and we weren't doing it." The importance of "view source" on business reports [23:21] "If you show somebody a report, you need to have view source on those reports... somebody would say 25% of our users did this thing. And I'm thinking I need to see the query because I knew where all of the skeletons were buried and often that 25% was actually a 50%." Fact-checking process for data reporting [24:16] "Their stories are fact checked, no story goes out the door without someone else fact checking it and without an editor approving it. And it's the same for data. If they do a piece of data reporting, a separate data reporter has to audit those numbers and maybe even produce those numbers themselves in a separate way before they're confident enough to publish them." Queries as first-class citizens with version history and comments [27:16] "I think the queries themselves need to be first class citizens where like I want to see a library of queries that my team are using and each one I want to know who built it and when it was built. And I want to see how that's changed over time and be able to post comments on it." Two types of documentation: official docs vs. temporal/timestamped notes [29:46] "There's another type of documentation which I call temporal documentation where effectively it's stuff where you say, 'Okay, it's Friday, the 31st of October and this worked.' But the timestamp is very prominent and if somebody looks that in six months time, there's no promise that it's still going to be valid to them." Starting an internal blog without permission - instant credibility [30:24] "The key thing is you need to start one of these without having to ask permission first. You just one day start, you can do it in a Google Doc, right?... It gives you so much credibility really quickly because nobody else is doing it." Building a search engine across seven documentation systems [31:35] "It turns out, once you get a search engine over the top, it's good documentation. You just have to know where to look for it. And if you are the person who builds the search engine, you secretly control the company." The TIL (Today I Learned) blog approach - celebrating learning basics [33:05] "I've done TILs about 'for loops' in Bash, right? Because okay, everyone else knows how to do that. I didn't... It's a value statement where I'm saying that if you've been a professional software engineer for 25 years, you still don't know everything. You should still celebrate figuring out how to learn 'for loops' in Bash." Coding agents like Claude Code and their unexpected general-purpose power [34:53] "They pretend to be programming tools but actually they're basically a sort of general agent because they can do anything that you can do by typing commands into a Unix shell, which is everything." Skills for Claude - markdown files for census data, visualization, newsroom standards [36:16] "Imagine a markdown file for census data. Here's where to get census data from. Here's what all of the columns mean. Here's how to derive useful things from that. And then you have another skill for here's how to visualize things on a map using D3... At the Washington Post, our data standards are this and this and this." Claude Skills are awesome, maybe a bigger deal than MCP The absurd 2025 reality: cutting-edge AI tools use 1980s terminal interfaces [38:22] "The terminal is now accessible to people who never learned the terminal before 'cause you don't have to remember all the commands because the LLM knows the commands for you. But isn't that fascinating that the cutting edge software right now is it's like 1980s style— I love that. It's not going to last. That's a current absurdity for 2025." Cursor for data? Generic agent loops vs. data-specific IDEs [38:18] "More of a notebook interface makes a lot more sense than a Claude Code style terminal 'cause a Jupyter Notebook is effectively a terminal, it's just in your browser and it can show you charts." Future of BI tools: prompt-driven, instant dashboard creation [39:54] "You can copy and paste a big chunk of JSON data from somewhere into [an LLM] and say build me a dashboard. And they do such a good job. Like they will just decide, oh this is a time element so we'll do a bar chart over time and these numbers feel big so we'll put those in a big green box." Three exciting LLM applications: text-to-SQL, data extraction, data enrichment [43:06] "LLMs are stunningly good at outputting SQL queries. Especially if you give them extra metadata about the columns. Maybe a couple of example queries and stuff." LLMs extracting structured data from scanned PDFs at 95-98% accuracy [43:36] "You file a freedom of information request and you get back horrifying scanned PDFs with slightly wonky angles and you have to get the data out of those. LLMs for a couple of years now have been so good at, 'here's a page of a police report, give me back JSON with the name of the arresting officer and the date of the incident and the description,' and they just do it." Data enrichment: running cheap models in loops against thousands of records [44:36] "There's something really exciting about the cheaper models, Gemini Flash 2.5 Lite, things like that. Being able to run those in a loop against thousands of records feels very valuable to me as well." datasette-enrichments Multimodal LLMs for images, audio transcription, and video processing [45:42] "At one point I calculated that using Google's least expensive model, if I wanted to generate captions for like 70,000 photographs in my personal photo library, it would cost me like $13 or something. Wildly inexpensive." Correction: with Gemini 1.5 Flash 8B it would cost 173.25 cents First programming language: hated C++, loved PHP and Commodore 64 BASIC [46:54] "I hated C++ 'cause I got my parents to buy me a book on it when I was like 15 and I did not make any progress with Borland C++ compiler... Actually, my first program language was Commodore 64 BASIC. And I did love that. Like I tried to build a database in Commodore 64 BASIC back when I was like six years old or something." Biggest production bug: crashing The Guardian's MPs expenses site with a progress bar [47:46] "I tweeted a screenshot of that progress bar and said, 'Hey, look, we have a progress bar.' And 30 seconds later the site crashed because I was using SQL queries to count all 17,000 documents just for this one progress bar." Crowdsourced document analysis and MP expenses Favorite test dataset: San Francisco's tree list, updated several times a week [48:44] "There's 195,000 trees in this CSV file and it's got latitude and longitude and species and age when it was planted... and get this, it's updated several times a week... most working days, somebody at San Francisco City Hall updates their database of trees, and I can't figure out who." Showrunning TV shows as a management model - transferring vision to lieutenants [50:07] "Your job is to transfer your vision into their heads so they can go and have the meetings with the props department and the set design and all of those kinds of things... I used to sniff at the idea of a vision when I was young and stupid. And now I'm like, no, the vision really is everything because if everyone understands the vision, they can make decisions you delegate to them." The Eleven Laws of Showrunning by Javier Grillo-Marxuach Hot take: all executable code with business value must be in version control [52:21] "I think it's inexcusable to have executable code that has business value that is not in version control somewhere." Hacker News automation: GitHub Actions scraping for notifications [52:45] "I've got a GitHub actions thing that runs a piece of software I wrote called shot-scraper that runs Playwright, that loads up a browser in GitHub actions to scrape that webpage and turn the results into JSON, which then get turned into an atom feed, which I subscribe to in NetNewsWire." Dream project: whale detection camera with Gemini AI [53:47] "I want to point a camera at the ocean and take a snapshot every minute and feed it into Google Gemini or something and just say, is there a whale yes or no? That would be incredible. I want push notifications when there's a whale." Favorite podcast: Mark Steel's in Town (hyperlocal British comedy) [54:23] "Every episode he goes to a small town in England and he does a comedy set in a local venue about the history of the town. And so he does very deep research... I love that sort of like hyperlocal, like comedy, that sort of British culture thing." Mark Steel's in Town available episodes Favorite fiction genre: British wizards caught up in bureaucracy [55:06] "My favorite genre of fiction is British wizards who get caught up in bureaucracy... I just really like that contrast of like magical realism and very clearly researched government paperwork and filings." The Laundry Files , Rivers of London , The Rook

0 views
ava's blog 1 weeks ago

📌 i got my data protection law certificate!

On the 30th of October , I officially finished my data protection law certificate! I'm a bit late to post this because I was so busy and still needed to wait for the actual paper to arrive plus getting a frame and all. :) The certificate ('Diploma of Advanced Studies') is intended for 3 semesters in part-time. I finished it up in one semester with a grade average of 2,2 1 while continuing my other part-time degree (a Bachelor of Laws, LL.B) and full-time work. It is quite a bit more intensive than the 2-week crash courses to be a data protection officer and I had to write 6 exams in total, but it enables me to be one plus the permission to call myself a certified consultant for data protection law. I'll have to refresh it every 4 years with a refresher course, or lose it. While I love to write about commercial tech and social media through a privacy lens here and burn for that topic in private, I intend my career/professional focus to be about health data and AI. I already work with pharmaceutical databases in my job, and I wouldn't wanna miss that part of my work day. My first of hopefully many pieces of paper on that wall 2 . Would love to do AIGP, CIPP/E, CIPM and ISO27001 Lead Implementer some time, and obviously finish my Bachelor degree and start a Master's in data protection law. This cert consisted of the first 3 modules of that Master's degree already, so I know what's ahead of me and I know I can do it. :) Now I'm off to another MRI, because my body is being difficult. I hope to post more soon <3 Reply via email Published 20 Nov, 2025 In case there is confusion, it is the opposite of the American GPA system: 1,0 is good, 4,0 is bad. ↩ I may even get a second frame already to also put up the actual grade records next to it. The one on the wall is just the naming rights proof. ↩ In case there is confusion, it is the opposite of the American GPA system: 1,0 is good, 4,0 is bad. ↩ I may even get a second frame already to also put up the actual grade records next to it. The one on the wall is just the naming rights proof. ↩

0 views
Jim Nielsen 2 weeks ago

Data Storage As Files on Disk Paired With an LLM

I recently added a bunch of app icons from macOS Tahoe to my collection . Afterwards, I realized some of them were missing relational metadata. For example, I have a collection of iMove icons through the years which are related in my collection by their App Store ID. However, the latest iMovie icon I added didn’t have this ID. This got me thinking, "Crap, I really want this metadata so I can see apps over time . Am I gonna have to go back through each icon I just posted and find their associated App Store ID?” Then I thought: “Hey, I bet AI could figure this out — right? It should be able to read through my collection of icons (which are stored as JSON files on disk), look for icons with the same name and developer, and see where I'm missing and .” So I formulated a prompt (in hindsight, a really poor one lol): look through all the files in and find any that start with and then find me any icons like iMovie that have a correlation to other icons in where it's missing and But AI did pretty good with that. I’ll save you the entire output, but Cursor thought for a bit, then asked to run this command: I was like, “Ok. I couldn’t write that myself, but that looks about right. Go ahead.” It ran the command, thought some more, then asked to run another command. Then another. It seemed unsatisfied with the results, so it changed course and wrote a node script and asked permission to run that. I looked at it and said, “Hey that’s probably how I would’ve approached this.” So I gave permission. It ran the script, thought a little, then rewrote it and asked permission to run again. Here’s the final version it ran: And with that, boom! It found a few newly-added icons with corollaries in my archive, pointed them out, then asked if I wanted to add the missing metadata. The beautiful part was I said “go ahead” and when it finished, I could see and review the staged changes in git. This let me double check the LLM’s findings with my existing collection to verify everything looked right — just to make sure there were no hallucinations. Turns out, storing all my icon data as JSON files on disk (rather than a database) wasn’t such a bad idea. Part of the reason I’ve never switched from static JSON files on disk to a database is because I always figured it would be easier for future me to find and work with files on disk (as opposed to learning how to setup, maintain, and query a database). Turns out that wasn’t such a bad bet. I’m sure AI could’ve helped me write some SQL queries to do all the stuff I did here. But what I did instead already fit within a workflow I understand: files on disk, modified with scripting, reviewed with git, checked in, and pushed to prod. So hey, storing data as JSON files in git doesn’t look like such a bad idea now, does it future Jim? Reply via: Email · Mastodon · Bluesky

0 views
Evan Schwartz 3 weeks ago

Scour - October Update

Hi friends, In October, Scour ingested 1,042,894 new posts from 14,140 sources . I was also training for the NYC Marathon (which is why this email comes a few days into November)! Last month was all about Interests: Your weekly email digest now includes a couple of topic recommendations at the end. And, if you use an RSS reader to consume your Scour feed, you’ll also find interest recommendations in that feed as well. When you add a new interest on the Interests page, you’ll now see a menu of similar topics that you can click to quickly add. You can browse the new Popular Interests page to find other topics you might want to add. Infinite scrolling is now optional. You can disable it and switch back to explicit pages on your Settings page. Thanks Tomáš Burkert for this suggestion! Earlier, Scour’s topic recommendations were a little too broad. I tried to fix that and now, as you might have noticed, they’re often too specific. I’m still working on solving this “Goldilocks problem”, so more on this to come! Finally, here were a couple of my favorite posts that I found on Scour in October: Happy Scouring! - Evan Introducing RTEB: A New Standard for Retrieval Evaluation Everything About Transformers Turn off Cursor, turn on your mind

1 views
Robin Moffatt 3 weeks ago

Tech Radar (Nov 2025) - data blips

The latest  Thoughtworks TechRadar  is out. Here are some of the more data-related ‘blips’ (as they’re called on the radar) that I noticed. Each item links to the blip’s entry where you can read more information about Thoughtwork’s usage and opinions on it. Databricks Assistant Apache Paimon Delta Sharing Naive API-to-MCP conversion Standalone data engineering teams Text to SQL

0 views
Evan Schwartz 2 months ago

Scour - September Update

Hi friends, Welcome if you've recently signed up! And if you've been using Scour for a while, product updates are back! This summer was one where life got in the way of me working on Scour, but I'm back at it now. Please share any feedback you have on the feedback board to help me prioritize what to work on next! Since the last update in May, Scour has tripled the amount of content it is ingesting per month and it scoured 1,535,995 posts from 13,875 sources in September. Interest recommendations are now more specific and, hopefully, more interesting! Take a look at what it suggests for you and let me know what you think. You can now see the history of all the articles you clicked on through Scour. That should make it easier to find those good posts you read without needing to search for them through your feed. Relatedly, Scour now syncs the read state across devices (so links you've clicked will appear muted, even if you clicked them on another device). Thanks to u/BokenPhilia for the suggestion! Scour's weekly update emails and the RSS/Atom/JSON feeds it produces now include links to 🌰 Love, 👍 Like, or 👎 Dislike posts. Thank you to @ashishb and an anonymous user for suggesting this addition! You can now flag posts as Harmful, Off-Topic, Low-Quality, etc (look for the flag icon below the link). Thank you to Andy Piper for the suggestion! On a related note, I also switched away from using LlamaGuard to identify harmful content. It was flagging too many okay posts, not finding many bad ones, and it became the most expensive single cost for operating this service. Scour now uses a domain blocklist along with explicit flagging to remove harmful content. Thank you to an anonymous user for the feedback! The RSS/Atom/JSON feeds produced by Scour now include a short preview of the content to help you decide if the article is worth reading. Tripling the number of items Scour ingests means there are more posts to search through to find your hidden gems when you load your feed. That, unfortunately, slowed down the page load speed, especially when you're scouring All Feeds and/or looking at a timeframe of the past week or month. (Thank you to Adam Gluck for pointing out this slowness!) I spent quite a bit of time working on speeding up the feed loading again and cut the time by ~35% when you're scouring All Feeds. If you're interested in reading about the technical details behind this speedup, you can read this blog post I wrote about it: Subtleties of SQLite Indexes . Finally, here were a couple of my favorite posts that I found on Scour in September: Happy Scouring! Identifying and Upweighting Power-Niche Users to Mitigate Popularity Bias in Recommendations Rules for creating good-looking user interfaces, from a developer Improving Cursor Tab with online RL

4 views
Justin Duke 2 months ago

Hidden coupons

Much of our work at Buttondown revolves around resolving amorphous bits of state and cleaning it up to our ends, particularly state from exogenous sources. This manifests itself in a lot of ways: SMTP error codes, importing archives, et cetera. But one particularly pernicious way is straight. An author can come to ButtonDown having already set up a Stripe account, whether for some ad hoc use case or because they were using a separate paid subscriptions platform such as Substack or Ghost that also interfaces with Stripe. And one of the first things we do is slurp up all that data so we understand exactly what their prior history is, how many paid subscribers they have, et cetera. As you might imagine, this is very, very effective because the biggest perceived barrier for users is friction and how difficult it is for them to move from one place to another. And every time we can make it incrementally easier for them, it's worth our while. However, as you can also imagine, we deal with a lot of edge cases and idiosyncratic bits of behavior from Stripe. (And if anyone from Stripe is reading this essay, please don't interpret it as that large of a complaint because Connect is a pretty impressive bit of engineering, janky as it is.) One thing we have to do is pull in all coupon and discount data. So this is for a variety of reasons that are all uninteresting. The point of this essay is to talk about a divergence and where the abstract breaks down. You might think, as we once did, that the way to do this is pretty simple. You compile a list of all the available coupons, and then you iterate through every single subscription looking for said coupons. This is also the approach outlined in the docs and surfaced in the dashboard, so your naivete is excusable. However, this neglects to highlight an entirely different genre of discount, which is ad hoc discounts that are created and applied during the checkout session process, as well as probably a couple other places in which I'm unaware. To iterate through these, you must iterate through the subscriptions themselves: I'm sure there are a lot of interesting and nuanced reasons why these intangible coupons are not actually available through the core endpoint — I also don't care! It is a bad abstraction that I can get two different answers for "what are the coupons for this account?"; it is particularly bad because the "real" answer is by looking in the non-obvious place. At the same time, I am sympathetic. "I should not have to create a dedicated Coupon object just to apply a single discount to a single subscription" is a very reasonable papercut that I understand Stripe's desire to solve; in so doing, they created a different (and perhaps more esoteric) problem. This is why API design is a fun and interesting problem.

0 views
ava's blog 2 months ago

challenges around AI and the GDPR

Today, I once again met up with my mentor around data protection law and had some questions about his view on the compatibility of AI with several aspects of the General Data Protection Regulation (GDPR). The talk went really well and was super engaging, but I soon had to run home so I would not miss the GDPRhub meeting by noyb.eu! Quick plug: Consider donating to noyb.eu or even becoming a member if you care about defending data protection and privacy from Meta, Google et. al. :) I am volunteering as a Country Reporter for them! The meeting included a really great presentation on the exact topics I had discussed earlier, and motivated me to write a post about it, so here we are. The first issues start with the actual aggregation of data specific models are trained on. There are, of course, a lot of internal, small models being trained on a very limited dataset, for a highly specific purpose. It is usually in the best interest of their developers to keep it that way to not dilute the output and keep it small. What would scraping half the net do for a model that is supposed to help with a specific template document at work? But for the big ones (ChatGPT, Copilot, Grok and more) that are decidedly supposed to be allrounders who can be used for anything, there is a clear incentive to vacuum up anything they can. This directly contradicts the principle of data minimization . Article 5 lists several principles relating to the processing of personal data, and data minimization is the idea that processing should be adequate, relevant and limited to what is necessary in relation to the purposes. In practice, this means acquiring the least amount of personal data to get the job done. For example: In recent rulings, it was declared unlawful that train companies in the EU 1 demand your gender, email address and/or telephone number just to buy a train ticket, as these infos aren't needed for a customer to use a train, nor for the company to provide the service. It should be optional to share that, at least. This principle usually also incorporates aspects of another principle: The storage limitation . Personal data should only be stored as long as is necessary, but how do you decide how long a name, address, telephone number and similar information is necessary for the purposes in a huge dataset? Depending on the methods, it might even be impossible to remove. Furthermore, there's the principle of purpose limitation . Processing needs a stated purpose that needs to be specific, explicit and legitimate. With limited models, this may be easier, as the use for it is very targeted; but it is a point of contention in the legal discourse about whether something as very vague as "AI model training" or similar purposes related to that are specific enough, and if companies can reference an exemption for their models based on scientific research purposes. Also: It is likely that the data they are scraping has been put out there or been acquired for a different purpose. If I consent to having my picture taken and put on the company website to promote our product and drive customer engagement, I am not consenting to the fact that 5 years down the line, my image will be used for AI training, for example. A change of purpose requires information, maybe even renewed consent, but that is a mixed bag. It is relatively easy to inform and get consent of users if they have an account on your platform, as you can show them a note about it and let them make a decision - but what about data scraping outside of platforms? How can users outside of services like the ones Meta, Microsoft, X or Google own get a chance to consent or be informed about their personally identifiable data being used to train AI? The GDPR handles information requirements in two main ways: Article 13 is for companies (usually referred to as "controllers") obtaining data and consent directly from you, and Article 14 is for the case in which the controller obtains data of you indirectly (for example, by using someone else's datasets, or getting your data transmitted to them by the company who actually obtained it directly from you). So in the case of the broad scraping taking place, companies are still technically required to inform you based on Article 14. The problem is that obviously, it is not feasible in practice. If they scrape your name and have nothing else, how will they contact you? How are their employees supposed to search through terabytes of data to search for personally identifiable data? 2 How would they detect sensitive data that comes with extra requirements, like data related to health, sexual identity, ethnic origin, religion and more ( Article 9 ) or the data of children ( Article 8 ), and fulfill them? How are they supposed to contact millions of people, and track who has opted out and who has opted in, who hasn't replied, etc.? Article 14 acknowledges this issue in section 5 when it says that they don't have to inform you if it would not be feasible to do so or would require disproportional effort. This effectively means that most big companies training AI with extremely large datasets scraping the entire net are off the hook from telling data subjects (you, as the affected person) about it. Consent is similarly tricky. Article 7 of the GDPR sets conditions for consent, which say that the controller (company) needs to demonstrate that you gave consent 3 and that the way to consent should not be misleading, should be distinguishable from other matters, accessible, easy to understand, not coerced and is not binding if it fails these standards. You have the right to withdraw your consent at any time. Again, this might be fulfilled in your user account settings, but how will you think of withdrawing consent if you have never even been informed that you are affected, and never had a chance of consenting to begin with? By law, and by the way the tech works, you withdrawing consent is only for future processing and will not affect past processing, and not all processing needs to have its legal basis in consent. 4 Article 6 covers the legal basis for any processing of personal data in the EU or of EU citizens. In practice, that means that usually, one or more of Article 6 a) - f) will apply and is used as legal basis. Only one needs to be fulfilled though for it to be lawful, and consent is only one of them. a) handles you giving consent for one or more specific purposes (blanket consent doesn't count!). b) is for when the processing is necessary to fulfill a contract - think about you ordering from an online shop, and they need to give your address to the shipping company. c) covers legal obligations; for example, back when restaurants were required to ask for your name and contact info to comply with Covid-19 measures the government instated. d) is for the niche cases where processing your data is necessary to protect you or someone else. A case I could think of is maybe if security camera footage is used where you are seen, to help find a thief that stole from you or the neighbor. e) is a bit vague, covering what's "necessary for the performance of a task carried out in the public interest or in the exercise of official authority vested in the controller". f) is the catch-all, saying processing is necessary due to a legitimate interest by the company or third party, unless these are overridden by the interests and fundamental rights and freedoms of the data subject (you as the affected person), especially if it is a child. As covered above, consent can be acquired with a user account, but not otherwise, so for a chunk of data scraped from the web outside of social media platforms, that falls away. For example, if they scrape my blog, I did not give consent, and I also do not have a contract with them. They have no legal obligation to scrape it, the scraping doesn't protect me or someone else, it is likely not in the public interest and they aren't an official authority either. That leaves the catch-all: Legitimate interest . 'Legitimate interest' can only be claimed as legal basis if the processing violates no part of the GDPR , as to not use it as a loophole; but arguably, as seen above, there at least a lot of questions, if not downright violations of the law as it is right now. Additionally, the sensitive data I already mentioned (from Article 9) cannot be processed under legitimate interest. That would be a chunk of data without a legal basis, then. It's also a delicate matter to discuss in general - if you believe AI will significantly advance humanity and is a superpower, you would surely argue that this is a legitimate interest to pursue and the rights and freedoms of data subjects don't trump that. But how do you prove this speculation about the future? What if we focus solely on more realistic interests, like the economic incentive of companies (the legitimate interest in keeping up with competition or making a lot of money) or more vague reasons like "providing a high quality tool serving our users and helping in research and development"? Would that be enough? How would the company argue that, for example, my blog data is necessary for this legitimate interest? But aside from that, there are several things that could impact the rights and freedoms of you, the data subject. There is the problem that the more data someone acquires, the more likely it is that data of you is in it. It doesn't have to immediately be personal data, but it is recognized in the discourse that when enough data is available, they can link up and become personally identifiable data that is no longer anonymous or unrelated. As an easy to understand example: An address by itself is not personal data, as it cannot be linked to anyone, but if you slowly fill in occupation, hair color, car brand, online orders, it becomes more likely that you can deduct who of the residents it is. A username like " RickRoller2000 " by itself with no other information is not by itself personally identifiable data, but if you suddenly also have the actual name of the person behind the account and link it to their username, the username becomes personally identifiable data. That poses the issue that a switch can happen for data that was previously not falling under GDPR that suddenly does. This introduces more complexity about everything I have mentioned prior, and that also means your rights. Additionally, there are inherent risks and difficulty of enforcing your rights (specific to the product) that need to be taken into account when judging legitimate interest vs. rights and freedoms. Things like: Once in it, it might be difficult or impossible to remove (violating the data protection goal of intervenability, and the right of deletion in Article 17 ). At best, the output may be restricted as in Article 18 . Objection/withdrawal from consent only ex nunc , not ex tunc (= doesn't affect past processing). Possibility of harmful hallucinations that slander people - like when a journalist reports about murder, and the model generates that the journalist committed the murder. If it weren't trained on these articles, it wouldn't happen. There are less severe cases too, about the output mischaracterizing or misattributing someone's work or impact. As some of these models increasingly seek to replace a Google Search or Wikipedia, this should not be underestimated. At best, the output could be influenced by corrections done by the company 5 via Article 16 . Possible inability to fulfill your right of access to your data (obtaining a copy) or confirmation that your data is being used (based on Article 15 ). A big likelihood of being uninformed, not having consented, or not having had the proper chance to consent; either because you don't have a user account, or you don't check your account settings to notice this new setting suddenly popping up, and also because Meta and others have set very short deadlines to opt out . Many services have decided to implement it this way to get as much consent as possible, but this also violates the GDPR as consent needs to be freely and clearly given and opted into (in practice: via sliders or checkboxes that are not activated by default) and the principle of privacy-friendly defaults (also called: "Privacy by Design" and "Privacy by Default" ). You need to know what you consent to, and you can't do that if you never go into your settings to see an already-enabled setting. Doubts about the companies' ability to apply different protections and requirements to different types of data in a dataset, for example health data and data of children; difficulty of children to understand the consequences and properly consent. Arguably hard for non-experts in tech and AI (or: the average layperson) to understand how the data is processed and what the consequences are, not just due to the details of training and output generation, but also due to how new this sort of AI use and prevalence is, and how difficult it is to predict how it will be used in the future. How can you realistically give consent to something if you can't really judge its impact on you? The GDPR is from 2016, came into effect 2018 and was deliberately worded in a tech-neutral way to still welcome innovation and make it fit around a variety of products and services. However, it could have hardly predicted this. What's easy to enforce with phones, computers or social media is a lot harder with training large language models or image generators. That begs the question - should the GDPR be changed? I personally hate that approach, but it is being discussed. We had hopes for the AI Act , and we could have some positive effects from the Digital Markets Act . However, the AI Act has the most regulations and requirements for high risk AI, and basically none for low risk systems. It seems ChatGPT and Co. would very likely fall under low risk, meaning it doesn't even meaningfully regulate the most popular and powerfully huge models. On the other hand, there are Data Protection Authorities that have a very supportive "make it happen" approach to it all, which has led to the Superior Court Cologne ruling very favorably on Meta AI opting everyone in by default and offering a short time to opt out, saying Article 5 section 2 b) DMA (= that the company may not combine personal data from the relevant core platform service with personal data from any further core platform services or from any other services provided by them or with personal data from third-party services) is not applicable. They're also also basing this on the fact that the data is public, that Meta has legitimate interests that override the rights of users, that the purpose is specified enough, that the data and use is low risk, and that there are no less intrusive ways for Meta to do this. They consider 6 weeks a long enough time to opt out. That decision is definitely regarded as horseshit by everyone I talked to about this stuff, but it nonetheless happened (even if it will be challenged). It is unfortunately a reality that there is intense lobbying, a strongly political arms race and fears that Europe gets further left behind in tech innovation, and it definitely colors the approaches and decisions by regulatory bodies. The Hamburg DPA also apparently said that even if the data was unlawfully collected, it doesn't mean that the hosting and use of the model is unlawful, which further muddies the waters and is hard to justify. As it is now, something has to give, and it is a little scary to see which side will likely give in. Now excuse me as I will collapse on the sofa as I wrote this for four hours straight right after work and the meeting, and I am so tired. Reply via email Published 23 Sep, 2025 I know of specific cases in France and Germany (x) (x) ↩ And that is not all, as I will get to the issue of anonymous data turning into personally identifiable data later on. ↩ We're getting to some exemptions to that later. ↩ You also have a right to object based on Article 21, which means the company needs to stop processing that data unless they demonstrate compelling legitimate reasons for the processing which override the interests, rights and freedoms of you as the affected person. But realistically, how do they stop processing your data if it is in a large dataset that increases daily, is used for training again and again, and in some way arguably is part of what decides the quality of the output? ↩ No idea how specifically it is handled internally there, but I've seen successful corrections or the model suddenly not giving an answer to a prompt that previously worked and showed lies. ↩

2 views
Jeff Geerling 2 months ago

Digging deeper into YouTube's view count discrepancy

For a great many tech YouTube channels, views have been markedly down from desktop ("computer") users since August 10th (or so). This month-long event has kicked up some dust—enough that two British YouTubers, Spiffing Brit and Josh Strife Hayes are having a very British argument 1 over who's right about the root cause. Spiffing Brit argued it's a mix of YouTube's seasonality (it's back to school season) and channels falling off, or as TechLinked puts it, " git gud ", while Josh Strife Hayes points out the massive number of channels which identified a historic shift down in desktop views (compared to mobile, tablet, and TV) starting after August 10. This data was corroborated by this Moist Critical video as well.

0 views
Robin Moffatt 2 months ago

Stumbling into AI: Part 4—Terminology Tidy-up (and a little rant)

Having looked at MCP , Models , and RAG , I realised that I’ve been mentally skirting around something that I don’t really understand, so I’m going to expose myself to some ridicule here and try to understand better: what’s the difference between AI and ML? Aren’t they just the same? OK we’re doing this are we? I thought AI was just ✨magic✨? And ML was the thing that got data scientists mad stacks ten years ago before everyone realised you couldn’t do shit without good data and processes? To me, a layperson in this space, watching it from the sidelines, AI and ML have been interchangeable. In fact, you’d get conferences and conference tracks titled "AI/ML"—because it’s all kind of the same thing anyway, right? This is, of course, factually incorrect and presumably infuriating to anyone actually working in the field. The whole purpose of this blog series has been for me to at least reduce the number of unknown unknowns in my knowledge in this space—to build up a mental map of the different areas and terms so that I at least I know where to go and look when encountering something that I know I don’t know. With that framing in mind, this is roughly how I understand the terms and :

0 views
James O'Claire 2 months ago

The 300 Most Common Android Data Endpoints (and the Companies Behind Them)

Last week I wrote a blog working through a few unknown endpoints. My goal was to bring some attention to these lesser known end points where many apps send their data. This post is split into two sections. The first at the top here are endpoints that do not have landing pages and have not been tagged. The second is my full list of the top 300 endpoints called by Android apps and the companies that own them. Where did this data come from? I’ve been collecting this data with AppGoblin and all code is open source. I’ve been running ~60k apps in an Android emulator with ~50k successfully run, I’ve been mapping each endpoint called by the apps when they are open for 1 minute. This means most ad networks and analytics are well represented as they often load on start. Here is a Google Sheet with the same data. The data may be updated for correctness today, but I will not keep it updated over time. For up to date mobile ad data check AppGoblin in the future. Here is the full list. Note there are some untagged end points here such as the IP geo tagging ones or large app publishers that I have not tagged yet as I was unsure how to categorize them.

0 views
Jack Vanlightly 2 months ago

Understanding Apache Fluss

This is a data system internals blog post. So if you enjoyed my table formats internals blog posts , or writing on Apache Kafka internals or Apache BookKeeper internals , you might enjoy this one. But beware, it’s long and detailed. Also note that I work for Confluent, which also runs Apache Flink but does not run nor contributes to Apache Fluss. However, this post aims to be a faithful and objective description of Fluss. Apache Fluss is a table storage engine for Flink being developed by Alibaba in collaboration with Ververica. To write this blog post, I reverse engineered a high level architecture by reading the Fluss code from the main branch (and running tests), in August 2025. This follows my same approach to my writing about Kafka, Pulsar, BookKeeper, and the table formats (Iceberg, Delta, Hudi and Paimon) as the code is always the true source of information. Unlike the rest, I have not had time to formally verify Fluss in TLA+ or Fizzbee, though I did not notice any obvious issues that are not already logged in a GitHub issue. Let’s get started. We’ll start with some high level discussion in the Fluss Overview section, then get into the internals in the Fluss Cluster Core Architecture and Fluss Lakehouse Architecture sections. In simplest terms, Apache Fluss has been designed as a disaggregated table storage engine for Apache Flink. It runs as a distributed cluster of tablet and coordinator servers, though much of the logic is housed in client-side modules in Flink. Fig 1. Fluss architecture components Fluss provides three main features: Low-latency table storage Append-only tables known as Log Tables. Keyed, mutable tables, known as Primary Key Tables (PK tables), that emit changelog streams. Tiering to lakehouse table storage Currently to Apache Paimon (Apache Iceberg coming soon). Client-side abstractions for unifying low-latency and historical table storage. Fig 2. Fluss logical model This post uses terminology from A Conceptual Model for Storage Unification , including terms such as internal tiering, shared tiering, materialization, and client/server-side stitching. This post also uses the term “real-time” mostly to distinguish between lower-latency hot data and colder, historical data. Last year I did a 3-part deep dive into Apache Paimon (a lakehouse table format), which was born as a table storage engine for Flink (originally named Flink Table Store). Paimon advertises itself today as “ A lake format that enables building a Realtime Lakehouse Architecture ”. Now Apache Fluss is being developed, also with the aim of being the table storage engine for Flink. Both projects (Paimon and Fluss) offer append-only tables and primary key tables with changelogs. So why do we need another table storage solution for Flink?  The answer is a combination of efficiency and latency. Paimon was specifically designed to be good at streaming ingestion (compared to Iceberg/Delta), however, it is still relatively slow and costly for real-time data. Paimon is also not particularly efficient at generating changelogs, which is one of Flink’s primary use cases. This is not a criticism of Paimon, only a reality of building decentralized table stores directly on object storage. Fluss in many ways is the realization that an object-store-only table format is not enough for real-time data. Apache Fluss is a table storage service initially designed to replace or sit largely in front of Paimon, offering lower latency table storage and changelogs. In the background, Fluss offloads data to Paimon. Fluss can also rely on its own internal tiering without any offloading to Paimon at all. Having two tiering mechanisms can sometimes be confusing, but we’ll dig into that later. Fig 3. Fluss and Paimon forming real-time/historical table storage for Flink With the introduction of a fast tier (real-time) and a larger slower tier (historical), Flink needs a way to stitch these data stores together. Fluss provides client-side modules that provide a simple API to Flink, but which under the hood, do the work of figuring out where the cut-off exists between real-time and historical data, and stitching them together. One of Flink’s main roles is that of a changelog engine (based on changes to a materialized view). A stateful Flink job consumes one or more input streams, maintains a private materialized view (MV), and emits changes to that materialized view as a changelog (such as a Kafka topic). Fig 4. Flink as a changelog engine However, the state maintained by Flink can grow large, placing pressure on Flink state management, such as its checkpointing and recovery. A Flink job can become cumbersome and unreliable if its state grows too large. Flink 2.0 has introduced disaggregated state storage (also contributed by Alibaba) which aims to solve or at least mitigate the large state problem by offloading the state to an object store. In this way Flink 2.0 should be able to better support large state Flink jobs. In the VLDB paper Disaggregated State Management in Apache Flink® 2.0 , the authors stated that “ we observe up to 94% reduction in checkpoint duration, up to 49× faster recovery after failures or a rescaling operation, and up to 50% cost savings. ” Paimon and Fluss offer a different approach to the large state problem. Instead of only offloading state, they expose the materialized data itself as a shared table that can be accessed by other jobs. This turns what was previously private job state into a public resource, enabling new patterns such as lookup joins. Fig 5. Flink table storage project evolution Paimon was the first table storage engine for Flink. It provides append-only tables, primary key tables and changelogs for primary key tables. If you are interested in learning about Paimon internals then I did a pretty detailed deep dive and even formally verified the protocol . While Paimon is no doubt one of the best table formats for streaming ingest, it still has limits. One of the main limitations is how it supports changelogs, which is a concern of the Paimon writer (or compactor). Maintaining a changelog may require lookups against the target Paimon table (slow) and caching results (placing memory pressure on Flink). Alternatives include generating changelogs based on full compactions, which add latency, can combine changes (losing some change events) and significantly impact compaction performance. In short, Paimon didn’t nail the changelog engine job and remains higher latency than regular databases and event streams. Fluss provides these very same primitives, append-only table, primary key table and changelog, but optimized for lower latency and better efficiency than Paimon (for small real-time writes). Fluss stores the primary key table across a set of tablet servers with RocksDB as the on-server storage engine. Fluss solves Paimon’s changelog issues by efficiently calculating the changes based on existing data stored in RocksDB. This is more efficient than Paimon’s lookups and higher fidelity (and more efficient) than Paimon’s compaction based changelog generation. Whether Fluss is a better changelog (and state management) engine than Flink 2.0 with disaggregated state storage remains to be seen, but one really nice thing about a Fluss Primary Key table (and Paimon PK table for that matter) is that it turns the formerly private MV state of a Flink job into a shared resource for other Flink jobs. Providing support for lookup joins where before another Flink job would need to consume the changelog. Fig 6. Fluss PK Table offloads MV/changelog state management from Flink and provides a shared data primitive for other Flink jobs, with lookup (lookup join) support Append-only tables are based on an adapted copy the Apache Kafka log replication and controller code. Where Fluss diverges is that Fluss is a table storage engine, it exposes a table API, whereas Kafka gives you untyped byte streams where schemas are optional. Kafka is extremely permissive and unopinionated about the data it replicates. One of the key features missing from Kafka if you want to use it as an append-only table storage engine, is being able to selectively read a subset of the columns or a subset of the rows as you can with a regular database. Fluss adds this capability by enforcing tabular schemas and serializing records in a columnar format (Arrow IPC). This allows clients to include projections (and in the future, filters too) in their read requests, which get pushed down to the Fluss storage layer. With enforced tabular schemas and columnar storage, Kafka log replication can be made into a simple append-only table storage solution. Fig 7. The Fluss client serializes record batches into columnar Arrow IPC batches which are appended to segment files by the tablet servers. On the other end, the client that reads the batches converts the columnar Arrow into records again. While Fluss storage servers can pushdown projections to the file system reading level, this doesn’t apply to tiered segment data which would make column pruning on small batches prohibitively expensive. As we’ll see in this deep dive, Fluss doesn’t just build on Kafka for append-only tables (log tables) but also for the changelogs of primary key tables and as the durability mechanism for PK Tables. With the addition of Fluss as a lower latency table storage engine, we now have the problem of stitching together real-time and historical data. As I described in my Conceptual Model for Storage Unification , it comes down to two main points: API abstraction . Stitching together the different physical storage into one logical model. Physical storage management (tiering, materialization, lifecycle management). Fig 8. Fluss real-time and historical data Fluss places the stitching and conversion logic client-side, with interactions with the Fluss coordinators to discover the necessary metadata.  The physical storage management is split up between a trifecta of: Fluss storage servers (known as Tablet servers) Fluss coordinators (based on the ZooKeeper Kafka controller) Fluss clients (where conversion logic is housed) Most storage management involves tiering and backup: Internal tiering is performed by storage servers themselves, much like tiered storage in Kafka. Internal tiering allows for the size of Fluss-cluster-hosted log tables and PK table changelogs to exceed the storage drive capacity of the storage servers. RocksDB state is not tiered and must fit on disk. RocksDB snapshots are taken periodically and stored in object storage, for both recovery and a source of historical reads for clients (but this is not tiering). Lakehouse tiering is run as a Flink job, using Fluss libraries for reading from Fluss storage and writing to Paimon. Fluss coordinators are responsible for managing tiering state and assigning tiering work to tiering jobs. Schema evolution, a critical part of storage management, is on the roadmap, so Fluss may not be ready for most production use cases quite yet. With that introduction, let’s dive into the internals to see how Fluss works. We’ll start by focusing on the Fluss cluster core architecture and then expand to include the lakehouse integration. Fluss has three main components: Tablet servers , which form the real-time storage component. Coordinator servers , which are similar to KRaft in Kafka, storing not only general metadata but acting as a coordination layer. Fluss clients, which present a read/write API to Flink, and do some of the unification work to meld historical and real-time data. The clients interact with both the Fluss coordinators and Fluss tablet servers, as well as reading directly from object storage. Fig 9. Fluss core architecture (without lakehouse or Flink) A Fluss Log Table is divided into two logical levels that define how data is sharded:  At the top level, the table is divided into partitions , which are logical groupings of data by partition columns, similar to partitions in Paimon (and other table formats). A partition column could be a date column, or country etc. Within each partition, data is further subdivided into table buckets , which are the units of parallelism for writes and reads; each table bucket is an independent append-only stream and corresponds to a Flink source split (discussed later). This is also how Paimon divides the data within a partition, aligning the two logical models closely. Fig 10. Log table sharding scheme. Each table bucket is physically stored as a replicated log tablet by the Fluss cluster . A log tablet is the equivalent of a Kafka topic partition, and its code is based on an adapted copy of Kafka and its replication protocol.  When a Fluss client appends to a Log Table, it must identify the right table partition and table bucket for any given record. Table partition selection is based on the partition column(s), but bucket selection is based on a similar approach to Kafka producers, using schemes such as round-robin, sticky or hashes of a bucket key. Fig 11. Table partition and table bucket selection when writing. Fluss clients write to table buckets according to the partition columns(s) to choose the table partition, then a bucket scheme, such as round-robin, sticky, hash-based (like Kafka producers choosing partitions). Each log tablet is a replicated append-only log of records, built on the Kafka replication protocol. It is the equivalent of a Kafka topic partition. So by default, each log tablet has three replicas, with one leader and two followers. Fig 12. Log Tablet replication is based on Kafka partition replication. It has the concept of the high watermark, and the ISR though the recent work of hardening the protocol against simultaneous correlated failures is not included (as it is based on the old ZooKeeper controller, not KRaft). Fluss optionally stores log table batches in a columnar format to allow for projection pushdown to the filesystem level, as well as obtaining the other benefits of Arrow. Fluss clients accumulate records into batches and then serialize each batch into Arrow vectors, using Arrow IPC . Log tablet replicas append these Arrow IPC record batches to log segment files on disk (as-is). These Arrow record batches are self-describing, with metadata to allow the file reader to be able to read only the requested columns from disk (when a projection is included in a fetch request). Projection pushdown consists of a client including a column projection in its fetch requests, and that projection getting pushed all the way down to the file storage level, that prunes unwanted columns while reading data from disk. This avoids network IO, but may come with additional storage IO overhead if columns of the projection are widely fragmented across a file. Storing log table data as a sequence of concatenated Arrow IPC record batches is quite different to using a single Parquet file as a segment file. A segment file with 100 concatenated Arrow IPC record batches stores each batch as its own self-contained columnar block, so reading a single column across the file requires touching every batch’s metadata and buffers, whereas a Parquet file lays out columns contiguously across the entire file, allowing direct, bulk column reads with projection pushdown. This adds some file IO overhead compared to Parquet and makes pruning columns in tiered log segments impractical and expensive. But serializing batches into Parquet files on the client is also not a great choice, so this approach of Arrow IPC files is a middleground. It may be possible in the future to use compaction to rewrite segment files into Parquet for more optimal columnar access of tiered segment data. On the consumption side, Fluss clients convert the Arrow vectors back into rows. This columnar storage is entirely abstracted away from application code, which is record (row) based. Each log record has a change type, which can be used by Flink and Paimon in streaming jobs/streaming ingestion: +I, Insert  +U, Update After -U, Update Before For Log Tables, the Fluss client assigns each record the +A change type. The remaining change types are used for the changelogs of PK tables which is explained in the PK Table section. Fluss has two types of tiering: Internal Tiering (akin to traditional tiered storage), which is covered in this section. Lakehouse Tiering , which is covered in the next major section, Fluss Lakehouse Architecture. Tablet servers internally tier log tablet segment files to object storage, which matches the general approach of Kafka. The difference is that when a Fluss client fetches from an offset that is tiered, the log tablet returns only the metadata of a set of tiered log segments, so the Fluss client can download those segments itself. This offloads load from Fluss servers during catch-up reads. Fig 13. Log tablet with internal segment tiering and replication. Fluss made the decision to place some of the logic on the client for melding local data on tablet server disks with internally tiered data on object storage. This diverges from the Kafka API which does not support this, thus placing the burden of downloading tiered segments on the Kafka brokers. The actual tiering work is performed as follows. Each tablet server has a RemoteLogManager which is responsible for tiering segments to remote storage via log tiering tasks. This RemoteLogManager can only trigger tiering tasks for log tablets that have their leader replicas on this server. Each tiering task works as follows: A task is scoped to a single log tablet, and identifies local log segment files to upload (and remote ones to expire based on TTL). It uploads the target segment files to object storage. It commits them by: Creating a new manifest file with the metadata of all current remote segments. Writes the manifest file to object storage. Sends a CommitRemoteLogManifestRequest to the coordinator, which contains the path to the manifest file in remote storage (the coordinator does not store the manifest itself). Once committed, expired segment files are deleted in object storage. Asynchronously, the coordinator notifies the log tablet replica via a NotifyRemoteLogOffsetsRequest so the replica knows: The offset range that is tiered (so it knows when to serve tiered metadata to the client, and when to read from disk). What local segments can be deleted. As I mentioned earlier, because clients download tiered segment files, the network IO benefits of columnar storage are limited to the hottest data stored on Fluss tablet servers. Even if a client doing a full table scan only needs one column, it must still download entire log segment files. There is no way around this except to use a different columnar storage format than Arrow IPC with N record batches per segment file. It’s worth understanding a little about Flink’s Datastream Source API, in order to understand how Flink reads from Fluss.  Splits : Flink distributes the work of reading into splits which are independent chunks of input (e.g. a file, topic partition).  Split enumeration : A split enumerator runs on the JobManager and discovers source inputs, generating corresponding splits, and assigning them to reader tasks. Readers : Each TaskManager runs a source reader, which reads the sources described by its assigned splits and emits records to downstream tasks. This design cleanly separates discovery & coordination (enumerator) from data reading (readers), while keeping splits small and resumable for fault tolerance. When Flink uses Fluss as a source, a Fluss split enumerator runs on the JobManager to discover and assign splits (which describe table buckets). Each TaskManager hosts a source reader, which uses a split reader to fetch records from its assigned table bucket splits and emit them downstream. Fig 14. Fluss integration of Log Table via Flink Source API (without lakehouse integration). In this way, a Log Table source parallelizes the reading of all table buckets of the table, emitting the records to the next operators in the DAG. While log tablets are built on an adapted version of Kafka replication, there are some notable differences: Fluss uses two levels of sharding :  Table partitions, via a partition column(s). Multiple table buckets per partition. Obligatory tabular schemas : A Fluss table must have a flat table schema with primitive types (structs, arrays, maps on the roadmap). Columnar storage: Allowing for projection pushdown (which also depends on a schema). Tiered storage (internal tiering) : Clients download tiered segments (Tablet servers only serve tiered segment metadata to clients). Fluss has no automatic consumer group protocol . This role is performed by Flink assigning splits to readers. A primary key table is also organized into two logical levels of partitions and table buckets. Clients write data to specific partitions and buckets as described in the log tablet section. However, the Primary Key table has a different API as it is a mutable table: Writes:  Upserts and Deletes to the table via a PutKV API. Lookups and Prefix Lookups against the table. Changelog scans (which can involve a hybrid read of KV snapshot files plus changelog, discussed later). Each table bucket of a PK table is backed by a KV Tablet, which emits changes to a child log tablet. KV Tablet state is composed of: A RocksDB table for storing the keyed table state. A Log Tablet for storing the changelog. This inner log tablet also acts as a write-ahead log (WAL), as described in the Writing to a KV Tablet subsection. Fig 15. A KV tablet and its replicated child log tablet for the changelog. Unlike log tablets, KV tablets do not store data in Arrow format as data is stored in RocksDB. However, the child log tablet uses Arrow as normal. There are a few other notable differences to Log Tablets in how data is written, tiered and read, which is covered in the next subsections. The PutKV API accepts a KV batch which contains a set of key-value records. When the value is not null, Fluss treats this as an upsert, and when the value is null, it treats it as a delete. The KV tablet performs the write as follows: For each record in the batch: Perform a read against the RocksDB table to determine the change to be emitted to the changelog.  A change record could be: a DELETE record with the old row, if the write contains a null value. an UPDATE_BEFORE record with the original record and an UPDATE_AFTER record with the new record, if a record exists and the write has a non-null value. an INSERT with the new record, if no existing record exists. Buffer both the new changelog records and the RocksDB write in memory for now. Append all buffered changelog records (as an Arrow IPC record batch) to the child log tablet and wait for the batch to get committed (based on Kafka replication protocol). Once the change records are committed, perform the buffered RocksDB writes. The new records overwrite the existing records by default, but read about merge engines and partial updates in the next subsection. Durability : A log tablet is like a Kafka partition, it is replicated across multiple servers, with one leader and multiple followers. The KV tablet itself is unreplicated, though it does regularly upload RocksDB snapshot files to object storage. Therefore, the changelog acts as a replicated write-ahead log (WAL) for the KV tablet. If the server disk were to die, the state could be recovered by downloading the latest KV tablet snapshot file and replaying the changelog from the corresponding next offset (the last offset is stored with the snapshot). Interestingly, the KV tablet leader has no followers, and moves with the log tablet leader. When a leader election occurs in the log tablet, the KV tablet leader changes with it. The new KV tablet leader (on the tablet server of the new log tablet leader) must download the latest RocksDB snapshot file and replay the changelog to recover the state of the former KV leader. This means that large RocksDB state could be an issue for availability due to the large amount of state needing to be downloaded and replayed on each leader election. This design may change in the future. By default, PK tables have no merge engine. The new row overwrites the old row (via the DefaultRowMerger class) as described in the last subsection. But Fluss supports using the merge types FIRST_ROW and VERSIONED in PK tables. Each merge operation has an old row and new row (either of which could be null). Merge types: None: New row replaces old row. FIRST_ROW: Keep the old row, unless is it null, then take the new row.  VERSIONED: Take whichever row has the highest version (new row version is supplied by the client). The Flink source for Fluss can use FIRST_ROW, but VERSIONED doesn’t seem to be used anywhere yet. Partial update : An update does not need to include all columns, allowing for partial updates. Partial updates cannot be combined with merge types. If a write only contains a subset of the columns, the PartialUpdateRowMerger class is used instead of the DefaultRowMerger class. It goes column by column, and takes the column from the new row, if it exists, else it takes it from the old row, to create a new merged row. The keyed state in RocksDB cannot be tiered, it must fit entirely on disk. Snapshots are made and stored in object storage, but this is not tiering but for backup/recovery and historical reads. Therefore, the key cardinality should not be excessive. The changelog is a log tablet, and is tiered as described in the log table tiering subsection. Fig 16. KV Tablet (and child Log Tablet) read, write, tiering and replication. A Fluss client can send lookup and prefix lookup requests to the KV tablet leader, which get translated into RocksDB lookups and prefix lookups. It is also possible to scan the changelog, and there are different options from the starting offset here: Earliest offset : Read the entire change stream from start to finish, assuming infinite retention of the log tablet. Latest offset : Read only the latest changes from the log tablet (but will miss historical data). Full : Bootstrap from a table RocksDB snapshot then switch over to reading from the log tablet. When using the full mode, the process works as follows: The Fluss client contacts the coordinator server to find out the KV snapshot files, along with the changelog offset they correspond to. Fluss client downloads the RocksDB snapshot files and initializes a RocksDB instance based on the files. Fluss client iterates over the records of the RocksDB table, treating each record as an insert (+I) row kind. Fluss client switches to reading from the log tablet directly, starting from the next offset after the snapshot. This follows the same logic as with a log tablet, where the client may not receive actual batches, but metadata of tiered log segments. Of course, this is usually within the Flink source API architecture of split enumeration and split readers. Fig 17. Fluss integration to the Flink Source API. The difference between log table and primary key table source are highlighted in green. While Flink provides the HybridSource abstraction that allows for reading from a bounded source until completion then switching to an unbounded source, Fluss has chosen to implement this within the Split abstraction itself: For table buckets that are PK-based, the split enumerator also requests the metadata of the KV snapshots from the coordinator. It creates hybrid splits which contain the snapshot and log tablet metadata. For hybrid splits, each split reader first loads the bounded snapshot into a RocksDB instance and processes the records, then switches to the unbounded log tablet (the changelog). Schema evolution is listed under the roadmap . Fluss coordinators play a key role in the following functions: Regular cluster metadata Internal tiering metadata and coordination Lakehouse tiering coordination (see next section) Serving metadata directly to clients (for client-side stitching) Next we’ll see how the Fluss core architecture is extended to include the lakehouse. Before we begin, let’s define some terms to make the discussion more precise. These are terms that only exist in this post to make explaining things easier (the Fluss project does not define these abstractions): Logical table (log table, primary key table). Partitioned and bucketed. Fluss table . Served by a Fluss cluster. Stored on disk and internally tiered segments/snapshot files. Lakehouse table . A Paimon table (Iceberg coming soon), fed by Lakehouse tiering. Fig 18. Logical table maps onto a Fluss Table (hosted by Fluss cluster) and optionally, a Lakehouse Table Fluss is a set of client-side and server-side components, with lakehouse integration being predominantly client-side. We can break up lakehouse integration into the following pieces (which are somewhat interrelated): Lakehouse tiering : Copying from Fluss tables to Lakehouse tables. Storage unification : Unifying historical lakehouse data with real-time Fluss tablet server data A logical table could map to only a Fluss table, or it could map to a combination of a Fluss table and a Lakehouse table. It is possible for there to be overlap (and therefore duplication) of Fluss and Lakehouse table. As with internal tiering, only Log Tables and the changelogs of Primary Key Tables can be tiered to a lakehouse. Paimon itself converts the changelog stream back into a primary key table. Right now Apache Paimon is the main open table format (OTF) that is supported, due to its superior stream ingestion support, but Apache Iceberg integration is on the way. Lakehouse tiering is driven by one or more Flink jobs that use client-side Fluss libraries to: Learn of what tables must be tiered (via the Fluss coordinators) Do the tiering (read from Fluss table, write to Lakehouse table) Notify the coordinators of tiering task completion or failure, and metadata regarding lakehouse snapshot metadata and how it maps to Fluss table offset metadata. Fig 19. Flink-based lakehouse tiering components. A table tiering task is initiated by a lakehouse tiering process (run in Flink) sending a heartbeat to the coordinator, telling it that it needs work. The coordinator will provide it a table that needs tiering, which includes the metadata about the current lakehouse snapshot including the log offset of every table bucket that the lakehouse snapshot corresponds to. The tiering task will start reading from each table bucket from these offsets. Fig 20. Internal topology lakehouse tiering Flink job. The tiering read process reads (log scan) from the Fluss table and is essentially the same as described in the Log Table and Primary Key Table sections of the Fluss Core Architecture section.  For log tables, the table buckets are assigned as splits to the various source readers, which use the Fluss client to read the log tablet of each table bucket. Under the hood, the Fluss client fetches from tablet servers (the leader replica of each log tablet), which may return data payloads or tiered log segment metadata to the client. For primary key tables, the tiering job scans the changelog, generating hybrid splits and assigning them to the readers. The Fluss client in each reader will start by downloading the KV snapshot files (RocksDB) and iterate over the RocksDB table, and then will switch to fetching from the changelog log tablet. The tiering write process involves writing the records as data files to the lakehouse (using Paimon/Iceberg libraries). The specifics are not important here, but the write phase requires an atomic commit, and so newly written but not committed files are not part of the table yet. The tiering commit process involves the Paimon commit but also updates the Fluss coordinator with the committed lakehouse snapshot metadata and the last tiered offset of each table bucket of the Fluss table. Basically, we need coordination to ensure that tiering doesn’t skip or duplicate data, and that clients can know the switch over point from lakehouse to Fluss cluster. The coordinators in turn notify the tablet servers of the lakehouse offset of each table bucket, which is a key part of the storage unification story, as we’ll discuss soon. The Flink topology itself is pretty standard for writing to a lakehouse, with multiple reader tasks emitting records for multiple Paimon writer tasks, with a single Paimon committer task that commits the writes serially to avoid commit conflicts. There is one writer per Paimon bucket, and it seems plausible that the Paimon table partitioning and bucketing will match Fluss (though I didn’t get around to confirming that). Evolving Fluss and Paimon tables in unison will be required, but schema evolution is not implemented yet. Each logical table has a simple state machine that governs when it can be tiered. Fig 21. Table tiering job state transitions, managed by Fluss coordinators. The key to storage unification in Fluss is the compatibility of the different storage formats and the work of the client to stitch data from different storage and different storage formats together transparently under a simple API. This job of stitching the different storage together is shared across the Fluss client and Fluss-Flink module. Fig 22. Fluss uses client-side stitching at two layers of the client-side abstractions. So far I’ve avoided tackling this question. Fluss makes for an interesting case study. It uses two types of tiering: Tiering of Fluss storage (log segments) to object storage, performed by tablet servers. This is classic internal tiering. Tiering of Fluss storage to the lakehouse (Paimon), performed by Flink using Fluss client-side modules and coordinated by Fluss Coordinators. In my first pass I equated lakehouse tiering to shared tiering as described in my post A Conceptual Model for Storage Unification . In that post, I expressed reservations about the risks and complexity of shared tiering in general. My reservation comes from the complexity of using a secondary storage format to store the majority of the primary system’s data and the problems of performance due to not being able to optimize secondary storage for both primary and secondary systems. But I soon realized that Fluss lakehouse tiering is another form of internal tiering as both storage tiers serve the same analytical workload (and the same compute engine). A Fluss cluster is akin to a durable distributed cache in front of Paimon, where Flink reads/writes hot data from/to this durable distributed cache and reads cold data from Paimon. Together, both storage types form the primary storage. The Paimon table only needs to support analytical workloads and thus can make full use of the various table organization options in Paimon without penalizing another system. Additionally, the logical data model of the Fluss API is extremely close to both Flink and Paimon (by design), which reduces the costs and risk associated with the conversion of formats. Fig 22. Fluss cluster storage and Paimon storage form the primary storage for the primary system (Flink). This is very different to a Kafka architecture that might tier to a lakehouse format. Kafka serves event streams largely in event-driven architectures and Paimon serves as an analytics table. The two workloads are completely different (sequential vs analytical). Having Kafka place the majority of its data in a lakehouse table would present many of the issues of shared tiering discussed in my storage unification blog post, namely conversion risk and performance constraints due to using the same storage for completely different access patterns. Likewise, Kafka is unopinionated regarding payloads, with many being schemaless, or using any one of Avro, Protobuf, JSON and JSONSchema. Payloads can be arbitrarily nested, in any of the above formats complicating the bidirectional conversions between lakehouse formats and Kafka payloads. As I noted in the last subsection, lakehouse tiering is a form of internal tiering, as the lakehouse tier serves the same workload and compute engine. Of course, the Paimon table could be read by other systems too, so there is still room for some conflicting requirements around data organization, but far less between a purely sequential workload (like Kafka) and an analytical workload. However, curiously, Fluss now has two types of internal tiering: Tiering of log segment files (as-is). This is the direct-access form of tiering discussed in my conceptual model post. Tiering of log data (in its logical form), which is of the API-access form of tiering. The question is why does Fluss need both? And why are they not lifecycle linked? Fluss table storage is expired based on a TTL and each lakehouse format provides TTL functionality too. But data is not expired in Fluss storage once it has been copied to the lakehouse. Here are some thoughts: First of all, I am sure linking the lifecycles of Fluss table and lakehouse table data will eventually be implemented, as it looks pretty trivial. It might be more reliable to keep internal log segment tiering, despite also using lakehouse tiering, should the lakehouse become unavailable for a period. For example, the write availability of Fluss would be tightly linked to the availability of the lakehouse catalog without internal tiering acting as an efficient spill-over safety mechanism. Without lifecycle-linked tiers, lakehouse tiering actually has more in common with materialization (data copy not move). This is not a bad thing, there are cases where it might be advantageous to keep the lifecycle of tiered log segments separate from that of the lakehouse. For example, some Fluss clients may wish to consume Log Tables in a purely sequential fashion and might benefit from reading tiered log segments rather than a Paimon table. Fluss also places Kafka API support on the roadmap, which makes maintaining internal tiering and lakehouse tiering as separately lifecycle managed processes compelling (as it avoids the issues of conversion and performance optimization constraints). Kafka clients could be served from internally tiered log segments and Flink could continue to merge both Fluss table and lakehouse table data, thereby avoiding the pitfalls of shared tiering (as the lakehouse still only must support one master). This is another case where lakehouse tiering starts looking more like materialization when using Fluss as a Kafka compatible system. It can be confusing having two types of tiering, but it also allows for users to configure Fluss to support different workloads. If Fluss is only a lakehouse real-time layer, then it might make sense to only use lakehouse tiering. However, if Fluss needs to support different access patterns for the same data, then it makes sense to use both (using the materialization approach). Apache Fluss can be seen as a realization that Apache Paimon, while well-suited to streaming ingestion and materialized views, is not sufficient as the sole table storage engine for Flink. Paimon remains a strong option for large-scale, object-store-backed tables, but it falls short when low-latency streaming processing requires efficient changelogs and high throughput, small writes to table storage. Fluss is designed to fill this gap. It provides the low-latency tier with append-only and primary key tables, supports efficient changelog generation, and integrates with Paimon through tiering. Client-side modules in Flink then stitch these tiers together, giving a unified view of real-time and historical data. Fluss aims to broaden its support to include other analytics engines such as Apache Spark and other table formats such as Iceberg, making it a more generic real-time layer for lakehouses. It should probably eventually be seen as an extension to Paimon (et al) more than a table storage engine made specially for Flink. What stands out about Fluss is how it shifts between tiering and materialization depending on its role. As a real-time layer in front of a lakehouse, it functions more like a tiering system, since both Fluss cluster and the lakehouse serve the same analytical workload. If eventually it extends to support the Kafka API as an event streaming system, it would resemble a materialization approach instead. Fluss still faces several adoption challenges. Schema evolution is not yet supported, and lifecycle management remains limited to simple TTL-based policies rather than being tied to lakehouse tiering progress. Its replication-based design also inherits the same networking cost concerns that Kafka faces in cloud environments. Flink 2.0 disaggregated state storage solves the large state problem without networking costs, but the state remains private to the Flink job. Fluss has placed direct-to-object-storage writes on its roadmap and so should eventually close this problem for workloads with high networking costs. Finally, given that Fluss copies heavily from Kafka for its log tablet storage, it raises questions for the Apache Kafka community as well. Features such as columnar storage, projection pushdown, and stronger integration of schemas are all central to how Fluss turns a (sharded) log into an append-only table. While Kafka has traditionally remained unopinionated about payloads, there are benefits to adding schema-aware storage. It may be worth considering whether some of these ideas have a place in Kafka’s future too. Low-latency table storage Append-only tables known as Log Tables. Keyed, mutable tables, known as Primary Key Tables (PK tables), that emit changelog streams. Tiering to lakehouse table storage Currently to Apache Paimon (Apache Iceberg coming soon). Client-side abstractions for unifying low-latency and historical table storage. API abstraction . Stitching together the different physical storage into one logical model. Physical storage management (tiering, materialization, lifecycle management). Fluss storage servers (known as Tablet servers) Fluss coordinators (based on the ZooKeeper Kafka controller) Fluss clients (where conversion logic is housed) Internal tiering is performed by storage servers themselves, much like tiered storage in Kafka. Internal tiering allows for the size of Fluss-cluster-hosted log tables and PK table changelogs to exceed the storage drive capacity of the storage servers. RocksDB state is not tiered and must fit on disk. RocksDB snapshots are taken periodically and stored in object storage, for both recovery and a source of historical reads for clients (but this is not tiering). Lakehouse tiering is run as a Flink job, using Fluss libraries for reading from Fluss storage and writing to Paimon. Fluss coordinators are responsible for managing tiering state and assigning tiering work to tiering jobs. Tablet servers , which form the real-time storage component. Coordinator servers , which are similar to KRaft in Kafka, storing not only general metadata but acting as a coordination layer. At the top level, the table is divided into partitions , which are logical groupings of data by partition columns, similar to partitions in Paimon (and other table formats). A partition column could be a date column, or country etc. Within each partition, data is further subdivided into table buckets , which are the units of parallelism for writes and reads; each table bucket is an independent append-only stream and corresponds to a Flink source split (discussed later). This is also how Paimon divides the data within a partition, aligning the two logical models closely. +I, Insert  +U, Update After -U, Update Before Internal Tiering (akin to traditional tiered storage), which is covered in this section. Lakehouse Tiering , which is covered in the next major section, Fluss Lakehouse Architecture. A task is scoped to a single log tablet, and identifies local log segment files to upload (and remote ones to expire based on TTL). It uploads the target segment files to object storage. It commits them by: Creating a new manifest file with the metadata of all current remote segments. Writes the manifest file to object storage. Sends a CommitRemoteLogManifestRequest to the coordinator, which contains the path to the manifest file in remote storage (the coordinator does not store the manifest itself). Once committed, expired segment files are deleted in object storage. Asynchronously, the coordinator notifies the log tablet replica via a NotifyRemoteLogOffsetsRequest so the replica knows: The offset range that is tiered (so it knows when to serve tiered metadata to the client, and when to read from disk). What local segments can be deleted. Splits : Flink distributes the work of reading into splits which are independent chunks of input (e.g. a file, topic partition).  Split enumeration : A split enumerator runs on the JobManager and discovers source inputs, generating corresponding splits, and assigning them to reader tasks. Readers : Each TaskManager runs a source reader, which reads the sources described by its assigned splits and emits records to downstream tasks. Fluss uses two levels of sharding :  Table partitions, via a partition column(s). Multiple table buckets per partition. Obligatory tabular schemas : A Fluss table must have a flat table schema with primitive types (structs, arrays, maps on the roadmap). Columnar storage: Allowing for projection pushdown (which also depends on a schema). Tiered storage (internal tiering) : Clients download tiered segments (Tablet servers only serve tiered segment metadata to clients). Fluss has no automatic consumer group protocol . This role is performed by Flink assigning splits to readers. Writes:  Upserts and Deletes to the table via a PutKV API. Lookups and Prefix Lookups against the table. Changelog scans (which can involve a hybrid read of KV snapshot files plus changelog, discussed later). A RocksDB table for storing the keyed table state. A Log Tablet for storing the changelog. This inner log tablet also acts as a write-ahead log (WAL), as described in the Writing to a KV Tablet subsection. For each record in the batch: Perform a read against the RocksDB table to determine the change to be emitted to the changelog.  A change record could be: a DELETE record with the old row, if the write contains a null value. an UPDATE_BEFORE record with the original record and an UPDATE_AFTER record with the new record, if a record exists and the write has a non-null value. an INSERT with the new record, if no existing record exists. Buffer both the new changelog records and the RocksDB write in memory for now. Append all buffered changelog records (as an Arrow IPC record batch) to the child log tablet and wait for the batch to get committed (based on Kafka replication protocol). Once the change records are committed, perform the buffered RocksDB writes. The new records overwrite the existing records by default, but read about merge engines and partial updates in the next subsection. None: New row replaces old row. FIRST_ROW: Keep the old row, unless is it null, then take the new row.  VERSIONED: Take whichever row has the highest version (new row version is supplied by the client). Earliest offset : Read the entire change stream from start to finish, assuming infinite retention of the log tablet. Latest offset : Read only the latest changes from the log tablet (but will miss historical data). Full : Bootstrap from a table RocksDB snapshot then switch over to reading from the log tablet. The Fluss client contacts the coordinator server to find out the KV snapshot files, along with the changelog offset they correspond to. Fluss client downloads the RocksDB snapshot files and initializes a RocksDB instance based on the files. Fluss client iterates over the records of the RocksDB table, treating each record as an insert (+I) row kind. Fluss client switches to reading from the log tablet directly, starting from the next offset after the snapshot. This follows the same logic as with a log tablet, where the client may not receive actual batches, but metadata of tiered log segments. For table buckets that are PK-based, the split enumerator also requests the metadata of the KV snapshots from the coordinator. It creates hybrid splits which contain the snapshot and log tablet metadata. For hybrid splits, each split reader first loads the bounded snapshot into a RocksDB instance and processes the records, then switches to the unbounded log tablet (the changelog). Regular cluster metadata Internal tiering metadata and coordination Lakehouse tiering coordination (see next section) Serving metadata directly to clients (for client-side stitching) Logical table (log table, primary key table). Partitioned and bucketed. Fluss table . Served by a Fluss cluster. Stored on disk and internally tiered segments/snapshot files. Lakehouse table . A Paimon table (Iceberg coming soon), fed by Lakehouse tiering. Lakehouse tiering : Copying from Fluss tables to Lakehouse tables. Storage unification : Unifying historical lakehouse data with real-time Fluss tablet server data Learn of what tables must be tiered (via the Fluss coordinators) Do the tiering (read from Fluss table, write to Lakehouse table) Notify the coordinators of tiering task completion or failure, and metadata regarding lakehouse snapshot metadata and how it maps to Fluss table offset metadata. Tiering of Fluss storage (log segments) to object storage, performed by tablet servers. This is classic internal tiering. Tiering of Fluss storage to the lakehouse (Paimon), performed by Flink using Fluss client-side modules and coordinated by Fluss Coordinators. Tiering of log segment files (as-is). This is the direct-access form of tiering discussed in my conceptual model post. Tiering of log data (in its logical form), which is of the API-access form of tiering. First of all, I am sure linking the lifecycles of Fluss table and lakehouse table data will eventually be implemented, as it looks pretty trivial. It might be more reliable to keep internal log segment tiering, despite also using lakehouse tiering, should the lakehouse become unavailable for a period. For example, the write availability of Fluss would be tightly linked to the availability of the lakehouse catalog without internal tiering acting as an efficient spill-over safety mechanism. Without lifecycle-linked tiers, lakehouse tiering actually has more in common with materialization (data copy not move). This is not a bad thing, there are cases where it might be advantageous to keep the lifecycle of tiered log segments separate from that of the lakehouse. For example, some Fluss clients may wish to consume Log Tables in a purely sequential fashion and might benefit from reading tiered log segments rather than a Paimon table. Fluss also places Kafka API support on the roadmap, which makes maintaining internal tiering and lakehouse tiering as separately lifecycle managed processes compelling (as it avoids the issues of conversion and performance optimization constraints). Kafka clients could be served from internally tiered log segments and Flink could continue to merge both Fluss table and lakehouse table data, thereby avoiding the pitfalls of shared tiering (as the lakehouse still only must support one master). This is another case where lakehouse tiering starts looking more like materialization when using Fluss as a Kafka compatible system.

0 views
tonsky.me 3 months ago

Talk: Почему компьютеры не умеют считать? @ Podlodka

Как компьютеры представляют числа – от int и float до NaN, BigInt, decimals и комплексных. Прошлись по всему числовому зоопарку: обсудили, зачем нужны разные типы, где они подводят, и почему 0.1 + 0.2 ≠ 0.3 – не баг, а особенность.

0 views
Jack Vanlightly 3 months ago

A Conceptual Model for Storage Unification

Object storage is taking over more of the data stack, but low-latency systems still need separate hot-data storage. Storage unification is about presenting these heterogeneous storage systems and formats as one coherent resource. Not one storage system and storage format to rule them all, but virtualizing them into a single logical view.  The primary use case for this unification is stitching real-time and historical data together under one abstraction. We see such unification in various data systems: Tiered storage in event streaming systems such as Apache Kafka and Pulsar HTAP databases such as SingleStore and TiDB Real-time analytics databases such as Apache Pinot, Druid and Clickhouse The next frontier in this unification are lakehouses, where real-time data is combined with historical lakehouse data. Over time we will see greater and greater lakehouse integration with lower latency data systems. In this post, I create a high-level conceptual framework for understanding the different building blocks that data systems can use for storage unification, and what kinds of trade-offs are involved. I’ll cover seven key considerations when evaluating design approaches. I’m doing this because I want to talk in the future about how different real-world systems do storage unification and I want to use a common set of terms that I will define in this post. From my opening paragraph, the word “virtualizing” may jump out at you, and that is where we’ll start. I posit that the primary concept behind storage unification is virtualization . Virtualization in software refers to the creation of an abstraction layer that separates logical resources from their physical implementation. The abstraction may allow one physical resource to appear as multiple logical resources, or multiple physical resources to appear as a single logical resource. We can use the term storage virtualization and data virtualization though for me personally I find the difference too nuanced for this post. I will use the term data virtualization. A virtualization layer can present a simple, unified API that stitches together different physical storage systems and formats behind the scenes. For example, the data may exist across filesystems with a row-based format and object storage in a columnar format, but the application layer sees one unified logical model. Data virtualization is the combination of: Frontend abstraction: Stitching together the different physical storage into one logical model. Backend work: Physical storage management (tiering, materialization, lifecycle management). Let’s dig into each in some more detail. How data is written and managed across these different storage mediums and formats is a key part of the data virtualization puzzle. This management includes: Data organization / format Data tiering Data materialization Data lifecycle “Data organization” is about how data is optimized for specific access patterns. Data could be stored in a row-based format, a graph-based format, a columnar format, and so on. Sometimes we might choose to store the same data in multiple formats in order to efficiently serve different query semantics (lookup, graph, analytics, changelogs). That is, we balance trade-offs between access semantics and the cost of writes, the cost of storage and the cost of reads. Trade-off optimization is a key part of system design. “Tiering” is about moving data from one storage tier (and possibly storage format) to another, such that both tiers are readable by the source system and data is only durably stored in one tier. While the system may use caches, only one tier is the source of truth for any given data item. Usually, storage cost is the main driver. In the next subsection, I’ll describe how there are two types of tiering: Internal vs Shared. “Materialization” is about making data of a primary system available to a secondary system, by copying data from the primary storage system (and format) to the secondary storage system, such that both data copies are maintained (albeit with different formats). The second copy is not readable from the source system as its purpose is to feed another data system. Copying data to a lakehouse for access by various analytics engines is a prime example. “Data lifecycle management” governs concerns such as data lifetime and data compatibility . Tiering implies lifecycle-linked storage tiers where the canonical data is deleted in the source tier once copied to the new tier (move semantics). Materialization implies much weaker lifecycle management, with no corresponding deletion after copy. Data can be stored with or without copies, in different storage systems and different formats, but the logical schema of the data may evolve over time. Therefore compatibility is a major lifecycle management concern, not only across storage formats but also across time. We can classify tiering into two types: Internal Tiering is data tiering where only the primary data system (or its clients) can access the various storage tiers. For example, Kafka tiered storage is internal tiering. These internal storage tiers as a whole form the primary storage . Shared Tiering is data tiering where one or more data tiers is shared between multiple systems. The result is a tiering-materialization hybrid, serving both purposes. Tiering to a lakehouse is an example of shared tiering. Internal tiering is the classic tiering that we all know today. The emergence of lakehouse tiering is relatively new and is a form of shared tiering. But with sharing comes shared responsibility. Once tiered data is in shared storage, it serves as the canonical data source for multiple systems, which adds an extra layer of discipline, control and coordination.  Shared tiering is a kind of hybrid between internal tiering and materialization. It serves two purposes: Tiering : Store historical data in a cheaper storage. Materialization : Make the data available to a secondary system, using the secondary system’s data format. A key aspect of shared tiering is the need for bidirectional lossless conversion between “primary protocol + format” and “secondary protocol + format”. Getting this part right is critical as the majority of the primary system’s data will be stored in the secondary system’s format. The main driver for shared tiering is avoiding data duplication to minimize storage costs. But while storage costs can be lowered, shared tiering also comes with some challenges due to it serving two purposes: Lifecycle management . Unlike materialization, the data written to the secondary system remains tied to the primary. They are lifecycle-linked and therefore data management should remain with the primary system. Imagine shared tiering where a Kafka topic operated by a Kafka vendor tiers to a Delta table managed by Databricks. Who controls the retention policy of the Kafka topic? How do we ensure that table maintenance by Databricks doesn’t break the ability of the tiering logic to read back historical data? Schema management . Schema evolution of the primary format (such as Avro or Protobuf) may not match the schema evolution of the secondary format (such as Apache Iceberg). Therefore, great care must be taken when converting from the primary to the secondary format, and back again. With multiple primary formats and multiple secondary formats, plus changing schemas over time, the work of ensuring long-term bidirectional compatibility should not be underestimated. Exposing data . Tiering requires bidirectional lossless conversion between the primary and secondary formats. When tiering to an Iceberg table, we must include all the metadata of the source system, headers, and all data fields. If there is sensitive data that we don’t want to be written to the secondary system, then materialization might be better. Fidelity . Great care must be taken to ensure that the bidirectional lossless conversion between formats does not lose fidelity. Security/encryption . End-to-end encryption will of course make shared tiering impossible. But there may be other related encryption related challenges, such as exposing encryption metadata in the secondary system. Performance overhead. There may be a large conversion cost when serving lagging or catch-up consumers from secondary storage, due to the conversion from the secondary format to the primary. Performance optimization . The shared tier serves two masters and here data organization can really make the difference in terms of performance and efficiency. But what can benefit the secondary can penalize the primary and vice versa. An example might be Z-order compaction in a lakehouse table, where files may be reorganized (rewritten) changing data locality such that data will be grouped into data files according to common predicates. For example, compacting by date and by country (when where clauses and joins frequently use those columns). While this reorganization can vastly improve lakehouse performance, it can make reads by the primary system fantastically inefficient compared to internal tiering.  Risk . Once data is tiered, it becomes the one canonical source. Should any silent conversion issue occur when translating from the primary format to the secondary format, then there is no recourse. Taking Kafka as an example, topics can have Avro, Protobuf, JSON among other formats. The secondary format could be an open table format such as Iceberg, Delta, Hudi or Paimon. Data files could be Avro, Parquet or ORC. With so many combinations of primary->secondary->primary format conversions, there is some unavoidable additional risk associated with shared tiering. Internal tiering is comparatively simple, you tier data in its primary format. In the case of Kafka, you tier log segment files directly.  While zero-copy shared tiering sounds attractive, there are practical implications that must be considered. I am sure that shared tiering can be made to work, but with great care and only if the secondary storage remains under the ownership and maintenance of the primary system.  Note: The risks of fidelity and compatibility issues can be mitigated by storing the original bytes alongside the converted columns, but this introduces data duplication which many proponents of shared tiering are advocating against. I wrote about data platform composability , enabled by the open-table formats. The main premise is that we can surface the same physical tabular data in two platforms. One platform acts as the primary, writing to the table and performing table management. Secondary platforms surface that table to their own users as a read-only table. That way, we can compose data platforms while ensuring table ownership and management responsibility remain clear. This same approach works for shared tiering. The primary system should have full ownership (and management responsibility) of the lakehouse tiered data, making it a purely readonly resource for secondary systems. In my opinion, this is the only sane way to do shared lakehouse tiering, but it adds the burden of lakehouse management to the primary system. Where does the work of stitching different storage sources live? Should it exist server-side, or is it better as a client-side abstraction? One option is to place that stitching and conversion logic server-side , on a cluster of nodes that serve the API requests from clients. This is the current choice of: Kafka API compatible systems. Kafka clients expect brokers to serve a seamless byte stream from a single logical log, so Kafka brokers do the work of stitching disk-bound log segments with object-store-bound log segments. HTAP and real-time analytics databases. Another option is to place that stitching logic client-side , inside the client library. Through some kind of signalling or coordination mechanism the client must know how to stitch data from two or more storage systems, possibly using different storage formats. The client must house the data conversion logic to surface the data in the logical model. If this is a database API, then the client will also have to perform the query logic, and hopefully is able to push down some of the work to the various storage sources. Client-side stitching can make sense if the client sits above two separate high-level APIs, such as a stream processor above Kafka and a lakehouse. But it’s also possible to place stitching client-side within a single data system protocol. The benefit of client-side stitching is that we unburden the storage cluster from this work. For example, it would be possible to make Kafka clients download remote log segments instead of the Kafka brokers being responsible, freeing brokers from the load generated by lagging consumers. On the downside, putting the stitching and conversion client-side makes clients more complicated and can make concepts such as storage format evolution and compatibility more difficult. A lot depends on what kind of control the operator has over the clients. If clients are tightly controlled and kept in-sync with the storage systems and their evolution, then client-side stitching might be feasible. If however, clients are not managed carefully and many different client versions are in production, this can complicate long-term evolution of data and storage. I’ve seen many issues from customers running multiple Kafka client versions against the same cluster, often very old versions due to constraints in the customer’s environment. Placing the stitching work server-side, either directly in the primary cluster or in a proxy layer, removes the client complexity and evolution headaches, but at the cost of extra load and complexity server-side. Arguably, we might prefer the complexity to live server-side where it can be better controlled. Materialization and tiering have some similarities. They both involve copying data from one storage location to another, and possibly from one format to another. Where they diverge is that tiering implies a tight data lifecycle between storage tiers and materialization does not. For tiering, we need a job that: Learns the current tiering state from the metadata store. Reads the data in the source tier. Potentially converts it to the format of the destination tier (such as from a row-based format to a columnar format). Writes it to the destination tier. Updates the associated metadata in the metadata service. Finally deletes the data in the source tier. That job could be the responsibility of a primary cluster (that serves reads and writes) or a secondary component whose only responsibility is to perform the tiering. If a secondary component is used, it might be able to directly access the data from its storage location, or may access the data through a higher level API. The same choices exist for materialization. Which is best? It’s all context dependent. In open-source projects, we typically see all this data management work centralized, hence Kafka brokers taking on this responsibility. In managed (serverless) cloud systems, data management is usually separated into a dedicated data management component, for better scaling and separation of concerns. Tiering involves reads and writes to both tiers, whereas materialization requires read access to the primary’s data and write access to the secondary. What kind of access to primary storage and secondary storage is used? Next we’ll look at direct vs API access for reads and writes to both tiers. We have two types of access: Direct-access : The tiering/materialization process (whether it be integrated or external) directly reads the primary storage data files. Example: Kafka brokers read local log segments from the filesystem and write them to the second tier (object storage). API-access : The tiering/materialization process uses the primary’s API to read the primary’s data. Example: Tiering/materialization could be the responsibility of a separate component, which reads data via the Kafka API and writes it to the second tier. Direct access might be more efficient than API-access, but likewise, direct access might be less reliable if the primary performs maintenance operations that change the files while tiering is running. It may be necessary to add coordination to overcome such conflicts. For both materialization and shared tiering, we must decide how to write data to the secondary system. In the case of a lakehouse, we would likely do it via a lakehouse API such as an Iceberg library (API-access). Shared tiering must also be able to read back tiered data from the secondary system.   A key consideration is that the primary must maintain a mapping of its logical and/or physical model onto the secondary storage. For example, mapping a Kafka topic partition offset range to a given tiered file. The primary needs this in order to be able to download the right tiered files to serve historical reads. But this mapping could also map to the logical model of the secondary system, such as row identifiers of an Iceberg table. Let’s look at some example strategies of Iceberg shared tiering: API-Access. The primary uses the Iceberg library to write tiered data and to read tiered data back again. It maintains a mapping of primary logical model to secondary logical model. Example: Kafka uses the Iceberg library to tier log segments, maps topic partition offset ranges to Iceberg table row identifier ranges. Uses the Iceberg library to read tiered data using predicate clauses. Hybrid-Access (Write via API, Read via Direct). The primary uses the Iceberg library to tier data but keeps track of the data files (Parquet) it has written, with a mapping of the primary logical model to secondary file storage. To serve historical reads, the primary knows which Parquet files to download directly, rather than using the Iceberg library which may be less efficient. The direct-access strategy could be problematic for shared tiering as it bypasses the secondary system’s API and abstractions (violating encapsulation leading to potential reliability issues). The biggest issue in the case of lakehouse tiering is that table maintenance might reorganize files and delete the files tracked by the primary. API-access might be preferable unless secondary maintenance can be modified to preserve the original Parquet files (causing data duplication) or have maintenance update the primary on the changes it has made so it can make the necessary mapping changes (adding a coordination component to table maintenance). Another consideration is that if a custom approach is used, where for example, additional custom metadata files are maintained side-by-side with Iceberg files, then Iceberg table maintenance cannot be used and maintenance itself must be a custom job of the primary. We ideally want one canonical source where the data lifecycle is managed. Whether stitching and conversion is done client-side or server-side, we need a metadata/coordination service to give out the necessary metadata that translates the logical data model of the primary to its physical location and layout. Tiering jobs, whether run as part of a primary cluster or as a separate service, must base their tiering work on the metadata maintained in this central metadata service. Tiering jobs learn of the current tiering state, inspect what new tierable data exists, do the tiering and then commit that work by updating the metadata service again (and deleting the source data). In some cases, the metadata service could even be a well-known location in object storage, with some kind of snapshot or root manifest file (and associated protocol for correctness). When client-side stitching is performed, clients must learn somehow of the different storage locations of the data it needs. There are two main patterns here: The clients directly ask the metadata service for this information, and then request the data from whichever storage tier it exists on. The client simply sends reads to a primary cluster, which will serve either the data (if stored on the local filesystem), or serve metadata (if stored on a separate storage tier). In the second case, it requires that the primary cluster knows the metadata of tiered data in order to respond with metadata instead of data. This may be readily available if the tiering job runs on the cluster itself. It can also be possible for the cluster to be notified of metadata updates by the metadata component. What governs the long-term compatibility of data across different storage services and storage formats? Is there a canonical logical schema which all other secondary schemas are derived from? Or are primary and secondary schemas managed separately somehow? How are they kept in sync? What manages the logical schema and how physical storage remains compatible with it?  If direct-access is used to read shared tiered data and maintenance operations periodically reorganize storage, how does the metadata maintained by the primary stay in-sync with secondary storage? Again, this comes down to coordination between metadata services, the tiering/materialization jobs, maintenance jobs, catalogs and whichever system is stitching the different data sources together (client or server-side). Many abstractions and components may be in play. Lakehouse formats provide excellent schema evolution features, but these need to be governed tightly with the source system, which may have different schema evolution rules and limitations. When shared tiering is used, the only sane choice is for the shared tiered data to be managed by the primary system, with read-only access to the secondary systems. If we want to expose the primary’s data in a secondary system, should we use shared tiering, or materialization (presumably with internal tiering)? This is an interesting and multi-faceted question. We should consider two principal factors: Where the stitching/conversion logic lives (client or server). The pros/cons of shared tiering vs pros/cons of materialization When the stitching is client-side, tiering vs materialization may not make a difference. Materialization also requires metadata to be maintained regarding the latest position of the materialization job. A client armed with this metadata can stitch primary and secondary data together as a single logical resource.  We might be using Flink and want to combine real-time Kafka data with historical lakehouse data. Flink sits above the high-level APIs of the two different storage systems. Whether the Kafka and lakehouse data are tightly lifecycle-linked with tiering, or more loosely with materialization is largely unimportant to Flink. It only needs to know the switchover point from batch to streaming. If we want to provide the data in lakehouse format so Spark jobs can slice and dice the data, then either shared tiering or materialization is an option. Shared tiering might be preferable if reducing storage cost (by avoiding data duplication) is the primary concern. However, other factors are also at play, as explained earlier in The challenges of shared tiering . Materialization might be preferable if: The primary and secondary systems have completely different access patterns, such that maintaining two copies of the data, in their respective formats is best. The secondary can organize the data optimized for its own performance and the primary uses internal tiering, maintaining its own optimized copy. The primary does not want to own the burden of long term management of the secondary storage. The primary does not have control over the secondary storage (to the point where it cannot fully manage its lifecycle). Performance and reliability conscious folks prefer to avoid the inherent risks associated with shared tiering, in terms of conversion logic over multiple schemas over time, performance constraints due to data organization limitations etc. The secondary only really needs a derived dataset. For example, the lakehouse just wants a primary key table rather than an append-only stream, so the materializer performs key-based upserts and deletes as part of the materialization process. Data duplication avoidance is certainly a key consideration, but by no means always the most important. The subject of storage unification (aka data virtualization), is a large and nuanced subject. You can choose to place the virtualization layer predominantly client-side, or server-side, each with their pros and cons. Data tiering or data materialization are both valid options, and can even be combined. Just because the primary system chooses to materialize data in a secondary system does not remove the benefits of internally tiering its own data.  Tiering can come in the form of Internal Tiering or Shared Tiering, where shared tiering is a kind of hybrid that serves both primary and secondary systems. Shared tiering links a single storage layer to both primary and secondary systems, each with its own query patterns, performance needs, and logical data model. This has advantages, such as reducing data duplication, but it also means lifecycle policies, schema changes, and format evolution must be coordinated (and battle tested) so that the underlying storage remains compatible with both primary and secondary systems. With clear ownership by the primary system and disciplined management, these challenges can be manageable. Without them, shared tiering becomes more of a liability rather than an advantage. While on paper, materialization may seem more work as two different systems must remain consistent, the opposite is more likely to be true. By keeping the canonical data a private concern of the primary data system, it frees the primary from potentially complex and frictionful compatibility work juggling competing concerns and different storage technologies with potentially diverging future evolution. I would like to underline that making consistent copies of data is a long and well understood data science problem. The urge to simply remove all data copies is understandable, as storage cost is a factor. But there are so many other and often more important factors involved, such as performance constraints, reliability, lifecycle management complexity and so on. But if reducing storage at-rest cost is the main concern, then shared tiering, with its additional complexity may be worth it. I hope this post has been food for thought. With this conceptual framework, I will be writing in the near future about how various systems in the data infra space perform storage unification work. Tiered storage in event streaming systems such as Apache Kafka and Pulsar HTAP databases such as SingleStore and TiDB Real-time analytics databases such as Apache Pinot, Druid and Clickhouse Frontend abstraction: Stitching together the different physical storage into one logical model. Backend work: Physical storage management (tiering, materialization, lifecycle management). Data organization / format Data tiering Data materialization Data lifecycle Internal Tiering is data tiering where only the primary data system (or its clients) can access the various storage tiers. For example, Kafka tiered storage is internal tiering. These internal storage tiers as a whole form the primary storage . Shared Tiering is data tiering where one or more data tiers is shared between multiple systems. The result is a tiering-materialization hybrid, serving both purposes. Tiering to a lakehouse is an example of shared tiering. Tiering : Store historical data in a cheaper storage. Materialization : Make the data available to a secondary system, using the secondary system’s data format. Lifecycle management . Unlike materialization, the data written to the secondary system remains tied to the primary. They are lifecycle-linked and therefore data management should remain with the primary system. Imagine shared tiering where a Kafka topic operated by a Kafka vendor tiers to a Delta table managed by Databricks. Who controls the retention policy of the Kafka topic? How do we ensure that table maintenance by Databricks doesn’t break the ability of the tiering logic to read back historical data? Schema management . Schema evolution of the primary format (such as Avro or Protobuf) may not match the schema evolution of the secondary format (such as Apache Iceberg). Therefore, great care must be taken when converting from the primary to the secondary format, and back again. With multiple primary formats and multiple secondary formats, plus changing schemas over time, the work of ensuring long-term bidirectional compatibility should not be underestimated. Exposing data . Tiering requires bidirectional lossless conversion between the primary and secondary formats. When tiering to an Iceberg table, we must include all the metadata of the source system, headers, and all data fields. If there is sensitive data that we don’t want to be written to the secondary system, then materialization might be better. Fidelity . Great care must be taken to ensure that the bidirectional lossless conversion between formats does not lose fidelity. Security/encryption . End-to-end encryption will of course make shared tiering impossible. But there may be other related encryption related challenges, such as exposing encryption metadata in the secondary system. Performance overhead. There may be a large conversion cost when serving lagging or catch-up consumers from secondary storage, due to the conversion from the secondary format to the primary. Performance optimization . The shared tier serves two masters and here data organization can really make the difference in terms of performance and efficiency. But what can benefit the secondary can penalize the primary and vice versa. An example might be Z-order compaction in a lakehouse table, where files may be reorganized (rewritten) changing data locality such that data will be grouped into data files according to common predicates. For example, compacting by date and by country (when where clauses and joins frequently use those columns). While this reorganization can vastly improve lakehouse performance, it can make reads by the primary system fantastically inefficient compared to internal tiering.  Risk . Once data is tiered, it becomes the one canonical source. Should any silent conversion issue occur when translating from the primary format to the secondary format, then there is no recourse. Taking Kafka as an example, topics can have Avro, Protobuf, JSON among other formats. The secondary format could be an open table format such as Iceberg, Delta, Hudi or Paimon. Data files could be Avro, Parquet or ORC. With so many combinations of primary->secondary->primary format conversions, there is some unavoidable additional risk associated with shared tiering. Internal tiering is comparatively simple, you tier data in its primary format. In the case of Kafka, you tier log segment files directly.  Kafka API compatible systems. Kafka clients expect brokers to serve a seamless byte stream from a single logical log, so Kafka brokers do the work of stitching disk-bound log segments with object-store-bound log segments. HTAP and real-time analytics databases. Learns the current tiering state from the metadata store. Reads the data in the source tier. Potentially converts it to the format of the destination tier (such as from a row-based format to a columnar format). Writes it to the destination tier. Updates the associated metadata in the metadata service. Finally deletes the data in the source tier. Direct-access : The tiering/materialization process (whether it be integrated or external) directly reads the primary storage data files. Example: Kafka brokers read local log segments from the filesystem and write them to the second tier (object storage). API-access : The tiering/materialization process uses the primary’s API to read the primary’s data. Example: Tiering/materialization could be the responsibility of a separate component, which reads data via the Kafka API and writes it to the second tier. API-Access. The primary uses the Iceberg library to write tiered data and to read tiered data back again. It maintains a mapping of primary logical model to secondary logical model. Example: Kafka uses the Iceberg library to tier log segments, maps topic partition offset ranges to Iceberg table row identifier ranges. Uses the Iceberg library to read tiered data using predicate clauses. Hybrid-Access (Write via API, Read via Direct). The primary uses the Iceberg library to tier data but keeps track of the data files (Parquet) it has written, with a mapping of the primary logical model to secondary file storage. To serve historical reads, the primary knows which Parquet files to download directly, rather than using the Iceberg library which may be less efficient. The clients directly ask the metadata service for this information, and then request the data from whichever storage tier it exists on. The client simply sends reads to a primary cluster, which will serve either the data (if stored on the local filesystem), or serve metadata (if stored on a separate storage tier). What governs the long-term compatibility of data across different storage services and storage formats? Is there a canonical logical schema which all other secondary schemas are derived from? Or are primary and secondary schemas managed separately somehow? How are they kept in sync? What manages the logical schema and how physical storage remains compatible with it?  If direct-access is used to read shared tiered data and maintenance operations periodically reorganize storage, how does the metadata maintained by the primary stay in-sync with secondary storage? Where the stitching/conversion logic lives (client or server). The pros/cons of shared tiering vs pros/cons of materialization The primary and secondary systems have completely different access patterns, such that maintaining two copies of the data, in their respective formats is best. The secondary can organize the data optimized for its own performance and the primary uses internal tiering, maintaining its own optimized copy. The primary does not want to own the burden of long term management of the secondary storage. The primary does not have control over the secondary storage (to the point where it cannot fully manage its lifecycle). Performance and reliability conscious folks prefer to avoid the inherent risks associated with shared tiering, in terms of conversion logic over multiple schemas over time, performance constraints due to data organization limitations etc. The secondary only really needs a derived dataset. For example, the lakehouse just wants a primary key table rather than an append-only stream, so the materializer performs key-based upserts and deletes as part of the materialization process.

0 views
A Smart Bear 4 months ago

Max MRR: Your growth ceiling

Your company will stop growing sooner than you think. The "Max MRR" metric predicts revenue plateaus based on churn and new revenue.

0 views