Posts in Data (20 found)
Tara's Website 2 weeks ago

Data first, programs as guests

Data first, programs as guests Some ideas don’t arrive suddenly. They form slowly, through repetition and exposure, until one day they become visible. Over the years, I’ve noticed that the systems I’ve always felt most at home in share a specific trait. It took me a long time to see the bigger picture and name it clearly. In those systems, data is the center of gravity, not programs. Programs are transient.

0 views
ava's blog 2 weeks ago

pay or okay - is it really?

While browsing news websites, you may have seen a pop-up like this: That's one form of how " Pay or Okay " can look like. The model was first introduced by newspapers in Austria and Germany in 2018, but in 2023, Meta adopted it for Instagram and Facebook. What this means is: Either you agree to the tracking and get to use the website for free, or you disagree to tracking and have to pay . Maybe this doesn't sound so bad to you; after all, if they lose out on tracking that generates money, they should be compensated, right? Unfortunately, it's not that easy! You see: Not every Pay or Okay system is set up the same way, or even the way it sounds at first. I wouldn't fault you for thinking that paying should mean there's no tracking at all, or only the most essential tracking and no ads, but that's not true. With many websites like The Guardian above, you pay just to opt out of the ads being personalized . You'll still see ads, you'll still have cookies and "similar technologies" (other tracking) being employed against you. Despite paying monthly, your data is still harvested and your reading behavior tracked. To me, this is a sort of " double-dipping ", as it still results in some data selling on top of my monthly payment. Some research shows that publishers on average earn €0.24 per user and month from personalized tracking and €3.24 per user and month from the paid option 1 . If I'm going to pay for this and there's increased revenue, I want there to be the minimum amount of tracking, not just less. I don't want you to take my money and still somehow monetize my data! There are regional differences in pricing too, with the most extreme in France: If you read French online news sites, you'd pay ~800% of the average total digital advertising revenue per user if you wanted to refuse tracking. That means that what you pay and what your data is worth is not equal, they are just milking you on top of it. Pay or Okay models can, depending on the implementation, lead to a double payment too. You might be paying to be tracked less, and then also need to pay to access paywalled content separately. This tends to happen in setups where it's combined with a freemium model, in which some content is freely accessible while some is paywalled. Even in setups where the paid mode to reduce tracking is just their normal subscription (usually called "hard paywall", or "metered paywall" if you have limited free samples), it means the popup is simply advertisement for their subscription and has little to do with choice. The sad reality is that instead of empowering users to make a choice, this is once again engaging in dark patterns . Not only is one of the options often automatically pre-selected, higher, or emphasized with colors, but it's obviously easier to just click to agree and be done with it instead of setting up payment first. Research papers about this show that this model leads consent rates of 99% to 99.9% 2 , even though only 0.16% - 7% of people actually want to be tracked or see personalized advertising online 3 . This is hardly reconcilable with Article 7(3) GDPR, in which withdrawal or rejection should be as easy as giving consent. That means: Not only does this put a price on the human right of informational self-determination , but it also makes it a hassle to enforce and stick to as a user. Another issue is that it's pricing people out of actually getting to make a decision freely . If you struggle financially (or are just a teen with no or little income), it's not worth it to spend money each month just for less tracking - you have bigger problems! If you cannot afford it, you're either forced to agree to the tracking or exit the site. Even if you pay the fee for one news site, you'd surely not pay it for the handful of others you visit. In Germany, paying the reject fee on 29 of the top 100 websites that used Pay or Okay (including news, weather, ‘social’ media networks and others) amounts to an overall cost of over € 1.528,87 per year according to noyb.eu . That's more than the German yearly spending for clothes. There's also no geographical pricing adjustment, so if you are in an economically weaker country wanting to read German or French news, you'd still have to pay those high prices. So far, I haven't seen a single site that allows you to pay a rejection fee per article with their Pay or Okay pop-up; it was all or nothing, in a recurring subscription. That's unfortunate, because a user shouldn't have to enter a subscription model to avoid tracking while viewing one article of a site they might not visit again. This, together with paywalls, is adding to the issue of people increasingly getting their news from third parties that are freely available, but may skew it to their advantage. Of course independent, investigative journalism needs to be compensated and kept alive . But digital advertising, according to estimates by the European Media Industry Outlook , only accounts for about 10% of the revenue of the press, with targeted advertising being only about 5%. For comparison: Their Figure 50 graphic shows print circulation still makes up roughly 50% of revenue! Given that on average only 5% of press revenue comes from advertising, implementing Pay or Okay likely only increases the income by very little. This is not enough to save the press , so we should not be misled by economic interests to deny that this has a significant negative impact on our decision to be tracked or not. This doesn't sound like a legitimate (economic) interest that overrides the users' interests according to Article 6(1)(f) GDPR. Tracking isn't even that useful for news sites: The World Association of News Publishers says that >50 % of global programmatic ('personalized') advertisement spending instead goes to Alibaba, Alphabet, Amazon, ByteDance and Meta. In comparison, news publishers are still taking more directly sold advertisements . That makes sense: The big platforms already work with algorithms and hyper-personalizing the user experience, while news publishers come from a long past of offering people a fixed, non-personalized ad space in the newspaper. Even if they wanted to use more fitting advertising, there is still the option of contextualized advertising , which are only linked to a specific medium or content without needing to use the users' personal data. Of course you could say " Who the hell cares? Just install an ad-blocker and other privacy-focused browser extensions! " and you'd not be wrong. Allegedly, due to increased blocking or rejection of tracking and cookies, only about 30% of internet users are even exposed to targeting 4 . I have doubts about this number, because many people do not engage via browsers, but within apps that don't allow interference. But if we believe it, that means even when we have an artificially inflated 99% consent rate due to Pay or Okay pop-ups, most of those don't actually transfer into ad revenue. Still, there's always an arms race between tracking/advertising and blocking, and we should enable a free choice even for people who aren't knowledgeable enough about this stuff and are still getting tracked without their consent, or forced to. Caring about privacy in this aspect requires people to know and that is a lot! Just imagine telling all of that to your grandparents. Ask the average person what cookie banners are about; many will not be able to tell you. They are like Terms of Service, Privacy Policies, or EULA's to people. They just know if they click yes, they'll get to where they wanna go faster. There's no informed choice there because many people are not sat down and educated about it, and Pay or Okay pop-ups work the same. I prefer to work on shitty implementations and legal loopholes rather than put the responsibility on the user to know about the latest issues or technical solutions. Unfortunately, it seems like we are moving on with this. Despite the European Data Protection Board stating in its 08/2024 opinion that large online platforms relying on a binary choice between consenting or paying a fee is generally not legal, no consequences have followed. Data Protection Authorities, like the ones in Germany, have stayed silent on the matter. In the Digital Omnibus to overhaul parts of the GDPR and other laws around digital rights, they write: " Considering the importance of advertising revenue for independent journalism as an indispensable pillar of a democratic society, media service providers as defined in Regulation (EU) 2024/1083 (European Media Freedom Act) should not be obliged to respect such signals . " "Such signals" meaning automated signals of refusing tracking/cookies. This unfortunately shows that in the future, if this goes through, " Pay or Okay " is seen as acceptable because choice does not matter for news media, even if it was previously (aside from a CJEU judgment in 2023) contentious or denied for large platforms . If it is allowed for one, it should technically be allowed for others, because the GDPR doesn't differentiate between different groups of controllers for these things. That means a future in which we still continue to fight back against ad-tech, and not just paywalls for content, but paywalls to our right to choose as well. Reply via email Published 14 Feb, 2026 Study by Müller-Tribbensee et. al. ↩ This is also mentioned in an interview with Dirk Freitag, CEO of Contentpass, a service that offers Pay or Okay services to publications. ↩ See for example the attitude towards tracking on Facebook . ↩ Read here and here about targeting issues, for example. ↩ how tracking and advertising works the negative aspects of advertising (why would you possibly not want it? Not just annoying placement, but possible psychological effects) the fact that many of these sites have 100+ (sometimes even 1000+) partners they share the data with what data is tracked how it can be misused, leaked, etc. that ad-blockers and other software exists that you can use a browser version instead of the app Study by Müller-Tribbensee et. al. ↩ This is also mentioned in an interview with Dirk Freitag, CEO of Contentpass, a service that offers Pay or Okay services to publications. ↩ See for example the attitude towards tracking on Facebook . ↩ Read here and here about targeting issues, for example. ↩

0 views
matduggan.com 2 weeks ago

The Small Web is Tricky to Find

One of the most common requests I've gotten from users of my little Firefox extension( https://timewasterpro.xyz ) has been more options around the categories of websites that you get returned. This required me to go through and parse the website information to attempt to put them into different categories. I tried a bunch of different approaches but ended up basically looking at the websites themselves seeing if there was anything that looked like a tag or a hint on each site. This is the end conclusion of my effort at putting stuff into categories. Unknown just means I wasn't able to get any sort of data about it. This is the result of me combining Ghost, Wordpress and Kagi Small Web data sources. Interestingly one of my most common requests is "I would like less technical content" which as it turns out is tricky to provide because it's pretty hard to find. They sorta exist but for less technical users they don't seem to have bought into the value of the small web own your own web domain (or if they have, I haven't been able to figure out a reliable way to find them). This is an interesting problem, especially because a lot of the tools I would have previously used to solve this problem are....basically broken. It's difficult for me to really use Google web search to find anything at this point even remotely like "give me all the small websites" because everything is weighted to steer me away from that towards Reddit. So anything that might be a little niche is tricky to figure out. So there's no point in building a web extension with a weighting algorithm to return less technical content if I cannot find a big enough pool of non-technical content to surface. It isn't that these sites don't exist its just that we never really figured out a way to reliably surface "what is a small website". So from a technical perspective I have a bunch of problems. I think I can solve....some of these, but the more I work on the problem the more I'm realizing that the entire concept of "the small web" had a series of pretty serious problems. First I need to reliably sort websites into a genre, which can be a challenge when we're talking about small websites because people typically write about whatever moves them that day. Most of the content on a site might be technical, but some of it might not be. Big sites tend to be more precise with their SEO settings but small sites that don't care don't do that, so I have fewer reliable signals to work with. Then I need to come up with a lot of different feeding systems for independent websites. The Kagi Small Web was a good starting point, but Wordpress and Ghost websites have a much higher ratio of non-technical content. I need those sites, but it's hard to find a big batch of them reliably. Once I have the type of website as a general genre and I have a series of locations, then I can start to reliably distribute the types of content you get. Google was the only place on Earth sending any traffic there Because Google was the only one who knew about it, there never needed to be another distribution system Now that Google is broken, it's almost impossible to recreate that magic of becoming the top of list for a specific subgenre without a ton more information than I can get from public records.

0 views
ava's blog 1 months ago

my journey into data protection, part one

Growing up, I wished for more radical honesty and openness around careers and opportunities online. How others achieved anything was beyond me. I was simply missing the experience and maturity to at least even guess how others' successes came to be, so it was often a big mystery to me. Many people are ashamed of freely sharing what it actually took. I guess some feel that if they reveal it, people will nitpick about what could have been done better, some guard their connections, others don't want to put out there that they're actually a 'nepo baby'. People are embarrassed about the failures on the way and would prefer to make it all seem effortless and instant on the outside. I want to do my part to be as open as is sensible about my path of trying to work in data protection/privacy; my challenges and my failures, the reasons for doing what I did, and my thoughts during some of the difficult moments and choices. I originally wanted to keep updating this for years until I hit a specific milestone and then release it, but even just writing down everything that happened until now was a lot . So I guess there will be two parts, with the second part coming one day :) I actually used to think I was too stupid for law. I admired law students and was secretly jealous, because I was intrigued but thought I could never do it. In 2017-2018, right as the GDPR was soon coming into effect, I saw lots of ads about Data Protection Officers, as they'd be needed soon. Offers for companies to send their employees to 2-week crash courses, or companies emerging whose lawyers could be your external officers. I saw these ads and thought: " Wow, this would be so cool. But no shot that I could actually do it. ", after all, I'm probably bad at law, and I was only working as a trainee at the time. Only the established IT guys would be sent to these courses, right? I was very interested in privacy online already at that point, but more focused on reducing unnecessary tracking and deleting social media. In the school part of my traineeship, I actually did have to learn some law, and I was surprisingly good at it. Soon after, I found out you don't always need to become a full on lawyer via the two Staatsexamen (state exams) in Germany - you can also do a Bachelor of Laws (LL.B) . That would fit me more! I chose to enroll in a distanced learning university in 2022 to do the degree part-time, as I had finished my traineeship and began a full-time position at the same place in 2021. At that point, I did that just for me, with no goals to make this relevant for my career in any way. That enabled me to take it slow with no pressure to finish as quickly as possible or with the best grades. Still, I started my studies with great grades. The degree had some elective courses, and one of them was data protection law. That made me even more interested (daring to dream, and all) and soon after, I considered taking that elective in the Winter semester of 2024/2025 to make sure it really suits me and learn more. It went great and cemented my interest and passion for the field. Months prior, I had seen that the same university offered an 1.5y Advanced Studies degree that would also certify as a Data Protection Consultant upon completion and enable work as a Data Protection Officer. The problem: To qualify, you needed a finished Bachelor degree or more. I'd been almost halfway through mine, but it would still be years, and who knows if I'd truly be able to finish it successfully? So I looked more deeply into it, and on one sub-page, they sneakily added an exception for people who had no degree but whose work involved data protection law concerns. To prove that, they required a CV signed by you, and a document from your employer confirming it. I shot my shot and asked my (very nice and supportive) boss, and she was on board. Admittedly, we exaggerated some parts of the work and tried to focus hard on the few things that would fit, like the way I managed the user accounts in our database. I applied for the program, and I got accepted in November 2024! I couldn't be happier. It cost close to 3k, and I paid it off in 10 months (March - December 2025), no interest. It's meant to be completed as two exams a semester, but I ended up grinding hard and finishing it early, taking all 6 exams in one semester (6 months instead of the 1.5y), all while continuing full time work and my part time LL.B. During that program, I was looking to gain more experience and network connections in the field, so I messaged the Data Protection Officer at my place of work. I said I have questions about the field in general and the job itself, and wanted his advice on what people looking to enter the field should do or bring to the table. He accepted gladly, and we kept meeting up like every other month for over a year to discuss things, like questions I had about a specific paragraphs and principles, or how to implement something in practice. He shared a lot of practical tips with me as well as how our workplace had implemented Microsoft or OpenAI products. He also took a lot of interest in the exams I wrote! In July 2025 started volunteering for noyb.eu . I became a Country Reporter first for Germany, later also for Austria. That helped me in multiple ways: I also chose to attend the Beschäftigendatenschutztag 2025 in Munich at the end of October 2025 (just as I had finished the cert) for a similar reason: I wanted to learn from others, show presence and get a feel for how the professionals discuss. I sadly couldn't attend the Datenschutzkonferenz that year. These events are super ultra expensive. Usually, companies cover their employees' fees when sending them there, but... no one was sending me. I wasn't hired in any role that would make my employer send me there and cover the costs, so I had to do it on my own. I got a student discount of 50% , which brought it down to a little over 500 Euro. Still expensive for me, though. I didn't live anywhere near Munich, so my lovely wife suggested we stay at my in-laws place near Nuremberg for the week and I take the train to Munich for the two event days. I was ready to put myself out there. I wanted to put all of my experience and credentials so far to good use and learn more, so I started asking for more data protection related tasks at work. Our DPO had become my mentor by that point and would have loved to work with me and could use the help, so I requested that. Unfortunately, leadership was neither wanting to create a new role in his team, nor interested in allowing me to internally transfer there. They saw data protection as an annoying topic and did not want more people working in it. That gave me my first taste on how employers would see me... My boss, meanwhile, tried to involve me in new projects that were tangentially related as best as she could as I kept asking for that, but most things fell through or were not giving me enough to chew on, through no fault of any of us. Unfortunately, my employer struggled with a lot of budget cuts at the time, which didn't help. My job involves a lot of pharmaceutical health data, and it's a part I love about it. So, I decided that as best as I could, I would like to specialize in data protection around health data. A promising niche, as AI and projects combining health databases all around the world could lead to amazing breakthroughs, but needed a lot of safety and oversight. Hopefully, I could combine my experience working for my employer with my extra qualifications. Luckily they had just built up a research data center that would focus on health data, and were searching for a Data Protection Consultant for it. Their wishlist was intense: A finished degree in law, IT, and related fields, ideally the state exams. I knew though that our place always gets a very little amount of applications and such lists in job postings are always the ideal candidate, and many places usually have to settle for less. I easily suited the rest of the requirements and the tasks for the job. And why not just try? The worst that could happen is a no. So I applied. First, they extended the application deadline. I already knew it was hopeless by then. Then they rejected me, and re-posted it externally. Even 5 months later, that spot doesn't seem to have been filled, but the listing is gone as well. They'd rather hire no one, than hire me, because of a missing Bachelor/Masters degree or StEx, despite I assume I was already filtered out by HR, because HR doesn't know how to properly judge the qualifications in that field. For many people, law is when you are a full lawyer with both state exams, and that's it (like I used to think!). They don't know much about the LL.B or LL.M, and they sure as hell don't know the extra degree you can get on top of a Bachelor's degree (or on my case, as an exception, if your work qualifies you). What should have qualified me to at least get an interview got me thrown out of the (non-existent) pile even though I was very likely the only applicant. That taught me my first important lesson: Until I clear the arbitrary lines they draw in the sand, I have to bypass HR . A while later, I read a tweet thread by user gabriel1 that was saying the same; especially: "never compete when applying for jobs, there are hundreds of applicants with better grades and universities than you. [...]" "straight to managers, ceo, ppl with incentive for the company to go well. HR people play losers game, they just don't want to make mistakes. if you are bad but are from harvard they can just say "oh he was supposed to be good" and they have an excuse. so they'll dislike you" I'd take this to heart. I was also rejected when I applied to another company. I was sure I could at least get an interview. Their requirements were loose and low and I could meet everything perfectly. Instead, I was rejected a week after I submitted my application, so I didn't even make it to further consideration. Reason for that was possibly the fact that I admitted to volunteering at noyb.eu in the motivational letter. That taught me my second valuable lesson: Unless employers directly approach me first and already know about it, I should rethink mentioning my volunteering. I thought it would show passion and knowledge in the field, but what it instead communicated was that I was being an activist . Noyb pushes to hold private companies and corporations accountable, and this is the opposite of what companies hiring privacy professionals actually want. What they want is someone who can make anything happen with a good legal justification. They aren't interested in ethics; they want new tech, especially AI, at any cost as long as their Compliance department finds ways to make the processing compliant with as little costs, obstacles and delay as possible. They were scared I was going to be someone who would delay and veto things. This complaint doesn't make much sense, as DPOs (here in Germany at least) do not have authority to issue any instructions or decide the course of action; all they can do is advise and document what they have advised, so if leadership goes against recommendations, the DPO can't do anything about it. If a DPO does anything else, they're overstepping. What they can do is prove they have said otherwise, and aren't liable for anything. I guess that is either not something leadership knows, or they want the DPOs who wanna enable anything to be their fall guys. It sucks, but I guess in the overall hellscape we are in, it makes sense. Gabriel also talked about a personal demo being better than a simple CV and motivational letter, and I thought long and hard about how I'd apply that to my field (as it was much easier to do that with any field that values portfolios, like tech). I couldn't develop a demo of anything in that way; no use recording a video introducing myself and... showing a Data Protection Impact Assessment for Microsoft Teams? It doesn't work like that in my field. What I could come up with instead was: I knew the last point would be slim chances, but I didn't realize how slim until I tried it. The e-mail addresses for DPO's of the companies I messaged were mostly automated and strictly for access requests only. It wasn't for human exchanges. I didn't receive a reply back for the first two I messaged and knew I had to change my approach for the rest of my list; probably find out via LinkedIn or other means who exactly is hired in their Privacy Compliance teams and messaging them directly. I also recognized my disadvantage: I'm not "big" already. No podcast personality, not a panel speaker, not a known author, not a big blogger in that space. My blog isn't hosted on Substack and the interview wouldn't be posted on LinkedIn. All of these things could give the people I reached out to some reassurance about who I am and that they will be featured somewhere "reputable". I still continue trying to make that happen. Other things I have been doing to bruteforce my way in somehow: 1. I submitted an idea to our Idea Management team about implementing data protection coordinators ( "Datenschutzkoordinatoren" ). This is standard practice for other companies, very common, but we don't have them despite our DPO/my mentor approving of it. Leadership doesn't want to, and had rejected the idea 5 years prior. But I had better ammo than the old idea submitter, and with AI in the workplace now, things have shifted massively, warranting a reevaluation of the idea. I expect it to be rejected, but at least I tried. This could open the door to me being the data protection coordinator for my department, at least. 2. I indeed created a deletion concept for my team. My mentor/DPO was very happy with it overall when I briefly showed my work in a meeting, and I've sent it to him for more in-depth feedback soon. Once that is done, I will move on to making one for our sub-department, and then maybe one day, the whole department. No one is asking me for this, but I have a lot of unused time at work, want to show my skills, and help fix a severe compliance error my workplace has been in for years now. 3. We had an internal seminar on the " Data Analysis and Real World Interrogation Network " ( DARWIN EU ), which is an EU initiative coordinated by the European Medicines Agency to generate and utilize real world evidence data (RWE) to support the evaluation and supervision of medicines and treatments and enhance decision-making in regulation by drawing on anonymized data from routine healthcare appointments. Many countries' health databases exchanging data, and possibly in the future using AI for better insight, was totally my jam. We got the contact info of the initiative coordinators in case we have questions and ideas, and I sent an e-mail basically asking how to get involved as a privacy professional in the project. No answer so far. 4. We have an AI Coordinators group at work that always welcome new ideas, input and help. During one of their presentations showing the current progress of AI adoption in-house and how well it works (not at all!), I sent in a question asking how employees can get involved in the project in terms of privacy compliance. Didn't receive an answer until the next day, which was worded very nicely, but also showcased our internal rigidity again. In other workplaces, employees can be used more fluidly and assigned across departments if it makes sense, but in our case, they sadly had to be very insistent on not being able to get deeply involved in the actual work if not part of that team as official role, aside from submitting ideas. And obviously, the compliance needs were already covered by our DPO/my mentor. What they suggested instead was that I could try developing an internal GPT model focused on privacy compliance. That made me a little mad! I want to work . I want to think . I don't want to train my replacement for a job I don't yet have, but want. I want you to ask me one day! And for now, the way LLMs are, I cannot recommend asking it for legal advice, and I can't train it to be better; the hallucinations are a current fundamental flaw I cannot solve. That's the point where I arrived at another lesson: I while I keep my options open, I'll likely never work for a private company, and instead am better suited for regulatory bodies, NGOs, research, and academia . I have much more fun genuinely diving deep into the law and ethics, writing opinion pieces, maybe even proposals, help with research and papers, etc. than playing doormat for IT guys who want a new toy. 5. Made it a goal to do more case translations and summaries for noyb this year, with at least one case each week on Saturday. I've hit 10 done cases total a few days ago. 6. I have applied to and been accepted into the volunteer pool of The Midas Project , a watchdog nonprofit working to ensure that AI technology is safe and helpful to everyone. They lead strategic initiatives to monitor tech companies and counter corporate propaganda. Their releases have been very informative and have also drawn the attention of OpenAI, who are challenging them legally. You read that right - I sort of doubled down on the volunteering, despite the very real negative consequences. I'm not sure yet if I will stay; they only offer Fellowships (= opportunities to volunteer on a project) every couple of months. I'm also noticing a bit of a weird vibe compared to noyb, and I actually have quite a bone to pick with Effective Altruism, which is a big influence on it and the people in the space. But I hope I can learn valuable lessons in AI governance, and praying that it is not dominated by people with very grandiose conspiracy theories about AGI. That marks my progress on the end of January 2026; almost 2 years since fully plunging into data protection law. Writing all of this out, I realize how much I have managed in that time. It feels simultaneously long and short. I'll have to remember that when I get sad about handing in my Bachelor thesis in 2028 :) Not gonna lie, I have felt crushed and discouraged lately. It sucks when you feel like your true interests, skills and passions don't matter or are a flaw in others' eyes. The praise I get cannot move the mountains that are seemingly in my way. But it's the year of rejection , so I'll take it. If you have made it this far, thank you! And happy Data Protection Day, which was yesterday. You should read 5 myths about data protection debunked! Reply via email Published 29 Jan, 2026 I kept up-to-date with current legislation, cases and problems via their newsletters, blog posts, and internal communication, like their interesting presentations during the Country Reporter meetings. It gave me a space to connect with like-minded people. I was practicing reading case law and writing English legal jargon. I could build up a reputation in the space. Me being halfway done with the Bachelor degree, Having the Advanced Studies Degree, Already having worked there for years, no onboarding needed, knowing the organization and its processes well, Both my boss and my mentor (our DPO) who would have sung praises about me and were named as references in my application, Had a document proving I had already attended events in the industry, and My volunteer work showing I am passionate, hardworking and always up-to-date on the field. Developing missing compliance documents and concepts for my workplace. We didn't have any internal GDPR-compliant deletion concepts ( "Löschkonzepte" ) at all (not house-wide, not department-wide, not even in sub-departments and teams). I cannot show these in a demo to other companies, but it would at least be a sort of portfolio/demo internally . Continuing my volunteer work , showcasing it with a list of all my contributions, and making it into noyb.eu's newsletter with a newly translated and summarized court case (as they highlight new decisions there with attribution). That would only attract people who are okay with it. Continuing to write about data protection law on the blog , and in a different, more professional way on my other, work-friendly website as well. Being open about searching for work online, so people working for fitting companies reading my blog could stumble across it as well. Potentially making a LinkedIn , though I preferred not to so far. Pursuing a blog project I loosely thought about more seriously: Inviting DPO's and other privacy professionals to answer questions in an interview I'd post on my blog.

0 views
Stratechery 1 months ago

An Interview with Kalshi CEO Tarek Monsour About Prediction Markets

An interview with Kalshi co-founder and CEO Tarek Monsour about the value of prediction markets.

0 views
Xe Iaso 1 months ago

Backfilling Discord forum channels with the power of terrible code

Hey all! We've got a Discord so you can chat with us about the wild world of object storage and get any help you need. We've also set up Answer Overflow so that you can browse the Q&A from the web. Today I'm going to discuss how we got there and solved one of the biggest problems with setting up a new community or forum: backfilling existing Q&A data so that the forum doesn't look sad and empty. All the code I wrote to do this is open source in our glue repo . The rest of this post is a dramatic retelling of the thought process and tradeoffs that were made as a part of implementing, testing, and deploying this pull request . Ready? Let's begin! There's a bunch of ways you can think about this problem, but given the current hype zeitgeist and contractual obligations we can frame this as a dataset management problem. Effectively we have a bunch of forum question/answer threads on another site, and we want to migrate the data over to a new home on Discord. This is the standard "square peg to round hole" problem you get with Extract, Transform, Load (ETL) pipelines and AI dataset management (mostly taking your raw data and tokenizing it so that AI models work properly). So let's think about this from an AI dataset perspective. Our pipeline has three distinct steps: When thinking about gathering and transforming datasets, it's helpful to start by thinking about the modality of the data you're working with. Our dataset is mostly forum posts, which is structured text. One part of the structure contains HTML rendered by the forum engine. This, the "does this solve my question" flag, and the user ID of the person that posted the reply are the things we care the most about. I made a bucket for this (in typical recovering former SRE fashion it's named for a completely different project) with snapshots enabled, and then got cracking. Tigris snapshots will let me recover prior state in case I don't like my transformations. When you are gathering data from one source in particular, one of the first things you need to do is ask permission from the administrator of that service. You don't know if your scraping could cause unexpected load leading to an outage. It's a classic tragedy of the commons problem that I have a lot of personal experience in preventing. When you reach out, let the administrators know the data you want to scrape and the expected load– a lot of the time, they can give you a data dump, and you don't even need to write your scraper. We got approval for this project, so we're good to go! To get a head start, I adapted an old package of mine to assemble User-Agent strings in such a way that gives administrators information about who is requesting data from their servers along with contact information in case something goes awry. Here's an example User-Agent string: This gives administrators the following information: This seems like a lot of information, but realistically it's not much more than the average Firefox install attaches to each request: The main difference is adding the workload hostname purely to help debugging a misbehaving workload. This is a concession that makes each workload less anonymous, however keep in mind that when you are actively scraping data you are being seen as a foreign influence. Conceding more data than you need to is just being nice at that point. One of the other "good internet citizen" things to do when doing benign scraping is try to reduce the amount of load you cause to the target server. In my case the forum engine is a Rails app (Discourse), which means there's a few properties of Rails that work to my advantage. Fun fact about Rails: if you append to the end of a URL, you typically get a JSON response based on the inputs to the view. For example, consider my profile on Lobsters at https://lobste.rs/~cadey . If you instead head to https://lobste.rs/~cadey.json , you get a JSON view of my profile information. This means that a lot of the process involved gathering a list of URLs with the thread indices we wanted, then constructing the thread URLs with slapped on the end to get machine-friendly JSON back. This made my life so much easier. Now that we have easy ways to get the data from the forum engine, the next step is to copy it out to Tigris directly after ingesting it. In order to do that I reused some code I made ages ago as a generic data storage layer kinda like Keyv in the node ecosystem . One of the storage backends was a generic object storage backend. I plugged Tigris into it and it worked on the first try. Good enough for me! Either way: this is the interface I used: By itself this isn't the most useful, however the real magic comes with my adaptor type . This uses Go generics to do type-safe operations on Tigris such that you have 90% of what you need for a database replacement. When you do any operations on a adaptor, the following happens: In the future I hope to extend this to include native facilities for forking, snapshots, and other nice to haves like an in-memory cache to avoid IOPs pressure, but for now this is fine. As the data was being read from the forum engine, it was saved into Tigris. All future lookups to that data I scraped happened from Tigris, meaning that the upstream server only had to serve the data I needed once instead of having to constantly re-load and re-reference it like the latest batch of abusive scrapers seem to do . So now I have all the data, I need to do some massaging to comply both with Discord's standards and with some arbitrary limitations we set on ourselves: In general, this means I needed to take the raw data from the forum engine and streamline it down to this Go type: In order to make this happen, I ended up using a simple AI agent to do the cleanup. It was prompted to do the following: I figured this should be good enough so I sent it to my local DGX Spark running GPT-OSS 120b via llama.cpp and manually looked at the output for a few randomly selected threads. The sample was legit, which is good enough for me. Once that was done I figured it would be better to switch from the locally hosted model to a model in a roughly equivalent weight class (gpt-5-mini). I assumed that the cloud model would be faster and slightly better in terms of its output. This test failed because I have somehow managed to write code that works great with llama.cpp on the Spark but results in errors using OpenAI's production models. I didn't totally understand what went wrong, but I didn't dig too deep because I knew that the local model would probably work well enough. It ended up taking about 10 minutes to chew through all the data, which was way better than I expected and continues to reaffirm my theory that GPT-OSS 120b is a good enough generic workhorse model, even if it's not the best at coding . From here things worked, I was able to ingest things and made a test Discord to try things out without potentially getting things indexed. I had my tool test-migrate a thread to the test Discord and got a working result. To be fair, this worked way better than expected (I added random name generation and as a result our CEO Ovais, became Mr. Quinn Price for that test), but it felt like one thing was missing: avatars. Having everyone in the migrated posts use the generic "no avatar set" avatar certainly would work, but I feel like it would look lazy. Then I remembered that I also have an image generation model running on the Spark: Z-Image Turbo . Just to try it out, I adapted a hacky bit of code I originally wrote on stream while I was learning to use voice coding tools to generate per-user avatars based on the internal user ID. This worked way better than I expected when I tested how it would look with each avatar attached to their own users. In order to serve the images, I stored them in the same Tigris bucket, but set ACLs on each object so that they were public, meaning that the private data stayed private, but anyone can view the objects that were explicitly marked public when they were added to Tigris. This let me mix and match the data so that I only had one bucket to worry about. This reduced a lot of cognitive load and I highly suggest that you repeat this pattern should you need this exact adaptor between this exact square peg and round hole combination. Now that everything was working in development, it was time to see how things would break in production! In order to give the façade that every post was made by a separate user, I used a trick that my friend who wrote Pluralkit (an accessibility tool for a certain kind of neurodivergence) uses: using Discord webhooks to introduce multiple pseudo-users into one channel. I had never combined forum channels with webhook pseudo-users like this before, but it turned out to be way easier than expected . All I had to do was add the right parameter when creating a new thread and the parameter when appending a new message to it. It was really neat and made it pretty easy to associate each thread ingressed from Discourse into its own Discord thread. Then all that was left was to run the Big Scary Command™ and see what broke. A couple messages were too long (which was easy to fix by simply manually rewriting them, doing the right state layer brain surgery, deleting things on Discord, and re-running the migration tool. However 99.9% of messages were correctly imported on the first try. I had to double check a few times including the bog-standard wakefulness tests. If you've never gone deep into lucid dreaming before, a wakefulness test is where you do something obviously impossible to confirm that it does not happen, such as trying to put your fingers through your palm. My fingers did not go through my palm. After having someone else confirm that I wasn't hallucinating more than usual I found out that my code did in fact work and as a result you can now search through the archives on community.tigrisdata.com or via the MCP server ! I consider that a massive success. As someone who has seen many truly helpful answers get forgotten in the endless scroll of chats, I wanted to build a way to get that help in front of users when they need it by making it searchable outside of Discord. Finding AnswerOverflow was pure luck: I happened to know someone who uses it for the support Discord for the Linux distribution I use on my ROG Ally, Bazzite . Thanks, j0rge! AnswerOverflow also has an MCP server so that your agents can hook into our knowledge base to get the best answers. To find out more about setting it up, take a look at the "MCP Server" button on the Tigris Community page . They've got instructions for most MCP clients on the market. Worst case, configure your client to access this URL: And bam, your agent has access to the wisdom of the ancients. But none of this is helpful without the actual answers. We were lucky enough to have existing Q&A in another forum to leverage. If you don't have the luxury, you can write your own FAQs and scenarios as a start. All I can say is, thank you to the folks who asked and answered these questions– we're happy to help, and know that you're helping other users by sharing. Connect with other developers, get help, and share your projects. Search our Q&A archives or ask a new question. Join the Discord . Extracting the raw data from the upstream source and caching it in Tigris. Transforming the cached data to make it easier to consume in Discord, storing that in Tigris again. Loading the transformed data into Discord so that people can see the threads in app and on the web with Answer Overflow . The name of the project associated with the requests (tigris-gtm-glue, where gtm means "go-to-market", which is the current in-vogue buzzword translation for whatever it is we do). The Go version, computer OS, and CPU architecture of the machine the program is running on so that administrator complaints can be easier isolated to individual machines. A contact URL for the workload, in our case it's just the Tigris home page. The name of the program doing the scraping so that we can isolate root causes down even further. Specifically it's the last path element of , which contains the path the kernel was passed to the executable. The hostname where the workload is being run in so that we can isolate down to an exact machine or Kubernetes pod. In my case it's the hostname of my work laptop. Key names get prefixed automatically. All data is encoded into JSON on write and decoded from JSON on read using the Go standard library. Type safety at the compiler level means the only way you can corrupt data is by having different "tables" share the same key prefix. Try not to do that! You can use Tigris bucket snapshots to help mitigate this risk in the worst case. Discord needs Markdown, the forum engine posts are all HTML. We want to remove personally-identifiable information from those posts just to keep things a bit more anonymous. Discord has a limit of 2048 characters per message and some posts will need to be summarized to fit within that window. Convert HTML to Markdown : Okay, I could have gotten away using a dedicated library for this like html2text , but I didn't think about that at the time. Remove mentions and names : Just strip them out or replace the mentions with generic placeholders ("someone I know", "a friend", "a colleague", etc.). Keep "useful" links : This was left intentionally vague and random sampling showed that it was good enough. Summarize long text : If the text is over 1000 characters, summarize it to less than 1000 characters.

0 views
Jim Nielsen 1 months ago

You Can Just Say No to the Data

“The data doesn’t lie.” I imagine that’s what the cigarette companies said. “The data doesn’t lie. People want this stuff. They’re buying it in droves. We’re merely giving them what they want.” Which sounds more like an attempt at exoneration than a reason to exist. Demand can be engineered. “We’re giving them what they want” ignores how desire is shaped, even engineered (algorithms, dark patterns, growth hacking , etc.). Appealing to data as the ultimate authority — especially when fueled by engineered desire — isn’t neutrality, it’s an abdication of responsibility. Satiating human desire is not the highest aspiration. We can do so much more than merely supply what the data says is in demand. Stated as a principle : Values over data. Data tells you what people consume, not what you should make. Values, ethics, vision, those can help you with the “should”. “What is happening?” and “What should happen?” are two completely different questions and should be dealt with as such. The more powerful our ability to understand demand, the more important our responsibility to decide whether to respond to it. We can choose not to build something, even though the data suggests we should. We can say no to the data. Data can tell you what people clicked on, even help you predict what people will click on, but you get to decide what you will profit from. Reply via: Email · Mastodon · Bluesky

0 views

Lessons from Building AI Agents for Financial Services

I’ve spent the last two years building AI agents for financial services. Along the way, I’ve accumulated a fair number of battle scars and learnings that I want to share. Here’s what I’ll cover: - The Sandbox Is Not Optional - Why isolated execution environments are essential for multi-step agent workflows - Context Is the Product - How we normalize heterogeneous financial data into clean, searchable context - The Parsing Problem - The hidden complexity of extracting structured data from adversarial SEC filings - Skills Are Everything - Why markdown-based skills are becoming the product, not the model - The Model Will Eat Your Scaffolding - Designing for obsolescence as models improve - The S3-First Architecture - Why S3 beats databases for file storage and user data - The File System Tools - How ReadFile, WriteFile, and Bash enable complex financial workflows - Temporal Changed Everything - Reliable long-running tasks with proper cancellation handling - Real-Time Streaming - Building responsive UX with delta updates and interactive agent workflows - Evaluation Is Not Optional - Domain-specific evals that catch errors before they cost money - Production Monitoring - The observability stack that keeps financial agents reliable Why financial services is extremely hard. This domain doesn’t forgive mistakes. Numbers matter. A wrong revenue figure, a misinterpreted guidance statement, an incorrect DCF assumption. Professional investors make million-dollar decisions based on our output. One mistake on a $100M position and you’ve destroyed trust forever. The users are also demanding. Professional investors are some of the smartest, most time-pressed people you’ll ever work with. They spot bullshit instantly. They need precision, speed, and depth. You can’t hand-wave your way through a valuation model or gloss over nuances in an earnings call. This forces me to develop an almost paranoid attention to detail. Every number gets double-checked. Every assumption gets validated. Every model gets stress-tested. You start questioning everything the LLM outputs because you know your users will. A single wrong calculation in a DCF model and you lose credibility forever. I sometimes feel that the fear of being wrong becomes our best feature. Over the years building with LLM, we’ve made bold infrastructure bets early and I think we have been right. For instance, when Claude Code launched with its filesystem-first agentic approach, we immediately adopted it. It was not an obvious bet and it was a massive revamp of our architecture. I was extremely lucky to have Thariq from Anthropic Claude Code jumping on a Zoom and opening my eyes to the possibilities. At the time the whole industry, including Fintool, was all building elaborate RAG pipelines with vector databases and embeddings. After reflecting on the future of information retrieval with agents I wrote “ the RAG obituary ” and Fintool moved fully to agentic search. We even decided to retire our precious embedding pipeline. Sad but whatever is best for the future! People thought we were crazy. The article got a lot of praise and a lot of negative comments. Now I feel most startups are adopting these best practices. I believe we’re early on several other architectural choices too. I’m sharing them here because the best way to test ideas is to put them out there. Let’s start with the biggest one. When we first started building Fintool in 2023, I thought sandboxing might be overkill. “We’re just running Python scripts” I told myself. “What could go wrong?” Haha. Everything. Everything could go wrong. The first time an LLM decided to `rm -rf /` on our server (it was trying to “clean up temporary files”), I became a true believer. Here’s the thing: agents need to run multi-step operations. A professional investor asks for a DCF valuation and that’s not a single API call. The agent needs to research the company, gather financial data, build a model in Excel, run sensitivity analysis, generate complex charts, iterate on assumptions. That’s dozens of steps, each potentially modifying files, installing packages, running scripts. You can’t do this without code execution. And executing arbitrary code on your servers is insane. Every chat application needs a sandbox. Today each user gets their own isolated environment. The agent can do whatever it wants in there. Delete everything? Fine. Install weird packages? Go ahead. It’s your sandbox, knock yourself out. The architecture looks like this: Three mount points. Private is read/write for your stuff. Shared is read-only for your organization. Public is read-only for everyone. The magic is in the credentials. We use AWS ABAC (Attribute-Based Access Control) to generate short-lived credentials scoped to specific S3 prefixes. User A literally cannot access User B’s data. The IAM policy uses ` ${aws:PrincipalTag/S3Prefix} ` to restrict access. The credentials physically won’t allow it. This is also very good for Enterprise deployment. We also do sandbox pre-warming. When a user starts typing, we spin up their sandbox in the background. By the time they hit enter, the sandbox is ready. 600 second timeout, extended by 10 minutes on each tool usage. The sandbox stays warm across conversation turns. So sandboxes are amazing but the under-discussed magic of sandboxes is the support for the filesystem. Which brings us to the next lesson learned about context. Your agent is only as good as the context it can access. The real work isn’t prompt engineering it’s turning messy financial data from dozens of sources into clean, structured context the model can actually use. This requires a massive domain expertise from the engineering team. The heterogeneity problem. Financial data comes in every format imaginable: - SEC filings : HTML with nested tables, exhibits, signatures - Earnings transcripts : Speaker-segmented text with Q&A sections - Press releases : Semi-structured HTML from PRNewswire - Research reports : PDFs with charts and footnotes - Market data : Snowflake/databases with structured numerical data - News : Articles with varying quality and structure - Alternative data : Satellite imagery, web traffic, credit card panels - Broker research : Proprietary PDFs with price targets and models - Fund filings : 13F holdings, proxy statements, activist letters Each source has different schemas, different update frequencies, different quality levels. Agent needs one thing: clean context it can reason over. The normalization layer. Everything becomes one of three formats: - Markdown for narrative content (filings, transcripts, articles) - CSV/tables for structured data (financials, metrics, comparisons) - JSON metadata for searchability (tickers, dates, document types, fiscal periods) Chunking strategy matters. Not all documents chunk the same way: - 10-K filings : Section by regulatory structure (Item 1, 1A, 7, 8...) - Earnings transcripts : Chunk by speaker turn (CEO remarks, CFO remarks, Q&A by analyst) - Press releases : Usually small enough to be one chunk - News articles : Paragraph-level chunks - 13F filings : By holder and position changes quarter-over-quarter The chunking strategy determines what context the agent retrieves. Bad chunks = bad answers. Tables are special. Financial data is full of tables and csv. Revenue breakdowns, segment performance, guidance ranges. LLMs are surprisingly good at reasoning over markdown tables: But they’re terrible at reasoning over HTML `<table>` tags or raw CSV dumps. The normalization layer converts everything to clean markdown tables. Metadata enables retrieval. The user asks the agent: “ What did Apple say about services revenue in their last earnings call? ” To answer this, Fintool needs: - Ticker resolution (AAPL → correct company) - Document type filtering (earnings transcript, not 10-K) - Temporal filtering (most recent, not 2019) - Section targeting (CFO remarks or revenue discussion, not legal disclaimers) This is why `meta.json` exists for every document. Without structured metadata, you’re doing keyword search over a haystack. It speeds up the search, big time! Anyone can call an LLM API. Not everyone has normalized decades of financial data into searchable, chunked markdown with proper metadata. The data layer is what makes agents actually work. The Parsing Problem Normalizing financial data is 80% of the work. Here’s what nobody tells you. SEC filings are adversarial. They’re not designed for machine reading. They’re designed for legal compliance: - Tables span multiple pages with repeated headers - Footnotes reference exhibits that reference other footnotes - Numbers appear in text, tables, and exhibits—sometimes inconsistently - XBRL tags exist but are often wrong or incomplete - Formatting varies wildly between filers (every law firm has their own template) We tried off-the-shelf PDF/HTML parsers. They failed on: - Multi-column layouts in proxy statements - Nested tables in MD&A sections (tables within tables within tables) - Watermarks and headers bleeding into content - Scanned exhibits (still common in older filings and attachments) - Unicode issues (curly quotes, em-dashes, non-breaking spaces) The Fintool parsing pipeline: Raw Filing (HTML/PDF) Document structure detection (headers, sections, exhibits) Table extraction with cell relationship preservation Entity extraction (companies, people, dates, dollar amounts) Cross-reference resolution (Ex. 10.1 → actual exhibit content) Fiscal period normalization (FY2024 → Oct 2023 to Sep 2024 for Apple) Quality scoring (confidence per extracted field) Table extraction deserves its own work. Financial tables are dense with meaning. A revenue breakdown table might have: - Merged header cells spanning multiple columns - Footnote markers (1), (2), (a), (b) that reference explanations below - Parentheses for negative numbers: $(1,234) means -1234 - Mixed units in the same table (millions for revenue, percentages for margins) - Prior period restatements in italics or with asterisks We score every extracted table on: - Cell boundary accuracy (did we split/merge correctly?) - Header detection (is row 1 actually headers, or is there a title row above?) - Numeric parsing (is “$1,234” parsed as 1234 or left as text?) - Unit inference (millions? billions? per share? percentage?) Tables below 90% confidence get flagged for review. Low-confidence extractions don’t enter the agent’s context—garbage in, garbage out. Fiscal period normalization is critical. “Q1 2024” is ambiguous: - Calendar Q1 (January-March 2024) - Apple’s fiscal Q1 (October-December 2023) - Microsoft’s fiscal Q1 (July-September 2023) - “Reported in Q1” (filed in Q1, but covers the prior period) We maintain a fiscal calendar database for 10,000+ companies. Every date reference gets normalized to absolute date ranges. When the agent retrieves “Apple Q1 2024 revenue,” it knows to look for data from October-December 2023. This is invisible to users but essential for correctness. Without it, you’re comparing Apple’s October revenue to Microsoft’s January revenue and calling it “same quarter.” Here’s the thing nobody tells you about building AI agents: the model is not the product. The skills are now the product. I learned this the hard way. We used to try making the base model “smarter” through prompt engineering. Tweak the system prompt, add examples, write elaborate instructions. It helped a little. But skills were the missing part. In October 2025, Anthropic formalized this with Agent Skills a specification for extending Claude with modular capability packages. A skill is a folder containing a `SKILL.md` file with YAML frontmatter (name and description), plus any supporting scripts, references, or data files the agent might need. We’d been building something similar for months before the announcement. The validation felt good but more importantly, having an industry standard means our skills can eventually be portable. Without skills, models are surprisingly bad at domain tasks. Ask a frontier model to do a DCF valuation. It knows what DCF is. It can explain the theory. But actually executing one? It will miss critical steps, use wrong discount rates for the industry, forget to add back stock-based compensation, skip sensitivity analysis. The output looks plausible but is subtly wrong in ways that matter. The breakthrough came when we started thinking about skills as first-class citizens. Like part of the product itself. A skill is a markdown file that tells the agent how to do something specific. Here’s a simplified version of our DCF skill: That’s it. A markdown file. No code changes. No production deployment. Just a file that tells the agent what to do. Skills are better than code. This matters enormously: 1. Non-engineers can create skills. Our analysts write skills. Our customers write skills. A portfolio manager who’s done 500 DCF valuations can encode their methodology in a skill without writing a single line of Python. 2. No deployment needed. Change a skill file and it takes effect immediately. No CI/CD, no code review, no waiting for release cycles. Domain experts can iterate on their own. 3. Readable and auditable. When something goes wrong, you can read the skill and understand exactly what the agent was supposed to do. Try doing that with a 2,000-line Python module. We have a copy-on-write shadowing system: Priority: private > shared > public So if you don’t like how we do DCF valuations, write your own. Drop it in `/private/skills/dcf/SKILL.md`. Your version wins. Why we don’t mount all skills to the filesystem. This is important. The naive approach would be to mount every skill file directly into the sandbox. The agent can just `cat` any skill it needs. Simple, right? Wrong. Here’s why we use SQL discovery instead: 1. Lazy loading. We have dozens of skills with extensive documentation like the DCF skill alone has 10+ industry guideline files. Loading all of them into context for every conversation would burn tokens and confuse the model. Instead, we discover skill metadata (name, description) upfront, and only load the full documentation when the agent actually uses that skill. 2. Access control at query time. The SQL query implements our three-tier access model: public skills available to everyone, organization skills for that org’s users, private skills for individual users. The database enforces this. You can’t accidentally expose a customer’s proprietary skill to another customer. 3. Shadowing logic. When a user customizes a skill, their version needs to override the default. SQL makes this trivial—query all three levels, apply priority rules, return the winner. Doing this with filesystem mounts would be a nightmare of symlinks and directory ordering. 4. Metadata-driven filtering. The `fs_files.metadata` column stores parsed YAML frontmatter. We can filter by skill type, check if a skill is main-agent-only, or query any other structured attribute—all without reading the files themselves. The pattern: S3 is the source of truth, a Lambda function syncs changes to PostgreSQL for fast queries, and the agent gets exactly what it needs when it needs it. Skills are essential. I cannot emphasize this enough. If you’re building an AI agent and you don’t have a skills system, you’re going to have a bad time. My biggest argument for skills is that top models (Claude or GPT) are post-trained on using Skills. The model wants to fetch skills. Models just want to learn and what they want to learn is our skills... Until they ate it. Here’s the uncomfortable truth: everything I just told you about skills? It’s temporary in my opinion. Models are getting better. Fast. Every few months, there’s a new model that makes half your code obsolete. The elaborate scaffolding you built to handle edge cases? The model just... handles them now. When we started, we needed detailed skills with step-by-step instructions for some simple tasks. “First do X, then do Y, then check Z.” Now? We can often just say for simple task “do an earnings preview” and the model figures it out (kinda of!) This creates a weird tension. You need skills today because current models aren’t smart enough. But you should design your skills knowing that future models will need less hand-holding. That’s why I’m bullish on markdown file versus code for model instructions. It’s easier to update and delete. We send detailed feedback to AI labs. Whenever we build complex scaffolding to work around model limitations, we document exactly what the model struggles with and share it with the lab research team. This helps inform the next generation of models. The goal is to make our own scaffolding obsolete. My prediction: in two years, most of our basic skills will be one-liners. “Generate a 20 tabs DCF.” That’s it. The model will know what that means. But here’s the flip side: as basic tasks get commoditized, we’ll push into more complex territory. Multi-step valuations with segment-by-segment analysis. Automated backtesting of investment strategies. Real-time portfolio monitoring with complex triggers. The frontier keeps moving. So we write skills. We delete them when they become unnecessary. And we build new ones for the harder problems that emerge. And all that are files... in our filesystem. Here’s something that surprised me: S3 for files is a better database than a database. We store user data (watchlists, portfolio, preferences, memories, skills) in S3 as YAML files. S3 is the source of truth. A Lambda function syncs changes to PostgreSQL for fast queries. Writes → S3 (source of truth) Lambda trigger PostgreSQL (fs_files table) Reads ← Fast queries - Durability : S3 has 11 9’s. A database doesn’t. - Versioning : S3 versioning gives you audit trails for free - Simplicity : YAML files are human-readable. You can debug with `cat`. - Cost : S3 is cheap. Database storage is not. The pattern: - Writes go to S3 directly - List queries hit the database (fast) - Single-item reads go to S3 (freshest data) The sync architecture. We run two Lambda functions to keep S3 and PostgreSQL in sync: S3 (file upload/delete) fs-sync Lambda → Upsert/delete in fs_files table (real-time) EventBridge (every 3 hours) fs-reconcile Lambda → Full S3 vs DB scan, fix discrepancies Both use upsert with timestamp guards—newer data always wins. The reconcile job catches any events that slipped through (S3 eventual consistency, Lambda cold starts, network blips). User memories live here too. Every user has a `/private/memories/UserMemories.md` file in S3. It’s just markdown—users can edit it directly in the UI. On every conversation, we load it and inject it as context: This is surprisingly powerful. Users write things like “I focus on small-cap value stocks” or “Always compare to industry median, not mean” or “My portfolio is concentrated in tech, so flag concentration risk.” The agent sees this on every conversation and adapts accordingly. No migrations. No schema changes. Just a markdown file that the user controls. Watchlists work the same way. YAML files in S3, synced to PostgreSQL for fast queries. When a user asks about “my watchlist,” we load the relevant tickers and inject them as context. The agent knows what companies matter to this user. The filesystem becomes the user’s personal knowledge base. Skills tell the agent how to do things. Memories tell it what the user cares about. Both are just files. Agents in financial services need to read and write files. A lot of files. PDFs, spreadsheets, images, code. Here’s how we handle it. ReadFile handles the complexity: WriteFile creates artifacts that link back to the UI: Bash gives persistent shell access with 180 second timeout and 100K character output limit. Path normalization on everything (LLMs love trying path traversal attacks, it’s hilarious). Bash is more important than you think. There’s a growing conviction in the AI community that filesystems and bash are the optimal abstraction for AI agents. Braintrust recently ran an eval comparing SQL agents, bash agents, and hybrid approaches for querying semi-structured data. The results were interesting: pure SQL hit 100% accuracy but missed edge cases. Pure bash was slower and more expensive but caught verification opportunities. The winner? A hybrid approach where the agent uses bash to explore and verify, SQL for structured queries. This matches our experience. Financial data is messy. You need bash to grep through filing documents, find patterns, explore directory structures. But you also need structured tools for the heavy lifting. The agent needs both—and the judgment to know when to use each. We’ve leaned hard into giving agents full shell access in the sandbox. It’s not just for running Python scripts. It’s for exploration, verification, and the kind of ad-hoc data manipulation that complex tasks require. But complex tasks mean long-running agents. And long-running agents break everything. Subscribe now Before Temporal, our long-running tasks were a disaster. User asks for a comprehensive company analysis. That takes 5 minutes. What if the server restarts? What if the user closes the tab and comes back? What if... anything? We had a homegrown job queue. It was bad. Retries were inconsistent. State management was a nightmare. Then we switched to Temporal and I wanted to cry tears of joy! That’s it. Temporal handles worker crashes, retries, everything. If a Heroku dyno restarts mid-conversation (happens all the time lol), Temporal automatically retries on another worker. The user never knows. The cancellation handling is the tricky part. User clicks “stop,” what happens? The activity is already running on a different server. We use heartbeats sent every few seconds. We run two worker types: - Chat workers : User-facing, 25 concurrent activities - Background workers : Async tasks, 10 concurrent activities They scale independently. Chat traffic spikes? Scale chat workers. Next is speed. In finance, people are impatient. They’re not going to wait 30 seconds staring at a loading spinner. They need to see something happening. So we built real-time streaming. The agent works, you see the progress. Agent → SSE Events → Redis Stream → API → Frontend The key insight: delta updates, not full state. Instead of sending “here’s the complete response so far” (expensive), we send “append these 50 characters” (cheap). Streaming rich content with Streamdown. Text streaming is table stakes. The harder problem is streaming rich content: markdown with tables, charts, citations, math equations. We use Streamdown to render markdown as it arrives, with custom plugins for our domain-specific components. Charts render progressively. Citations link to source documents. Math equations display properly with KaTeX. The user sees a complete, interactive response building in real-time. AskUserQuestion: Interactive agent workflows. Sometimes the agent needs user input mid-workflow. “Which valuation method do you prefer?” “Should I use consensus estimates or management guidance?” “Do you want me to include the pipeline assets in the valuation?” We built an `AskUserQuestion` tool that lets the agent pause, present options, and wai When the agent calls this tool, the agentic loop intercepts it, saves state, and presents a UI to the user. The user picks an option (or types a custom answer), and the conversation resumes with their choice. This transforms agents from autonomous black boxes into collaborative tools. The agent does the heavy lifting, but the user stays in control of key decisions. Essential for high-stakes financial work where users need to validate assumptions. “Ship fast, fix later” works for most startups. It does not work for financial services. A wrong earnings number can cost someone money. A misinterpreted guidance statement can lead to bad investment decisions. You can’t just “fix it later” when your users are making million-dollar decisions based on your output. We use Braintrust for experiment tracking. Every model change, every prompt change, every skill change gets evaluated against a test set. Generic NLP metrics (BLEU, ROUGE) don’t work for finance. A response can be semantically similar but have completely wrong numbers. Building eval datasets is harder than building the agent. We maintain ~2,000 test cases across categories: Ticker disambiguation. This is deceptively hard: - “Apple” → AAPL, not APLE (Appel Petroleum) - “Meta” → META, not MSTR (which some people call “meta”) - “Delta” → DAL (airline) or is the user talking about delta hedging (options term)? The really nasty cases are ticker changes. Facebook became META in 2021. Google restructured under GOOG/GOOGL. Twitter became X (but kept the legal entity). When a user asks “What happened to Facebook stock in 2023?”, you need to know that FB → META, and that historical data before Oct 2021 lives under the old ticker. We maintain a ticker history table and test cases for every major rename in the last decade. Fiscal period hell. This is where most financial agents silently fail: - Apple’s Q1 is October-December (fiscal year ends in September) - Microsoft’s Q2 is October-December (fiscal year ends in June) - Most companies Q1 is January-March (calendar year) “Last quarter” on January 15th means: - Q4 2024 for calendar-year companies - Q1 2025 for Apple (they just reported) - Q2 2025 for Microsoft (they’re mid-quarter) We maintain fiscal calendars for 10,000+ companies. Every period reference gets normalized to absolute date ranges. We have 200+ test cases just for period extraction. Numeric precision. Revenue of $4.2B vs $4,200M vs $4.2 billion vs “four point two billion.” All equivalent. But “4.2” alone is wrong—missing units. Is it millions? Billions? Per share? We test unit inference, magnitude normalization, and currency handling. A response that says “revenue was 4.2” without units fails the eval, even if 4.2B is correct. Adversarial grounding. We inject fake numbers into context and verify the model cites the real source, not the planted one. Example: We include a fake analyst report stating “Apple revenue was $50B” alongside the real 10-K showing $94B. If the agent cites $50B, it fails. If it cites $94B with proper source attribution, it passes. We have 50 test cases specifically for hallucination resistance. Eval-driven development. Every skill has a companion eval. The DCF skill has 40 test cases covering WACC edge cases, terminal value sanity checks, and stock-based compensation add-backs (models forget this constantly). PR blocked if eval score drops >5%. No exceptions. Our production setup looks like this: We auto-file GitHub issues for production errors. Error happens, issue gets created with full context: conversation ID, user info, traceback, links to Braintrust traces and Temporal workflows. Paying customers get `priority:high` label. Model routing by complexity: simple queries use Haiku (cheap), complex analysis uses Sonnet (expensive). Enterprise users always get the best model. The biggest lesson isn’t about sandboxes or skills or streaming. It’s this: The model is not your product. The experience around the model is your product. Anyone can call Claude or GPT. The API is the same for everyone. What makes your product different is everything else: the data you have access to, the skills you’ve built, the UX you’ve designed, the reliability you’ve engineered and frankly how well you know the industry which is a function of how much time you spend with your customers. Models will keep getting better. That’s great! It means less scaffolding, less prompt engineering, less complexity. But it also means the model becomes more of a commodity. Your moat is not the model. Your moat is everything you build around it. For us, that’s financial data, domain-specific skills, real-time streaming, and the trust we’ve built with professional investors. What’s yours? Thanks for reading! Subscribe for free to receive new posts and support my work. I’ve spent the last two years building AI agents for financial services. Along the way, I’ve accumulated a fair number of battle scars and learnings that I want to share. Here’s what I’ll cover: - The Sandbox Is Not Optional - Why isolated execution environments are essential for multi-step agent workflows - Context Is the Product - How we normalize heterogeneous financial data into clean, searchable context - The Parsing Problem - The hidden complexity of extracting structured data from adversarial SEC filings - Skills Are Everything - Why markdown-based skills are becoming the product, not the model - The Model Will Eat Your Scaffolding - Designing for obsolescence as models improve - The S3-First Architecture - Why S3 beats databases for file storage and user data - The File System Tools - How ReadFile, WriteFile, and Bash enable complex financial workflows - Temporal Changed Everything - Reliable long-running tasks with proper cancellation handling - Real-Time Streaming - Building responsive UX with delta updates and interactive agent workflows - Evaluation Is Not Optional - Domain-specific evals that catch errors before they cost money - Production Monitoring - The observability stack that keeps financial agents reliable Why financial services is extremely hard. This domain doesn’t forgive mistakes. Numbers matter. A wrong revenue figure, a misinterpreted guidance statement, an incorrect DCF assumption. Professional investors make million-dollar decisions based on our output. One mistake on a $100M position and you’ve destroyed trust forever. The users are also demanding. Professional investors are some of the smartest, most time-pressed people you’ll ever work with. They spot bullshit instantly. They need precision, speed, and depth. You can’t hand-wave your way through a valuation model or gloss over nuances in an earnings call. This forces me to develop an almost paranoid attention to detail. Every number gets double-checked. Every assumption gets validated. Every model gets stress-tested. You start questioning everything the LLM outputs because you know your users will. A single wrong calculation in a DCF model and you lose credibility forever. I sometimes feel that the fear of being wrong becomes our best feature. Over the years building with LLM, we’ve made bold infrastructure bets early and I think we have been right. For instance, when Claude Code launched with its filesystem-first agentic approach, we immediately adopted it. It was not an obvious bet and it was a massive revamp of our architecture. I was extremely lucky to have Thariq from Anthropic Claude Code jumping on a Zoom and opening my eyes to the possibilities. At the time the whole industry, including Fintool, was all building elaborate RAG pipelines with vector databases and embeddings. After reflecting on the future of information retrieval with agents I wrote “ the RAG obituary ” and Fintool moved fully to agentic search. We even decided to retire our precious embedding pipeline. Sad but whatever is best for the future! People thought we were crazy. The article got a lot of praise and a lot of negative comments. Now I feel most startups are adopting these best practices. I believe we’re early on several other architectural choices too. I’m sharing them here because the best way to test ideas is to put them out there. Let’s start with the biggest one. The Sandbox Is Not Optional When we first started building Fintool in 2023, I thought sandboxing might be overkill. “We’re just running Python scripts” I told myself. “What could go wrong?” Haha. Everything. Everything could go wrong. The first time an LLM decided to `rm -rf /` on our server (it was trying to “clean up temporary files”), I became a true believer. Here’s the thing: agents need to run multi-step operations. A professional investor asks for a DCF valuation and that’s not a single API call. The agent needs to research the company, gather financial data, build a model in Excel, run sensitivity analysis, generate complex charts, iterate on assumptions. That’s dozens of steps, each potentially modifying files, installing packages, running scripts. You can’t do this without code execution. And executing arbitrary code on your servers is insane. Every chat application needs a sandbox. Today each user gets their own isolated environment. The agent can do whatever it wants in there. Delete everything? Fine. Install weird packages? Go ahead. It’s your sandbox, knock yourself out. The architecture looks like this: Three mount points. Private is read/write for your stuff. Shared is read-only for your organization. Public is read-only for everyone. The magic is in the credentials. We use AWS ABAC (Attribute-Based Access Control) to generate short-lived credentials scoped to specific S3 prefixes. User A literally cannot access User B’s data. The IAM policy uses ` ${aws:PrincipalTag/S3Prefix} ` to restrict access. The credentials physically won’t allow it. This is also very good for Enterprise deployment. We also do sandbox pre-warming. When a user starts typing, we spin up their sandbox in the background. By the time they hit enter, the sandbox is ready. 600 second timeout, extended by 10 minutes on each tool usage. The sandbox stays warm across conversation turns. So sandboxes are amazing but the under-discussed magic of sandboxes is the support for the filesystem. Which brings us to the next lesson learned about context. Context Is the Product Your agent is only as good as the context it can access. The real work isn’t prompt engineering it’s turning messy financial data from dozens of sources into clean, structured context the model can actually use. This requires a massive domain expertise from the engineering team. The heterogeneity problem. Financial data comes in every format imaginable: - SEC filings : HTML with nested tables, exhibits, signatures - Earnings transcripts : Speaker-segmented text with Q&A sections - Press releases : Semi-structured HTML from PRNewswire - Research reports : PDFs with charts and footnotes - Market data : Snowflake/databases with structured numerical data - News : Articles with varying quality and structure - Alternative data : Satellite imagery, web traffic, credit card panels - Broker research : Proprietary PDFs with price targets and models - Fund filings : 13F holdings, proxy statements, activist letters Each source has different schemas, different update frequencies, different quality levels. Agent needs one thing: clean context it can reason over. The normalization layer. Everything becomes one of three formats: - Markdown for narrative content (filings, transcripts, articles) - CSV/tables for structured data (financials, metrics, comparisons) - JSON metadata for searchability (tickers, dates, document types, fiscal periods) Chunking strategy matters. Not all documents chunk the same way: - 10-K filings : Section by regulatory structure (Item 1, 1A, 7, 8...) - Earnings transcripts : Chunk by speaker turn (CEO remarks, CFO remarks, Q&A by analyst) - Press releases : Usually small enough to be one chunk - News articles : Paragraph-level chunks - 13F filings : By holder and position changes quarter-over-quarter The chunking strategy determines what context the agent retrieves. Bad chunks = bad answers. Tables are special. Financial data is full of tables and csv. Revenue breakdowns, segment performance, guidance ranges. LLMs are surprisingly good at reasoning over markdown tables: But they’re terrible at reasoning over HTML `<table>` tags or raw CSV dumps. The normalization layer converts everything to clean markdown tables. Metadata enables retrieval. The user asks the agent: “ What did Apple say about services revenue in their last earnings call? ” To answer this, Fintool needs: - Ticker resolution (AAPL → correct company) - Document type filtering (earnings transcript, not 10-K) - Temporal filtering (most recent, not 2019) - Section targeting (CFO remarks or revenue discussion, not legal disclaimers) This is why `meta.json` exists for every document. Without structured metadata, you’re doing keyword search over a haystack. It speeds up the search, big time! Anyone can call an LLM API. Not everyone has normalized decades of financial data into searchable, chunked markdown with proper metadata. The data layer is what makes agents actually work. The Parsing Problem Normalizing financial data is 80% of the work. Here’s what nobody tells you. SEC filings are adversarial. They’re not designed for machine reading. They’re designed for legal compliance: - Tables span multiple pages with repeated headers - Footnotes reference exhibits that reference other footnotes - Numbers appear in text, tables, and exhibits—sometimes inconsistently - XBRL tags exist but are often wrong or incomplete - Formatting varies wildly between filers (every law firm has their own template) We tried off-the-shelf PDF/HTML parsers. They failed on: - Multi-column layouts in proxy statements - Nested tables in MD&A sections (tables within tables within tables) - Watermarks and headers bleeding into content - Scanned exhibits (still common in older filings and attachments) - Unicode issues (curly quotes, em-dashes, non-breaking spaces) The Fintool parsing pipeline: Raw Filing (HTML/PDF) ↓ Document structure detection (headers, sections, exhibits) ↓ Table extraction with cell relationship preservation ↓ Entity extraction (companies, people, dates, dollar amounts) ↓ Cross-reference resolution (Ex. 10.1 → actual exhibit content) ↓ Fiscal period normalization (FY2024 → Oct 2023 to Sep 2024 for Apple) ↓ Quality scoring (confidence per extracted field) Table extraction deserves its own work. Financial tables are dense with meaning. A revenue breakdown table might have: - Merged header cells spanning multiple columns - Footnote markers (1), (2), (a), (b) that reference explanations below - Parentheses for negative numbers: $(1,234) means -1234 - Mixed units in the same table (millions for revenue, percentages for margins) - Prior period restatements in italics or with asterisks We score every extracted table on: - Cell boundary accuracy (did we split/merge correctly?) - Header detection (is row 1 actually headers, or is there a title row above?) - Numeric parsing (is “$1,234” parsed as 1234 or left as text?) - Unit inference (millions? billions? per share? percentage?) Tables below 90% confidence get flagged for review. Low-confidence extractions don’t enter the agent’s context—garbage in, garbage out. Fiscal period normalization is critical. “Q1 2024” is ambiguous: - Calendar Q1 (January-March 2024) - Apple’s fiscal Q1 (October-December 2023) - Microsoft’s fiscal Q1 (July-September 2023) - “Reported in Q1” (filed in Q1, but covers the prior period) We maintain a fiscal calendar database for 10,000+ companies. Every date reference gets normalized to absolute date ranges. When the agent retrieves “Apple Q1 2024 revenue,” it knows to look for data from October-December 2023. This is invisible to users but essential for correctness. Without it, you’re comparing Apple’s October revenue to Microsoft’s January revenue and calling it “same quarter.” Skills Are Everything Here’s the thing nobody tells you about building AI agents: the model is not the product. The skills are now the product. I learned this the hard way. We used to try making the base model “smarter” through prompt engineering. Tweak the system prompt, add examples, write elaborate instructions. It helped a little. But skills were the missing part. In October 2025, Anthropic formalized this with Agent Skills a specification for extending Claude with modular capability packages. A skill is a folder containing a `SKILL.md` file with YAML frontmatter (name and description), plus any supporting scripts, references, or data files the agent might need. We’d been building something similar for months before the announcement. The validation felt good but more importantly, having an industry standard means our skills can eventually be portable. Without skills, models are surprisingly bad at domain tasks. Ask a frontier model to do a DCF valuation. It knows what DCF is. It can explain the theory. But actually executing one? It will miss critical steps, use wrong discount rates for the industry, forget to add back stock-based compensation, skip sensitivity analysis. The output looks plausible but is subtly wrong in ways that matter. The breakthrough came when we started thinking about skills as first-class citizens. Like part of the product itself. A skill is a markdown file that tells the agent how to do something specific. Here’s a simplified version of our DCF skill: That’s it. A markdown file. No code changes. No production deployment. Just a file that tells the agent what to do. Skills are better than code. This matters enormously: 1. Non-engineers can create skills. Our analysts write skills. Our customers write skills. A portfolio manager who’s done 500 DCF valuations can encode their methodology in a skill without writing a single line of Python. 2. No deployment needed. Change a skill file and it takes effect immediately. No CI/CD, no code review, no waiting for release cycles. Domain experts can iterate on their own. 3. Readable and auditable. When something goes wrong, you can read the skill and understand exactly what the agent was supposed to do. Try doing that with a 2,000-line Python module. We have a copy-on-write shadowing system: Priority: private > shared > public So if you don’t like how we do DCF valuations, write your own. Drop it in `/private/skills/dcf/SKILL.md`. Your version wins. Why we don’t mount all skills to the filesystem. This is important. The naive approach would be to mount every skill file directly into the sandbox. The agent can just `cat` any skill it needs. Simple, right? Wrong. Here’s why we use SQL discovery instead: 1. Lazy loading. We have dozens of skills with extensive documentation like the DCF skill alone has 10+ industry guideline files. Loading all of them into context for every conversation would burn tokens and confuse the model. Instead, we discover skill metadata (name, description) upfront, and only load the full documentation when the agent actually uses that skill. 2. Access control at query time. The SQL query implements our three-tier access model: public skills available to everyone, organization skills for that org’s users, private skills for individual users. The database enforces this. You can’t accidentally expose a customer’s proprietary skill to another customer. 3. Shadowing logic. When a user customizes a skill, their version needs to override the default. SQL makes this trivial—query all three levels, apply priority rules, return the winner. Doing this with filesystem mounts would be a nightmare of symlinks and directory ordering. 4. Metadata-driven filtering. The `fs_files.metadata` column stores parsed YAML frontmatter. We can filter by skill type, check if a skill is main-agent-only, or query any other structured attribute—all without reading the files themselves. The pattern: S3 is the source of truth, a Lambda function syncs changes to PostgreSQL for fast queries, and the agent gets exactly what it needs when it needs it. Skills are essential. I cannot emphasize this enough. If you’re building an AI agent and you don’t have a skills system, you’re going to have a bad time. My biggest argument for skills is that top models (Claude or GPT) are post-trained on using Skills. The model wants to fetch skills. Models just want to learn and what they want to learn is our skills... Until they ate it. The Model Will Eat Your Scaffolding Here’s the uncomfortable truth: everything I just told you about skills? It’s temporary in my opinion. Models are getting better. Fast. Every few months, there’s a new model that makes half your code obsolete. The elaborate scaffolding you built to handle edge cases? The model just... handles them now. When we started, we needed detailed skills with step-by-step instructions for some simple tasks. “First do X, then do Y, then check Z.” Now? We can often just say for simple task “do an earnings preview” and the model figures it out (kinda of!) This creates a weird tension. You need skills today because current models aren’t smart enough. But you should design your skills knowing that future models will need less hand-holding. That’s why I’m bullish on markdown file versus code for model instructions. It’s easier to update and delete. We send detailed feedback to AI labs. Whenever we build complex scaffolding to work around model limitations, we document exactly what the model struggles with and share it with the lab research team. This helps inform the next generation of models. The goal is to make our own scaffolding obsolete. My prediction: in two years, most of our basic skills will be one-liners. “Generate a 20 tabs DCF.” That’s it. The model will know what that means. But here’s the flip side: as basic tasks get commoditized, we’ll push into more complex territory. Multi-step valuations with segment-by-segment analysis. Automated backtesting of investment strategies. Real-time portfolio monitoring with complex triggers. The frontier keeps moving. So we write skills. We delete them when they become unnecessary. And we build new ones for the harder problems that emerge. And all that are files... in our filesystem. The S3-First Architecture Here’s something that surprised me: S3 for files is a better database than a database. We store user data (watchlists, portfolio, preferences, memories, skills) in S3 as YAML files. S3 is the source of truth. A Lambda function syncs changes to PostgreSQL for fast queries. Writes → S3 (source of truth) ↓ Lambda trigger ↓ PostgreSQL (fs_files table) ↓ Reads ← Fast queries Why? - Durability : S3 has 11 9’s. A database doesn’t. - Versioning : S3 versioning gives you audit trails for free - Simplicity : YAML files are human-readable. You can debug with `cat`. - Cost : S3 is cheap. Database storage is not. The pattern: - Writes go to S3 directly - List queries hit the database (fast) - Single-item reads go to S3 (freshest data) The sync architecture. We run two Lambda functions to keep S3 and PostgreSQL in sync: S3 (file upload/delete) ↓ SNS Topic ↓ fs-sync Lambda → Upsert/delete in fs_files table (real-time) EventBridge (every 3 hours) ↓ fs-reconcile Lambda → Full S3 vs DB scan, fix discrepancies Both use upsert with timestamp guards—newer data always wins. The reconcile job catches any events that slipped through (S3 eventual consistency, Lambda cold starts, network blips). User memories live here too. Every user has a `/private/memories/UserMemories.md` file in S3. It’s just markdown—users can edit it directly in the UI. On every conversation, we load it and inject it as context: This is surprisingly powerful. Users write things like “I focus on small-cap value stocks” or “Always compare to industry median, not mean” or “My portfolio is concentrated in tech, so flag concentration risk.” The agent sees this on every conversation and adapts accordingly. No migrations. No schema changes. Just a markdown file that the user controls. Watchlists work the same way. YAML files in S3, synced to PostgreSQL for fast queries. When a user asks about “my watchlist,” we load the relevant tickers and inject them as context. The agent knows what companies matter to this user. The filesystem becomes the user’s personal knowledge base. Skills tell the agent how to do things. Memories tell it what the user cares about. Both are just files. The File System Tools Agents in financial services need to read and write files. A lot of files. PDFs, spreadsheets, images, code. Here’s how we handle it. ReadFile handles the complexity: WriteFile creates artifacts that link back to the UI: Bash gives persistent shell access with 180 second timeout and 100K character output limit. Path normalization on everything (LLMs love trying path traversal attacks, it’s hilarious). Bash is more important than you think. There’s a growing conviction in the AI community that filesystems and bash are the optimal abstraction for AI agents. Braintrust recently ran an eval comparing SQL agents, bash agents, and hybrid approaches for querying semi-structured data. The results were interesting: pure SQL hit 100% accuracy but missed edge cases. Pure bash was slower and more expensive but caught verification opportunities. The winner? A hybrid approach where the agent uses bash to explore and verify, SQL for structured queries. This matches our experience. Financial data is messy. You need bash to grep through filing documents, find patterns, explore directory structures. But you also need structured tools for the heavy lifting. The agent needs both—and the judgment to know when to use each. We’ve leaned hard into giving agents full shell access in the sandbox. It’s not just for running Python scripts. It’s for exploration, verification, and the kind of ad-hoc data manipulation that complex tasks require. But complex tasks mean long-running agents. And long-running agents break everything. Subscribe now Temporal Changed Everything Before Temporal, our long-running tasks were a disaster. User asks for a comprehensive company analysis. That takes 5 minutes. What if the server restarts? What if the user closes the tab and comes back? What if... anything? We had a homegrown job queue. It was bad. Retries were inconsistent. State management was a nightmare. Then we switched to Temporal and I wanted to cry tears of joy! That’s it. Temporal handles worker crashes, retries, everything. If a Heroku dyno restarts mid-conversation (happens all the time lol), Temporal automatically retries on another worker. The user never knows. The cancellation handling is the tricky part. User clicks “stop,” what happens? The activity is already running on a different server. We use heartbeats sent every few seconds. We run two worker types: - Chat workers : User-facing, 25 concurrent activities - Background workers : Async tasks, 10 concurrent activities They scale independently. Chat traffic spikes? Scale chat workers. Next is speed. Real-Time Streaming In finance, people are impatient. They’re not going to wait 30 seconds staring at a loading spinner. They need to see something happening. So we built real-time streaming. The agent works, you see the progress. Agent → SSE Events → Redis Stream → API → Frontend The key insight: delta updates, not full state. Instead of sending “here’s the complete response so far” (expensive), we send “append these 50 characters” (cheap). Streaming rich content with Streamdown. Text streaming is table stakes. The harder problem is streaming rich content: markdown with tables, charts, citations, math equations. We use Streamdown to render markdown as it arrives, with custom plugins for our domain-specific components. Charts render progressively. Citations link to source documents. Math equations display properly with KaTeX. The user sees a complete, interactive response building in real-time. AskUserQuestion: Interactive agent workflows. Sometimes the agent needs user input mid-workflow. “Which valuation method do you prefer?” “Should I use consensus estimates or management guidance?” “Do you want me to include the pipeline assets in the valuation?”

0 views
ava's blog 1 months ago

privacy is a value we can lose

Sometimes, I think about the fact that society at large could just stop caring about data protection and privacy, and there goes everything that I worked towards and am passionate about. Humbling. These laws are young. Not that people didn’t want privacy before, it’s just that as more data was collected, recorded and then processed via the earliest information processing systems (card punch systems and early computers), more needed to be protected. The more is written down and stored, the more this need arises. 1890 saw the right to privacy emerge in the US. Later on, people were understandably more wary of governments collecting data on citizens after WW1+2. Still, the world’s oldest data protection law is from 1970 . Important law work around the idea of protecting personal data happened from then onwards - Germany’s Volkszählungsurteil 1983, the US’ HIPAA 1996, the EU’s 1995 Data Protection Directive and 2002 ePrivacy Directive, the General Data Protection Regulation (GDPR) being passed in 2016 and going into effect in 2018, just to name the big, well-known ones. But governments, priorities and views change. This could be a blip in history. You can already see a sort of resignation in many people (“They track us all anyway, what does it matter? I have nothing to hide.” etc.) and the selling of data is becoming a very established and acceptable practice. The air around it is sort of like: “Oh well, we want to use these services and advertise on them, and they have a lot of costs associated with hosting billions of users and we want to target our ads better. If that’s the price we have to pay, so be it.”. I already wrote about data being the cookie jar, where even non-data-broker businesses now want your data to sell as an additional income stream. Everyone nowadays has an incentive to collect as much data as possible, not just to sell it to AI companies for a good sum, but also to potentially train their own AI. Businesses feel pressured to implement AI into anything they can, which raises the risk of employees entering sensitive data into it and sending it straight to OpenAI et. al. for training - that is, if they aren't using unapproved, so-called "shadow AI" on shady websites or wrappers, unclear who receives the data. Data protection officers and other privacy professionals feel coerced into going along with some dicey setups and risky processing activities because they can’t afford being seen as a Luddite who will advise against everything and “hinder progress”, aka cost saving via AI replacement. Even I was told that I should probably make a LinkedIn account so companies would get a signal that I am not the "activist type" and I have a higher chance of being hired! Governments are also shifting more right and fascist in a lot of places, which goes hand-in-hand with less protections, deregulation, and increased surveillance and criminalization. These types of parties and leaders do not care about upholding privacy if it means they get to target groups more easily - just look at how ICE tracks people in the US . I wrote about the EU’s Digital Omnibus a while ago, which threatens to severely weaken the GDPR . The parties backing this deregulation and even asking for more are far-right parties and fascist tech bros . The unfortunate reality is: What would have raised eyebrows just 10-20 years ago is shrugged at now. We got used to a level of data harvesting that used to be unacceptable. I wonder sometimes if, or rather when, we will reach a point at which privacy is no longer even valued on paper. A point at which EU governments value total surveillance under the guise of digitalization, immigration control and protection of kids over heeding the EU Charta of Fundamental Rights, specifically Article 7 and 8, which guarantee the right to respect for private and family life and protect personal data. A point at which the majority sees so much value in extreme data harvesting tech like social media, smartphones and AI that no cost is too great and they’d rather give up privacy than lose access or have a slightly worse tool. People sometimes say to me that this field is so safe, as tech will increase and always be very integrated into our lives. I wouldn’t be so sure about the first part; it assumes that everyone will always see personal data as worth protecting, and I don’t think that’s a given. Privacy and control over your own data is not a natural law, it's a social and political choice that only exists as long as people care enough to defend it. Reply via email Published 24 Jan, 2026

0 views
ava's blog 1 months ago

i'm looking for work!

I'm currently employed full-time working with pharmaceutical databases, but I'm looking to shift into job roles centered around Data Protection Law , like Compliance and Privacy, or Data Governance, preferably in the 📍Nuremberg/Erlangen/Fürth area 🇩🇪, where I am relocating to from NRW. My current role in drug regulatory has already given me hands-on experience with highly regulated environments and sensitive data, which is a strong foundation that I'm bringing into the new role. This could be... ... or similar roles! :) In October 2025, I finished a 1.5 year advanced studies program in 6 months to be a certified consultant in data protection law . Aside from that, I'm a part-time student pursuing a Bachelor of Laws (LL.B) at a distanced-learning university since 2022 and I'm over halfway done. I'm looking to add a Master's in Data Protection Law in the future. In my free time, I write this blog, particularly about data protection law and tech . I also volunteer as a Country Reporter for noyb.eu on their GDPRhub project , translating and summarizing court cases pertaining to national and European data protection law, specifically German and Austrian cases. You can see my current list of contributions here , and there are more to come. When possible, I also attend events and conferences, like the 2nd Beschäftigtendatenschutztag 2025 in Munich. I'm very passionate about the work and love to self-teach and research. I'm particularly interested in working within a team in a hybrid working setup, with a regular in-office presence to collaborate and learn. That said, I remain open to fully remote roles if the role and organization are a good match. Looking ahead, I would be very open to pursuing additional professional certifications where they are relevant to the role, such as the AIGP or ISO 27001 Lead Implementer . This is a snapshot of what I’m currently working toward and excited about! If you think my profile could be a good fit, or if you’re working in this space and feel like exchanging notes, or just know people who do, I’m always happy to hear from you. Published 10 Jan, 2026 , last updated 12 hours, 16 minutes ago. Data Protection Officer Privacy/Data Protection Consultant Compliance/Regulatory Counsel (Privacy) Data Governance Manager

7 views
Manuel Moreale 1 months ago

How Do You Read My Content

Recently, Kev posted a survey on his site to figure out how people access his content. Big fan of asking people directly and the results are not at all surprising to me. As I said to him, RSS traffic on my server is VERY high. But it's fun to get more datapoints so I created a similar survey and I'd really appreciate it if you could take probably 10 seconds to answer it. It's literally 1 question. I'll keep the form live for a week and then publish the results. Thank you :) Thank you for keeping RSS alive. You're awesome. Email me :: Sign my guestbook :: Support for 1$/month :: See my generous supporters :: Subscribe to People and Blogs

0 views
Christian Jauvin 2 months ago

Metaphysical Boldness

Digital physics is the body of mathematical and philosophical work treating the universe and the way it works as a giant digital computer. This is often associated with cellular automata, and names like Konrad Zuse, John Von Neumann, Stephen Wolfram, etc. What I find fascinating about this field is that the models it suggests are making very deep metaphysical claims: if they are true, it means that the underlying structure of the world is much different than we think, and radically simpler in a sense. Take the lattice gas automaton for instance. A version of it is an hexagonal cellular automata with very simple collision rules, not more complicated than the famous Rule 30 or 110 , for 1D cellular automata. The impressive thing about it is that a simulation running this rule with many particles can be shown to approximate the Navier-Stokes equations , which are the classical complicated mathematics to describe the dynamics of fluids. Following Wolfram, I find it very appealing to consider the idea that the world is not somehow running “hidden mathematics”, somewhere and somehow, to solve some complicated equations in a seemingly magical way, but rather, that things are radically simpler, in that the world is simply implementing a set of trivially simple rules. The world is not concerned with, or made with mathematics, mathematics just emerges, with inherent and irreducible complexity, from extreme simplicity.

0 views
ava's blog 3 months ago

my data should not be your cookie jar

It’s 1970. You walk into the store, grab a bunch of apples, go to the cash register, pay with cash, and walk out. What kinds of data have been automatically processed about you while doing that? Very little. Most likely, none, as CCTV footage relied on the development of VHS to be viable, and IP cameras transmitting video over networks only took off in the 90s. Fast forward to today. Depending on where you live, your supermarket has good cameras everywhere; some, like the super fancy new experiments, have recognition technology that detects what items you grab so that you can just pay without scanning, or even just walk out, having it subtracted automatically from your account. This isn’t just Amazon stores; German store Rewe is trying to get into that too, as I know someone personally who works in their sub-company Lekkerland’s “Smart Store Rollout” department. A more mundane but very common thing for the big stores is tracking you with RFID technology: They track where you are and how long you stay at specific spots by using a network of fixed RFID readers via the RFID tag on the loyalty card or shopping cart (or the individual scanners Rewe offers nowadays!). By noting the time and location of each tag read, the system can create a map of your path and duration of stay within the store. Your supermarket might also have an app to get specific sales and offers. Mine, for a little period, even made it seem as if you could only buy specific products if you pay through their app instead. They dropped that after a while, but I’m sure it got many to download it and make an account - as, of course, you could not use it without one. At the checkout, you might opt for self-checkout now. I’ve seen that stores in the US distinctly record your face and your hands scanning the products, so in case you try to sneak something, they have clear proof and identification options. That video gets analyzed and stored for a while. Either way, you might use a loyalty card you signed up for with your real name and address to collect points or get a discount, tracking exactly what you bought, and you’ll likely pay via card. Your bank account has a bit more information about where you shop and when than if you had just withdrawn cash. If you’re like me, you also pay contactless via phone or watch, giving the processor like Google Pay or Apple Pay some info as well. All that for quickly getting something at the grocery store, something that would not have given the companies much meaningful data about you specifically even just 55 years ago. Of course, some of these things are avoidable and no one forces you to use apps, bank and loyalty cards, but still. These things are not presented as the data harvesters they are, but as convenience and a way to save money or time, targeting vulnerable groups the most. But why even go to the store? Maybe you live in a country with delivery options like Instacart and the like. One more service related to the groceries you buy that is an app, a user account. What if you can’t or don’t wanna cook? Just get a delivery via DoorDash, UberEats, Lieferando or the equivalent in your country. More data about you, and that’s just food. What if you aren’t buying apples at the grocery store, but you’re buying lamps, frames, or a new bed cover? Nowadays, you’d most likely either have a similar shop experience as in the grocery stores, or you’ll online shop on the company’s website or app, that may or may not also show ads and place tracking cookies or reads other data on your phone. They might get you with a 5% off coupon if you just sign up for their newsletter! So you do. Not many use a throwaway mail address or immediately unsubscribe. Now they have constant access to you and your attention if they want to, not just while you’re at their physical stores. A marketing email popping up at the right time creates desires and a suggestion to do some online window shopping, again creating data as you use their website or app. And then there’s the shipping companies… What about the news? You can still buy magazines and newspapers at the store and the corner shop/kiosk, or maybe those little newspaper vending machines that drop one if you put in a coin. But everything is moving to digital nowadays, saving waste and printing costs, so to read the same newspaper online, you have to either pay with your data or pay more than print used to cost, and even then, still pay with your data . Subscribing to the digital version or unlocking a single article via a one-time-payment still tracks you and still shows you ads on many, many news sites. And what do you pay for? If you’re unlucky, it is the same article copy pasted across 10 different newspapers, or a completely AI generated article with zero human effort. For comparison: Just buying the print at a coin vending machine leaves them completely in the dark about you. That was just normal . I notice this in all kinds of industries and parts of life now - it’s why everything now requires an app and a sign-up. Your local café, your hairdresser, your e-scooter. Hell, I even saw nailbiter nail polish now comes with an app. New washing machines and refrigerators are reporting back to their companies. Why is every place, every product company now accepted to be a data aggregation company as well? Why is my data the cookie jar that companies frequently get their hand stuck in while acting entitled? Hello, I already paid you, why are you not ashamed of your obvious greed? What tires me about all of this is that we are supposed to pretend this is all normal and as if it has always been that way, and pretend that this isn’t just double-dipping . I pay money, and then I also generate money with my data. In cases of the loyalty cards and discounts, you could say that there is a fair trade as the price gets lowered, but this is the minority. The majority of the time, we are tracked and profiled with no advantage for us, no compensation. And even if there is, and default pricing is higher if you don‘t share data, that ends up being financial discrimination and affects your choice significantly. As prices rise everywhere, paying with our data gets us almost no relief and is just an ever-growing additional income stream on the side for these companies. Despite having this pile of digital gold to pad their wallets, they still pretend that they have to raise prices all the time for all kinds of situations, and then never lower them when they resolve, as the profit of doing so and selling to advertisers and AI companies is concentrated at the top of the chain. Companies used to be fine selling via means that did not track and invade your life this hard, now we’re supposed to pretend these things are essential. Essential for what? More ads? More manipulation? Better sales numbers? More money for the CEO? They are not essential. We could drop 3/4ths of these mechanisms with no discernible changes to the user experience or product access. The reality is that literal essentials are gatekept by being subjected to this constant harassment and evaluation. How long until not complying with this surveillance regime downright hurts? When you cannot pay cash, or you cannot get into the store without scanning a QR code via their app for authentication, or pricing is personalized based on the profile they have about you - compiled with not just the store data, but other data they bought from data brokers? Your loyalty status, past purchases, your income information, credit score, propensity-to-pay algorithms, Meta social media info, …? Premium loyalty tiers where you ironically pay for more privacy? Predictive technology wrongfully classifying you as a high risk for stealing and banning you from the store? I’m tired of every niche jumping on this opportunity to be the next Cambridge Analytica. You are a hardware store, not a data broker company! I keep swatting your hand out of the jar, but you are just back in there every time I look. Reply via email Published 29 Nov, 2025

0 views
Simon Willison 3 months ago

Highlights from my appearance on the Data Renegades podcast with CL Kao and Dori Wilson

I talked with CL Kao and Dori Wilson for an episode of their new Data Renegades podcast titled Data Journalism Unleashed with Simon Willison . I fed the transcript into Claude Opus 4.5 to extract this list of topics with timestamps and illustrative quotes. It did such a good job I'm using what it produced almost verbatim here - I tidied it up a tiny bit and added a bunch of supporting links. What is data journalism and why it's the most interesting application of data analytics [02:03] "There's this whole field of data journalism, which is using data and databases to try and figure out stories about the world. It's effectively data analytics, but applied to the world of news gathering. And I think it's fascinating. I think it is the single most interesting way to apply this stuff because everything is in scope for a journalist." The origin story of Django at a small Kansas newspaper [02:31] "We had a year's paid internship from university where we went to work for this local newspaper in Kansas with this chap Adrian Holovaty . And at the time we thought we were building a content management system." Building the "Downloads Page" - a dynamic radio player of local bands [03:24] "Adrian built a feature of the site called the Downloads Page . And what it did is it said, okay, who are the bands playing at venues this week? And then we'll construct a little radio player of MP3s of music of bands who are playing in Lawrence in this week." Working at The Guardian on data-driven reporting projects [04:44] "I just love that challenge of building tools that journalists can use to investigate stories and then that you can use to help tell those stories. Like if you give your audience a searchable database to back up the story that you're presenting, I just feel that's a great way of building more credibility in the reporting process." Washington Post's opioid crisis data project and sharing with local newspapers [05:22] "Something the Washington Post did that I thought was extremely forward thinking is that they shared [ the opioid files ] with other newspapers. They said, 'Okay, we're a big national newspaper, but these stories are at a local level. So what can we do so that the local newspaper and different towns can dive into that data for us?'" NICAR conference and the collaborative, non-competitive nature of data journalism [07:00] "It's all about trying to figure out what is the most value we can get out of this technology as an industry as a whole." ProPublica and the Baltimore Banner as examples of nonprofit newsrooms [09:02] "The Baltimore Banner are a nonprofit newsroom. They have a hundred employees now for the city of Baltimore. This is an enormously, it's a very healthy newsroom. They do amazing data reporting... And I believe they're almost breaking even on subscription revenue [correction, not yet ], which is astonishing." The "shower revelation" that led to Datasette - SQLite on serverless hosting [10:31] "It was literally a shower revelation. I was in the shower thinking about serverless and I thought, 'hang on a second. So you can't use Postgres on serverless hosting, but if it's a read-only database, could you use SQLite? Could you just take that data, bake it into a blob of a SQLite file, ship that as part of the application just as another asset, and then serve things on top of that?'" Datasette's plugin ecosystem and the vision of solving data publishing [12:36] "In the past I've thought about it like how Pinterest solved scrapbooking and WordPress solved blogging, who's going to solve data like publishing tables full of data on the internet? So that was my original goal." Unexpected Datasette use cases: Copenhagen electricity grid, Brooklyn Cemetery [13:59] "Somebody was doing research on the Brooklyn Cemetery and they got hold of the original paper files of who was buried in the Brooklyn Cemetery. They digitized those, loaded the results into Datasette and now it tells the story of immigration to New York." Bellingcat using Datasette to investigate leaked Russian food delivery data [14:40] "It turns out the Russian FSB, their secret police, have an office that's not near any restaurants and they order food all the time. And so this database could tell you what nights were the FSB working late and what were the names and phone numbers of the FSB agents who ordered food... And I'm like, 'Wow, that's going to get me thrown out of a window.'" Bellingcat: Food Delivery Leak Unmasks Russian Security Agents The frustration of open source: no feedback on how people use your software [16:14] "An endless frustration in open source is that you really don't get the feedback on what people are actually doing with it." Open office hours on Fridays to learn how people use Datasette [16:49] "I have an open office hours Calendly , where the invitation is, if you use my software or want to use my software, grab 25 minutes to talk to me about it. And that's been a revelation. I've had hundreds of conversations in the past few years with people." Data cleaning as the universal complaint - 95% of time spent cleaning [17:34] "I know every single person I talk to in data complains about the cleaning that everyone says, 'I spend 95% of my time cleaning the data and I hate it.'" Version control problems in data teams - Python scripts on laptops without Git [17:43] "I used to work for a large company that had a whole separate data division and I learned at one point that they weren't using Git for their scripts. They had Python scripts, littering laptops left, right and center and lots of notebooks and very little version control, which upset me greatly." The Carpentries organization teaching scientists Git and software fundamentals [18:12] "There's an organization called The Carpentries . Basically they teach scientists to use Git. Their entire thing is scientists are all writing code these days. Nobody ever sat them down and showed them how to use the UNIX terminal or Git or version control or write tests. We should do that." Data documentation as an API contract problem [21:11] "A coworker of mine said, you do realize that this should be a documented API interface, right? Your data warehouse view of your project is something that you should be responsible for communicating to the rest of the organization and we weren't doing it." The importance of "view source" on business reports [23:21] "If you show somebody a report, you need to have view source on those reports... somebody would say 25% of our users did this thing. And I'm thinking I need to see the query because I knew where all of the skeletons were buried and often that 25% was actually a 50%." Fact-checking process for data reporting [24:16] "Their stories are fact checked, no story goes out the door without someone else fact checking it and without an editor approving it. And it's the same for data. If they do a piece of data reporting, a separate data reporter has to audit those numbers and maybe even produce those numbers themselves in a separate way before they're confident enough to publish them." Queries as first-class citizens with version history and comments [27:16] "I think the queries themselves need to be first class citizens where like I want to see a library of queries that my team are using and each one I want to know who built it and when it was built. And I want to see how that's changed over time and be able to post comments on it." Two types of documentation: official docs vs. temporal/timestamped notes [29:46] "There's another type of documentation which I call temporal documentation where effectively it's stuff where you say, 'Okay, it's Friday, the 31st of October and this worked.' But the timestamp is very prominent and if somebody looks that in six months time, there's no promise that it's still going to be valid to them." Starting an internal blog without permission - instant credibility [30:24] "The key thing is you need to start one of these without having to ask permission first. You just one day start, you can do it in a Google Doc, right?... It gives you so much credibility really quickly because nobody else is doing it." Building a search engine across seven documentation systems [31:35] "It turns out, once you get a search engine over the top, it's good documentation. You just have to know where to look for it. And if you are the person who builds the search engine, you secretly control the company." The TIL (Today I Learned) blog approach - celebrating learning basics [33:05] "I've done TILs about 'for loops' in Bash, right? Because okay, everyone else knows how to do that. I didn't... It's a value statement where I'm saying that if you've been a professional software engineer for 25 years, you still don't know everything. You should still celebrate figuring out how to learn 'for loops' in Bash." Coding agents like Claude Code and their unexpected general-purpose power [34:53] "They pretend to be programming tools but actually they're basically a sort of general agent because they can do anything that you can do by typing commands into a Unix shell, which is everything." Skills for Claude - markdown files for census data, visualization, newsroom standards [36:16] "Imagine a markdown file for census data. Here's where to get census data from. Here's what all of the columns mean. Here's how to derive useful things from that. And then you have another skill for here's how to visualize things on a map using D3... At the Washington Post, our data standards are this and this and this." Claude Skills are awesome, maybe a bigger deal than MCP The absurd 2025 reality: cutting-edge AI tools use 1980s terminal interfaces [38:22] "The terminal is now accessible to people who never learned the terminal before 'cause you don't have to remember all the commands because the LLM knows the commands for you. But isn't that fascinating that the cutting edge software right now is it's like 1980s style— I love that. It's not going to last. That's a current absurdity for 2025." Cursor for data? Generic agent loops vs. data-specific IDEs [38:18] "More of a notebook interface makes a lot more sense than a Claude Code style terminal 'cause a Jupyter Notebook is effectively a terminal, it's just in your browser and it can show you charts." Future of BI tools: prompt-driven, instant dashboard creation [39:54] "You can copy and paste a big chunk of JSON data from somewhere into [an LLM] and say build me a dashboard. And they do such a good job. Like they will just decide, oh this is a time element so we'll do a bar chart over time and these numbers feel big so we'll put those in a big green box." Three exciting LLM applications: text-to-SQL, data extraction, data enrichment [43:06] "LLMs are stunningly good at outputting SQL queries. Especially if you give them extra metadata about the columns. Maybe a couple of example queries and stuff." LLMs extracting structured data from scanned PDFs at 95-98% accuracy [43:36] "You file a freedom of information request and you get back horrifying scanned PDFs with slightly wonky angles and you have to get the data out of those. LLMs for a couple of years now have been so good at, 'here's a page of a police report, give me back JSON with the name of the arresting officer and the date of the incident and the description,' and they just do it." Data enrichment: running cheap models in loops against thousands of records [44:36] "There's something really exciting about the cheaper models, Gemini Flash 2.5 Lite, things like that. Being able to run those in a loop against thousands of records feels very valuable to me as well." datasette-enrichments Multimodal LLMs for images, audio transcription, and video processing [45:42] "At one point I calculated that using Google's least expensive model, if I wanted to generate captions for like 70,000 photographs in my personal photo library, it would cost me like $13 or something. Wildly inexpensive." Correction: with Gemini 1.5 Flash 8B it would cost 173.25 cents First programming language: hated C++, loved PHP and Commodore 64 BASIC [46:54] "I hated C++ 'cause I got my parents to buy me a book on it when I was like 15 and I did not make any progress with Borland C++ compiler... Actually, my first program language was Commodore 64 BASIC. And I did love that. Like I tried to build a database in Commodore 64 BASIC back when I was like six years old or something." Biggest production bug: crashing The Guardian's MPs expenses site with a progress bar [47:46] "I tweeted a screenshot of that progress bar and said, 'Hey, look, we have a progress bar.' And 30 seconds later the site crashed because I was using SQL queries to count all 17,000 documents just for this one progress bar." Crowdsourced document analysis and MP expenses Favorite test dataset: San Francisco's tree list, updated several times a week [48:44] "There's 195,000 trees in this CSV file and it's got latitude and longitude and species and age when it was planted... and get this, it's updated several times a week... most working days, somebody at San Francisco City Hall updates their database of trees, and I can't figure out who." Showrunning TV shows as a management model - transferring vision to lieutenants [50:07] "Your job is to transfer your vision into their heads so they can go and have the meetings with the props department and the set design and all of those kinds of things... I used to sniff at the idea of a vision when I was young and stupid. And now I'm like, no, the vision really is everything because if everyone understands the vision, they can make decisions you delegate to them." The Eleven Laws of Showrunning by Javier Grillo-Marxuach Hot take: all executable code with business value must be in version control [52:21] "I think it's inexcusable to have executable code that has business value that is not in version control somewhere." Hacker News automation: GitHub Actions scraping for notifications [52:45] "I've got a GitHub actions thing that runs a piece of software I wrote called shot-scraper that runs Playwright, that loads up a browser in GitHub actions to scrape that webpage and turn the results into JSON, which then get turned into an atom feed, which I subscribe to in NetNewsWire." Dream project: whale detection camera with Gemini AI [53:47] "I want to point a camera at the ocean and take a snapshot every minute and feed it into Google Gemini or something and just say, is there a whale yes or no? That would be incredible. I want push notifications when there's a whale." Favorite podcast: Mark Steel's in Town (hyperlocal British comedy) [54:23] "Every episode he goes to a small town in England and he does a comedy set in a local venue about the history of the town. And so he does very deep research... I love that sort of like hyperlocal, like comedy, that sort of British culture thing." Mark Steel's in Town available episodes Favorite fiction genre: British wizards caught up in bureaucracy [55:06] "My favorite genre of fiction is British wizards who get caught up in bureaucracy... I just really like that contrast of like magical realism and very clearly researched government paperwork and filings." The Laundry Files , Rivers of London , The Rook I used a Claude Project for the initial analysis, pasting in the HTML of the transcript since that included elements. The project uses the following custom instructions You will be given a transcript of a podcast episode. Find the most interesting quotes in that transcript - quotes that best illustrate the overall themes, and quotes that introduce surprising ideas or express things in a particularly clear or engaging or spicy way. Answer just with those quotes - long quotes are fine. I then added a follow-up prompt saying: Now construct a bullet point list of key topics where each item includes the mm:ss in square braces at the end Then suggest a very comprehensive list of supporting links I could find Here's the full Claude transcript of the analysis. You are only seeing the long-form articles from my blog. Subscribe to /atom/everything/ to get all of my posts, or take a look at my other subscription options . What is data journalism and why it's the most interesting application of data analytics [02:03] "There's this whole field of data journalism, which is using data and databases to try and figure out stories about the world. It's effectively data analytics, but applied to the world of news gathering. And I think it's fascinating. I think it is the single most interesting way to apply this stuff because everything is in scope for a journalist." The origin story of Django at a small Kansas newspaper [02:31] "We had a year's paid internship from university where we went to work for this local newspaper in Kansas with this chap Adrian Holovaty . And at the time we thought we were building a content management system." Building the "Downloads Page" - a dynamic radio player of local bands [03:24] "Adrian built a feature of the site called the Downloads Page . And what it did is it said, okay, who are the bands playing at venues this week? And then we'll construct a little radio player of MP3s of music of bands who are playing in Lawrence in this week." Working at The Guardian on data-driven reporting projects [04:44] "I just love that challenge of building tools that journalists can use to investigate stories and then that you can use to help tell those stories. Like if you give your audience a searchable database to back up the story that you're presenting, I just feel that's a great way of building more credibility in the reporting process." Washington Post's opioid crisis data project and sharing with local newspapers [05:22] "Something the Washington Post did that I thought was extremely forward thinking is that they shared [ the opioid files ] with other newspapers. They said, 'Okay, we're a big national newspaper, but these stories are at a local level. So what can we do so that the local newspaper and different towns can dive into that data for us?'" NICAR conference and the collaborative, non-competitive nature of data journalism [07:00] "It's all about trying to figure out what is the most value we can get out of this technology as an industry as a whole." NICAR 2026 ProPublica and the Baltimore Banner as examples of nonprofit newsrooms [09:02] "The Baltimore Banner are a nonprofit newsroom. They have a hundred employees now for the city of Baltimore. This is an enormously, it's a very healthy newsroom. They do amazing data reporting... And I believe they're almost breaking even on subscription revenue [correction, not yet ], which is astonishing." The "shower revelation" that led to Datasette - SQLite on serverless hosting [10:31] "It was literally a shower revelation. I was in the shower thinking about serverless and I thought, 'hang on a second. So you can't use Postgres on serverless hosting, but if it's a read-only database, could you use SQLite? Could you just take that data, bake it into a blob of a SQLite file, ship that as part of the application just as another asset, and then serve things on top of that?'" Datasette's plugin ecosystem and the vision of solving data publishing [12:36] "In the past I've thought about it like how Pinterest solved scrapbooking and WordPress solved blogging, who's going to solve data like publishing tables full of data on the internet? So that was my original goal." Unexpected Datasette use cases: Copenhagen electricity grid, Brooklyn Cemetery [13:59] "Somebody was doing research on the Brooklyn Cemetery and they got hold of the original paper files of who was buried in the Brooklyn Cemetery. They digitized those, loaded the results into Datasette and now it tells the story of immigration to New York." Bellingcat using Datasette to investigate leaked Russian food delivery data [14:40] "It turns out the Russian FSB, their secret police, have an office that's not near any restaurants and they order food all the time. And so this database could tell you what nights were the FSB working late and what were the names and phone numbers of the FSB agents who ordered food... And I'm like, 'Wow, that's going to get me thrown out of a window.'" Bellingcat: Food Delivery Leak Unmasks Russian Security Agents The frustration of open source: no feedback on how people use your software [16:14] "An endless frustration in open source is that you really don't get the feedback on what people are actually doing with it." Open office hours on Fridays to learn how people use Datasette [16:49] "I have an open office hours Calendly , where the invitation is, if you use my software or want to use my software, grab 25 minutes to talk to me about it. And that's been a revelation. I've had hundreds of conversations in the past few years with people." Data cleaning as the universal complaint - 95% of time spent cleaning [17:34] "I know every single person I talk to in data complains about the cleaning that everyone says, 'I spend 95% of my time cleaning the data and I hate it.'" Version control problems in data teams - Python scripts on laptops without Git [17:43] "I used to work for a large company that had a whole separate data division and I learned at one point that they weren't using Git for their scripts. They had Python scripts, littering laptops left, right and center and lots of notebooks and very little version control, which upset me greatly." The Carpentries organization teaching scientists Git and software fundamentals [18:12] "There's an organization called The Carpentries . Basically they teach scientists to use Git. Their entire thing is scientists are all writing code these days. Nobody ever sat them down and showed them how to use the UNIX terminal or Git or version control or write tests. We should do that." Data documentation as an API contract problem [21:11] "A coworker of mine said, you do realize that this should be a documented API interface, right? Your data warehouse view of your project is something that you should be responsible for communicating to the rest of the organization and we weren't doing it." The importance of "view source" on business reports [23:21] "If you show somebody a report, you need to have view source on those reports... somebody would say 25% of our users did this thing. And I'm thinking I need to see the query because I knew where all of the skeletons were buried and often that 25% was actually a 50%." Fact-checking process for data reporting [24:16] "Their stories are fact checked, no story goes out the door without someone else fact checking it and without an editor approving it. And it's the same for data. If they do a piece of data reporting, a separate data reporter has to audit those numbers and maybe even produce those numbers themselves in a separate way before they're confident enough to publish them." Queries as first-class citizens with version history and comments [27:16] "I think the queries themselves need to be first class citizens where like I want to see a library of queries that my team are using and each one I want to know who built it and when it was built. And I want to see how that's changed over time and be able to post comments on it." Two types of documentation: official docs vs. temporal/timestamped notes [29:46] "There's another type of documentation which I call temporal documentation where effectively it's stuff where you say, 'Okay, it's Friday, the 31st of October and this worked.' But the timestamp is very prominent and if somebody looks that in six months time, there's no promise that it's still going to be valid to them." Starting an internal blog without permission - instant credibility [30:24] "The key thing is you need to start one of these without having to ask permission first. You just one day start, you can do it in a Google Doc, right?... It gives you so much credibility really quickly because nobody else is doing it." Building a search engine across seven documentation systems [31:35] "It turns out, once you get a search engine over the top, it's good documentation. You just have to know where to look for it. And if you are the person who builds the search engine, you secretly control the company." The TIL (Today I Learned) blog approach - celebrating learning basics [33:05] "I've done TILs about 'for loops' in Bash, right? Because okay, everyone else knows how to do that. I didn't... It's a value statement where I'm saying that if you've been a professional software engineer for 25 years, you still don't know everything. You should still celebrate figuring out how to learn 'for loops' in Bash." Coding agents like Claude Code and their unexpected general-purpose power [34:53] "They pretend to be programming tools but actually they're basically a sort of general agent because they can do anything that you can do by typing commands into a Unix shell, which is everything." Skills for Claude - markdown files for census data, visualization, newsroom standards [36:16] "Imagine a markdown file for census data. Here's where to get census data from. Here's what all of the columns mean. Here's how to derive useful things from that. And then you have another skill for here's how to visualize things on a map using D3... At the Washington Post, our data standards are this and this and this." Claude Skills are awesome, maybe a bigger deal than MCP The absurd 2025 reality: cutting-edge AI tools use 1980s terminal interfaces [38:22] "The terminal is now accessible to people who never learned the terminal before 'cause you don't have to remember all the commands because the LLM knows the commands for you. But isn't that fascinating that the cutting edge software right now is it's like 1980s style— I love that. It's not going to last. That's a current absurdity for 2025." Cursor for data? Generic agent loops vs. data-specific IDEs [38:18] "More of a notebook interface makes a lot more sense than a Claude Code style terminal 'cause a Jupyter Notebook is effectively a terminal, it's just in your browser and it can show you charts." Future of BI tools: prompt-driven, instant dashboard creation [39:54] "You can copy and paste a big chunk of JSON data from somewhere into [an LLM] and say build me a dashboard. And they do such a good job. Like they will just decide, oh this is a time element so we'll do a bar chart over time and these numbers feel big so we'll put those in a big green box." Three exciting LLM applications: text-to-SQL, data extraction, data enrichment [43:06] "LLMs are stunningly good at outputting SQL queries. Especially if you give them extra metadata about the columns. Maybe a couple of example queries and stuff." LLMs extracting structured data from scanned PDFs at 95-98% accuracy [43:36] "You file a freedom of information request and you get back horrifying scanned PDFs with slightly wonky angles and you have to get the data out of those. LLMs for a couple of years now have been so good at, 'here's a page of a police report, give me back JSON with the name of the arresting officer and the date of the incident and the description,' and they just do it." Data enrichment: running cheap models in loops against thousands of records [44:36] "There's something really exciting about the cheaper models, Gemini Flash 2.5 Lite, things like that. Being able to run those in a loop against thousands of records feels very valuable to me as well." datasette-enrichments Multimodal LLMs for images, audio transcription, and video processing [45:42] "At one point I calculated that using Google's least expensive model, if I wanted to generate captions for like 70,000 photographs in my personal photo library, it would cost me like $13 or something. Wildly inexpensive." Correction: with Gemini 1.5 Flash 8B it would cost 173.25 cents First programming language: hated C++, loved PHP and Commodore 64 BASIC [46:54] "I hated C++ 'cause I got my parents to buy me a book on it when I was like 15 and I did not make any progress with Borland C++ compiler... Actually, my first program language was Commodore 64 BASIC. And I did love that. Like I tried to build a database in Commodore 64 BASIC back when I was like six years old or something." Biggest production bug: crashing The Guardian's MPs expenses site with a progress bar [47:46] "I tweeted a screenshot of that progress bar and said, 'Hey, look, we have a progress bar.' And 30 seconds later the site crashed because I was using SQL queries to count all 17,000 documents just for this one progress bar." Crowdsourced document analysis and MP expenses Favorite test dataset: San Francisco's tree list, updated several times a week [48:44] "There's 195,000 trees in this CSV file and it's got latitude and longitude and species and age when it was planted... and get this, it's updated several times a week... most working days, somebody at San Francisco City Hall updates their database of trees, and I can't figure out who." Showrunning TV shows as a management model - transferring vision to lieutenants [50:07] "Your job is to transfer your vision into their heads so they can go and have the meetings with the props department and the set design and all of those kinds of things... I used to sniff at the idea of a vision when I was young and stupid. And now I'm like, no, the vision really is everything because if everyone understands the vision, they can make decisions you delegate to them." The Eleven Laws of Showrunning by Javier Grillo-Marxuach Hot take: all executable code with business value must be in version control [52:21] "I think it's inexcusable to have executable code that has business value that is not in version control somewhere." Hacker News automation: GitHub Actions scraping for notifications [52:45] "I've got a GitHub actions thing that runs a piece of software I wrote called shot-scraper that runs Playwright, that loads up a browser in GitHub actions to scrape that webpage and turn the results into JSON, which then get turned into an atom feed, which I subscribe to in NetNewsWire." Dream project: whale detection camera with Gemini AI [53:47] "I want to point a camera at the ocean and take a snapshot every minute and feed it into Google Gemini or something and just say, is there a whale yes or no? That would be incredible. I want push notifications when there's a whale." Favorite podcast: Mark Steel's in Town (hyperlocal British comedy) [54:23] "Every episode he goes to a small town in England and he does a comedy set in a local venue about the history of the town. And so he does very deep research... I love that sort of like hyperlocal, like comedy, that sort of British culture thing." Mark Steel's in Town available episodes Favorite fiction genre: British wizards caught up in bureaucracy [55:06] "My favorite genre of fiction is British wizards who get caught up in bureaucracy... I just really like that contrast of like magical realism and very clearly researched government paperwork and filings." The Laundry Files , Rivers of London , The Rook

0 views
ava's blog 3 months ago

📌 i got my data protection law certificate!

On the 30th of October , I officially finished my data protection law certificate! I'm a bit late to post this because I was so busy and still needed to wait for the actual paper to arrive plus getting a frame and all. :) The certificate ('Diploma of Advanced Studies') is intended for 3 semesters in part-time. I finished it up in one semester with a grade average of 2,2 1 while continuing my other part-time degree (a Bachelor of Laws, LL.B) and full-time work. It is quite a bit more intensive than the 2-week crash courses to be a data protection officer and I had to write 6 exams in total, but it enables me to be one plus the permission to call myself a certified consultant for data protection law. I'll have to refresh it every 4 years with a refresher course, or lose it. While I love to write about commercial tech and social media through a privacy lens here and burn for that topic in private, I intend my career/professional focus to be about health data and AI. I already work with pharmaceutical databases in my job, and I wouldn't wanna miss that part of my work day. My first of hopefully many pieces of paper on that wall 2 . Would love to do AIGP, CIPP/E, CIPM and ISO27001 Lead Implementer some time, and obviously finish my Bachelor degree and start a Master's in data protection law. This cert consisted of the first 3 modules of that Master's degree already, so I know what's ahead of me and I know I can do it. :) Now I'm off to another MRI, because my body is being difficult. I hope to post more soon <3 Reply via email Published 20 Nov, 2025 In case there is confusion, it is the opposite of the American GPA system: 1,0 is good, 4,0 is bad. ↩ I may even get a second frame already to also put up the actual grade records next to it. The one on the wall is just the naming rights proof. ↩ In case there is confusion, it is the opposite of the American GPA system: 1,0 is good, 4,0 is bad. ↩ I may even get a second frame already to also put up the actual grade records next to it. The one on the wall is just the naming rights proof. ↩

0 views
Jim Nielsen 3 months ago

Data Storage As Files on Disk Paired With an LLM

I recently added a bunch of app icons from macOS Tahoe to my collection . Afterwards, I realized some of them were missing relational metadata. For example, I have a collection of iMove icons through the years which are related in my collection by their App Store ID. However, the latest iMovie icon I added didn’t have this ID. This got me thinking, "Crap, I really want this metadata so I can see apps over time . Am I gonna have to go back through each icon I just posted and find their associated App Store ID?” Then I thought: “Hey, I bet AI could figure this out — right? It should be able to read through my collection of icons (which are stored as JSON files on disk), look for icons with the same name and developer, and see where I'm missing and .” So I formulated a prompt (in hindsight, a really poor one lol): look through all the files in and find any that start with and then find me any icons like iMovie that have a correlation to other icons in where it's missing and But AI did pretty good with that. I’ll save you the entire output, but Cursor thought for a bit, then asked to run this command: I was like, “Ok. I couldn’t write that myself, but that looks about right. Go ahead.” It ran the command, thought some more, then asked to run another command. Then another. It seemed unsatisfied with the results, so it changed course and wrote a node script and asked permission to run that. I looked at it and said, “Hey that’s probably how I would’ve approached this.” So I gave permission. It ran the script, thought a little, then rewrote it and asked permission to run again. Here’s the final version it ran: And with that, boom! It found a few newly-added icons with corollaries in my archive, pointed them out, then asked if I wanted to add the missing metadata. The beautiful part was I said “go ahead” and when it finished, I could see and review the staged changes in git. This let me double check the LLM’s findings with my existing collection to verify everything looked right — just to make sure there were no hallucinations. Turns out, storing all my icon data as JSON files on disk (rather than a database) wasn’t such a bad idea. Part of the reason I’ve never switched from static JSON files on disk to a database is because I always figured it would be easier for future me to find and work with files on disk (as opposed to learning how to setup, maintain, and query a database). Turns out that wasn’t such a bad bet. I’m sure AI could’ve helped me write some SQL queries to do all the stuff I did here. But what I did instead already fit within a workflow I understand: files on disk, modified with scripting, reviewed with git, checked in, and pushed to prod. So hey, storing data as JSON files in git doesn’t look like such a bad idea now, does it future Jim? Reply via: Email · Mastodon · Bluesky

0 views
Evan Schwartz 3 months ago

Scour - October Update

Hi friends, In October, Scour ingested 1,042,894 new posts from 14,140 sources . I was also training for the NYC Marathon (which is why this email comes a few days into November)! Last month was all about Interests: Your weekly email digest now includes a couple of topic recommendations at the end. And, if you use an RSS reader to consume your Scour feed, you’ll also find interest recommendations in that feed as well. When you add a new interest on the Interests page, you’ll now see a menu of similar topics that you can click to quickly add. You can browse the new Popular Interests page to find other topics you might want to add. Infinite scrolling is now optional. You can disable it and switch back to explicit pages on your Settings page. Thanks Tomáš Burkert for this suggestion! Earlier, Scour’s topic recommendations were a little too broad. I tried to fix that and now, as you might have noticed, they’re often too specific. I’m still working on solving this “Goldilocks problem”, so more on this to come! Finally, here were a couple of my favorite posts that I found on Scour in October: Happy Scouring! - Evan Introducing RTEB: A New Standard for Retrieval Evaluation Everything About Transformers Turn off Cursor, turn on your mind

1 views
Robin Moffatt 3 months ago

Tech Radar (Nov 2025) - data blips

The latest  Thoughtworks TechRadar  is out. Here are some of the more data-related ‘blips’ (as they’re called on the radar) that I noticed. Each item links to the blip’s entry where you can read more information about Thoughtwork’s usage and opinions on it. Databricks Assistant Apache Paimon Delta Sharing Naive API-to-MCP conversion Standalone data engineering teams Text to SQL

0 views