Latest Posts (14 found)
Robin Moffatt 1 weeks ago

Stumbling into AI: Part 6—I've been thinking about Agents and MCP all wrong

Ever tried to hammer a nail in with a potato? Nor me, but that’s what I’ve felt like I’ve been attempting to do when trying to really understand agents, as well as to come up with an example agent to build. As I wrote about previously , citing Simon Willison, an LLM agent runs tools in a loop to achieve a goal . Unlike building ETL/ELT pipelines, these were some new concepts that I was struggling to fit to an even semi-plausible real world example. That’s because I was thinking about it all wrong. For the last cough 20 cough years I’ve built data processing pipelines, either for real or as examples based on my previous experience. It’s the same pattern, always: Data comes in Data gets processed Data goes out Maybe we fiddle around with the order of things (ELT vs ETL), maybe a particular example focusses more on one particular point in the pipeline—but all the concepts remain pleasingly familiar. All I need to do is figure out what goes in the boxes: I’ve even extended this to be able to wing my way through talking about applications and microservices (kind of). We get some input, we make something else happen. Somewhat stretching beyond my experience, admittedly, but it’s still the same principles. When this thing happens, make a computer do that thing. Perhaps I’m too literal, perhaps I’m cynical after too many years of vendor hype, or perhaps it’s just how my brain is wired—but I like concrete, tangible, real examples of something. So when it comes to agents, particularly with where we’re at in the current hype-cycle, I really wanted to have some actual examples on which to build my understanding. In addition, I wanted to build some of my own. But where to start? Here was my mental model; literally what I sketched out on a piece of paper as I tried to think about what real-world example could go in each box to make something plausible: But this is where I got stuck, and spun my proverbial wheels on for several days. Every example I could think of ended up with me uttering, exasperated… but why would you do it like that . My first mistake was focussing on the LLM bit as needing to do something to the input data . I had a whole bunch of interesting data sources (like river levels , for example) but my head blocked on " but that’s numbers, what can you get an LLM to do with those?! ". The LLM bit of an agent, I mistakenly thought, demanded unstructured input data for it to make any sense. After all, if it’s structured, why aren’t we just processing it with a regular process—no need for magic fairy dust here. This may also have been an over-fitting of an assumption based on my previous work with an LLM to summarise human-input data in a conference keynote . The tool bit baffled me just as much. With hindsight, the exact problem turned out to be the solution . Let me explain… Whilst there are other options, in many cases an agent calling a tool is going to do so using MCP. Thus, grabbing the dog firmly by the tail and proceeding to wag it, I went looking for MCP servers. Looking down a list of hosted MCP servers that I found, I saw that there was only about a half-dozen that were open, including GlobalPing , AlphaVantage , and CoinGecko . Flummoxed, I cast around for an actual use of one of these, with an unstructured data source. Oh jeez…are we really going to do the ' read a stream of tweets and look up the stock price/crypto-token ' thing again? The mistake I made was this: I’d focussed on the LLM bit of the agent definition: an LLM agent runs tools in a loop to achieve a goal Actually, what an agent is about is this: […] runs tools The LLM bit can do fancy LLM stuff—but it’s also there to just invoke the tool(s) and decide when they’ve done what they need to do . A tool is quite often just a wrapper on an API. So what we’re saying is, with MCP, we have a common interface to APIs. That’s…all. We can define agents to interact with systems, and the way they interact is through a common protocol: MCP. When we load a web page, we don’t concern ourselves with what Chrome is doing, and unless we stop and think about it we don’t think about the TCP and HTTP protocols being used. It’s just the common way of things talking to each other. And that’s the idea with MCP, and thus tool calling from agents. (Yes, there are other ways you can call tools from agents, but MCP is the big one, at the moment). Given this reframing, it makes sense why there are so few open MCP servers. If an MCP server is there to offer access to an API, who leaves their API open for anyone to use? Well, read-only data provided like CoinGecko and AlphaVantage, perhaps. In general though, the really useful thing we can do with tools is change the state of systems . That’s why any SaaS platform worth its salt is rushing to provide an MCP server. Not to jump on the AI bandwagon per se, but because if this is going to be the common protocol by which things get to be automated with agents, you don’t want to be there offering Betamax when everyone else has VHS. SaaS platforms will still provide their APIs for direct integration, but they will also provide MCP servers. There’s also no reason why applications developed within an organisation wouldn’t offer MCP either, in theory. No, not really. It actually makes a bunch of sense to me. I personally also like it a lot from a SQL-first, not-really-a-real-coder point of view. Let me explain. If you want to build a system to respond to something that’s happened by interacting with another external system, you have two choices now: Write custom code to call the external system’s API. Handle failures, retries, monitoring, etc. If you want to interact with a different system, you now need to understand the different API, work out calling it, write new code to do so. Write an agent that responds to the thing that happened, and have it call the tool. The agent framework now standardises handling failures, retries, and all the rest of it. If you want to call a different system, the agent stays pretty much the same. The only thing that you change is the MCP server and tool that you call. You could write custom code—and there are good examples of where you’ll continue to. But you no longer have to . For Kafka folk, my analogy here would be data integration with Kafka Connect. Kafka Connect provides the framework that handles all of the sticky and messy things about data integration (scale, error handling, types, connectivity, restarts, monitoring, schemas, etc etc etc). You just use the appropriate connector with it and configure it. Different system? Just swap out the connector. You want to re-invent the wheel and re-solve a solved-problem? Go ahead; maybe you’re special. Or maybe NIH is real ;P So…what does an actual agent look like now, given this different way of looking at it? How about this: Sure, the LLM could do a bunch of clever stuff with the input. But it can also just take our natural language expression of what we want to happen, and make it so. Agents can use multiple tools, from multiple MCP servers. Confluent launched Streaming Agents earlier this year. They’re part of the fully-managed Confluent Cloud platform and provide a way to run agents like I’ve described above, driven by events in a Kafka topic. Here’s what the above agent would look like as a Streaming Agent: Is this over-engineered? Do you even need an agent? Why not just do this? You can. Maybe you should. But…don’t forget failure conditions. And restarts. And testing. And scaling. All these things are taken care of for you by Flink. Although having the runtime considerations taken care of for you is nice, let’s not forget another failure vector which LLMs add into the mix: talking shite hallucinations. Compared to a lump of Python code which either works or doesn’t, LLMs keep us on our toes by sometimes confidently doing the wrong thing. However, how do we know it’s wrong? Our Python program might crash, or throw a nicely-handled error, but left to its own devices an AI Agent will happily report that everything worked even if it actually made up a parameter for a tool call that doesn’t exist. There are mitigating steps we can take, but it’s important to recognise the trade-offs between the approaches. Permit me to indulge this line of steel-manning, because I think I might even have a valid argument here. Let’s say we’ve built the above simplistic agent that sends a Slack when a data point is received. Now we want to enhance it to also include information about the weather forecast. An agent would conceptually be something like this: Our streaming agent above changes to just amending the prompt and adding a new tool (just DDL statements, defining the MCP server and its tools): Whilst the bespoke application might have a seemingly-innocuous small addition: But consider what this looks like in practice. Figuring out the API, new lines of code to handle calling it, failures, and so on. Oh, whilst you’re at it; don’t introduce any bugs into the bespoke code. And remember to document the change. Not insurmountable, and probably a good challenge if you like that kind of thing. But is it as straightforward as literally changing the prompt in an agent to use an additional tool, and let it figure the rest out (courtesy of MCP)? Let’s not gloss over the reality too much here though; whilst adding a new tool call into the agent is definitely easier and less prone to introducing code errors, LLMs are by their nature non-deterministic—meaning that we still need to take care with the prompt and the tool invocation to make sure that the agent is still doing what it’s designed to do. You wouldn’t be wrong to argue that at least the non-Agent route (of coding API invocations directly into your application) can actually be tested and proved to work. There are different types of AI Agent—the one I’ve described is a tools-based one. As I mentioned above, its job is to run tools . The LLM provides the natural language interface with which to invoke the tools. It can also , optionally , do additional bits of magic: Process [unstructured] input, such as summarising or extracting key values from it Decide which tool(s) need calling in order to achieve its aim But at the heart of it, it’s about the tool that gets called. That’s where I was going wrong with this. That’s the bit I needed to think differently about :) Data comes in Data gets processed Data goes out Write custom code to call the external system’s API. Handle failures, retries, monitoring, etc. If you want to interact with a different system, you now need to understand the different API, work out calling it, write new code to do so. Write an agent that responds to the thing that happened, and have it call the tool. The agent framework now standardises handling failures, retries, and all the rest of it. If you want to call a different system, the agent stays pretty much the same. The only thing that you change is the MCP server and tool that you call. Process [unstructured] input, such as summarising or extracting key values from it Decide which tool(s) need calling in order to achieve its aim

0 views
Robin Moffatt 3 weeks ago

Tech Radar (Nov 2025) - data blips

The latest  Thoughtworks TechRadar  is out. Here are some of the more data-related ‘blips’ (as they’re called on the radar) that I noticed. Each item links to the blip’s entry where you can read more information about Thoughtwork’s usage and opinions on it. Databricks Assistant Apache Paimon Delta Sharing Naive API-to-MCP conversion Standalone data engineering teams Text to SQL

0 views
Robin Moffatt 1 months ago

Blog Writing for Developers - slides

A presentation about effective blog writing for developers, covering why to blog, what to write about, and how to structure your content. This presentation covers: Why developers should blog What topics to write about How to structure and write effective content Tools and platforms for technical writing Using AI in the writing process No video, but you can listen to the recording here (or download it for offline listening). Apologies for the voice quality—I was getting over a bad cold! 🤧 The presentation is built from AsciiDoc source using reveal.js. You can find the source here . Why developers should blog What topics to write about How to structure and write effective content Tools and platforms for technical writing Using AI in the writing process

0 views
Robin Moffatt 1 months ago

Stumbling into AI: Part 5—Agents

A short series of notes for myself as I learn more about the AI ecosystem as of Autumn [Fall] 2025. The driver for all this is understanding more about Apache Flink’s Flink Agents project, and Confluent’s Streaming Agents . I started off this series —somewhat randomly, with hindsight—looking at Model Context Protocol ( MCP ) . It’s a helper technology to make things easier to use and provide a richer experience. Next I tried to wrap my head around Models —mostly LLMs, but also with an addendum discussing other types of model too. Along the lines of MCP, Retrieval Augmented Generation ( RAG ) is another helper technology that on its own doesn’t do anything but combined with an LLM gives it added smarts. I took a brief moment in part 4 to try and build a clearer understanding of the difference between ML and AI . So whilst RAG and MCP combined make for a bunch of nice capabilities beyond models such as LLMs alone, what I’m really circling around here is what we can do when we combine all these things: Agents ! But…what is an Agent, both conceptually and in practice? Let’s try and figure it out. Let’s begin with Wikipedia’s definition : In computer science, a software agent is a computer program that acts for a user or another program in a relationship of agency . We can get more specialised if we look at Wikipedia’s entry for an Intelligent Agent : In artificial intelligence, an intelligent agent is an entity that perceives its environment, takes actions autonomously to achieve goals , and may improve its performance through machine learning or by acquiring knowledge . Citing Wikipedia is perhaps the laziest ever blog author’s trick, but I offer no apologies 😜. Behind all the noise and fuss, this is what we’re talking about: a bit of software that’s going to go and do something for you (or your company) autonomously . LangChain have their own definition of an Agent, explicitly identifying the use of an LLM: An AI agent is a system that uses an LLM to decide the control flow of an application. The blog post from LangChain as a whole gives more useful grounding in this area and is worth a read. In fact, if you want to really get into it, the LangChain Academy is free and the Introduction to LangGraph course gives a really good primer on Agents and more. Meanwhile, the Anthropic team have a chat about their definition of an Agent . In a blog post Anthropic differentiates between Workflows (that use LLMs) and Agents: Workflows are systems where LLMs and tools are orchestrated through predefined code paths. Agents, on the other hand, are systems where LLMs dynamically direct their own processes and tool usage, maintaining control over how they accomplish tasks. Independent researcher Simon Willison also uses the LLM word in his definition: An LLM agent runs tools in a loop to achieve a goal. He explores the definition in a recent blog post: I think “agent” may finally have a widely enough agreed upon definition to be useful jargon now , in which Josh Bickett’s meme demonstrates how much of a journey this definition has been on: That there’s still discussion and ambiguity nearly two years after this meme was created is telling. My colleague Sean Falconer knows a lot more this than I do. He was a guest on a recent podcast episode in which he spells things out: [Agentic AI] involves AI systems that can reason, dynamically choose tasks, gather information, and perform actions as a more complete software system. [ 1 ] [Agents] are software that can dynamically decide its own control flow: choosing tasks, workflows, and gathering context as needed. Realistically, current enterprise agents have limited agency[…]. They’re mostly workflow automations rather than fully autonomous systems . [ 2 ] In many ways […] an agent [is] just a microservice . [ 3 ] A straightforward software Agent might do something like: Order more biscuits when there are only two left The pseudo-code looks like this: We take this code, stick it on a server and leave it to run. One happy Agent, done. An AI Agent could look more like this: Other examples of AI Agents include: Coding Agents . Everyone’s favourite tool (when used right). It can reason about code, it can write code, it can review PRs. One of the trends that I’ve noticed recently (October 2025) is the use of Agents to help with some of the up-front jobs in software engineering (such as data modelling and writing tests ), rather than full-blown code that’s going to ship to production. That’s not to say that coding Agents aren’t being used for that, but by using AI to accelerate certain tasks whilst retaining human oversight (a.k.a. HITL ) it makes it easier to review the output rather than just trusting to luck that reams and reams of code are correct. There’s a good talk from Uber on how they’re using AI in the development process, including code conversion, and testing. Travel booking . Perhaps you tell it when you want to go, the kind of vacation you like, and what your budget is; it then goes and finds where it’s nice at that time of year, figures out travel plans within your budget, and either proposes an itinerary or even books it for you. Another variation could be you tell it where , and then it integrates with your calendar to figure out the when . This is a canonical example that is oft-cited; I’d be interested if anyone can point me to an actual implementation of it, even if just a toy one . I saw this in a blog post from Simon Willison that made me wince, but am leaving the above in anyway just to serve as an example of the confusion/hype that exists in this space: comes from plus , the latter meaning of, relating to, or characterised by . So is simply AI that is characterised by an Agent, or Agency. Contrast that to AI that’s you sat at the ChatGPT prompt asking it to draw pictures of a duck dressed as a clown . Nothing Agentic about that—just a human-led and human-driven interaction. "AI Agents" becomes a bit of a mouthful with the qualifier, so much of the current industry noise is simply around "Agents". That said, "Agentic AI" sounds cool, so gets used as the marketing term in place of "AI" alone. So we’ve muddled our way through to some kind of understanding of what an Agent is, and what we mean by Agentic AI. But how do we actually build one? All we need is an LLM (such as access to the API for OpenAI or Claude ), something to call that API (there are worse choices than !), and a way to call external services (e.g. MCP servers) if the LLM determines that it needs to use them. So in theory we could build an Agent with some lines of bash, some API calls, and a bunch of sticky-backed plastic . This is a grossly oversimplified example (and is missing elements such as memory)—but it hopefully illustrates what we’re building at the core of an Agent. On top of this goes all the general software engineering requirements of any system that gets built (suitable programming language and framework, error handling, LLM output validation, guard rails, observability, tests, etc etc). The other nuance that I’ve noticed is that whilst the above simplistic diagram is 100% driven by an LLM (it decides what tools to call, it decides when to iterate) there are plenty of cases where an Agent is to some degree rules-driven. So perhaps the LLM does some of the autonomous work, but then there’s a bunch of good ol' statements in there too. This is also borne out by the notion of "Workflows" when people talk about Agents. An Agent doesn’t wake up in the morning and set out on its day serving only to fulfill its own goals and enrichment. More often than not an Agent is going to be tightly bound into a pre-defined path with a limited range of autonomy. What if you want to actually build this kind of thing for real? That’s where tools like LangGraph and LangChain come in. Here’s a notebook with an example of an actual Agent built with these tools. LlamaIndex is another framework, with details of building an Agent in their docs. As we build up from the so-simple-it-is-laughable strawman example of an Agent above, one of the features we’ll soon encounter is the concept of memory. The difference between a crappy response and a holy-shit-that’s-magic response from an LLM is often down to context . The richer the context, the better a chance it has at generating a more accurate output. So if an Agent can look back on what it did previously, determining what worked well and what didn’t, perhaps even taking into account human feedback, it can then generate a more successful response the next time. You can read a lot more about memory in this chapter of Agentic Design Patterns by Antonio Gulli . This blog post from "The BIG DATA guy" is also useful: Agentic AI, Agent Memory, & Context Engineering This diagram from Generative Agents: Interactive Simulacra of Human Behavior (J.S. Park, J.C. O’Brien, C.J. Cai, M.R. Morris, P. Liang, M.S. Bernstein) gives a good overview of a much richer definition of an Agent’s implementation. The additional concepts include memory (discussed briefly above), planning, and reflection: Also check out Paul Iusztin’s talk from QCon London 2025 on The Data Backbone of LLM Systems . Around the 35-minute mark he goes into some depth around Agent architectures. Just as you can build computer systems as monoliths (everything done in one place) or microservices (multiple programs, each responsible for a discrete operation or domain), you can also have one big Agent trying to do everything (probably not such a good idea) or individual Agents each good at their particular thing that are then hooked together into what’s known as a Multi-Agent System (MAS). Sean Falconer’s family meal planning demo is a good example of a MAS. One Agent plans the kids' meals, one the adults' meals, another combines the two into a single plan, and so on. This is a term you’ll come across referring to the fact that Agents might be pretty good, but they’re not infallible. In the travel booking example above, do we really trust the Agent to book the best holiday for us? Almost certainly we’d want—at a minimum—the option to sign off on the booking before it goes ahead and sinks £10k on an all-inclusive trip to Bognor Regis. Then again, we’re probably happy enough for an Agent to access our calendars without asking permission, and as to whether they need permission or not to create a meeting is up to us and how much we trust them. When it comes to coding, having an Agent write code, test it, fix the broken tests, compare it to a spec, and iterate is really neat. On the other hand, letting it decide to run …less so 😅. Every time an Agent requires HITL, it reduces its autonomy and/or responsiveness to situations. As well as simply using smarter models that make fewer mistakes, there are other things that an Agent can do to reduce the need for HITL such as using guardrails to define acceptable parameters. For example, an Agent is allowed to book travel but only up to a defined threshold. That way the user gets to trade off convenience (no HITL) with risk (unintended first-class flight to Hawaii). 📃 Generative Agents: Interactive Simulacra of Human Behavior 🎥 Paul Iusztin - The Data Backbone of LLM Systems - QCon London 2025 📖 Antonio Gulli - Agentic Design Patterns 📖 Sean Falconer - https://seanfalconer.medium.com/ Coding Agents . Everyone’s favourite tool (when used right). It can reason about code, it can write code, it can review PRs. One of the trends that I’ve noticed recently (October 2025) is the use of Agents to help with some of the up-front jobs in software engineering (such as data modelling and writing tests ), rather than full-blown code that’s going to ship to production. That’s not to say that coding Agents aren’t being used for that, but by using AI to accelerate certain tasks whilst retaining human oversight (a.k.a. HITL ) it makes it easier to review the output rather than just trusting to luck that reams and reams of code are correct. There’s a good talk from Uber on how they’re using AI in the development process, including code conversion, and testing. Travel booking . Perhaps you tell it when you want to go, the kind of vacation you like, and what your budget is; it then goes and finds where it’s nice at that time of year, figures out travel plans within your budget, and either proposes an itinerary or even books it for you. Another variation could be you tell it where , and then it integrates with your calendar to figure out the when . This is a canonical example that is oft-cited; I’d be interested if anyone can point me to an actual implementation of it, even if just a toy one . I saw this in a blog post from Simon Willison that made me wince, but am leaving the above in anyway just to serve as an example of the confusion/hype that exists in this space: 📃 Generative Agents: Interactive Simulacra of Human Behavior 🎥 Paul Iusztin - The Data Backbone of LLM Systems - QCon London 2025 📖 Antonio Gulli - Agentic Design Patterns 📖 Sean Falconer - https://seanfalconer.medium.com/

0 views
Robin Moffatt 2 months ago

Stumbling into AI: Part 4—Terminology Tidy-up (and a little rant)

Having looked at MCP , Models , and RAG , I realised that I’ve been mentally skirting around something that I don’t really understand, so I’m going to expose myself to some ridicule here and try to understand better: what’s the difference between AI and ML? Aren’t they just the same? OK we’re doing this are we? I thought AI was just ✨magic✨? And ML was the thing that got data scientists mad stacks ten years ago before everyone realised you couldn’t do shit without good data and processes? To me, a layperson in this space, watching it from the sidelines, AI and ML have been interchangeable. In fact, you’d get conferences and conference tracks titled "AI/ML"—because it’s all kind of the same thing anyway, right? This is, of course, factually incorrect and presumably infuriating to anyone actually working in the field. The whole purpose of this blog series has been for me to at least reduce the number of unknown unknowns in my knowledge in this space—to build up a mental map of the different areas and terms so that I at least I know where to go and look when encountering something that I know I don’t know. With that framing in mind, this is roughly how I understand the terms and :

0 views
Robin Moffatt 2 months ago

Stumbling into AI: Part 3—RAG

A short series of notes for myself as I learn more about the AI ecosystem as of September 2025. The driver for all this is understanding more about Apache Flink’s Flink Agents project, and Confluent’s Streaming Agents . Having poked around MCP and Models , next up is RAG. RAG has been one of the buzzwords of the last couple of years, with any vendor worth its salt finding a way to crowbar it into their product. I’d been sufficiently put off it by the hype to steer away from actually understanding what it is. In this blog post, let’s fix that—because if I’ve understood it correctly, it’s a pattern that’s not scary at all. First up: RAG stands for Retrieval-Augmented Generation . Put another way, it’s about Generation (using LLMs, like we saw before and like you use every day), in which the prompt given to the LLM is Augmented by the Retrieval of additional context.

0 views
Robin Moffatt 2 months ago

Stumbling into AI: Part 2—Models

A short series of notes for myself as I learn more about the AI ecosystem as of September 2025. The driver for all this is understanding more about Apache Flink’s Flink Agents project, and Confluent’s Streaming Agents . Having poked around MCP and got a broad idea of what it is, I want to next look at Models. What used to be as simple as " I used AI " actually boils down into several discrete areas, particularly when one starts looking at using LLMs beyond writing a rap about Apache Kafka in the style of Monty Python and using it to build agents (like the Flink Agents that prompted this exploration in the first place). This is the what it’s all about right here. Large Language Models (LLMs) are what piqued the interest of nerds outside the academic community in 2023 and the broader public a year or so later. What used to be a " OMFG have you seen this " moment is now somewhat passé. Of course I can ask my computer to write my homework assignment for me. Of course I can use my phone to explain the nuances of the leg-before-wicket rule in Cricket. Without a model, the whole AI sandcastle collapses. There are many dozens of LLMs . The most well-known ones are grouped into families and include GPT , Claude , and Gemini . Within these there are different models, such as GPT-5, Claude 4.1, and so on. Often these models themselves have variants, specific to certain tasks like writing software code, generating images, or understanding audio. The big companies behind the models include:

0 views
Robin Moffatt 2 months ago

Stumbling into AI: Part 1—MCP

A short series of notes for myself as I learn more about the AI ecosystem as of September 2025. The driver for all this is understanding more about Apache Flink’s Flink Agents project, and Confluent’s Streaming Agents . The first thing I want to understand better is MCP. For context, so far I’ve been a keen end-user of LLMs, for generating images , proof-reading my blog posts, and general lazyweb stuff like getting it to spit out the right syntax for a bash one-liner. I use Raycast with its Raycast AI capabilities to interact with different models, and I’ve used Cursor to vibe-code some useful (and less useful , fortunately never deployed) functionality for this blog. But what I’ve not done so far is dig any further into the ecosystem beyond. Let’s fix that! So, what is MCP?

0 views
Robin Moffatt 3 months ago

Interesting links - August 2025

Not got time for all this? I’ve marked 🔥 for my top reads of the month :) 🔥 Ben Rogojan (a.k.a. SeattleDataGuy) has a great list of 5 Things in Data Engineering That Still Hold True After 10 Years ( guess what: data modelling matters, if you start with crap data you’ll end with crap data, and so on… ). Veronika Durgin shares some good tips for building resilient data pipelines . Some good pointers for why you might want to modernise your data platform , and how to pick your stack if you do so . 🔥 Aleksandr Klein has a thoughtful post about The Mythic Journey of Data Quality Maturity

0 views
Robin Moffatt 3 months ago

Kafka to Iceberg - Exploring the Options

You’ve got data in Apache Kafka . You want to get that data into Apache Iceberg . What’s the best way to do it? Perhaps invariably, the answer is: IT DEPENDS . But fear not: here is a guide to help you navigate your way to choosing the best solution for you 🫵. I’m considering three technologies in this blog post:

0 views
Robin Moffatt 4 months ago

Connecting Apache Flink SQL to Confluent Cloud Kafka broker

This is a quick blog post to remind me how to connect Apache Flink to a Kafka topic on Confluent Cloud. You may wonder why you’d want to do this, given that Confluent Cloud for Apache Flink is a much easier way to run Flink SQL. But, for whatever reason, you’re here and you want to understand the necessary incantations to get this connectivity to work. There are two versions of this connectivity - with, and without, using the Schema Registry for Avro. First off, you need to get two things: The address of your Confluent Cloud broker An API key pair with authorisation to access the topic that you want to read/write

0 views
Robin Moffatt 4 months ago

Interesting links - July 2025

Not got time for all this? I’ve marked 🔥 for my top reads of the month :) First up, allow me a shameless plug for my blog posts this month: Writing to Apache Iceberg on S3 using Kafka Connect with Glue catalog . Keeping your Data Lakehouse in Order: Table Maintenance in Apache Iceberg . 🔥 Building Streaming Data Pipelines, Part 2: Data Processing and Enrichment with Flink SQL (see also Part 1 )

0 views
Robin Moffatt 4 months ago

Keeping your Data Lakehouse in Order: Table Maintenance in Apache Iceberg

Plus the data file itself for a table, in : record_count file_size_in_bytes s3://warehouse/rmoff/customers/data/00000-0-e3b7a202-2481-4d9f-9b7c-9830908a425a-0-00001.parquet After a few more changes to the data on the table, what started off as five files in the bucket is now ten times that:

0 views
Robin Moffatt 4 months ago

Writing to Apache Iceberg on S3 using Kafka Connect with Glue catalog

Without wanting to mix my temperature metaphors, Iceberg is the new hawtness, and getting data into it from other places is a common task. I wrote previously about using Flink SQL to do this , and today I’m going to look at doing the same using Kafka Connect. Kafka Connect can send data to Iceberg from any Kafka topic. The source Kafka topic(s) can be populated by a Kafka Connect source connector (such as Debezium), or a regular application producing directly to it. I’m going to use AWS’s Glue Data Catalog, but the sink also works with other Iceberg catalogs. Kafka Connect is a framework for data integration, and is part of Apache Kafka. There is a rich ecosystem of connectors for getting data in and out of Kafka, and Kafka Connect itself provides a set of features that you’d expect for a resilient data integration platform, including scaling, schema handling, restarts, serialisation, and more. The Apache Iceberg connector for Kafka Connect was originally created by folk at Tabular and has subsequently been contributed to the Apache Iceberg project (via a brief stint on a Databricks repo following the Tabular acquisition).

2 views