Latest Posts (11 found)
Alex Jacobs 1 months ago

The Case Against pgvector

If you’ve spent any time in the vector search space over the past year, you’ve probably read blog posts explaining why pgvector is the obvious choice for your vector database needs. The argument goes something like this: you already have Postgres, vector embeddings are just another data type, why add complexity with a dedicated vector database when you can keep everything in one place? It’s a compelling story. And like most of the AI influencer bullshit that fills my timeline, it glosses over the inconvenient details. I’m not here to tell you pgvector is bad. It’s not. It’s a useful extension that brings vector similarity search to Postgres. But after spending some time trying to build a production system on top of it, I’ve learned that the gap between “works in a demo” and “scales in production” is… significant. What bothers me most: the majority of content about pgvector reads like it was written by someone who spun up a local Postgres instance, inserted 10,000 vectors, ran a few queries, and called it a day. The posts are optimistic, the benchmarks are clean, and the conclusions are confident. They’re also missing about 80% of what you actually need to know. I’ve read through   dozens of these posts. × Understanding Vector Search and HNSW Index with pgvector HNSW Indexes with Postgres and pgvector Understand Indexes in pgvector External Indexing for pgvector Exploring Postgres pgvector HNSW Index Storage pgvector v0.5.0: Faster semantic search with HNSW indexes Early Look at HNSW Performance with pgvector Vector Indexes in Postgres using pgvector: IVFFlat vs HNSW Vector Database Basics: HNSW Index PostgreSQL Vector Indexing with HNSW They all cover the same ground: here’s how to install pgvector, here’s how to create a vector column, here’s a simple similarity search query. Some of them even mention that you should probably add an index. What they don’t tell you is what happens when you actually try to run this in production. Let’s start with indexes, because this is where the tradeoffs start. pgvector gives you two index types: IVFFlat and HNSW. The blog posts will tell you that HNSW is newer and generally better, which is… technically true but deeply unhelpful. IVFFlat (Inverted File with Flat quantization) partitions your vector space into clusters. During search, it identifies the nearest clusters and only searches within those. Image source: IVFFlat or HNSW index for similarity search? by Simeon Emanuilov HNSW (Hierarchical Navigable Small World) builds a multi-layer graph structure for search. Image source: IVFFlat or HNSW index for similarity search? by Simeon Emanuilov None of the blogs mention that building an HNSW index on a few million vectors can consume 10+ GB of RAM or more (depending on your vector dimensions and dataset size). On your production database. While it’s running. For potentially hours. In a typical application, you want newly uploaded data to be searchable immediately. User uploads a document, you generate embeddings, insert them into your database, and they should be available in search results. Simple, right? When you insert new vectors into a table with an index, one of two things happens: IVFFlat : The new vectors are inserted into the appropriate clusters based on the existing structure. This works, but it means your cluster distribution gets increasingly suboptimal over time. The solution is to rebuild the index periodically. Which means downtime, or maintaining a separate index and doing an atomic swap, or accepting degraded search quality. HNSW : New vectors are added to the graph structure. This is better than IVFFlat, but it’s not free. Each insertion requires updating the graph, which means memory allocation, graph traversals, and potential lock contention. Neither of these is a deal-breaker in isolation. But here’s what happens in practice: you’re inserting vectors continuously throughout the day. Each insertion is individually cheap, but the aggregate load adds up. Your database is now handling your normal transactional workload, analytical queries, AND maintaining graph structures in memory for vector search. Let’s say you’re building a document search system. Users upload PDFs, you extract text, generate embeddings, and insert them. The user expects to immediately search for that document. Here’s what actually happens: With no index : The insert is fast, the document is immediately available, but your searches do a full sequential scan. This works fine for a few thousand documents. At a few hundred thousand? Your searches start taking seconds. Millions? Good luck. With IVFFlat : The insert is still relatively fast. The vector gets assigned to a cluster. But whoops, a problem. Those initial cluster assignments were based on the data distribution when you built the index. As you add more data, especially if it’s not uniformly distributed, some clusters get overloaded. Your search quality degrades. You rebuild the index periodically to fix this, but during the rebuild (which can take hours for large datasets), what do you do with new inserts? Queue them? Write to a separate unindexed table and merge later? With HNSW : The graph gets updated on each insert through incremental insertion, which sounds great. But updating an HNSW graph isn’t free—you’re traversing the graph to find the right place to insert the new node and updating connections. Each insert acquires locks on the graph structure. Under heavy write load, this becomes a bottleneck. And if your write rate is high enough, you start seeing lock contention that slows down both writes and reads. Here’s the real nightmare: you’re not just storing vectors. You have metadata—document titles, timestamps, user IDs, categories, etc. That metadata lives in other tables (or other columns in the same table). You need that metadata and the vectors to stay in sync. In a normal Postgres table, this is easy—transactions handle it. But when you’re dealing with index builds that take hours, keeping everything consistent gets complicated. For IVFFlat, periodic rebuilds are basically required to maintain search quality. For HNSW, you might need to rebuild if you want to tune parameters or if performance has degraded. The problem is that index builds are memory-intensive operations, and Postgres doesn’t have a great way to throttle them. You’re essentially asking your production database to allocate multiple (possibly dozens) gigabytes of RAM for an operation that might take hours, while continuing to serve queries. You end up with strategies like: None of these are “wrong” exactly. But they’re all workarounds for the fact that pgvector wasn’t really designed for high-velocity real-time ingestion. Okay but let’s say you solve your index and insert problems. Now you have a document search system with millions of vectors. Documents have metadata—maybe they’re marked as , , or . A user searches for something, and you only want to return published documents. Simple enough. But now you have a problem: should Postgres filter on status first (pre-filter) or do the vector search first and then filter (post-filter)? This seems like an implementation detail. It’s not. It’s the difference between queries that take 50ms and queries that take 5 seconds. It’s also the difference between returning the most relevant results and… not. Pre-filter works great when the filter is highly selective (1,000 docs out of 10M). It works terribly when the filter isn’t selective—you’re still searching millions of vectors. Post-filter works when your filter is permissive. Here’s where it breaks: imagine you ask for 10 results with . pgvector finds the 10 nearest neighbors, then applies your filter. Only 3 of those 10 are published. You get 3 results back, even though there might be hundreds of relevant published documents slightly further away in the embedding space. The user searched, got 3 mediocre results, and has no idea they’re missing way better matches that didn’t make it into the initial k=10 search. You can work around this by fetching more vectors (say, ) and then filtering, but now: With pre-filter, you avoid this problem, but you get the performance problems I mentioned. Pick your poison. Now add another dimension: you’re filtering by user_id AND category AND date_range. What’s the right strategy now? The planner will look at table statistics, index selectivity, and estimated row counts and come up with a plan. That plan will probably be wrong, or at least suboptimal, because the planner’s cost model wasn’t built for vector similarity search. And it gets worse: you’re inserting new vectors throughout the day. Your index statistics are outdated. The plans get increasingly suboptimal until you ANALYZE the table. But ANALYZE on a large table with millions of rows takes time and resources. And it doesn’t really understand vector data distribution in a meaningful way—it can tell you how many rows match , but not how clustered those vectors are in the embedding space, which is what actually matters for search performance. You end up with hacks: query rewriting for different user types, partitioning your data into separate tables, CTE optimization fences to force the planner’s hand, or just fetching way more results than needed and filtering in application code. None of these are sustainable at scale. Dedicated vector databases have solved this. They understand the cost model of filtered vector search and make intelligent decisions: OpenSearch’s k-NN plugin, for example, lets you specify pre-filter or post-filter behavior. Pinecone automatically handles filter selectivity. Weaviate has optimizations for common filter patterns. With pgvector, you get to build all of this yourself. Or live with suboptimal queries. Or hire a Postgres expert to spend weeks tuning your query patterns. Oh, and if you want hybrid search—combining vector similarity with traditional full-text search—you get to build that yourself too. Postgres has excellent full-text search capabilities. pgvector has excellent vector search capabilities. Combining them in a meaningful way? That’s on you. You need to: Again, not impossible. Just another thing that many dedicated vector databases provide out of the box. Timescale has released pgvectorscale , which addresses some of these issues. It adds: This is great! It’s also an admission that pgvector out of the box isn’t sufficient for production use cases. pgvectorscale is still relatively new, and adopting it means adding another dependency, another extension, another thing to manage and upgrade. For some teams, that’s fine. For others, it’s just more evidence that maybe the “keep it simple, use Postgres” argument isn’t as simple as it seemed. Oh, and if you’re running on RDS, pgvectorscale isn’t available. AWS doesn’t support it. So enjoy managing your own Postgres instance if you want these improvements, or just… keep dealing with the limitations of vanilla pgvector. The “just use Postgres” simplicity keeps getting simpler. I get the appeal of pgvector. Consolidating your stack is good. Reducing operational complexity is good. Not having to manage another database is good. But here’s what I’ve learned: for most teams, especially small teams, dedicated vector databases are actually simpler. With a managed vector database (Pinecone, Weaviate, Turbopuffer, etc.), you typically get: Yes, it’s another service to pay for. But compare: Turbopuffer starts at $64 month with generous limits. For a lot of teams, the managed service is actually cheaper. pgvector is an impressive piece of technology. It brings vector search to Postgres in a way that’s technically sound and genuinely useful for many applications. But it’s not a panacea. Understand the tradeoffs. If you’re building a production vector search system: Index management is hard . Rebuilds are memory-intensive, time-consuming, and disruptive. Plan for this from day one. Query planning matters . Filtered vector search is a different beast than traditional queries, and Postgres’s planner wasn’t built for this. Real-time indexing has costs . Either in memory, in search quality, or in engineering time to manage it. The blog posts are lying to you (by omission). They’re showing you the happy path and ignoring the operational reality. Managed offerings exist for a reason . There’s a reason that Pinecone, Weaviate, Qdrant, and others exist and are thriving. Vector search at scale has unique challenges that general-purpose databases weren’t designed to handle. The question isn’t “should I use pgvector?” It’s “am I willing to take on the operational complexity of running vector search in Postgres?” For some teams, the answer is yes. You have database expertise, you need the tight integration, you’re willing to invest the time. For many teams—maybe most teams—the answer is probably no. Use a tool designed for the job. Your future self will thank you. Lower memory footprint during index creation Reasonable query performance for many use cases Index creation is faster than HNSW Requires you to specify the number of lists (clusters) upfront That number significantly impacts both recall and query performance The commonly recommended formula ( ) is a starting point at best Recall can be… disappointing depending on your data distribution New vectors get assigned to existing clusters, but clusters don’t rebalance without a full rebuild Better recall than IVFFlat for most datasets More consistent query performance Scales well to larger datasets Significantly higher memory requirements during index builds Index creation is slow—painfully slow for large datasets The memory requirements aren’t theoretical; they are real, and they’ll take down your database if you’re not careful IVFFlat : The new vectors are inserted into the appropriate clusters based on the existing structure. This works, but it means your cluster distribution gets increasingly suboptimal over time. The solution is to rebuild the index periodically. Which means downtime, or maintaining a separate index and doing an atomic swap, or accepting degraded search quality. HNSW : New vectors are added to the graph structure. This is better than IVFFlat, but it’s not free. Each insertion requires updating the graph, which means memory allocation, graph traversals, and potential lock contention. Write to a staging table, build the index offline, then swap it in (but now you have a window where searches miss new data) Maintain two indexes and write to both (double the memory, double the update cost) Build indexes on replicas and promote them Accept eventual consistency (users upload documents that aren’t searchable for N minutes) Provision significantly more RAM than your “working set” would suggest You’re doing way more distance calculations than needed You still don’t know if 100 is enough Your query performance suffers You’re guessing at the right oversampling factor Apply all filters first, then search? (Pre-filter) Search first, then apply all filters? (Post-filter) Apply some filters first, search, then apply remaining filters? (Hybrid) Which filters should you apply in which order? Adaptive strategies : Some databases dynamically choose pre-filter or post-filter based on estimated selectivity Configurable modes : Others let you specify the strategy explicitly when you know your data distribution Specialized indexes : Some build indexes that support efficient filtered search (like filtered HNSW) Query optimization : They track statistics specific to vector operations and optimize accordingly Decide how to weight vector similarity vs. text relevance Normalize scores from two different scoring systems Tune the balance for your use case Probably implement Reciprocal Rank Fusion or something similar StreamingDiskANN, a new search backend that’s more memory-efficient Better support for incremental index builds Improved filtering performance Intelligent query planning for filtered searches Hybrid search built in Real-time indexing without memory spikes Horizontal scaling without complexity Monitoring and observability designed for vector workloads The cost of a managed vector database for your workload vs. the cost of over-provisioning your Postgres instance to handle index builds vs. the engineering time to tune queries and manage index rebuilds vs. the opportunity cost of not building features because you’re fighting your database Index management is hard . Rebuilds are memory-intensive, time-consuming, and disruptive. Plan for this from day one. Query planning matters . Filtered vector search is a different beast than traditional queries, and Postgres’s planner wasn’t built for this. Real-time indexing has costs . Either in memory, in search quality, or in engineering time to manage it. The blog posts are lying to you (by omission). They’re showing you the happy path and ignoring the operational reality. Managed offerings exist for a reason . There’s a reason that Pinecone, Weaviate, Qdrant, and others exist and are thriving. Vector search at scale has unique challenges that general-purpose databases weren’t designed to handle.

0 views
Alex Jacobs 6 months ago

A Production Framework for LLM Feature Evaluation

After several years of integrating LLMs into production systems, I’ve observed a consistent pattern: the features that deliver real value rarely align with what gets attention at conferences. While the industry focuses on AGI and emergent behaviors, the mundane applications—data extraction, classification, controlled generation—are quietly transforming how we build software. This post presents a framework I’ve developed for evaluating LLM features based on what actually ships and scales. It’s deliberately narrow in scope, focusing on patterns that have proven reliable across multiple deployments rather than exploring the theoretical boundaries of what’s possible. Through trial, error, and more error, I’ve found that LLMs consistently excel in three specific areas. When I’m evaluating a potential AI feature, I ask: “Does this clearly fit into one of these categories?” If not, it’s probably not worth pursuing (yet). This is the unsexy workhorse of AI features. Think of it as having an intelligent data entry assistant who never gets tired of parsing messy inputs. What makes this valuable: Real examples I’ve built: The key insight here is that LLMs excel at handling structural variance and ambiguity—the exact things that make traditional parsers brittle. A single well-crafted prompt can replace hundreds of lines of mapping logic, regex patterns, and edge case handling. The model’s ability to understand intent rather than just pattern match is what makes this category so powerful. Production considerations: For high-volume extraction from standardized formats, purpose-built services like Reducto offer better economics and reliability than raw LLM calls. These platforms have already solved for edge cases around OCR quality, table extraction, and format variations. The build-vs-buy calculation here typically favors buying unless you have unique requirements or scale that justifies the engineering investment. This is probably what most people think of when they hear “AI features,” but the key is being specific about the use case. What makes this valuable: Real examples I’ve built: The critical lesson here is that unconstrained generation is rarely what you want in production. Effective generation features require explicit boundaries: output structure, length constraints, tone guidelines, and forbidden topics. The challenge isn’t getting the model to generate—it’s getting it to generate within your specific constraints reliably. This is where prompt engineering transitions from art to engineering: defining schemas, enforcing structural requirements, and building validation layers. The most successful generation features I’ve seen treat the LLM as one component in a larger pipeline, not a magic box. This is where LLMs really shine compared to traditional ML. What used to require thousands of labeled examples and complex training pipelines can now be done with a well-crafted prompt. What makes this valuable: The architectural advantage here is profound: you’re essentially defining classifiers declaratively rather than imperatively. No training data, no model selection, no hyperparameter tuning—just clear descriptions of your categories. The model’s pre-trained understanding of language and context does the heavy lifting. This fundamentally changes the iteration cycle. Adding a new category or adjusting definitions happens in minutes, not weeks. The trade-off is less fine-grained control over the decision boundary, but for most business applications, this is a feature, not a bug. Scaling considerations: Production deployments require: Let me save you some pain by sharing what consistently fails: LLMs are great at general knowledge but terrible at specialized domains without extensive context. If you need deep expertise, you still need experts. Sub-100ms response times and high-frequency calls remain outside the practical envelope for LLM applications. The latency floor of current models, even with optimizations like speculative decoding, makes them unsuitable for hot-path operations. LLMs are probabilistic. If you need 100% accuracy (financial calculations, legal compliance, etc.), use traditional code. When someone comes to me with an AI feature idea, here’s my checklist: For teams evaluating their first LLM feature, I recommend starting with categorization. The reasoning is purely pragmatic: it has the clearest evaluation metrics, the most forgiving failure modes, and provides immediate value. You can validate the approach with a small dataset and scale incrementally. The implementation complexity is also minimal—you’re essentially building a discriminator rather than a generator, which sidesteps many of the challenges around hallucination, output formatting, and content safety. Most importantly, when classification confidence is low, you can gracefully fall back to human review without breaking the user experience. The gap between AI demos and production systems remains vast. The features that succeed in production share a common trait: they augment existing workflows rather than attempting to replace them entirely. They handle the tedious, error-prone tasks that humans perform inconsistently, freeing cognitive capacity for higher-value work. This isn’t a limitation—it’s the current sweet spot for LLM applications. The technology excels at tasks that are simultaneously too complex for traditional automation but too mundane to justify human attention. Understanding this paradox is key to building AI features that actually ship. Humans hate data entry Traditional parsing is brittle and breaks with slight format changes LLMs can handle ambiguity and variations gracefully PDF to JSON converter : Taking uploaded forms (PDFs, images, even handwritten docs) and extracting structured data. What used to require complex OCR pipelines and regex nightmares now works with a simple prompt. API response mapper : Taking inconsistent third-party API responses and mapping them to your internal data model. Every integration engineer’s nightmare—different field names, nested structures that change randomly, optional fields that are sometimes null and sometimes missing entirely. Customer feedback analyzer : Extracting actionable insights from the stream of unstructured feedback across emails, Slack, support tickets. Automatically pulling out feature requests, bug reports, severity, and sentiment. What used to be a PM’s full-time job. Reduces cognitive load on users Provides consistent quality and tone Can process and synthesize large amounts of information quickly Smart report generation : Taking raw data and generating human-readable reports with insights and recommendations. Meeting summarizer : Processing transcripts to extract key decisions, action items, and important discussions. Documentation assistant : Generating first drafts of technical documentation from code comments and README files. No need for labeled training data Can handle edge cases and ambiguity Easy to adjust categories without retraining Structured output guarantees : Libraries like Pydantic AI and Outlines enforce schema compliance at the token generation level, eliminating post-processing failures. Prompt optimization : DSPy and similar frameworks apply optimization techniques to prompt engineering, treating it as a learnable parameter rather than a manual craft. Evals, Observability, and Error Analysis : This could and will likely eventually be its own post

0 views
Alex Jacobs 8 months ago

A Computer Made This

The past 24 hours have had me navigating an existential crisis while simultaneously being gaslit by friends, family, and colleagues about what’s going on. And that’s probably fair of them—I have a tendency to overreact to things, to be a bit dramatic. I am 100% the guy in this panel right now. But 4o image generation is insane. I’ve been working in the LLM space since before ChatGPT shifted everything. I’ve closely followed the progress. I test every new release. I tell my friends that every AI app they send me is slop. I am not easily impressed. But this feels like another ChatGPT moment. This isn’t just better distribution (is hiding your state-of-the-art model in a Discord chat behind /commands really the best way to get people to use it?). This feels foundational. It’s not just a better diffusion model—it’s actual reasoning in pixel space. I’ve been on the fence about whether AGI (whatever that even means) is possible. Can we actually bottle intelligence into an electric rock? But it doesn’t take much napkin math to pencil this out a few years. (True believers might ask where I’ve been, but rest easy brethren—I am yours now.) It brings to mind this 100% real needlepoint of an Ilya Sutskever quote. Trying to game out the second- and third-order effects of an image generation model feels strange, even dumb. Infinite Ghibli? What are you worried about? Ghibli gonna take all the jobs? If I’m a graphic designer, it is over for me today. But I’m not, I’m a software engineer my job is safe! I have felt like chicken little screaming into the void about the computers coming for a few years now. Today I feel some combination of awe and dread. (This probably ends with most of us becoming electricians so we can wire up the data centers) I won’t even try to get into what the post-reality-filter stage of society we’re about to enter looks like, or what this will do to the meme economy. (My parents already can’t tell the difference between AI-generated and real images. Maybe I can’t either.) I think this tweet (shared with me by a friend this morning) just about sums it up. But to that, I say: Below is a series I’ve been working on to try and demonstrate this phenomenon I’m experiencing. My fiancée and I got engaged last October, and we captured an amazing photo (maybe my favorite picture ever). So I’ve been trying to recreate it in every style possible. The consistency of this model is incredible–and the content filters are tuned to low right now. I won’t be surprised if by the time you’re reading this, most of these styles will be blocked (already seems to be happening :/ ) Our original photo: Lego: Claymation: Sesame Street: Scooby-Doo: Neon Sign: Tim Burton: Hey Arnold: Victorian Botanical Print: Wes Anderson: Pixar: Vintage Comic: Peanuts: “Yellow Submarine Family”: Construction Paper: Architectural Blueprint: Medieval Manuscript: Street Art Stencil: Pixelated Video Game: 1960s Style Cartoon: Stop Motion: And finally, Ghibli: Other models could already do this! No, they couldn’t.

0 views
Alex Jacobs 9 months ago

RAG: From Context Injection to Knowledge Integration

Retrieval-Augmented Generation (RAG) has rapidly become a cornerstone in the practical application of Large Language Models (LLMs). Its promise is compelling: to expand LLMs beyond their training data by connecting them to external knowledge sources – from enterprise databases and real-time data streams to proprietary knowledge bases. The allure of RAG lies in its apparent simplicity – augment the LLM’s input context with retrieved information, and witness enhanced output quality. However, beneath this layer of simplicity lies a more complex reality–its a bit of a hack. RAG only works because LLMs are generally robust. The more you think on it, the more it becomes clear it shouldn’t really work, and should serve only as a stepping stone to a new paradigm. At their core, LLMs are generative models that produce text by navigating through a high-dimensional latent space. During pre-training on large datasets, these models learn to map language into this space, capturing relationships between words, phrases, and concepts. Text generation isn’t a simple lookup process - it’s a sequential operation where the model predicts each token based on both the previous context and its learned representations. RAG changes this core process significantly. Rather than relying only on the model’s learned representations, RAG injects external information directly into the context window alongside the user’s query. While this works well in practice, it raises important questions about the theoretical and architectural implications: Impact on Generation Quality: How does inserting external information affect the model’s learned generation process? Does mixing training-derived and retrieved information create inconsistencies in the model’s outputs? Information Integration: Can the model effectively combine information from different sources during generation? Or is it simply stitching together pieces without truly understanding how they relate? Architectural Fitness: Are transformer architectures and their training objectives actually suited for combining retrieved information with generation? Or are we forcing an approach that doesn’t align with how these models were designed to work? These theoretical concerns manifest in several practical ways: Current RAG implementations often struggle with: The transformer’s attention mechanism faces significant challenges: RAG systems often struggle to resolve conflicts between: Recent research and development suggest several promising directions for addressing these limitations: Future systems might: Advanced approaches could: New architectures might include: Anthropic’s Citations API represents a significant step beyond traditional RAG implementations. While the exact implementation details aren’t public, we can make informed speculations about its architectural innovations based on the capabilities it demonstrates. The Citations API likely goes beyond simple prompt engineering to include fundamental architectural changes: Enhanced Context Processing Citation-Aware Generation Training Innovations The system likely employs several key mechanisms: Dual-Stream Processing Source Integration Training Approach The Citations API and similar emerging technologies point to a future where knowledge integration isn’t just an add-on but a core capability of language models. This evolution requires moving beyond simply stuffing context windows with retrieved documents toward architectures specifically designed for knowledge-aware generation. The next generation of these systems will likely feature: As we move forward, the goal isn’t to patch the limitations of current RAG systems but to fundamentally rethink how we combine language models with external knowledge. This might lead to entirely new architectures specifically designed for knowledge-enhanced generation, moving us beyond the current paradigm of context window injection toward truly integrated knowledge-aware AI systems. Impact on Generation Quality: How does inserting external information affect the model’s learned generation process? Does mixing training-derived and retrieved information create inconsistencies in the model’s outputs? Information Integration: Can the model effectively combine information from different sources during generation? Or is it simply stitching together pieces without truly understanding how they relate? Architectural Fitness: Are transformer architectures and their training objectives actually suited for combining retrieved information with generation? Or are we forcing an approach that doesn’t align with how these models were designed to work? Abrupt transitions between retrieved content and generated text Inconsistent voice and style when mixing sources Difficulty maintaining coherent reasoning across retrieved facts Limited ability to synthesize information from multiple sources Managing attention across disconnected chunks of information Balancing focus between query, retrieved content, and generated text Handling potentially contradictory information from different sources Maintaining coherence when dealing with multiple retrieved documents The model’s pretrained knowledge Retrieved information Different retrieved sources User queries and retrieved content Process retrieved information before injection Maintain explicit source tracking throughout generation Use structured knowledge representations Implement hierarchical attention mechanisms Evaluate source reliability and relevance Resolve conflicts between sources Maintain provenance information Generate explicit citations and references Dedicated pathways for retrieved information Specialized attention mechanisms for source integration Dynamic context window management Explicit fact-checking mechanisms Enhanced Context Processing Specialized attention mechanisms for source document processing Dedicated layers for maintaining source awareness throughout generation Architectural separation between query processing and source document handling Advanced chunking and document representation strategies Citation-Aware Generation Built-in tracking of source-claim relationships Automatic detection of when citations are needed Dynamic weighting of source relevance Real-time fact verification against sources Training Innovations Custom loss functions for citation accuracy Source fidelity metrics during training Explicit training for source grounding Specialized datasets for citation learning Dual-Stream Processing Separate processing paths for user queries and source documents Specialized attention heads for citation tracking Fusion layers for combining information streams Dynamic context management Source Integration Fine-grained document chunking Semantic similarity tracking Citation boundary detection Provenance preservation Training Approach Multi-task training combining generation and citation Custom datasets focused on source grounding Citation-specific loss functions Source fidelity metrics Native citation capabilities Real-time fact verification Seamless source integration Dynamic knowledge updates Explicit handling of source conflicts

0 views
Alex Jacobs 1 years ago

Deep Dive into Python Async Programming

In the ever-evolving landscape of software development, responsiveness and efficiency are paramount. Modern applications, especially those dealing with network requests, user interfaces, or concurrent operations, often demand the ability to handle multiple tasks seemingly at the same time. For a long time, traditional threading and multiprocessing were the go-to solutions in Python for achieving concurrency. However, Python’s Global Interpreter Lock (GIL) and the overhead associated with thread management can sometimes limit the effectiveness of these approaches, especially for I/O-bound tasks. Enter asynchronous programming. Async Python offers a powerful paradigm for writing concurrent code that is both efficient and, arguably, more intuitive for certain types of applications. If you’ve encountered the keywords and in Python and found yourself intrigued but also slightly mystified, you’re not alone. While the surface-level syntax of async Python might seem straightforward, understanding what’s happening beneath the hood is crucial for truly leveraging its power and avoiding common pitfalls. This post is your comprehensive guide to demystifying async Python. We’ll go far beyond the basic syntax, diving deep into the core concepts that underpin asynchronous programming in Python. We’ll explore the event loop, coroutines, tasks, futures, context switching, and even touch upon how async compares to traditional threading and parallelism. By the end of this journey, you’ll not only be able to write async Python code, but you’ll also possess a solid understanding of the mechanisms that make it all tick. Let’s start with the syntax that you’ll encounter most frequently when working with async Python: the and keywords. These two are the fundamental building blocks for writing asynchronous code in Python. In Python, you declare a function as asynchronous by using the syntax instead of the regular . This seemingly small change has profound implications. An function, also known as a coroutine function , doesn’t execute like a regular synchronous function. Instead, it returns a coroutine object . Think of a coroutine object as a promise of work to be done later. It’s not the work itself, but rather a representation of that work, ready to be executed when the time is right. In this example, is an asynchronous function. When we call , it doesn’t immediately print “Hello from async function!”. Instead, it creates and returns a coroutine object. To actually execute the code within the coroutine, we need to use , which sets up and runs an event loop (more on event loops shortly) to manage the execution of our coroutine. The keyword is the other half of the async duo. It can only be used inside functions. is the point where an asynchronous function can pause its execution and yield control back to the event loop. This is the crucial mechanism that enables concurrency in async Python without relying on threads. When you something, you are essentially saying: “I need to wait for this asynchronous operation to complete. While I’m waiting, I’m going to yield control back to the event loop so it can work on other tasks. Once this operation is done, please resume my execution from right here.” In our example, is a simulated asynchronous operation that represents waiting for 1 second. During this second, the coroutine pauses, and the event loop is free to execute other coroutines or handle other events. Once the sleep duration is over, the event loop resumes the coroutine from the line after the statement, and it prints “Async function finished.” Important Note: You can only objects that are awaitable . In practice, this usually means you’re awaiting other coroutines, objects (which we’ll discuss later), or objects that have implemented the special method. Standard synchronous functions are not awaitable. At the heart of async Python lies the event loop . Think of the event loop as the central conductor of an orchestra. It’s responsible for managing and scheduling the execution of all your asynchronous tasks. It’s a single-threaded loop that constantly monitors for events and dispatches tasks to be executed when those events occur. Python’s standard library provides the module, which includes a built-in event loop implementation. This default event loop is written in Python and is generally sufficient for many use cases. However, for performance-critical applications, especially those dealing with high-performance networking, you might consider using alternative event loop implementations. One popular option is . : is a blazing-fast, drop-in replacement for ’s event loop. It’s written in Cython and built on top of , the same high-performance library that powers Node.js. is significantly faster than the default event loop, especially for network I/O. To use , you typically need to install it separately ( ) and then set it as the event loop policy for when your application starts: Other event loop implementations exist, but and are the most commonly used in Python async programming. Choosing between them often depends on the performance requirements of your application. For most general async tasks, ’s default loop is perfectly adequate. For high-load network applications, can provide a noticeable performance boost. To truly understand async Python, it’s helpful to think about it in layers. We’ve already touched upon coroutines and the event loop. Let’s now delve into the roles of Tasks and Futures . As we discussed earlier, coroutines are the asynchronous functions you define using . They represent units of asynchronous work. Coroutines themselves are not directly executed by the event loop. Instead, they need to be wrapped in something that the event loop can manage and schedule. This “something” is a Task . A Task in is essentially a wrapper around a coroutine that allows the event loop to schedule and manage its execution. When you want to run a coroutine concurrently within the event loop, you typically create a Task from it. You can create a Task using : In this example, we create two Tasks, and , from the same . schedules these coroutines to be run by the event loop concurrently. When we and , we are waiting for these Tasks to complete and retrieve their results. Tasks are essential for managing the lifecycle of coroutines within the event loop. They provide methods to: A Future is an object that represents the eventual result of an asynchronous operation. It’s a placeholder for a value that might not be available yet. Tasks in are actually a subclass of Futures. Futures are used extensively in async Python to represent the outcome of operations that are performed asynchronously, such as: A Future object has a state that can be: You can interact with a Future to: When you a Task (or any awaitable Future-like object), you are essentially waiting for that Future to become “done” and then retrieving its result or handling any exceptions. Now, let’s delve deeper into how coroutines are actually executed and how context switching works in async Python. Async Python uses cooperative multitasking . This is in contrast to preemptive multitasking used by operating systems for threads and processes. This cooperative nature has important implications: When a coroutine reaches an statement, several things happen: This pause-and-resume mechanism is what allows asynchronous code to be written in a seemingly sequential style, even though it’s actually being executed in an interleaved and non-blocking manner. It’s crucial to understand when async Python is the right choice and when traditional threading or multiprocessing might be more appropriate. Async Python excels in scenarios where your application is I/O-bound . This means that the primary bottleneck is waiting for external operations to complete, such as: In these cases, the CPU is often idle while waiting for I/O operations. Async Python allows you to utilize this idle time by letting other coroutines run while one is waiting for I/O. It’s highly efficient for handling many concurrent I/O operations with minimal overhead. Example: Web Server A web server that handles many concurrent requests is a classic example where async Python shines. While one request is being processed (which often involves waiting for database queries, external API calls, etc.), the server can be handling other requests concurrently. This example uses , an async HTTP client and server library. The coroutine performs an asynchronous HTTP request. The coroutine uses to fetch data without blocking the server. The server can handle many requests concurrently, making it highly scalable for I/O-bound web applications. Threads, especially when used with Python’s module, are suitable for tasks that are more CPU-bound and can benefit from concurrency (even if not true parallelism due to the GIL). CPU-Bound Tasks: Tasks that spend most of their time performing computations on the CPU, rather than waiting for I/O. Examples include: Concurrency (with GIL limitations): Python’s GIL (Global Interpreter Lock) prevents true parallelism for CPU-bound tasks in standard CPython threads. Only one thread can hold the Python interpreter lock at any given time. However, threads can still provide concurrency by releasing the GIL during I/O operations or certain blocking system calls. This can improve responsiveness even for CPU-bound tasks if they involve some I/O or blocking. Example: CPU-Bound Computation (with threading for concurrency) In this example, simulates a CPU-intensive operation. We create multiple threads to run this task concurrently. While the GIL limits true parallelism for CPU-bound Python code, threads can still provide some concurrency and potential performance improvement, especially if the tasks involve some I/O or blocking operations. For purely CPU-bound tasks, however, the benefits might be limited by the GIL. For truly CPU-bound and computationally intensive tasks that need true parallelism and to bypass the GIL limitations, multiprocessing using Python’s module is the way to go. Example: CPU-Bound Computation (with multiprocessing for parallelism) In this multiprocessing example, we create separate processes to run the same CPU-bound task. Because each process has its own interpreter and bypasses the GIL, we can achieve true parallelism and significantly speed up CPU-intensive computations on multi-core systems. General Guidelines: Async Python offers a powerful and elegant way to write concurrent code, particularly for I/O-bound applications. Understanding the underlying mechanisms – the event loop, coroutines, tasks, futures, and cooperative multitasking – is key to effectively leveraging its benefits. While async Python is not a silver bullet for all concurrency problems, and it’s not a direct replacement for threading or multiprocessing in all cases, it provides a compelling and often more efficient alternative for many modern application scenarios. By mastering async Python, you gain a valuable tool in your development arsenal, enabling you to build responsive, scalable, and performant applications in the asynchronous world. So, embrace the and duo, dive into the event loop, and unlock the power of asynchronous programming in Python! Task Queue: The event loop maintains a queue of tasks (usually coroutines wrapped in objects) that are ready to be executed or resumed. Event Monitoring: The event loop also monitors for various events, such as network sockets becoming ready for reading or writing, timers expiring, or file operations completing. It typically uses efficient system calls like , , or (depending on the operating system) to monitor these events without blocking. Task Execution and Resumption: When an event occurs that makes a task ready to proceed (e.g., data is available on a socket that a task is waiting to read from), the event loop picks up that task from the queue and executes it until it encounters an statement. Yielding Control with : When a coroutine reaches an statement, it effectively tells the event loop, “I need to wait for this operation. Please pause me and let someone else run.” The event loop then takes control and looks for other tasks in the queue that are ready to run. Resuming Execution: Once the awaited operation completes (e.g., the network request returns, the timer expires), the event loop is notified. It then puts the paused coroutine back into the task queue, ready to be resumed at the point where it left off. Looping Continuously: The event loop continues this process of monitoring events, executing tasks, and pausing/resuming coroutines in a loop until there are no more tasks to run or the program is explicitly stopped. Cancel a Task: Check if a Task is done: Get the result of a Task: (if done) Get exceptions raised during Task execution: (if any) Network I/O: Reading data from a socket, sending a request to a server. File I/O: Reading or writing to a file (in an async context). Concurrent computations: Tasks running in parallel (within the same event loop or in different threads/processes). Pending: The asynchronous operation is still in progress. Running: The operation is currently being executed. Done: The operation has completed successfully or with an exception. Cancelled: The operation has been cancelled. Check if it’s done: Get the result: (blocks until done if pending, raises exception if an exception occurred) Get exceptions: (returns exception if one occurred, otherwise ) Add callbacks: (run a function when the future is done) Cancel the future: Preemptive Multitasking (Threads/Processes): In preemptive multitasking, the operating system’s scheduler decides when to switch between threads or processes. It can interrupt a running thread/process at any time and switch to another, even if the running thread/process doesn’t explicitly yield control. This is typically based on time slices and priority levels. Cooperative Multitasking (Async Python): In cooperative multitasking, coroutines voluntarily yield control back to the event loop when they encounter an statement. The event loop then decides which coroutine to run next. Context switching only happens at these explicit points. A coroutine will continue to run until it reaches an or completes. No True Parallelism (within a single event loop): Within a single event loop running in a single thread, true parallelism is not achieved. Coroutines take turns running. If a coroutine doesn’t frequently and performs long-running CPU-bound operations, it can block the event loop and prevent other coroutines from making progress. Responsiveness: Cooperative multitasking is excellent for I/O-bound tasks. While one coroutine is waiting for I/O, another can run, keeping the application responsive. Less Overhead: Context switching in cooperative multitasking is generally lighter than preemptive context switching between threads or processes. There’s less operating system overhead involved. Deterministic Behavior (mostly): Because context switching happens only at explicit points, the execution flow of async code is often more predictable and easier to reason about compared to multithreaded code, which can have race conditions and unpredictable scheduling. Expression: The expression after (e.g., , another coroutine, a Future) must be awaitable. Yielding Control: The coroutine effectively “pauses” its execution at the point. It returns control back to the event loop. Event Loop Takes Over: The event loop becomes active again. It looks at its task queue for other tasks that are ready to run. Registering for Resumption: The coroutine, along with information about where it paused (the line after the ), is registered with the event loop as being “waiting” for the completion of the awaited operation. Awaited Operation Proceeds: The awaited operation (e.g., network request, timer) proceeds asynchronously in the background (often managed by non-blocking system calls). Event Notification: When the awaited operation is complete, the event loop receives a notification (e.g., socket becomes readable, timer expires). Resuming the Coroutine: The event loop puts the paused coroutine back into the task queue, marked as ready to be resumed. Coroutine Resumes: When the event loop gets around to executing this coroutine again, it resumes from the exact point where it was paused (right after the statement). It now has access to the result of the awaited operation (if any). Network requests: Fetching data from APIs, making HTTP requests, communicating with databases over a network. File I/O: Reading and writing to files (especially over a network file system). Waiting for user input: In GUI applications or interactive systems. CPU-Bound Tasks: Tasks that spend most of their time performing computations on the CPU, rather than waiting for I/O. Examples include: Image processing Numerical computations Data analysis Cryptographic operations Concurrency (with GIL limitations): Python’s GIL (Global Interpreter Lock) prevents true parallelism for CPU-bound tasks in standard CPython threads. Only one thread can hold the Python interpreter lock at any given time. However, threads can still provide concurrency by releasing the GIL during I/O operations or certain blocking system calls. This can improve responsiveness even for CPU-bound tasks if they involve some I/O or blocking. True Parallelism: Multiprocessing creates separate processes, each with its own Python interpreter and memory space. Processes run in parallel on multiple CPU cores, achieving true parallelism for CPU-bound tasks. Bypassing the GIL: Each process has its own GIL, so the GIL limitation of threads is overcome. Higher Overhead: Process creation and inter-process communication have more overhead compared to threads or async tasks. Processes consume more system resources (memory, process management overhead). I/O-Bound, High Concurrency: Async Python (asyncio) is often the best choice. CPU-Bound with some I/O, Responsiveness: Threads (threading) can be considered, but be mindful of GIL limitations for pure CPU-bound tasks. CPU-Bound, True Parallelism, Max Performance: Processes (multiprocessing) are essential, especially for computationally intensive tasks on multi-core machines. Hybrid Applications: You can combine async and multiprocessing. For example, use async for handling network I/O and multiprocessing for CPU-bound background tasks.

0 views
Alex Jacobs 2 years ago

Mastering Integration Testing with FastAPI

Follow along with all the code here If you’re a backend developer working with FastAPI, you already know the framework excels at simplifying API development with features like async support and automated Swagger docs. However, when it comes to integration testing–particularly mocking external services like MongoDB, AWS S3, and third-party APIs—the waters can get murky. This post is your guide to navigating these complexities. This isn’t a one-stop-shop for all things testing; the focus is squarely on integration testing within FastAPI. Prerequisites: Familiarity with FastAPI, PyTest, MongoDB, and AWS S3 is assumed. If you’re new to any of these technologies, you may want to get up to speed before proceeding. Integration testing involves combining individual units of code and testing them as a group. This type of testing aims to expose faults in the interactions between integrated units. In our context, we could also probably call these API tests (and you’ll see that’s how I name my test files). In the context of FastAPI, these units often involve your API endpoints, external databases like MongoDB, and other services such as AWS S3. When compared to unit tests, integration tests in FastAPI present unique difficulties. These challenges mostly stem from the interaction with external dependencies. Mocking these dependencies is not always straightforward, and mistakes can lead to false positives or negatives, undermining the purpose of the tests. So, why focus on integration testing in FastAPI? Because it’s here that you validate that various services and databases work in harmony. By ensuring that these integrated units function as expected, you not only increase code reliability but also save debugging time in the long run. Unit tests are great for testing individual units of code, but they don’t test the interactions between these units. I also find integration tests (especially this kind, which test our endpoints) are particularly useful before doing a large refactor–going from a v0 to v1 of an app, where unit tests might not make a ton of sense because you’re rewriting all of your ‘units’. Before we delve into the technicalities of mocking various elements, let’s take a quick look at the FastAPI application we’ll be using as a test subject. This app serves as a sandbox for our integration tests, built to showcase different aspects of FastAPI that commonly require mocking in a test environment. Our sample application has a straightforward schema designed for demonstration purposes: Our tests live in where they are organized according to the application components they evaluate. We employ PyTest for test execution and adhere to its conventions for discovering tests and initial set up. This includes utilizing the configuration file for shared hooks and fixtures. As we move through the article, we’ll see that FastAPI provides an easy way to do dependency injection, which is especially useful for our testing scenarios. Using , we can inject our mocked functions into the FastAPI app, but remember, this only works for endpoints/functions using the syntax. You can read more about dependency injection in FastAPI here . In our FastAPI application, we have endpoints that require authentication. During testing, we don’t want to use real authentication because it would require us to manage real user credentials and tokens. This would complicate our tests and potentially expose sensitive information. Therefore, we mock the authentication process. Before we get to dependency injection or mocking out our services, let’s make sure our auth works. We have simple authentication defined in the project that checks if a user is in the database, if their password matches, and returns a token if they are. (I am not going to focus on how to do authentication, but you can check out the code in if you’re interested.) For the purposes of this tutorial, our database is just a dictionary with one hard coded user. and our login endpoint… and we have some simple tests just to verify that our login endpoint works as expected These initial tests don’t require mocking the authentication process because they are designed to test the endpoint’s basic functionality. They act as a foundational layer upon which we will build more complex test scenarios that require mocking. Now, let’s say we have another endpoint that requires authentication. We’ll use the function to inject our authentication function into our endpoint. Our endpoint returns information about the user logged in. Authentication is handled by the function, which is injected into our endpoint using the function. Since we don’t want to use real authentication in our tests, we’re going to mock this function. I’m going to show multiple ways of doing this so you can choose the one that works best for you. The most standard way of mocking this function is to use the feature of FastAPI, before we initialize our TestClient. You’ll see we declare a function called that returns a object. Then in line 10, , we ‘inject’ our mock function into our app in place of the function. This code is essentially replacing the function in our app with a mock function that returns the TokenData object hardcoded into the function we’re pathing with. Then, in our test we pass in the client fixture, and we can see that our test passes. (we really don’t even need to include auth headers, since we’re mocking the auth function, but I’m keeping it here for clarity) Now, what if we want to mock the auth function in a different way? We can do this by mocking the auth function directly in our test. First, we’ll create a second client fixture that doesn’t patch the auth function. Now we’re going to write a test to verify that auth fails (since we’re not patching the auth function) To make this work, there are two approaches. The first is to mock the auth function directly in the test All we’ve done here is moved the code that patches the auth function into the test itself. This could be useful if you wanted to mock the auth function in some tests, but not others (but only wanted to have one client fixture). Next, we’ll do the same thing, but we’re going to use a new fixture and factory function to create our mock auth function. So first, we’ll create a new fixture, and then we’ll use this fixture in our test Notice that when we’re passing in the fixture, we’re passing in the function itself, not the return value of the function. This is because the fixture is a factory function that returns a function, so we need to pass in the function itself. This may not seem very useful (and for mocking auth, it may not be), but there are plenty of scenarios when you may want to mock a function in some tests, but not others. This is a good way to do that. Our FastAPI application makes external API calls to fetch weather data. During testing, we don’t want to make real API calls because they can be slow and unreliable. Therefore, we mock the external API calls. We mock the external API calls by patching the function in our FastAPI application. This function is responsible for making the actual API call and returning the weather data. We replace it with a mock function that always returns a predefined weather data. But, this function does use dependency injection (the funciton) in our app, so we have to mock it in a different way. First, let’s look at our endpoint and the function it calls. and our function Pretty simple… we are hitting an external API, though, so we need to mock this out in our tests. We have three ways to mock the external API calls: Using unittest.mock.patch, we patch our function to return ‘sunny’. We then verify our endpoint returns the expected response (and verify that our mock function was called–this last part probably isn’t really necessary here, but is a good demonstration) Let’s suggest that we may want to test multiple endpoints that call the function. We could patch the function in each test, but that’s a lot of code duplication. Instead, we can create a fixture that patches the function. First, we’ll set up our fixture (this is another factor function). We’re using ’s Mock and Patch functions in order to replace the function with a mock function that returns ‘rainy’. Note: As before, we’re using a factory function to create our mock function, so we need to pass in the function itself, not the return value of the function. In our test, we just need to again pass in the fixture as an argument. The above example probably isn’t very useful in practice. In the real world, if we’re testing multiple endpoints that call the same function, we probably want to test different scenarios. For example, we may want to test that our endpoint returns the correct response for different weather conditions. We could do this by creating multiple fixtures that patch the function with different mock functions, but this is a lot of code duplication and basically defeats the purpose of using a fixture at all. Instead, we can use a parametrized fixture to pass in values to our fixture to make it behave differently depending on our test. To do this, we’re going to create another fixture, but this one will take a parameter. It essentially wraps our previous factory function, but now we can pass in a parameter to the fixture. And then, when we call our test, we need to decorate it with , and pass in the fixture as an argument. I’ve set up two tests here so we can really see how it works. The parameterized fixture is more complicated to set up, but it is incredibly useful in practice when you have to simulate different responses from external APIs. Our FastAPI application interacts with MongoDB. During testing, we might not want to (or really be able to in some cases) hit a real database with real/mocked data. Instead, we acn mock MongoDB by using the mongomock library, which simulates a MongoDB client. (Note: In my experience, mongomock can be difficult to work with and some practice to get working, but were going to proceed with it for now) In our application, we employ context management for database interactions, utilizing Python’s statement. This demands that the object used within the statement must implement context management protocols, specifically and methods. The mongomock library doesn’t implement these methods, so to align with these requirements, we define a custom MockMongoClient class to wrap our mongomock client. This class mimics the behavior of the actual MongoClient by implementing the and methods. This ensures compatibility with the existing code that expects a context-managed database client. Using our MockMongoClient class, we can now create our mock database client. We’ll do this in a fixture so we can reuse it in multiple tests. I’m going to initialize two separate fixtures here, one with an empty database and one with some data initialized. (In theory, we should be able to set the scope of the fixture to or and just initialize the database once, add data through other tests, and then use that data for testing later, but I haven’t been able to get that working with mongomock. The project seems to have been designed more with unit tests in mind and is not a complete implementation or drop in replacement. If you know how to make this work, please let me know!) and initialize some data… And now we can write our tests. We’ll pass in our fixture and again use the feature to inject our mock database client into our app (overriding the function) We’ve got three simple tests here. First, we test that our endpoint correctly returns a 404 if no user preferences exist (we use our non-initialized fixture for this) Next, we use our initialized fixture to test that our endpoint correctly returns the user preferences. Finally, we test that we can save user preferences (we use our non-initialized fixture for this, but it should work with either) And that’s it! Pretty simple to write the actual tests once we have our mocked database client set up correctly. Our FastAPI application interacts with AWS S3 for storing and retrieving user profile pictures. During testing, we don’t want to use a real S3 bucket because it would require us to manage real AWS resources. This would complicate our tests and potentially incur costs. Therefore, we mock the S3 bucket. (The same patterns used here could be applied to other AWS services) We mock the S3 bucket by using the mock_s3 decorator from the moto library, which simulates an S3 bucket. This is done in a fixture so we can reuse it in multiple tests. (moto is a great library for mocking AWS services, and you’ll see we are able to set our fixture scope to session, allowing us to maintain bucket state across multiple tests) This looks similar to our MockMongo fixtures, but this time, rather than pass our fixture into our tests, we pass it into the client fixture. Now we can write our tests. We’ll start by testing that we get the right message if no profile picture exists for the user Now we’ll test adding one… Finally, we can test getting one. (notice how this is using the same file we uploaded in the previous test, this is due to the fact that we’re using the session scope for our mock_s3_bucket fixture) This was a fairly deep dive. We’ve unraveled the intricacies of integration testing in FastAPI, which is not without its challenges when it comes to mocking external dependencies. We’ve gone through a variety of techniques to mock authentication, from simple dependency_overrides to more advanced fixture-based strategies. We’ve also tackled how to mock external APIs using Python’s unittest.mock.patch and pytest’s parametrized fixtures. When it comes to databases, MongoDB adds another layer of complexity. We’ve seen how Mongomock can be a useful tool, albeit with its own set of limitations. We crafted custom mock MongoDB clients and fixtures to ease this pain point. As for AWS S3, the Moto library proved to be a robust tool, enabling us to mock S3 buckets effectively, even allowing state persistence across tests. The aim has been to arm you with a set of tools and strategies for your FastAPI testing arsenal. Whether it’s a simple authentication mock or a more complex external service, you should now be equipped to tackle these head-on. Happy testing. Happy testing.

0 views
Alex Jacobs 2 years ago

CheeseGPT

This toy project was originally created for a guest lecture to a Data Science 101 course (and its quality may reflect that :) This post extends that lecture, designed to provide a high-level understanding and example of Retrieval Augmented Generation (RAG). We’ll go through the steps of creating a RAG based LLM system, explaining what we’re doing along the way, and why. You can follow along with the slides and code here CheeseGPT combines Large Language Models (LLMs) with the advanced capabilities of Retrieval-Augmented Generation (RAG). At its core, CheeseGPT uses OpenAI’s GPT-4 model for natural language processing. This model serves as the backbone for generating human-like text responses. However, what sets CheeseGPT apart is its integration with Langchain and a Redis database containing all of the information on Wikipedia relating to cheese. When a user asks a question, the system utilizes RAG to retrieve the most relevant information/documents from its vector database, and then includes those in its message to the LLM. This allows the LLM to have specific and up-to-date information to use, extending from the data that it was trained on. The image below, flow from right to left (steps 1-5) shows the high level design of this. The user’s query is passed into our embedding model. We do a similarity search against our database to retrieve the most relevant documents to our users question. And then these are included in context passed to our LLM. Below, we’ll outline the steps to building this system. NOTE : this is an example, and probably doesn’t make a ton of sense as a useful system. (For one, we’re getting our data from wikipedia, which is already contained within the training data of GPT-4) This is meant to be a high level example that can show how a RAG based system can work, and to show what the possibilities are when integrating external data with LLMs (proprietary data, industry specific technical docs, etc.) As with most projects, getting and munging your data is one of the most time consuming yet crucial elements. For our CheeseGPT example, this involved scraping Wikipedia for cheese-related articles, generating embeddings, and storing them in a Redis database. Below, I’ll outline these steps with code snippets for clarity. We start by extracting content from Wikipedia. We made a recursive function to fetch pages related to cheese, including summaries and sections. (Note: This function could definitely be improved.) This is a very greedy (and lazy) approach. We don’t discriminate at all, and we end up with a ton of noise (things not related to cheese at all), but for our purposes of example, it works. Next, we need to generate our embeddings from our collected documents. Embeddings are high-dimensional, continuous vector representations of text, words, or other types of data, where similar items have similar representations. They capture semantic relationships and features in a space where operations like distance or angle measurement can indicate similarity or dissimilarity. In machine learning, embeddings are used to convert categorical, symbolic, or textual data into a form that algorithms can process more effectively, enabling tasks like natural language processing, recommendation systems, and more sophisticated pattern recognition. With our textual data collected, we’ll be using OpenAI and Langchain to generate our embeddings. There are lots of different ways to generate embeddings (plenty packages that run locally, too), but using OpenAI API to get them is fast and easy for us. (and also dirt cheap) NOTE: In a true production system, there would be much more consideration taken around generating embeddings. This is arguably the most important step in a RAG based system. We’d need to do experimentation with chunk size to see what gives us the best results. We’d need to explore our vectors to make sure their working as expected, remove noise, etc. Langchain makes it very easy to do create embeddings and store them in Redis without much thought, but this step requires extreme care to generate good results in a production system The snippet below takes our scraped wikipedia sections, generates embeddings for them using OpenAI’s embeddings API, and stores them in Redis. Again, LangChain abstracts away a ton of complexity and makes this really easy for us. Our RAG operates by creating an embedding of the user’s question and then finding the most semantically similar documents in our database (via cosine similarity between the embedding of our user’s query and the N closest documents in our database). We then include these documents / snippets in our request to the LLM, telling it that they are the most relevant documents based on a similarity search. The LLM can then use these documents as reference when generating its response. Here’s a simplified overview of the process with code snippets: The user’s query is converted into an embedding using the OpenAI API. This embedding represents the semantic content of the query in a format that can be compared against the pre-computed embeddings of the database articles. We then use the query embedding to perform a similarity search in the Redis database. It retrieves a set number of articles that are most semantically similar to the query. The retrieved articles are formatted and integrated into the prompt for GPT-4. This allows GPT-4 to use the information from these articles to generate a response that is not only contextually relevant but also rich in content. Finally, the enriched prompt is fed to GPT-4, which generates a response based on both the user’s query and the additional context provided by the retrieved articles. Through this process, CheeseGPT effectively combines the generative power of GPT-4 with the information retrieval capabilities of RAG, resulting in responses that are informative, accurate, and contextually rich. CheeseGPT’s chat interface is an important component, orchestrating the interaction between the user, the retrieval-augmented generation system, and the underlying Large Language Model (LLM). For the purposes of our example, we have built the bindings for the interface, but did not create a fully interactive interface. Let’s dive into the key functions that make this interaction possible. This function establishes a connection to the Redis database, where the precomputed embeddings of cheese-related Wikipedia pages are stored. Filters are applied to ensure that irrelevant sections like ‘External Links’ and ‘See Also’ are excluded from the search results. This function ensures that duplicate content from the search results is removed, enhancing the quality of the final output. (This is necessary in our case because we were greedy / lazy when pulling our data / generating our vectors) This key function performs a similarity search in the Redis database using the user’s query, filtered and deduplicated. The function formats the search results, making them readable and including the source information for transparency. This is what our message looks like when we send it to GPT-4. Our system prompt is first and includes instructions for the model to use the retrieved documents when answering the question. In our user message, you can see the user’s question, and then the documents we retrieved, presented as a list with some formatting. This function prepares the input for the LLM, combining the system prompt, user question, and the retrieved documents. The integration of these functions creates a seamless flow from the user’s question to the LLM’s informed response, enabling CheeseGPT to provide expert-level insights into the world of cheese. Putting it all together might look something like… So, let’s compare a question using our system vs. asking ChatGPT. We’ll use the same question above. Using our system, we get this response: And if we ask ChatGPT the same question… I’m not sure which of these answer is more correct, and it doesn’t matter for the purposes of this example. The point is that we were able to retrieve and include our own information, external to the model, and make it use that information in it’s response. It’s clear how many amazing use cases there are for something like this! Hopefully this high level toy example was able to shed some light on what a RAG based system may look like. Checkout the additional resources linked before for more in-depth information. Thanks for reading! https://github.com/ray-project/llm-applications/blob/main/notebooks/rag.ipynb https://github.com/pchunduri6/rag-demystified https://www.anyscale.com/blog/a-comprehensive-guide-for-building-rag-based-llm-applications-part-1

0 views
Alex Jacobs 2 years ago

Effective Error Handling

All too frequently, we, as developers, are faced with the task of handling errors. By developing effective messaging for our users, we can make their experience with our software much more pleasant. While the task of crafting these messages can seem mundane, it’s a necessary step in making our software more user-friendly and intuitive. And yet, all too frequently, end users will go out of their way to completely avoid even attempting to read the error message. And after a long hard day writing bug free code and solving customer problems, there is nothing less satisfying than a story that contains a screenshot of an error message that explains exactly the action the user needs to take the resolve the mistake they have made. Below we present a simple and effective approach to handling errors in client-side fleshware. If it’s not yet clear, I’ll take this opportunity to say that this is (mostly) a joke. type the error message to confirm you've seen it and disable the error modal I’m certainly no UI designer, but with a little love, I expect this to be a core product feature in all modern softwares going forward. Checkout the fiddle below to make your own improvements! https://jsfiddle.net/eg142dkp/2/ First I had to learn javascript. Then I had to learn CSS and HTML. After all that, I had to learn about Hugo and Shortcodes By that point, I was a certified full-stack developer, and I was able to create the following shortcode to render the modal and it goes right here in my directory and then in my markdown file, I can just call it like this Hugo is neat It is worth updating this post to say that I gave this prompt to the openai model and within seconds had a fully working implementation Checkout openAI’s fiddle for comparison. Maybe we’re all wasting our time https://jsfiddle.net/Lkyxuc0v/

0 views
Alex Jacobs 3 years ago

Running Jupyter lab behind NGINX--Part 2

In the last post , we left off with a working reverse proxy, but we couldn’t access Jupyter lab due to its auth enforcement. Because of how we’re setting this up, we will be handling authentication upstream of Jupyter Lab, and we don’t want to rely on them for handling authentication. What we are going to do here is generally considered “unsafe.” Again, if you’re looking to do this for your team, check out Jupyter Hub –it probably makes more sense for your use case. To disable token auth, we will update our Jupyter Lab config. There is an extensive config file for Jupyter Lab. In a production environment, I recommend using it (you can generate a sample file by running ). But, for this toy example, we will pass our config as cmd line args. To disable token auth and to allow same-origin requests, we’re going to update our Jupyter Lab Dockerfile Entrypoint to include these arguments Our Dockerfile should now look like And if we rebuild and start our docker compose again We now get through to Juypter! But if we try and open the Python kernel, we’ll notice it’s having trouble connecting. Opening our browser dev tools shows that there is an issue with how our proxy is handling WebSockets We’ll have to update our Nginx config to address this. We will add these lines to set headers properly for WebSockets to our / location in the server block. And now, if we restart our containers using the updated config, we’ll see our kernel connects! If you’re wondering how we will handle security when we’re basically giving whoever is using this a terminal into our cloud, the answer is using AWS to isolate the instance via IAM roles/ policy. We aren’t going to get too much into that in this post, but it is a valid concern. There isn’t much we can do to prevent a privilege escalation/container escape from a sophisticated user, but we can at least not give root access. We’re going to update our Jupyter Dockerfile to have a new user, ‘jupyter’, and we’ll run Jupyter Lab as this user. We’re also going to update our ENTRYPOINT, so the Jupyter Lab root directory is set to the Jupyter user’s home directory Our Dockerfile should now look like If we reload our site, we’ll see that the working directory is now set to , and if we try to write to , we’ll get a permissions error. It’s important to note that while this makes it a little more difficult for a malicious user to take over this ‘instance’, we will be giving them access to the internet, the ability to download and install packages, execute code, etc. It would not be too difficult for someone with mal intent to get around this. Changing the user and working directory does more to help an innocent user from accidentally breaking something. Great! Now we have disabled token authentication, added a system user (who is now running Jupyter), and changed our notebook directory to our user’s directory! In the next post, we’ll set up a task definition and deploy to ECS.

0 views
Alex Jacobs 3 years ago

Running Jupyter lab behind NGINX--Part 1

Jupyter Lab is an open source web-based IDE for notebooks with Python and R support, geared towards the data science crowd. It’s a powerful, mature application with a potentially complex configuration. Our requirement was to deliver Jupyter Lab to users so that each user would have their own isolated “instance .” There is an off-the-shelf solution for this called Jupyter Hub that probably makes the most sense for your organization. This example will be a proof of concept on how you could roll your solution. Our first step will be getting Jupyter Lab up and running in a container. There are many Docker images available on (Docker hub)[https://hub.docker.com/] for Jupyter Lab, but since we’re rolling everything ourselves, we might as well make our own image. It also gives us more control over our code–it’s also a pretty simple Dockerfile. There are probably some good arguments for why you should use alpine or something else as the base image here, but I’m a sucker for ubuntu. Since this isn’t a Docker tutorial, I’m not going to go into great detail here about what each line in this Dockerfile does, but assume that it installs Jupyter Lab and configures it to run at port 8888. We’ll expand on the Jupyter Lab config (and make some changes) later, but for now, this works fine. We’re going to use Docker compose to run this. Our compose file looks like We can run this with This will start our Jupyter Lab container and make it available at http://127.0.0.1:8888/lab/ Next, we need to put together an Nginx docker file. While we could use the official Nginx image, in keeping with the theme, we’re going to create our own Nginx image (and it’s also really simple) Pretty straightforward. Our config file is also pretty simple. We’re going to use port 8000, and we’re going to simply forward all requests directly to Jupyter lab. The final piece that will tie these together is our docker-compose file. Our docker-compose is pretty simple as well. By using Docker compose, the networking between the containers is handled for us, and we can point the as the service name in our nginx.conf Now, let’s head to http://127.0.0.1:8000 , and… Awesome! We’re being proxied to Jupyter Lab. But, we see a page requiring token auth. This is because Jupyter is currently configured to enforce this. In the next post, we’ll deal with this and some other things regarding permissions, creating a user, and making a task definition for deploying this configuration to ECS.

0 views
Alex Jacobs 3 years ago

Splitting SRA into FASTQ with SRAToolkit, Python, and Docker

SRA (Sequence Read Archive) is a file format used by NCBI, EBI, etc., for storing genomic read data. It works with multiple file types (BAM, HDF5, FASTQ). In our case, we’re going to be focusing on FASTQ. The first step of many pipelines is converting SRA into FASTQ, which will be our focus in this post. If you’re working as an individual or a scientist, you probably want to go ahead and use SRA Toolkit to download your files. For our purposes here, though, we’re going to assume you already have your files downloaded (and are probably using an implement outside of SRA Toolkit for file i/o) SRA Toolkit is pretty frustrating to use. It is not designed for programmatic use or as part of larger systems (and the developers seem hostile to the idea that someone would even try and do this 😲). It wants you to do an interactive configuration on every install https://github.com/ncbi/sra-tools/issues/77 We get around this with a dumb hack to make it think we’ve gone through this process and configured it. That’s what’s happening in lines 25-26. (it’s kind of messy to create this config file like this, it would probably be better to make this as a file and copy it in, but for our purposes, I want to contain everything in a single file with no external dependencies) We will be using Docker and Python for this, so our first step is to create a Dockerfile with the tools we need installed. Here’s our Dockerfile–I’ve added comments to explain what I’m doing, but if you don’t know anything about Docker, this isn’t a day one tutorial, so check out one of those first. Our python script for this is pretty simple. We’re assuming that our SRA file has been downloaded. We’re going to be running SRA Toolkit using the python subprocess module. First, we need to check that our SRA is valid. We use the tool to do this. If we get a good return code, we will test if it’s paired-ended data. I’m not going to go into a ton of detail about paired vs. single-ended data but suffice it to say that it’s more effective to use paired-end data. You can read more here https://www.illumina.com/science/technology/next-generation-sequencing/plan-experiments/paired-end-vs-single-read.html In most cases, data from modern experiments is paired, but it’s essential to know. In this case, it may seem like it’s not helpful, but in a production env we would be passing these files on to another step in a pipeline, and we need to be able to tell if we will have a single read file or multiple to configure the next step properly. To determine this, we stand on the shoulders of those who’ve come before us, and use a version of a function described here https://www.biostars.org/p/139422/ . Once we determine the data type, we’ll pass our SRA to to split it. We’re going to tell it to give us gzipped output files. If the result is successful, we’re simply going to return the paths of the files (which, in this case, we’re just going to log to stdout rather than upload) That’s basically it. We’ve also added some simple error handling. I’ve commented the file below to explain better what we’re doing. Next, we’re going to need some data. SRA files are often pretty large (sometimes hundreds of gigabytes). Typically, S. cerevisiae RNAseq datasets are pretty small but also fully functional, so we’re going to be using one of those https://trace.ncbi.nlm.nih.gov/Traces/?run=SRR21712309 (To find this, I went to https://ncbi.nlm.nih.gov/sra , entered S. cerevisiae in the search bar, and selected the first one :smiling:) Now that we have data, we can run this container using Docker. We will bind a directory on our host machine ( ) to our directory in our container. This folder is where our input data will live, and the Fastq files our script generates will be written. To run, we’ll simply run, and we’ll see output logged to stdout We’ll also see these files appear in our directory. The path to these will match our file paths logged at the end And that’s it! This is a lot of explanation for a simple toy example, but hopefully is helpful to someone just getting started!

0 views