XGBoost Is All You Need
import LLMFeatureDemo from "../../components/blog/XGBoostIsAllYouNeed/LLMFeatureDemo.astro"; import ModelComparisonDemo from "../../components/blog/XGBoostIsAllYouNeed/ModelComparisonDemo.astro"; import FeatureImportanceDemo from "../../components/blog/XGBoostIsAllYouNeed/FeatureImportanceDemo.astro"; {/* <!-- TODO: Add a more concrete opening story - maybe a specific moment at the startup where you realized asking LLMs for direct answers was broken? --> */} I spent two and a half years at a well-funded search startup building systems that used LLMs to answer questions via RAG (Retrieval Augmented Generation). We'd retrieve relevant documents, feed them to an LLM, and ask it to synthesize an answer. I came out of that experience with one overwhelming conviction: we were doing it backwards. The problem was that we were asking LLMs "what's the answer?" instead of "what do we need to know?" LLMs are brilliant at reading and synthesizing information at massive scale. You can spawn infinite instances in parallel to process thousands of documents, extract insights, and transform unstructured text into structured data. They're like having an army of research assistants who never sleep and work for pennies. {/* <!-- TODO: Add personality - maybe joke about why you picked this problem, or a detail about trying the "ask LLM directly" approach first and failing? --> */} Forecasting how many rushing yards an NFL running back will gain in their next game is a perfect example of this architecture. It's influenced by historical statistics (previous yards, carries, opponent defense), qualitative factors (recent press coverage, injury concerns, offensive line health), and game context (Vegas betting lines, projected workload). {/* <!-- TODO: Add personality - show a real example of ChatGPT giving a plausible-sounding but wrong prediction? Make it funny? --> */} You could ask ChatGPT's Deep Research feature to predict every game in a week. It would use web search to gather context, think about each matchup, and give you predictions. This approach is fundamentally broken. It's unscalable (each prediction requires manual prompting and waiting), the output is unstructured (you'd need to manually parse each response and log it in a spreadsheet), it's unreliable (LLMs are trained to sound plausible, not to optimize for numerical accuracy), and you can't learn from it (each prediction is independent—there's no way to improve based on what worked). This is the "ask the LLM what's the answer" approach. It feels like you're doing AI, but you're really just creating an expensive, slow research assistant that makes gut-feel predictions. {/* <!-- TODO: Add personality - maybe contrast this with how a human would do feature engineering? Show the "aha" moment when you realized this approach? --> */} Instead of asking "How many yards will Derrick Henry rush for?", we ask the LLM to transform unstructured information into structured features. Search for recent press coverage and rate sentiment 1-10. Analyze injury reports and rate concern level 1-5. Evaluate opponent's run defense and rate weakness 1-10. This is scalable (run 100+ feature extractions in parallel), structured (everything becomes a number XGBoost can use), and improves over time (XGBoost learns which features actually matter). I started with basic statistical features from the NFL API: yards and carries from the previous week, 3-week rolling averages, that kind of thing. These are helpful, but they miss important context. So I had the LLM engineer seven qualitative features: press coverage sentiment, injury concerns, opponent defense weakness, offensive line health, Vegas sentiment, projected workload share, and game script favorability. An agent loop with web search processed context about each player and game to populate these features—searching for news in the week leading up to the game and rating each factor on a numerical scale. <LLMFeatureDemo /> Once we run this process for every running back each week, we end up with a dataset that has both statistical and LLM-engineered qualitative features. {/* <!-- TODO: Add personality - what were you hoping for? What did you expect to happen? --> */} I split the data chronologically—early weeks for training, later weeks for testing—and trained two models. A baseline using only statistical features (previous yards, carries, rolling averages), and an enhanced model using both statistical and LLM-engineered features. {/* <!-- TODO: Add personality - show your emotional reaction to seeing these numbers. Were you shocked? Skeptical? Did you run it again to make sure? --> */} <ModelComparisonDemo /> The LLM-enhanced model reduced prediction error by 22.6% . The baseline model was actually worse than just predicting the average yards (R² of -0.025), while the enhanced model explained 38.6% of the variance. But that's not the interesting part. The interesting part is what XGBoost actually learned. {/* <!-- TODO: Add personality - build up the surprise here. Maybe say "I looked at the feature importance rankings expecting to see..." --> */} <FeatureImportanceDemo /> Six of the top seven most important features are LLM-engineered. The top feature is average carries over the last 3 weeks (statistical). The second most important feature is press coverage sentiment (LLM). Then game script prediction (LLM), Vegas sentiment (LLM), projected workload share (LLM), offensive line health (LLM), and injury concern (LLM). I didn't tell XGBoost that press sentiment matters more than injury concerns, or that game script prediction is more important than offensive line health. The model discovered these patterns on its own by analyzing which features actually correlated with rushing yards. The most predictive LLM feature, press coverage sentiment, captures momentum and narrative that doesn't show up in raw statistics. When a running back is getting positive press coverage, they tend to get more carries and perform better. XGBoost found this signal and learned to weight it heavily. This is the power of the hybrid approach: LLMs transform messy, unstructured context into clean features. XGBoost discovers which features actually matter. Neither could do this alone. {/* <!-- TODO: Add personality - make the transition to "this is actually a bigger problem" more dramatic. Show frustration with the current state? --> */} This isn't just about NFL predictions. Email prioritization, Slack message routing, pull request quality assessment, prediction market opportunities, customer support triage—every one of these problems has the same structure. Some structured data combined with unstructured context that needs to be transformed into a prediction. The architecture is identical every time: use LLMs in parallel to extract features from unstructured data, combine with structured features, train XGBoost to find patterns, deploy and iterate. Setting this up from scratch takes way too much time. I want tools that make this trivial—upload your data, describe what you want to predict, and get back a trained model with a deployment-ready API. {/* <!-- TODO: Add personality - make this section angrier? More pointed? This is your villain reveal. --> */} The tools I'm describing could exist today. The technology is mature and proven. So why hasn't anyone built them? Random forests don't raise $1B rounds. Founders are building pure-LLM systems because that's what gets funded. VCs get excited about foundation models and AGI, not about elegant hybrid architectures that combine 2019-era XGBoost with LLM feature engineering. This is the real problem with modern AI development. Not that the technology isn't good enough—it's that incentives are backwards. VC-led engineering is bad engineering. The best technical solutions rarely align with what makes a compelling pitch deck. Everyone's building the wrong thing because they're building what raises money instead of what solves problems. If you're a builder who cares more about solving real problems than raising huge rounds, there's a massive opportunity here. Build the boring, practical tools that let people deploy these hybrid systems in minutes instead of weeks. Build what actually works instead of what sounds impressive. {/* <!-- TODO: Add personality - end on a more concrete note? What are YOU going to build next? What do you wish existed? --> */} The future of ML isn't pure LLMs or pure classical ML—it's knowing which tool to use for which job. Don't ask LLMs "what's the answer?" Ask them "what do we need to know?" Then let XGBoost find the patterns in those answers. Want to see the full implementation? Check out the complete Jupyter notebook walkthrough with all the code, data processing steps, training, and visualizations.