GreatReads - Blog Aggregator · Phoenix Framework

Writing with AI without the Slop

I suck at writing. I open too many parentheses, and my thoughts scatter (everywhere). So when ChatGPT launched, I thought it would finally replace Grammarly. But LLMs have their own problems: “It’s not just x—it’s y,” Rhetorical questions? Affirmative answers! “Here’s the kicker”: That preface was entirely unnecessary, And in the end, it ends with recaps — that repeat everything already said, now with bullet points. The problem with AI text is that when you read it, your first thought is: “Did this person actually invest time in this, or did they write a two-line prompt and expect me to read something they never even thought about?” And as some people put it: “I’d rather just read the prompt.” The current state of Reddit, basically. LLMs can’t be genuine because they don’t know how to be a person. They read text from multiple public sources and average it out. They weren’t trained by eavesdropping on authentic conversations or messages. (At least I hope not) The more the AI creates for you, the worse the output becomes. That’s why when you ask it to keep it casual, it turns into “How do you do, fellow kids?” and when you ask for a professional tone, it becomes “Alas, who’d’ve done this?”. If you want LLMs to cook, you need to provide ingredients. As a general writing (and cooking) tip, start it raw. Don’t use autocorrect. In fact, don’t even look at what you’re typing. Close your eyes and let raw ideas flow, along with grammatical mistakes and misconstrued sentences. Just make it coherent enough. Make bullet points to answer: “What’s the point of me writing this?” Connect those bullet points with your personality, which dictates how you link sentences. A serious person uses serious connectors; a casual person throws in verbal expressions (and memes). How LLMs can help When you have the first draft, the key is using the right edits. The biggest mistake people make is while prompting. If you prompt like a casual writer, it treats you like one. Saying, "Improve the text below for my email,” makes the AI slopify everything: it accesses the neural latent space of “This person needs my help immensely.”. You need to signal, “Hey, I know what I’m writing. I just need help improving the flow while keeping my own words.” You can do this by using the verbiage editors and publishers use during the different editing phases, from solidifying the overall scope to minor edits like correcting grammar. While the LLM won't write for you, it can help you immensely, because writing words is not the hard part once you get the hang of it. For me, the editing takes 80% of the overall time . Most people start as slow writers because they try to write and edit simultaneously. With chain-of-thoughts in newer models, you don't need much prompt engineering anymore . You just need to know the right words so the LLM's thinking can go into the embeddings. Content editing improves flow and structure at the sentence level. It is useful when you know what you want to say but are unsure how to connect thoughts. It’s the most destructive, so it's better only to use it once. Example Prompt: “You are a content editor. Improve the flow of the sentences and make the text stronger and more structured.” The AI will make many edits to make your text make sense, and the places where the AI misunderstood your intentions will stick out like sore thumbs. You will need to adjust them and add points that solidify your premise. As you add (and cut) content for a second draft, it's time to move to line editing. Line editing is where AI shines, especially for short texts like announcements. Use this when you know what and how you want to say something, but specific words escape you, or phrasing could be simpler. I spend most of my time here, line editing multiple times until nothing stands out badly. Example Prompt: “Line edit this (Slack message / blog post).” Proofreading happens when you’ve “mastered” the copy. It’s always safe to run multiple times without fearing the AI will destroy your voice, because you will tempt yourself to write small additional bits here and there. Example Prompt: “You're Grammarly, fix the mistakes in the text:” This is basically a cheap Grammarly (but better). Writing text is not magic, and you must put in effort. Even if we have better AI, I don’t think we will ever remove the AI scent of text writing. So we as humans will need to write until we get tired and don't even want to finis- Thanks for reading Jampa.dev! Subscribe for free to receive new posts. (And avoid getting shot by a snip- Note: I’ve added all the editing phrases of this article here . You can see how the content was changed from draft to final editing. I used Claude Sonnet for the editing part. Overall, I did one content edit and 18 line edits (on different snippets), and I lost count of how much proofreading I used.

Writing

0 views

Jampa.dev 1 weeks ago

Things I’ve learned in my 7 Years implementing AI

Even though the impacts of LLMs have never been seen before, they feel familiar to earlier assumptions. For context: I wasn’t the “PhD scientist,” working on models. I was the guy who worked on productionizing their proof-of-concept code and turning it into something people could actually use. I worked in industries ranging from software/hardware automated testing at Motorola to small startups dealing with accessibility and education. So here is what I've learned: This AI hype cycle is missing the mark by building ChatGPT-like bots and “✨” buttons that perform single OpenAI API calls. For example, Notion, Slack, and Airtable now lead with “AI” in their page titles instead of the core value they provide. Slack calls itself “AI Work Management & Productivity Tools,” but has anyone chosen Slack for its AI features? Most of these companies seem lost on how to implement AI. A simple vector semantic search on Slack would outperform what they’ve shipped as “AI” so far. People don’t use these products due to these “✨” AI solutions. The best AI applications work beneath the surface to empower users. Jeff Bezos comments about this ( in 2016! ) You don’t see AI as a chatbot on the Amazon homepage. You see it in “demand forecasting, product search ranking, product and deals recommendations, merchandising placements, fraud detection, translations.” That’s where AI comes in, not as “the thing” but as “ the tool that gets you to the thing .” Relevant XKCD, which is not relevant anymore… What if a problem that took a team of PhDs one year to solve could be solved better in four hours? That's when LLM shines: When I worked on accessibility for nonverbal people, one of our projects aimed to make communication cards (“I want,” “Eat,” “Yes,” “No”) context-aware to allow nonverbals to express their desires faster, similar to an autocomplete. For example, the user is home at 7 AM and taps “I want to eat” card. The next cards should anticipate their needs (which are more likely to be breakfast items), but there are caveats: What a person typically eats for breakfast depends on their country, the type of establishment they are in (home, hotel, restaurant), the day of the week, and, of course, current personal preferences, which also change over time. After a year of work, our team of researchers from two universities achieved a 55% rate (of the suggested options). It was a massive success at the time. We even won an award for best accessibility solution. When ChatGPT 3.5 was released, I replicated a solution for this project and, after hacking over the weekend, got an 82% accuracy rate when running against the same test database. AI skeptics ask, “If AI is so good, why don’t we see a lot of new startups?” Ask any founder. Coding isn’t even close to the most challenging part of creating a startup. What I do see is a boom in internal tools. This year alone, I shipped projects that would never have been viable. As an engineering manager, spending weeks coding means neglecting the team. The “Nice to have” bucket is when a project dies. It means there is no engineering capacity to tackle it, so it goes into the backlog limbo—until now. Now, I can build these projects using Claude, running prompts, and reviewing the output between meetings. I see many people releasing new things that are incredibly helpful and productive, which would not have happened without Claude or Cursor. Like with all tools before it, we’re coming closer to the top of the S-curve for LLMs: Note: Take this graph with a grain of salt. It is hard to compare earlier models because most benchmarks came much later. The last releases were unimpressive. Does anyone know a real application where ChatGPT 5 can do something that o3 could not? The good news is that what we have is enough for most people. AI tools like KNNs are very limited but still valuable today. This also kills the reverse FOMO: “If I wait for the technology to mature, I won’t have to deal with their earlier quirks,” is less relevant now. But AI research is definitely not over: We will still see cheaper, faster, and open models, like those that can run on a mobile device and are as capable as ChatGPT 4o. Creating AI models is hard, but working with them is simple. I put off implementing earlier AI tools because I couldn’t grasp how neural networks, sigmoids, and all that worked. Then someone said, “What are you doing? If you want to apply the technology, just use Scikit-learn.” If you’ve never used AI for coding, install Claude Code and start using it for small tasks. That gets you 70% of AI’s current benefits without diving into prompt optimization or chain-of-thought mechanics. Eventually, you’ll need to learn to leverage LLMs better when you hit bottlenecks. You will realize that you will still need to review code and CLI commands. You will naturally be better at prompting. You will know when and when not to use it. AI is the new Agile: something simple, that makes you faster but has limits, yet people will position it as the solution for every problem, preaching: “Oh, you’re using (AI / Agile) wrong. In fact, it seems like what you need is even more of (AI / Agile)” The tool has limits, especially when breaking new ground. LLMs are limited by their training data. For example, when I tried to vibecode a mod for a recently released Unity game, the AI failed to complete even a basic hook. Automatic railway gates replaced crossing attendants. But if those gates worked 99% of the time (or even 99.99%), would that be good enough? LLMs are very far from being 99% accurate. They fix problems, but they tend to miss the root cause. I see many cases where the LLM suggested a fix by adding multiple lines, which an experienced engineer did by removing one. Recognizing this requires senior-level skills, such as valuing simplicity over complexity and knowledge gained from dealing with similar bugs in the past. This creates a problem for juniors, who, when using LLMs , will have problem-solving done for them and won’t develop this skill, hurting their code reviewing abilities. I see many companies that have stopped hiring juniors altogether. The Internet was a bubble in 1999, and you know the result. The internet died completely, but it was good for a while. Man, I miss the Internet. But seriously, we are seeing great tools coming to boost productivity, a new era of AI memes, while VCs and Big Tech pay for most of them. It’s a win-win. Thanks for reading Jampa.dev! Subscribe for free to receive new posts and support my work. Also, here is my current favorite SORA video: (Warning: LOUD ) (I had to remove the video because a bug in Substack causes the space bar to play the video instead of scrolling down—sorry for the jumpscare. Here’s the Reddit link instead: https://www.reddit.com/r/SoraAi/comments/1nwcx9e/some_body_cam_footage/ )

Business

Career

0 views

Jampa.dev 1 months ago

Using Claude Code SDK to Reduce E2E Test Time by 84%

End-to-end (E2E) tests sit at the top of the test pyramid because they're slow, fragile, and expensive. But they're also the only tests that completely verify complete user workflows actually work across systems. Due to time constraints, most teams run E2E nightly to avoid CI bottlenecks. However, this means bugs can slip through to production and be harder to fix because there are so many changes to isolate the root cause. But what if we could run only the relevant E2E tests for specific code changes of a PR? Instead of waiting hours for the entire suite, we could get the results in under 10 minutes , catch bugs before they ship, and keep our master branch always clean. The first logical step toward running only relevant tests would be using glob patterns. We tell the system to test what changed by matching file paths. Here's how a typical could work: But globs are very limited. They require constant maintenance as the codebase evolves. Every new feature would require updating the glob patterns file. More importantly, they cast too wide a net. A change to might need to trigger every E2E test that involves any page with a button interaction, depending on how deep the change is. So, how can we determine which E2E tests should run for a given PR with both coverage and precision? We need coverage because missing a critical test could let bugs slip through to production. But we also need precision because running tests that will obviously pass just wastes time and resources. The naive approach might be to dump the entire repository and changes into an LLM and ask it to figure out which tests are relevant. But this completely falls apart in practice. Repositories can easily contain millions of tokens worth of code, which makes it impossible for all AI models. Claude Code takes a fundamentally different approach because of one key differentiator: tool calls . Instead of trying to process your entire codebase, Claude Code strategically examines specific files, searches for patterns, traces dependencies, and incrementally builds up an understanding of your changes. So here's the hypothesis: If I see a PR, I will know which E2E tests it should run because I know the codebase. The question is: Can Claude Code replicate my human intuition by searching for it? Let's build and find out. For the E2E selection to be successful, Claude needs to know what I know: the PR modifications, the E2E tests, and the codebase structure. We need to glue all three together in a well-crafted prompt. This is perhaps the easiest piece - we can leverage git to get exactly what we need. We start with the basic command: This gives us the changes of a branch, but we can do much better. First, we want git to be less verbose, so we add to focus on the actual code changes rather than whitespace noise. We also don't care about deleted files since we'll need to remove references in existing files anyway (unless we don't care about those tests), so we add to exclude deleted files and focus on (A)dded, (C)opied, (M)odified, and (R)enamed files. Finally, we need some strategic excludes because there are generally large files in PRs like that would blow up our token count. We add to keep things manageable. Putting it all together: The result is a clean diff showing the actual code modifications: We could hardcode a list of test files in our prompt, but that violates the single source of truth principle. We already maintain this list for our daily benchmarks, so let's reuse it. For example, if the test configuration lives in a WebdriverIO ( ) config file, we can extract it programmatically: This script dynamically reads the file and outputs our exact test suite configuration: The prompt needs to be precise about what we want. We start by setting clear expectations: The key phrase here is "think deep" . This tells Claude Code not to be lazy with its analysis (while spending more thinking tokens). Without it, the output was very inconsistent. I used to joke that without it, Claude runs in “engineering manager mode” by delegating the work. Next, we set boundaries: The "only run tests listed" constraint was added because Claude was being "too smart," finding work-in-progress spec files and scheduling them to run. We added the last piece because it is better to run more specs than leave a test out. I initially asked for JSON output , and since I didn't want Claude's judgment to be a black box, I requested two keys: the list of tests to run and an explanation . This makes it easy to benchmark whether the reasoning is sound. I initially tried using JSON mode and asking Claude to output only JSON: But Claude has strong internal system instructions and couldn't stop adding commentary . I initially fixed this with a regex JSON parser to remove the commentaries, but when you use regex to solve a problem, you get two problems. But then I realized: Claude Code is used to write files, duh So instead of fighting with JSON mode and regex, I asked: Works every time! The final pipeline combines everything with what might be the ugliest bash command known to humankind: The result command is piped to Claude: We add So it can write our file. By the way , you should never use which gives all permissions, including . I am surprised by how many people are taught to do this . If we did add this flag, someone could write in the prompt file and instruct Claude to read our environment variables and send them to a URL using Fetch(). Since the CI runs on a PR open, not a merge, this would be similar to a “0-click” exploit. I won't lie - this exceeded my expectations. We used to run all core tests, which took 44 minutes (and now it would take us more than 2 hours, since we keep adding tests). Most PRs complete E2E testing in less than 7 minutes, even for larger changes. Even if it performed worse, it would still be an incredible success because our system has so many complexities that other types of tests (unit and integration) are nowhere near as effective as E2E. The solution scales well because adding E2E test names consumes few tokens, and PR changes are mostly constant. Claude doesn't read all test files: it focuses on the ones with semantic naming and explores modified file patterns, which is surprisingly effective. Did Claude catch all the edge cases? Yes, and I'm not exaggerating. Claude never missed a relevant E2E test. But it tends to run more tests than needed, which is fine - better safe than sorry. How much does it cost? Without getting into sensitive details, the solution costs about $30 per contributor per month. Despite the steep price, it actually saves money on mobile device farm runners. And I expect these costs will drop as models become cheaper. Overall, we're saving money, developer time, and preventing bugs that would make it to production. So it's a win-win-win! Thanks for reading Jampa.dev! Subscribe for free to receive new posts! We need coverage because missing a critical test could let bugs slip through to production. But we also need precision because running tests that will obviously pass just wastes time and resources.

Testing

DevOps

AI Bash

JSON

0 views

Jampa.dev 1 months ago

Why AI for coding is so polarizing

If you spend any time online, you've probably seen the wildly different opinions on using LLMs in coding. On one side, Twitter bros bragging about how they built “a $1k revenue app in just 10 days using AI”. On the other hand, engineers who refuse to use any LLM tool at all. You'll find them in every thread, insisting that AI sucks, produces garbage code, and only adds to technical debt. Alt text: The most civilized Anti-AI vs Pro-AI conversation on Twitter. Joking aside, some people use AI to do great things daily, while others have problems with it and have given up. The difference is context. An LLM has no sapience. Everything the AI cooks up is a product of its training corpus, fine-tuning, and a system + user prompt. (with a bit of randomness for seasoning). No matter how clever your prompt is, the training data is its foundation. This is why companies are so aggressively scraping the web. If you create a new language tomorrow called FunkyScript, the AI will be terrible at it, regardless of your prompt. This explains the different experiences of AI detractors and champions. On the one hand, you have people new to coding working on greenfield projects with popular tools like Tailwind and React (which have a massive training corpus). On the other hand, you have engineers working with more niche tools. A great example is CircleCI’s YAML configuration. Since CircleCI has documentation that's difficult for an AI to ingest (because it sucks). So the AI starts hallucinating and spitting out code for GitHub Actions instead. Then there's the context window, the "short-term memory" of the AI. It's a known issue that the more context you stuff into a prompt, the "dumber" the model can get. When you're working on a greenfield project, there are no existing files or dependencies, so you don't need to provide much context, which saves you from spending tokens on it. But greenfield projects aren't the norm . The norm is a legacy codebase built by multiple people who changed many parts and then left the company. Some of it has parts that don't make sense even to a human, much less to an LLM. All this extra context weighs down the LLM tokens. Consider the same prompt: " Change all the colors to blue on my Auth page ." In a new project, the AI can probably find and handle the relevant files. But on a mature codebase, that auth page is tied to a color system, part of a larger design system. Now the AI is in trouble. Throw in some unit tests that will inevitably break, and the AI is completely lost. "Hey AI, you broke this stuff" — You say, thinking you are not using AI enough Then the AI sycophantly replies: "You are absolutely right! Let me try another approach!" Now you're the one in trouble . It's time to shut the AI down and salvage what you can from the wreckage. This isn't a perfect fix, but there is a strategy to make the AI less destructive and, eventually, genuinely helpful. You'll have to decide if the upfront effort is worth it compared to manually coding. It won't be worth it for the FunkyScript codebase, but I succeeded on niche stacks, like Mobile E2E. In complex codebases, an AI must learn your project's unique patterns with every prompt. The solution is to give it that knowledge upfront, rather than making it rediscover everything at "runtime." Having a good , for example, which an LLM can read before performing a task, helps the AI understand what makes your project different from its base model. Your is not for you to say “ do it right, stop making it wrong ” like a lot of people do. We can even use the AI itself to help. Here is an example prompt. You should provide more high-level context for a real project, especially if your README.md sucks . You are a senior engineer onboarding another senior engineer to our codebase. Analyze the provided files at a high level. Study its structure and patterns, then write a document explaining how to work on it. Highlight the parts that differ from common industry patterns for this language and framework. For example, do you use Bun instead of npm? Inline styles instead of CSS? These are crucial details the model needs to know; otherwise, it will default to the most common patterns in its training data. So, the next time someone gives an opinion on AI that differs from yours, maybe don't immediately jump to arguing. They aren't necessarily doomers who will be replaced, nor are they grifters selling snake oil. Consider that not every engineer works on your stack / codebase. …. or maybe they are all koopas: Thanks for reading Jampa.dev! Subscribe for free to receive my shitposts and Goomba fallacies.

Programming

CSS Yaml

0 views

Jampa.dev 2 months ago

My advice to the new generation of software engineers

The job market is tough for junior engineers right now, and many companies have drastically reduced hiring for these roles. Some claim this is due to AI, which still needs someone to operate it. Others blame outsourcing, a practice that's been part of the industry since, well, forever. But the truth is, junior engineers have never had it easy. When I first started applying, I had seven years of experience writing software as a hobby and still struggled to get an interview. It wasn't until I was on the other side of the table, hiring juniors myself, that I finally understood what I had been doing wrong. Thanks for reading Jampa.dev! Subscribe for free to receive new posts and support my work. Ironically, the current situation is worse for juniors because, just a few years ago, things were too easy for everyone. Let's rewind to the COVID era. Governments initiated a ZIRP (Zero-Interest-Rate Policy), meaning "safe investments" like bonds became less attractive. Investors with a lot of cash lying around needed another avenue to generate returns. As a result, tech startups became a huge target for investment. Also, during COVID, people needed to digitize their processes. Tools like Zoom, which were rarely used in person, suddenly became multibillion-dollar companies. And, with more people at home, industries like advertising, entertainment, and gaming also saw a massive infusion of cash. These factors caused companies to hire like crazy, causing a shortage of software engineers. With few senior engineers available, the high tide raised all the boats, and companies started poaching engineers from each other, including juniors. Recruiters were reaching even for boot camp graduates. I had a friend who was being paid and had a job offer lined up if he attended a boot camp. However, the hiring frenzy created a long-term problem for juniors. Everyone realized they were losing their junior engineers to other companies. Since juniors require training and would quickly get an offer elsewhere as “Senior" and leave anyway, companies stopped hiring them. With the interest rates up, our industry is in a downturn (unless you slap “AI” in your product). I still believe juniors will make a comeback. The core principles of breaking into this industry haven't changed, and the demand for software engineers still exists. So, how do you do that when the odds are stacked against you? Even before COVID, many computer science graduates left the industry after getting their degrees. The problem was that many of them focused only on getting good grades and forgot about the actual craft of programming. Imagine yourself as an employer reading resumes. What would set a candidate apart: their GPA or the fact that they worked on live projects that people are actually using? I worked at a company that hired many juniors. When we reviewed their resumes, they mostly fell into two buckets, with a very imbalanced split: 95% of resumes just listed a bootcamp certificate or a college GPA. The rest of the space was filled with fun facts, big headshots, and flashy modern designs. The other 5% listed side projects, college research assistant work, public GitHub repos, Jupyter notebooks, or personal websites. We only interviewed candidates from that 5% bucket. We knew there were capable people with high potential in the 95% pile, but like most companies, we didn't have the budget or time to interview everyone. It used to be that you got the job, and then you got the experience. Now, it's the other way around. If you're in college, you're in the perfect place to build connections and find opportunities, but very few students take advantage of them. Don't wait until your final semester or when you desperately need an internship to start thinking about your career. Unless you plan to be a researcher and pursue a master's degree, college is primarily a launchpad for your employment prospects. It's also a chance to taste different areas of computer science. Most bootcamp grads end up as mobile, front-end, or back-end engineers, but in college, you can explore other segments like working with embedded devices, firmware, or even game development. While you're there, find other students interested in building cool stuff. Many of the biggest tech unicorns were founded by people who met in college and shared a passion. You can also pursue research opportunities with professors. You'll learn a ton, and some have valuable industry contacts who can provide strong referrals. If you're not in college, you can start your career by doing "odd jobs" instead of only pursuing full-time employment. For example, try creating a startup. Even if the idea is bad, it's the fastest way to learn. Freelancing is one possibility, like doing a product for that friend-of-a-friend who owns a business. This path isn't for everyone, though. After a while, you might spend more time on business and negotiation than on becoming a better coder. Those are important skills, but they can get boring if your passion is the code itself. Another option, if you're in a country where college is cheap or free, is to go to college and apply the advice I mentioned earlier. That's what I did, I knew how to code, but wanted to learn more. The most important thing is to keep making things and sharing them. It doesn't need to be a viable business or make a single cent. It can just be something you find helpful that might be useful to others, too. Even when you "fail," you meet many new people. Even silly projects can lead to amazing things. I enjoy making scrapers, so back when Pokémon Go was at its peak, I built a map for my city. One of my first users was the CTO of one of the biggest companies in my city, who encouraged me to apply to his company. If you do enough of that, your resume will eventually cross the 95%/5% gap, and people will start calling you for interviews. These "odd jobs" are career-defining. You will be forced to learn about optimization, caching, Redis, N+1 queries, microservices, and DevOps. You can also drop their links in your resume. So, after a while, you will start to get interviews! Which is only half the battle. Interviewing is a skill that has almost nothing to do with your actual skill as a programmer, but LeetCode-style problems aren't going away. You need to read at least the *Cracking the Coding Interview*, even if you aren't aiming for a FAANG job. And even after you've read it, prepare to flop a few interviews. Remember that CTO who invited me to interview? I totally blew it. They asked me how I would design a database system using the Windows filesystem and folders. I basically told them the idea was silly. "Why would you create a production database with TXT files on Windows? If you need a NoSQL-style system, why not use an actual RDBMS and avoid the Windows overhead?" That's exactly how their system was built, and they didn't appreciate my candidness. You will make mistakes in your first interviews. That's fine. It's how you learn to navigate the corporate world. You'll learn what you can say and, more importantly, what you can't. In the end, despite no one saying so, most interviewers aren't looking for the best candidate. They're more concerned with avoiding hiring the worst ones. Once you land that first job offer and accumulate years of real-world experience, finding the next job gets easier. (It's never easy, of course, unless you have great connections). One final piece of advice: don't focus on money too early in your career. Career growth is way more important. Joining a large enterprise might offer more job security, but a startup often gives you more opportunities to shine and get promoted faster. Thanks for reading Jampa.dev! Subscribe for free to receive new posts and support my work. Governments initiated a ZIRP (Zero-Interest-Rate Policy), meaning "safe investments" like bonds became less attractive. Investors with a lot of cash lying around needed another avenue to generate returns. As a result, tech startups became a huge target for investment. Also, during COVID, people needed to digitize their processes. Tools like Zoom, which were rarely used in person, suddenly became multibillion-dollar companies. And, with more people at home, industries like advertising, entertainment, and gaming also saw a massive infusion of cash. 95% of resumes just listed a bootcamp certificate or a college GPA. The rest of the space was filled with fun facts, big headshots, and flashy modern designs. The other 5% listed side projects, college research assistant work, public GitHub repos, Jupyter notebooks, or personal websites.

Career

Business

0 views

Jampa.dev 6 months ago

Testing the Big Five LLMs: Which AI Can Better Redesign My Landing Page?

The best thing about AI is that it can code snippets I am not passionate about. I am glad that I no longer need to think as hard to write Javascript .reduce() or any Swift code. With the new Flagship models coming in hot this month, like Gemini 2.5 and o1-pro, I thought it would be perfect to try those out in one category I suck most: visual design . This is the perfect opportunity for me to replace the designer in me who is terrible with hand-eye coordination and always got bad grades in art classes because my teacher thought I was “not taking it seriously enough.” However, it is cheaper to benchmark LLMs than to go to therapy. I am writing a FOSS book about Engineering Management , and I have created a monstrosity of a home page below—as you can clearly see, I should get a designer . Job to be done We will take a screenshot and the code of the current homepage and feed it to multiple flagship AI models to ask them to make it less horrible. The most popular reasoning models currently are Google Gemini 2.5, OpenAI o1 Pro High, xAI Grok3 Think, DeepSeek R1 , and Claude 3.7 Sonnet , so we are going with those. I tweaked the system prompt in some preemptive benchmarks to improve common confusion points and things I missed. I find this prompt good enough, even for non-reasoning models: One thing I am learning with system prompts is that if you ask LLM too much, they start failing catastrophically . For example, it is okay to ask an LLM agent to create a unit test for a component, but if you ask: “create all missing units tests in my codebase,” it shits the bed terribly. So I am going to keep it simple. Let's establish a system to evaluate the AI results. I have created a scorecard system to evaluate what we want the most: Visual Design (50 points) - This is what we came for, so it should be half of the overall score Interactivity (25 points) - Relates to mouse button hovers and scroll animations, basically “making it pop.” Code quality (15 Points) - We should judge the code since having visual improvements is good and shouldn't come at the cost of code maintainability Dark mode compatibility (10 points) - A “nice to have”: Our prompt doesn't even mention it so that we can focus on the above. If the AI messes this up, it is a quick fix. I ran my current code against all the models and will share the code and the prompt when possible. Here is the original branch from which the code is used . Let's start with the oldest one (I cannot believe I am calling an LLM released in January old). Deepseek produced a great concept with a few caveats. The art doesn't mean much. The links with the underlines under them are fugly, and, most important, the shadows are terrible, and the hover effect of a shadow is dated. But at least it works with dark mode! The code is also not bad; it adds a lot of SCSS, but it is expected. Result: Visual Design: 30 | Interactivity: 15 | Dark mode: 10 | Code Quality 10 Google's new LLM has performed very well in many of my benchmarks. It makes the best even with the worst prompts. This is very good. I have a few complaints about this design. I don't like the double columns of the chapters, but that is about it. The hovering is great and presents the chapter in a very solid way! In terms of code , it did surprisingly well, and it even fixed my bad dark mode logic, but on the other hand, it had a lot of unnecessary comments, which would not be ideal for pushing to the main as is. Visual Design: 40 | Interactivity: 25 | Dark mode: 10 | Code Quality 10 Grok is the first one to disappoint me so far, but at least it made me feel better about my own design. It repeated the book icon, “locked” the chapters as if I was selling a SaaS plan, and visually nested the cards too much. Overall, there were some small changes, but none were positive. It's also not stellar on the code side either, but at least I don't see any negative downsides. Visual Design: 10 | Interactivity: 15 | Dark mode: 5 | Code Quality 10 o1 knows the key to my heart: I love blue and gradients. Maybe the fact that it did exactly how I liked it without me asking creeps me out. I really like the header—it draws more attention than others. And instead of adding a generic SVG, it went with the best approach with the current LLM capabilities. The chapter's menu is also not bad. I like that it starts over the banner, so it flows better. The padding around the chapter's card could be better, and the broken white padding around the whole site might have been a mistake. The dark mode is utterly broken. On the code side , it is nothing bad; I could merge this as is. Visual Design: 40 | Interactivity: 25 | Dark mode: 0 | Code Quality 15 I almost forgot Claude, to be honest. I always hear compliments on the code quality and how people prefer to use it on Cursor and Windsurf. I am glad to see its greatness in UI design as well. That gradient is super sleek, and the subtle blue backdrop is also cool. The SVG, while not the most incredible one, was at least the most relevant. I just didn't like that it added the famous “3 boxes with icons” that you see in every landing page template. Also, the bullet points and double columns are not visually pleasing. Ironically, since Claude LLMs are praised for their code quality, I expected more ; it tried to import a new font and tried to make the book's page flip, which clearly did not work. Visual Design: 35 | Interactivity: 20 | Dark mode: 15 | Code Quality 5 The results are in: Gemini 2.5: 40 + 25 + 10 + 10 = 85 points o1 Pro High: 40 + 25 + 0 + 15 = 80 points Claude 3.7 Sonnet: 35 + 20 + 15 + 5 = 75 points Deepseek R1: 30 + 15 + 10 + 10 = 65 points Grok 3: 10 + 15 + 5 + 10 = 40 points One thing I missed in all the LLMs was improving the header text. I know that my callout is terrible because I am very bad at selling things. The system prompt even mentioned this, but all LLMs ignored it. But at least all the LLMs were successful in fixing 2 grammatical errors! It is impressive that the AI “knows” how to better design a website without any visual aid to validate it afterward. Sure, it's not the Linear home page, and it won't win any Awwards, but in the end, I think that is the current LLM limitation. It is a blender of text absorbed by the corpus, resulting in an average of all designs worldwide. I am also impressed by how the AI advanced. If I had tried this a few months ago, the results wouldn't be ready-to-run code. There would be just a few improvements in the code, but mostly not visual. In the end, yes, any of those LLMs except for Grok would improve the current landing page, and I should have just applied the improvements instead of writing an article. However, I don't want an average book page. Even if I don't get paid for it, I want an excellent one. So, for that reason, I am still getting a designer. Thanks for reading my blog! Subscribe for free to receive new posts about AI and Tech Careers content. This is the perfect opportunity for me to replace the designer in me who is terrible with hand-eye coordination and always got bad grades in art classes because my teacher thought I was “not taking it seriously enough.” However, it is cheaper to benchmark LLMs than to go to therapy. The Problem I am writing a FOSS book about Engineering Management , and I have created a monstrosity of a home page below—as you can clearly see, I should get a designer . Job to be done We will take a screenshot and the code of the current homepage and feed it to multiple flagship AI models to ask them to make it less horrible. The most popular reasoning models currently are Google Gemini 2.5, OpenAI o1 Pro High, xAI Grok3 Think, DeepSeek R1 , and Claude 3.7 Sonnet , so we are going with those. System Prompt I tweaked the system prompt in some preemptive benchmarks to improve common confusion points and things I missed. I find this prompt good enough, even for non-reasoning models: One thing I am learning with system prompts is that if you ask LLM too much, they start failing catastrophically . For example, it is okay to ask an LLM agent to create a unit test for a component, but if you ask: “create all missing units tests in my codebase,” it shits the bed terribly. So I am going to keep it simple. Scorecard Let's establish a system to evaluate the AI results. I have created a scorecard system to evaluate what we want the most: Visual Design (50 points) - This is what we came for, so it should be half of the overall score Interactivity (25 points) - Relates to mouse button hovers and scroll animations, basically “making it pop.” Code quality (15 Points) - We should judge the code since having visual improvements is good and shouldn't come at the cost of code maintainability Dark mode compatibility (10 points) - A “nice to have”: Our prompt doesn't even mention it so that we can focus on the above. If the AI messes this up, it is a quick fix. Let's go! I ran my current code against all the models and will share the code and the prompt when possible. Here is the original branch from which the code is used . Deepseek R1 Let's start with the oldest one (I cannot believe I am calling an LLM released in January old). Deepseek produced a great concept with a few caveats. The art doesn't mean much. The links with the underlines under them are fugly, and, most important, the shadows are terrible, and the hover effect of a shadow is dated. But at least it works with dark mode! The code is also not bad; it adds a lot of SCSS, but it is expected. Result: Visual Design: 30 | Interactivity: 15 | Dark mode: 10 | Code Quality 10 Gemini 2.5 Google's new LLM has performed very well in many of my benchmarks. It makes the best even with the worst prompts. This is very good. I have a few complaints about this design. I don't like the double columns of the chapters, but that is about it. The hovering is great and presents the chapter in a very solid way! In terms of code , it did surprisingly well, and it even fixed my bad dark mode logic, but on the other hand, it had a lot of unnecessary comments, which would not be ideal for pushing to the main as is. Visual Design: 40 | Interactivity: 25 | Dark mode: 10 | Code Quality 10 Grok 3 Grok is the first one to disappoint me so far, but at least it made me feel better about my own design. It repeated the book icon, “locked” the chapters as if I was selling a SaaS plan, and visually nested the cards too much. Overall, there were some small changes, but none were positive. It's also not stellar on the code side either, but at least I don't see any negative downsides. Visual Design: 10 | Interactivity: 15 | Dark mode: 5 | Code Quality 10 o1 Pro High o1 knows the key to my heart: I love blue and gradients. Maybe the fact that it did exactly how I liked it without me asking creeps me out. I really like the header—it draws more attention than others. And instead of adding a generic SVG, it went with the best approach with the current LLM capabilities. The chapter's menu is also not bad. I like that it starts over the banner, so it flows better. The padding around the chapter's card could be better, and the broken white padding around the whole site might have been a mistake. The dark mode is utterly broken. On the code side , it is nothing bad; I could merge this as is. Visual Design: 40 | Interactivity: 25 | Dark mode: 0 | Code Quality 15 Con… Oh wait, it's Claude Sonnet 3.7, with a steel chair! I almost forgot Claude, to be honest. I always hear compliments on the code quality and how people prefer to use it on Cursor and Windsurf. I am glad to see its greatness in UI design as well. That gradient is super sleek, and the subtle blue backdrop is also cool. The SVG, while not the most incredible one, was at least the most relevant. I just didn't like that it added the famous “3 boxes with icons” that you see in every landing page template. Also, the bullet points and double columns are not visually pleasing. Ironically, since Claude LLMs are praised for their code quality, I expected more ; it tried to import a new font and tried to make the book's page flip, which clearly did not work. Visual Design: 35 | Interactivity: 20 | Dark mode: 15 | Code Quality 5 Conclusion The results are in: Gemini 2.5: 40 + 25 + 10 + 10 = 85 points o1 Pro High: 40 + 25 + 0 + 15 = 80 points Claude 3.7 Sonnet: 35 + 20 + 15 + 5 = 75 points Deepseek R1: 30 + 15 + 10 + 10 = 65 points Grok 3: 10 + 15 + 5 + 10 = 40 points

Design

Web Development

CSS

JavaScript

0 views

Jampa.dev 6 months ago

The Battle for Attention

LinkedIn shows that I have six notifications, I know that none are interesting, but I click on them anyway. Like an empty fridge, I always fall for it. If I don't read it, it will start sending me emails. Speaking about emails, I am unsure when I last had a real conversation using them. Most of them just want to tell me that their company exists. “We are updating our privacy policy.” — We both know that's not why you are emailing me. No one, not even you, cares about your privacy policy. Email shadow advertising also takes many forms: like “How did we do? Give us feedback that we won't ever read” and “Your package has updates! (It is transit city that doesn't matter.)” When we talk about attention, we always think about social media, which tries to get you addicted to it somehow. If you want to test this, create a new account on Twitter / TikTok / Instagram and see what the algorithm feed "seed” posts are. They are always quasi-nudity, religion, public fights, hustlebros, or politics. It is bad even when there is no algorithmic feed. Some years ago, Reddit's “all” page was used to entertain and have interesting news, but not anymore. If I open my Reddit /r/all page, all the posts are about <x> destroying the world, but what can I do? None of those posts offer anything interesting or actionable. Most are rage-bait relationship stories written by AI and “clever comebacks” at a boomer politician who doesn't even know what an “ecks” is. Source: ExtraFabulousComics It is bad, even for creators The problem is that it didn't even click with me until I started a startup, but attention is more valuable than short-term financial success. Views and signups are cool, but it is better if the user returns to you daily and you spend a lot of time in your app. Later, you can figure out how to turn this attention into money. There is no limit to wealth, but there is to people's lives and how much of it they spend on your app. If you want to show the world something you made, you need to gather that attention—“Build and they will come” is BS for anyone who ever created something. And if you want that attention, you need to piggyback on existing platforms to try syndicating some of their attention. But they will always get the last laugh when they rug pull you, when “your attention” is stolen back into the “platform's attention.” For example, if I post on LinkedIn with a link attached, no one will ever see the post because LinkedIn doesn't want people to click links that get you out of the website. If you are a YouTube creator, you know that the people subscribing is not enough—you have to ask them to hit the bell. The algorithm sometimes won't prioritize your videos to your subscribers and will prefer random videos instead. Ironically, even Substack (this platform) does this. When I go into Substack.com, my home is not the dashboard but other unrelated people's blogs, which are nothing close to things I care about. Why Substack, why are you not opening my Dashboard by default? The only way out of this hell is to treat your time as a currency. “Time is money” is now on a whole new meaning. Now, your time is someone else's money . Now, I have the same disregard for people who want to steal my time as I do for people trying to scam me out of my money. Send me an AI cold email, and I report it as spam. Have fun getting deservedly banned by GMail. Mobile notifications are a privilege not given to any social media. Attention-seeking spam keeps coming in, but I slowly push it out. Nowadays, I limit my social media to Subscribed Subreddits and Hacker News, but honestly, I can't resist sometimes going to Reddit's /r/all page and my local news websites and regretting it. But like a diet, I know I might improve if I consume fewer garbage products.

Marketing

0 views

Jampa.dev 7 months ago

How promotions happens after Senior

From junior to senior level, promotions depend mainly on meeting the proficiency bar for technical skills and communication abilities. But after reaching a senior level, whether your next step is toward Staff or Manager, clearing the proficiency bar alone isn't enough—there must also be a clear "business need." Positions typically become available when your company expands, someone leaves, or new initiatives emerge. The "business need" means that advancing beyond senior relies not just on you but also on your company's circumstances. This makes getting promoted in companies that aren't growing much harder because new roles or opportunities are rarely created. And when a new senior role does open up, many internal candidates will compete for it. Thanks for reading Jampa.dev! Subscribe for free to receive new posts and support my work. This is what makes growing companies attractive. They constantly need to find new ways to capture more market share, resulting in frequent hiring, formation of new teams and initiatives, and consequently, more senior+ roles. If you want to accelerate your career growth, consider joining a rapidly growing startup. You'll likely have greater opportunities to handle significant challenges and rapidly develop your skills. The trade-off is usually stability and possibly a lower salary. Before accepting a new role, take the time to research the company's growth trajectory. Publicly available metrics such as headcount growth, Glassdoor reviews, funding rounds, and media coverage are good indicators of a company's health. For larger companies, you can also assess their long-term potential by reviewing customer experiences on Reddit or other forums. Generally, satisfied customers point out a positive trajectory for the company. It's important not to take a passive approach to your career advancement. There's a well-known saying in Western culture: "The squeaky wheel gets the grease." In conversations with friends who mentioned deep frustrations at work, I would often ask, "What does your manager say when you discuss it?" Surprisingly often, they had never even brought it up. Ideally, your manager should initiate conversations about your career development at least every six months. Unfortunately, many managers never discuss career development, meaning you'll likely need to be the one to bring up the discussions. Nobody is more invested in your career than you — Be explicit about your career goals during your one-on-ones with your manager. Doing so enables more relevant feedback tailored to your desired path and helps identify skills to focus on. If your current manager understands your goals for advancement beyond senior, they can more effectively mentor you by highlighting relevant lessons from situations that arise. Without an awareness of your aspirations, they may miss opportunities to provide valuable context or guidance. Proactively seek opportunities with your manager to learn essential skills relevant to your desired path—whether it's interviewing candidates, running meetings effectively, strategic thinking, or directly interacting with customers. If possible, schedule occasional skip-level meetings with your director. These conversations offer insight into the company's higher-level challenges. It also doubles as a opportunity to know and build rapport with someone who could potentially become your future boss. Senior leaders generally know exactly which individual contributors (ICs) on their teams are fully prepared to step into higher-level roles if needed. Typically, these people are highly competent senior ICs who teammates naturally approach when they need help. They're effective communicators, strong mentors, and skilled at analyzing business requirements alongside technical tradeoffs; this makes them "go-to" advisors for their colleagues. Another invaluable trait senior leadership looks for is curiosity, essential when offering feedback constructively. For instance, if a colleague takes an unusual approach in their work, instead of calling them out directly, ask why they chose that method rather than the accepted 'ideal' approach. Framing your concern as a question demonstrates trust in their abilities and encourages them to reflect and learn. One of the most powerful actions you can take if you're seeking career growth is to position yourself as a "force multiplier." Although this concept often goes unnoticed, it significantly impacts a project's success. Force multiplier is work that once completed, enables others on the team to deliver faster and more effectively. Whether you aspire to become a Staff-level engineer or Manager, becoming a force multiplier is an essential milestone you'll need to achieve. Staff-level leaders don't necessarily do 2–10 times the direct work of a senior individual contributor. Instead, they identify the most impactful opportunities, remove critical roadblocks, or improve infrastructure, making their teams significantly more productive. For example, if frequent production bugs constantly stall your team's progress, implementing critical automated testing can be a significant multiplier. This ensures improvements can safely ship and saves hours of debugging later. Another scenario could be a situation where shifting requirements frequently force team members to redo their work repeatedly. One individual who effectively communicates with stakeholders and creates comprehensive specifications before the team's efforts begin can save many hours of redundant work. Force multiplier efforts increase everyone's efficiency. Imagine you reduce your teammates' task times by 5 minutes each, and they complete around 3 tasks per day. Across a 5-person team, you've effectively added up to **375 working days per year** to your team's productivity! Identifying possible multipliers can be challenging since they're often highly specific to a particular context or team dynamic. However, common areas worth exploring include: * Overly complex processes or architecture forcing team members to duplicate effort across multiple areas, introducing errors. * Lack of robust quality checks that slow team efforts due to manual testing and debugging cycles. * Outdated or inefficient tools complicating maintenance and hindering effective troubleshooting or productivity. * Non-technical process inefficiencies, like unclear task definitions or ineffective planning meetings which add unnecessary overhead. The greater the number of people impacted by your improvement, the stronger the multiplier effect. Look proactively for gaps within your team's workflow or processes. Your manager can help you spot these bottlenecks, but as someone actively involved in day-to-day work, you're perfectly placed to notice inefficiencies firsthand. Asking your teammates directly what they like or dislike about the project can also yield valuable insights. Strive to progressively take on additional responsibilities through incremental trust-building with your manager and teammates. Promotions to Staff+ roles appear to happen overnight but take months of gradual progression. Like many others, I started as an individual contributor at a growing startup, and my responsibilities gradually shifted—from technical tasks to hiring, mentoring, planning, and client interaction. Eventually, I found myself spending most of my time guiding and leading rather than executing individually. When it became clear that I was essentially doing the higher-level work without official recognition, I approached my senior leadership about it. Shortly afterward, I received the official recognition and title, along with greater strategic responsibilities. Finally, even if promotion opportunities don't happen in your current company, the skills and experiences you develop make excellent talking points when interviewing elsewhere. Thanks for reading Jampa.dev! Subscribe for free to receive new posts.

Career

Business

0 views

Jampa.dev 9 months ago

Becoming a force multiplier

One of the most impactful things you can do is aim to become a force multiplier. This concept is under-noticed among engineers but can make a sizable difference in a project. Force multiplier work is one that, when delivered, will allow other people to deliver faster. Whether you want to be a Staff Engineer or an Engineering Manager, this is the most crucial bar you need to clear. A staff engineer or an architect does not do 2-10x the work of senior engineers. Still, they make teams 2-10x more productive by working on areas that will improve productivity. Force multiplier work is one that, when delivered, will allow other people to deliver faster. If you make a specific improvement that reduces by 5 minutes of your engineer's time in each task, and if they perform about 3 tasks in a working day. In a team with 5 people, you just gave your team 375 days of work in a year . There is no 10x engineer, but there is work done by one that can have a 10x impact. For example, a team impacted by constantly shipped bugs leads an engineer to create critical automated testing. This results in fewer bugs breaking production and speeds up every engineer on the team by reassuring them that their tickets will not break when shipping, also saving hours of debugging. Another case would be a team that has to work back and forth frequently because new technical requirements constantly change. However, one engineer can produce excellent technical specifications by communicating effectively with stakeholders before writing the first line of code, saving the team hours of rewriting code. The larger the number of people impacted by a positive change, the more critical the multiplier aspect is. Not every change needs to be so dramatic, but some can be even more productive, especially the ones that remove a communication overhead. Seeking opportunities Force multipliers are difficult to detect because they are specific to a team and codebase. But as a general rule, good potential candidates are bottlenecks and inefficiencies that drive daily aspects of the engineer's work, for example: Overengineered code architecture that requires the engineer to write duplicated code across codebases, with some occasional bugs. Quality control gaps, especially in automation, make the engineers take more time testing their tickets or shipping bugs requiring new tickets to fix. Outdated or reinvent-the-wheel code that introduces a lot of bugs, requires constant attention and makes it hard to find solutions to problems online due to documentation only supporting newer versions. There are also non-technical cases, such as proposing improvements to the shipping process, such as ticket structure and ceremonies, to avoid additional meeting overhead. Therefore, look for gaps in your team. Your manager can help you identify those, but as someone working actively in the code, you are privileged to know how things are. Asking your fellow engineers what they dislike but have to spend their time on the project is also a good way to get that information. I am writing an open-source book for engineering managers (EMs) seeking to enhance their skills and senior engineers aspiring to transition into an EM role. If that is your thing, check out the Github Subscribe for free to receive new posts and support my work. Seeking opportunities Force multipliers are difficult to detect because they are specific to a team and codebase. But as a general rule, good potential candidates are bottlenecks and inefficiencies that drive daily aspects of the engineer's work, for example: Overengineered code architecture that requires the engineer to write duplicated code across codebases, with some occasional bugs. Quality control gaps, especially in automation, make the engineers take more time testing their tickets or shipping bugs requiring new tickets to fix. Outdated or reinvent-the-wheel code that introduces a lot of bugs, requires constant attention and makes it hard to find solutions to problems online due to documentation only supporting newer versions. There are also non-technical cases, such as proposing improvements to the shipping process, such as ticket structure and ceremonies, to avoid additional meeting overhead.

Career

0 views

Jampa.dev 11 months ago

Google AI tools are surprisingly underrated

Google has a problem releasing: they start by announcing the product, which generates a lot of hype, but all we get is a landing page and a paper. After a few months, when people stop caring, Google quietly releases the tool… but it is a slow rollout in the US only. 🤦 Compare this with the OpenAI strategy. They create hype long before releasing, then casually drop it with a closed beta, with a public release right after. It's also Google's fault that nobody follows its tools—the names change occasionally: They invested millions in Bard just to change to Gemini. Gemini also has multiple tiers and versions with caveats, like “1.5 Pro” is better than “1.0 Ultra”. There are multiple tiers and versions, like 1.5 Pro 002—the last number being extra padded with 2 zeros means more will come to confuse everyone. A “Bard” ad I saw in Shibuya, Japan. So much money wasted on brand recognition… So, why should you care about them? Their AI is far from capable like ChatGPT 4o's — Gemini feels more like GPT 3.5. Well, because it excels in the needle-in-haystack problems. ChatGPT fumbles a lot of its tokens. It theoretically gives you 16k tokens in GPT-4o, which seems like a lot, but it tends to forget the earlier tokens, the more tokens you feed it. This is probably because it might use a “rolling-window” approach, so it does not consider those earlier tokens as much as it should, but you still pay for all of them. If you want to use Gemini tools, you must also use different websites with generic names that will probably change twice before the project gets killed. The tools are Notebook LLM and AI Studio . Not to be confused with LLM Studio , a popular FOSS tool not made by Google — I told you it was confusing. Notebook LLM sells itself short. It claims to be a tool for studying and brainstorming ideas using your documents. They also claim they can turn documents into an AI podcast(?). At first, it seems like things a student would use to cheat help with their exams. But the tool shines by being an excellent “ Google Search for your documents .” If you feed it with large documents, it can retrieve information and make interesting critiques. For example, I am using this tool extensively for my new engineering manager book. Having the AI do extensive critiques with sources to back them up helps me write better and not lose my agency as an author by turning things into slop. I don't want to use GenAI since it only gives generic advice from SEO-hungry websites. Using NotebookLLM means I am still in control of my writing. Unlike ChatGPT, which is always positive, feedback from Notebook LLM provides a good critique that is sometimes very humbling. AI Studio What Notebook LLM can do for large documents, AI Studio can do for large videos. It is very useful to extract details from any media you have and seems to be the only tool to do so. I imagine this tool being a game changer for people who use video to document things, like scouts for filming locations or real estate agents looking to write a pitch. I sent 3 videos I made for later reference when I visited a house. I asked them to write a pitch for those videos. One thing that amazed me is that it saw a grapevine and used it for the pitch! Overhyped pitch aside, it was impressive how well it gathered features from the property. I've been using this tool for transcribing and annotating video evidence. My house inspection is 17 minutes long, and there was no way I was ever going back to watch the whole thing again. This transcription helps a lot to get the action points. Something is very strange with it, though: Since the Lite mode worked so well. I thought the “Pro” model would work even better, but surprisingly, it was way worse. I don't have an explanation for this. Gemini Flash 1.5: It was surprisingly good despite its verbosity… So the Gemini Pro 1.5 should be even better, right? Gemini 1.5 Pro: Not only did it take 1:20 minutes to run this, but the output is more incomplete than the Flash version, and it triggered a content warning for no reason. Another limitation from Google is that the model is fine-tuned towards safety . Some very casual videos I've uploaded triggered the warning that I might have been breaking the ToS. It also uses the word “diverse” a lot; it probably means that, in its internal token, it should focus on diversity, but it starts using this word for everything from “diversity of rooms” to “diverse problems.” It uses this word like ChatGPT uses “delve.” Gemini is far from beating ChatGPT and perhaps even Claude Sonnet. They are playing safe with what they have, but we should not count them out of the game. Those tools are currently free, and NotebookLLM does not train on your data. With ChatGPT, segmenting these documents/videos into tokens/images is prohibitively expensive. The thing left for us to decide is how long "free” will last… If you are interested in the engineering manager book, I will send you the first chapters for free. These chapters will include the best and worst parts, how to succeed in the manager interview, and how the role differs in every company. If you provide feedback, I will send you the digital version of the book for free once it is released!

0 views

Jampa.dev 1 years ago

Wardriving for a place to live

I moved into a small city where I don't know anyone, and now I want to buy a place. Problem is: there are no listings anywhere, almost no real estate agents, and no updated information on the web. In a city where you have to call to order deliveries, and Google Maps won't give you their number, how can I find a place to live? Moving to a high-quality neighborhood here would cost way less than in a normal one in a big city, so searching for it would be worth it, especially for a first-time buy By driving around, I see there are a lot of properties "for sale" announced in a sign with a number and without a real estate agency attached. I could find one by just driving it. But the city is big, with too many streets to go, and too many unmapped dead ends. It has expanded organically. And the high and low infrastructure places are just one street apart. The city is a colonial-era city where the main export was/is sugarcane, near a major highway, with a lot of dead-ends with no urban planning. With Street View, you can see the “good parts” of the town. But that is still too much work and honestly, I can barely remember where I "drove" later. If there was a way to "cull" neighborhoods I would know where the potential home places are without driving into dangerous places. Afterward, I can narrow my search to only those places. Get all streets in the city area and points equally distributed along the road. Get all the Street View pictures Make GPT4 the judge of the neighborhoods. Plot on a map! I did not bother with Google Maps at first because it just had too many "false" streets, and would probably cost something. It would take years of my time to download the 30MB on JS bundle of the Google Cloud Console and navigate their menus. So I asked ChatGPT to generate code for me using OpenStreetMaps. Not gonna lie, I didn't even want to bother reading the code it made: "I just want the .exe" I said. But the ChatGPT got horribly wrong and wouldn't bother fixing it. So like a caveman, I had to read it and fix it. For the roads I marked the points at every 70m, in imperial units that is 0.76 football fields, I had a “bug” that every point at the intersection on the road was marked. But it was due that the intersection is the "beginning of the road". Duh! It is super duper expensive now to use Google APIs. And Playwright is easier than reading Google documentation, so I am cashing the favor for contributing long enough to Google Maps. One huge tip is to make the dimension of the image a multiple of 512 for OpenAI, a 512x512 image costs 4x less than a 513 x 513 one . It says in the documentation that it splits the image in 512 x 512 chunks, or 85 tokens each, we do 1024x512, so it is wide and only 2 tiles! Made sure to capture one angle along the road and the other inverted by 180 to avoid bias. Separately the first image ranks higher than the second despite it being 2 angles of the same point. After the images were downloaded I spent hours promptly writing and got mad that the first short draft I made was the best one. GPT4 gets biased if you give too many details, like if it is Brazil mentioned, it starts to sugarcoat how bad the street is: "This precarious street without sidewalk […] is typical of a Brazilian suburb - I give 4 / 5 " - GPT with more context. Was that a roast? Was the AI being ironic? If you add what factors to consider to the prompt, it will hyperfocus on looking for a positive aspect in the input. A street with an abandoned lot was given credit for its “greenery”. Probably ChatGPT has instructions for it to be positive with what you give in the prompt and this case is not ideal. Another good surprise is that the new structured output is awesome, you don’t need to YELL at the AI to not over-explain things, or safeguard it, and it spared a lot of time not calibrating the prompt. Last thing, there are some loose spots, where there are some few good houses in an overall bad street. Those are not ideal candidates, I used something similar to KNN to remove that "noise". Yes, I made a O(n**2) code since my time is worth more than the energy that my Mac will use. Overall it was great enough, looking at the rating and the images in a vacuum like the AI did I would rate them the same as well. There were no disagreements between me and AI, which would invalidate the whole thing. I also removed the beginning of the street in my code and removed nearby points to save some tokens. Here is the final thing, using QGIS and TIM interpolation among points: Insights The good news was every place got at least 1, despite telling it to rate from 0-5, there were no legit zeros given. Even on muddy roads with large vegetation. But that was a good thing, it independently decided to reserve 0 to "not found" images and I could easily filter those out. The bad news was that no place got a 5, the one that got a 4 was removed by the KNN, which is good for the quality of the rating but bad news for the city overall. I think a 4-5 would be a posh American suburbia with an HOA and lots of grass lawns and a large asphalt road with no overhead wires, and we have that here. The only place that got a 4 was because someone left their garage door open. The AI saw a pool and thought: “Hmmm, yes, very upscale”. A surprise was how consistent it was to judge according to its own parameters, all the prompting was done in isolation, but remained the same across nearby road spots. There were rarely sudden jumps between 1 to 3 for example. Another surprise was that the AI loved parks, it could have the saddest playground ever, if there was a park it got at least a 1-point bump, but that makes sense, a park is a “permanent view" of some sort. Codepen from another run . Ignore Google Warning modal. What's next Now I need to visit those places and drive around to see what is there to sell, next steps might be getting a 360 camera, putting it on the roof of the car, a lá street view, and taking pictures and looking for "for sale signs". Stay tuned! Subscribe now The city is a colonial-era city where the main export was/is sugarcane, near a major highway, with a lot of dead-ends with no urban planning. With Street View, you can see the “good parts” of the town. But that is still too much work and honestly, I can barely remember where I "drove" later. If there was a way to "cull" neighborhoods I would know where the potential home places are without driving into dangerous places. Afterward, I can narrow my search to only those places. The plan(TM): Get all streets in the city area and points equally distributed along the road. Get all the Street View pictures Make GPT4 the judge of the neighborhoods. Plot on a map!

JavaScript

0 views