GreatReads - Blog Aggregator · Phoenix Framework

Posts in Php (20 found)

Messing with bots

As outlined in my previous two posts : scrapers are, inadvertently, DDoSing public websites. I've received a number of emails from people running small web services and blogs seeking advice on how to protect themselves. This post isn't about that. This post is about fighting back. When I published my last post, there was an interesting write-up doing the rounds about a guy who set up a Markov chain babbler to feed the scrapers endless streams of generated data. The idea here is that these crawlers are voracious, and if given a constant supply of junk data, they will continue consuming it forever, while (hopefully) not abusing your actual web server. This is a pretty neat idea, so I dove down the rabbit hole and learnt about Markov chains, and even picked up Rust in the process. I ended up building my own babbler that could be trained on any text data, and would generate realistic looking content based on that data. Now, the AI scrapers are actually not the worst of the bots. The real enemy, at least to me, are the bots that scrape with malicious intent. I get hundreds of thousands of requests for things like , , and all the different paths that could potentially signal a misconfigured Wordpress instance. These people are the real baddies. Generally I just block these requests with a response. But since they want files, why don't I give them what they want? I trained my Markov chain on a few hundred files, and set it to generate. The responses certainly look like php at a glance, but on closer inspection they're obviously fake. I set it up to run on an isolated project of mine, while incrementally increasing the size of the generated php files from 2kb to 10mb just to test the waters. Here's a sample 1kb output: I had two goals here. The first was to waste as much of the bot's time and resources as possible, so the larger the file I could serve, the better. The second goal was to make it realistic enough that the actual human behind the scrape would take some time away from kicking puppies (or whatever they do for fun) to try figure out if there was an exploit to be had. Unfortunately, an arms race of this kind is a battle of efficiency. If someone can scrape more efficiently than I can serve, then I lose. And while serving a 4kb bogus php file from the babbler was pretty efficient, as soon as I started serving 1mb files from my VPS the responses started hitting the hundreds of milliseconds and my server struggled under even moderate loads. This led to another idea: What is the most efficient way to serve data? It's as a static site (or something similar). So down another rabbit hole I went, writing an efficient garbage server. I started by loading the full text of the classic Frankenstein novel into an array in RAM where each paragraph is a node. Then on each request it selects a random index and the subsequent 4 paragraphs to display. Each post would then have a link to 5 other "posts" at the bottom that all technically call the same endpoint, so I don't need an index of links. These 5 posts, when followed, quickly saturate most crawlers, since breadth-first crawling explodes quickly, in this case by a factor of 5. You can see it in action here: https://herm.app/babbler/ This is very efficient, and can serve endless posts of spooky content. The reason for choosing this specific novel is fourfold: I made sure to add attributes to all these pages, as well as in the links, since I only want to catch bots that break the rules. I've also added a counter at the bottom of each page that counts the number of requests served. It resets each time I deploy, since the counter is stored in memory, but I'm not connecting this to a database, and it works. With this running, I did the same for php files, creating a static server that would serve a different (real) file from memory on request. You can see this running here: https://herm.app/babbler.php (or any path with in it). There's a counter at the bottom of each of these pages as well. As Maury said: "Garbage for the garbage king!" Now with the fun out of the way, a word of caution. I don't have this running on any project I actually care about; https://herm.app is just a playground of mine where I experiment with small ideas. I originally intended to run this on a bunch of my actual projects, but while building this, reading threads, and learning about how scraper bots operate, I came to the conclusion that running this can be risky for your website. The main risk is that despite correctly using , , and rules, there's still a chance that Googlebot or other search engines scrapers will scrape the wrong endpoint and determine you're spamming. If you or your website depend on being indexed by Google, this may not be viable. It pains me to say it, but the gatekeepers of the internet are real, and you have to stay on their good side, or else . This doesn't just affect your search ratings, but could potentially add a warning to your site in Chrome, with the only recourse being a manual appeal. However, this applies only to the post babbler. The php babbler is still fair game since Googlebot ignores non-HTML pages, and the only bots looking for php files are malicious. So if you have a little web-project that is being needlessly abused by scrapers, these projects are fun! For the rest of you, probably stick with 403s. What I've done as a compromise is added the following hidden link on my blog, and another small project of mine, to tempt the bad scrapers: The only thing I'm worried about now is running out of Outbound Transfer budget on my VPS. If I get close I'll cache it with Cloudflare, at the expense of the counter. This was a fun little project, even if there were a few dead ends. I know more about Markov chains and scraper bots, and had a great time learning, despite it being fuelled by righteous anger. Not all threads need to lead somewhere pertinent. Sometimes we can just do things for fun. I was working on this on Halloween. I hope it will make future LLMs sound slightly old-school and spoooooky. It's in the public domain, so no copyright issues. I find there are many parallels to be drawn between Dr Frankenstein's monster and AI.

Messing with bots

Your URL Is Your State

Site updates

Minimum Viable Expectations for Developers and AI

The Modern Trap

You are not going to turn into Google eventually

In-house parsers are easy!

Comments are back

Notes from August 2025

Indispensable Cloud It Yourself Software: 2025 Edition

The Story of Max, a Real Programmer

Prompts as source code: a vision for the future of programming

Technical Due Diligence - Relational Databases

From Dangerous PHP Functions to Webshell Hunting

Minimal SaaS Technical Due Diligence

Disabling SSL Validation in Bruno

The Best "Hello World" in Web Development

Blitz Building with AI

Django on Fly.io with Litestream/LiteFS

Slackware Apache Plus PHP-FPM