BioHacker News | Faking a JPEG

▲Faking a JPEG(ty-penguin.org.uk)

259 points by todsacerdoti 13 hours ago | 19 comments

▲tomsmeding 5 hours ago

They do have a robots.txt [1] that disallows robot access to the spigot tree (as expected), but removing the /spigot/ part from the URL seems to still lead to Spigot. [2] The /~auj namespace is not disallowed in robots.txt, so even well-intentioned crawlers, if they somehow end up there, can get stuck in the infinite page zoo. That's not very nice.

[1]: https://www.ty-penguin.org.uk/robots.txt

[2]: https://www.ty-penguin.org.uk concatenated with /~auj/cheese (don't want to create links there)

▲bstsb 3 hours ago

previously the author wrote in a comment reply about not configuring robots.txt at all:

> I've not configured anything in my robots.txt and yes, this is an extreme position to take. But I don't much like the concept that it's my responsibility to configure my web site so that crawlers don't DOS it. In my opinion, a legitimate crawler ought not to be hitting a single web site at a sustained rate of > 15 requests per second.

▲yorwba 1 hour ago

The spigot doesn't seem to distinguish between crawlers that make more than 15 requests per second and those that make less. I think it would be nicer to throw up a "429 Too Many Requests" page when you think the load is too much and only poison crawlers that don't back off afterwards.

▲evgpbfhnr 59 minutes ago

when crawlers use a botnet to only make one request per ip per long duration that's not realistic to implement though..

▲josephg 3 hours ago

> even well-intentioned crawlers, if they somehow end up there, can get stuck in the infinite page zoo. That's not very nice.

So? What duty do web site operators have to be "nice" to people scraping your website?

▲gary_0 3 hours ago

The Marginalia search engine or archive.org probably don't deserve such treatment--they're performing a public service that benefits everyone, for free. And it's generally not in one's best interests to serve a bunch of garbage to Google or Bing's crawlers, either.

▲suspended_state 3 hours ago

The point is that not every web crawler is out there to scrape websites.

▲andybak 44 minutes ago

Unless you define "scrape" to be inherently nefarious - then surely they are? Isn't the definition of a web crawler based on scraping websites?

▲jandrese 7 hours ago

I wonder if you could mess with AI input scrapers by adding fake captions to each image? I imagine something like:

    (big green blob)

    "My cat playing with his new catnip ball".


    (blue mess of an image)

    "Robins nesting"

▲Dwedit 5 hours ago

A well-written scraper would check the image against a CLIP model or other captioning model to see if the text there actually agrees with the image contents.

▲Simran-B 5 hours ago

Then captions that are somewhat believable? "Abstract digital art piece by F. U. Botts resembling wide landscapes in vibrant colors"

▲Someone 1 hour ago

Do scrapers actually do such things on every page they download? Sampling a small fraction of a site to check how trustworthy it is, I can see happen, but I would think they’d rather scrape many more pages than spend resources doing such checks on every page.

Or is the internet so full of garbage nowadays that it is necessary to do that on every page?

▲levzzz 6 hours ago

[dead]

▲a-biad 12 minutes ago

I am bit confused about the context. What is exactly the point of exposing fake data to webcrawlers?

▲marcod 10 hours ago

Reading about Spigot made me remember https://www.projecthoneypot.org/

I was very excited 20 years ago, every time I got emails from them that the scripts and donated MX records on my website had helped catching a harvester

> Regardless of how the rest of your day goes, here's something to be happy about -- today one of your donated MXs helped to identify a previously unknown email harvester (IP: 172.180.164.102). The harvester was caught a spam trap email address created with your donated MX:

▲notpushkin 3 hours ago

This is very neat. Honeypot scripts are fairly outdated though (and you can’t modify them according to ToS). The Python one only supports CGI and Zope out of the box, though I think you can make a wrapper to make it work with WSGI apps as well.

▲mrbluecoat 12 hours ago

> I felt sorry for its thankless quest and started thinking about how I could please it.

A refreshing (and amusing) attitude versus getting angry and venting on forums about aggressive crawlers.

▲ASalazarMX 12 hours ago

Helped without doubt by the capacity to inflict pain and garbage unto those nasty crawlers.

▲Szpadel 2 hours ago

the worst offender I saw is meta.

they have facebookexternalhit bot (they sometimes use default python request user agent) that (as they documented) explicitly ignores robots.txt

it's (as they say) used to validate links if they contain malware. But if someone would like to serve malware the first thing they would do would be to serve innocent page to facebook AS and their user agent.

they also re-check every URL every month to validate if this still does not contain malware.

the issue is as follows some bad actors spam Facebook with URLs to expensive endpoints (like some search with random filters) and Facebook provides then with free ddos service for your competition. they flood you with > 10 r/s for days every month.

▲EspadaV9 12 hours ago

I like this one

https://www.ty-penguin.org.uk/~auj/spigot/pics/2025/03/25/fa...

Some kind of statement piece

▲creatonez 9 hours ago

For the full experience:

Firefox: Press F12, go to Network, click No Throttling > change it to GPRS

Chromium: Press F12, go to Network, click No Throttling > Custom > Add Profile > Set it to 20kbps and set the profile

▲extraduder_ire 6 hours ago

Good mention. There's probably some good art to be made by serving similar jpeg images with the speed limited server-side.

▲myelinsheep 10 hours ago

Anything with Shakespeare in it?

▲EspadaV9 7 hours ago

Looks like he didn't get time to finish

https://www.ty-penguin.org.uk/~auj/spigot/pics/2025/03/25/fa...

Terry Pratchett has one I'd like to think he'd approve of. Just a shame I'm unable to see the 8th colour, I'm sure it's in there somewhere.

https://www.ty-penguin.org.uk/~auj/spigot/pics/2025/03/25/fa...

▲derefr 11 hours ago

> It seems quite likely that this is being done via a botnet - illegally abusing thousands of people's devices. Sigh.

Just because traffic is coming from thousands of devices on residential IPs, doesn't mean it's a botnet in the classical sense. It could just as well be people signing up for a "free VPN service" — or a tool that "generates passive income" for them — where the actual cost of running the software, is that you become an exit node for both other "free VPN service" users' traffic, and the traffic of users of the VPN's sibling commercial brand. (E.g. scrapers like this one.)

This scheme is known as "proxyware" — see https://www.trendmicro.com/en_ca/research/23/b/hijacking-you...

▲cAtte_ 11 hours ago

sounds like a botnet to me

▲whatsupdog 8 hours ago

Botnet with extra steps.

▲ronsor 11 hours ago

because it is, but it's a legal botnet

▲derefr 11 hours ago

Eh. To me, a bot is something users don't know they're running, and would shut off if they knew it was there.

Proxyware is more like a crypto miner — the original kind, from back when crypto-mining was something a regular computer could feasibly do with pure CPU power. It's something users intentionally install and run and even maintain, because they see it as providing them some potential amount of value. Not a bot; just a P2P network client.

Compare/contrast: https://en.wikipedia.org/wiki/Winny / https://en.wikipedia.org/wiki/Share_(P2P) / https://en.wikipedia.org/wiki/Perfect_Dark_(P2P) — pieces of software which offer users a similar devil's bargain, but instead of "you get a VPN; we get to use your computer as a VPN", it's "you get to pirate things; we get to use your hard drive as a cache node in our distributed, encrypted-and-striped pirated media cache."

(And both of these are different still to something like BitTorrent, where the user only ever seeds what they themselves have previously leeched — which is much less questionable in terms of what sort of activity you're agreeing to play host to.)

▲tgsovlerkhgsel 11 hours ago

AFAIK much of the proxyware runs without the informed consent of the user. Sure, there may be some note on page 252 of the EULA of whatever adware the user downloaded, but most users wouldn't be aware of it.

▲kazinator 3 hours ago

Faking a JPEG is not only less CPU intensive than making one properly, but by doing os you are fuzzing whatever malware is on the other end; if it is decoding the JPEG and isn't robust, it may well crash.

▲112233 7 hours ago

So how do I set up an instance of this beautiful flytrap? Do I need a valid personal blog, or can I plop something on cloudflare to spin on their edge?

▲ffsm8 7 hours ago

It's a flask app, he linked to it

https://github.com/gw1urf/spigot/

▲Modified3019 10 hours ago

Love the effort.

That said, these seem to be heavily biased towards displaying green, so one “sanity” check would be if your bot is suddenly scraping thousands of green images, something might be up.

▲lvncelot 4 hours ago

Nature photographers around the world rejoice as their content becomes safe from scraping.

▲ykonstant 4 hours ago

Next we do it with red and blue :D

▲recursive 9 hours ago

Mission accomplished I guess

▲superjan 6 hours ago

There is a particular pattern (block/tag marker) that is illegal the compressed JPEG stream. If I recall correctly you should insert a 0x00 after a 0xFF byte in the output to avoid it. If there is interest I can followup later (not today).

▲lblume 13 hours ago

Given that current LLMs do not consistently output total garbage, and can be used as judges in a fairly efficient way, I highly doubt this could even in theory have any impact on the capabilities of future models. Once (a) models are capable enough to distinguish between semi-plausible garbage and possibly relevant text and (b) companies are aware of the problem, I do not think data poisoning will be an issue at all.

▲jesprenj 12 hours ago

Yes, but you still waste their processing power.

▲immibis 11 hours ago

There's no evidence that the current global DDoS is related to AI.

▲ykonstant 4 hours ago

We have investigated nobody and found no evidence of malpractice!

▲lblume 4 hours ago

The linked page claims that most identified crawlers are related to scraping for training data of LLMs, which seems likely.

▲bschwindHN 12 hours ago

You should generate fake but believable EXIF data to go along with your JPEGs too.

▲russelg 9 hours ago

They're taking the valid JPEG headers from images already on their site, so it's possible those are already in place.

▲electroglyph 5 hours ago

there's no metadata in the example image

▲bigiain 6 hours ago

Fake exif data with lat/longs showing the image was taken inside Area 51 or The Cheyenne Mountain Complex or Guantanamo Bay...

▲derektank 12 hours ago

From the headline that's actually what I was expecting the link to discuss

▲jekwoooooe 1 hour ago

It’s our moral imperative to make crawling cost prohibitive and also poison LLM training.

▲BubbleRings 1 hour ago

Is there reason you couldn’t generate your images by grabbing random rectangles of pixels from one source image and pasting it into a random location in another source image? Then you would have a fully valid jpg that no AI could easily successfully identify as generated junk. I guess that would require much more CPU than your current method huh?

▲hashishen 11 hours ago

the hero we needed and deserved

▲puttycat 10 hours ago

> compression tends to increase the entropy of a bit stream.

Does it? Encryption increases entropy, but not sure about compression.

▲gregdeon 10 hours ago

Yes: the reason why some data can be compressed is because many of its bits are predictable, meaning that it has low entropy per bit.

▲JCBird1012 10 hours ago

I can see what was meant with that statement. I do think compression increases Shannon entropy by virtue of it removing repeating patterns of data - Shannon entropy per byte of compressed data increases since it’s now more “random” - all the non-random patterns have been compressed out.

Total information entropy - no. The amount of information conveyed remains the same.

▲gary_0 4 hours ago

Technically with lossy compression, the amount of information conveyed will likely change. It could even increase the amount of information of the decompressed image, for instance if you compress a cartoon with simple lines and colors, a lossy algorithm might introduce artifacts that appear as noise.

▲dheera 12 hours ago

> So the compressed data in a JPEG will look random, right?

I don't think JPEG data is compressed enough to be indistinguishable from random.

SD VAE with some bits lopped off gets you better compression than JPEG and yet the latents don't "look" random at all.

So you might think Huffman encoded JPEG coefficients "look" random when visualized as an image but that's only because they're not intended to be visualized that way.

▲maxbond 10 hours ago

Encoded JPEG data is random in the same way cows are spherical.

▲BlaDeKke 10 hours ago

Cows can be spherical.

▲ 10 hours ago

▲bigiain 6 hours ago

And have uniform density.

▲anyfoo 5 hours ago

Yeah, but in practice you only get that in a perfect vacuum.