Bots are currently scraping the internet for LLM training data at unprecedented rates[1][2][3], driving up costs and destabilizing public-facing websites. I want to talk about how this has been particularly difficult for wikis, and has gotten much worse in the last few months.
You could, but it’s tricky to get right I feel. Most small websites use a form of bot detection for visitors to manage this. This might be a service like Cloudflare or an open source thing like Anubis for example.
There’s different ways to tackle this and it sucks we are forced into putting time and effort to deal with it.
There’s a clever trick from Cloudflare:
https://blog.cloudflare.com/ai-labyrinth/
Poisoning the well at scale. I love it.