Bots are currently scraping the internet for LLM training data at unprecedented rates[1][2][3], driving up costs and destabilizing public-facing websites. I want to talk about how this has been particularly difficult for wikis, and has gotten much worse in the last few months.
There’s a clever trick from Cloudflare:
https://blog.cloudflare.com/ai-labyrinth/
Poisoning the well at scale. I love it.