Bots are currently scraping the internet for LLM training data at unprecedented rates[1][2][3], driving up costs and destabilizing public-facing websites. I want to talk about how this has been particularly difficult for wikis, and has gotten much worse in the last few months.
Yes, I use this block list as well as my own additions (mostly IPs of misbehaving bots):
https://github.com/mitchellkrogza/apache-ultimate-bad-bot-blocker
It’s specifically for Apache, but that’s what I use. There are more of these kinds of lists available.