Bots are currently scraping the internet for LLM training data at unprecedented rates[1][2][3], driving up costs and destabilizing public-facing websites. I want to talk about how this has been particularly difficult for wikis, and has gotten much worse in the last few months.
Do you have links or tutorials that would help to deal with these issues?
Yes, I use this block list as well as my own additions (mostly IPs of misbehaving bots):
https://github.com/mitchellkrogza/apache-ultimate-bad-bot-blocker
It’s specifically for Apache, but that’s what I use. There are more of these kinds of lists available.