Data Poisoning could be a tool we use to identify AI that has used copyritten material, or we use it to mess with AI.
https://www.vice.com/en/article/infinite-ai-homer-simpson-cover-songs-poisoned-soulseek/
https://mosis.eecs.utk.edu/harmonycloak.html
https://mosis.eecs.utk.edu/publications/meerza2024harmonycloak.pdf



Yep. But just providing a list of millions of URLs and saying “we trained on this” as some models in the past have done also didn’t make it possible to replicate; by the time anyone re-fetches them all, many of the URLs will inevitably have changed or disappeared.
That’s exactly why projects like the common crawl exist though !