Data Poisoning could be a tool we use to identify AI that has used copyritten material

ekZepp@lemmy.world · 7 days ago

Data Poisoning could be a tool we use to identify AI that has used copyritten material

YourMomsTrashman@lemmy.world · 6 days ago

Traditionally, with machine learning, it is standard practice to mention what datasets and/or pretrains were used, so that the results are transparent and can be replicated. With GPT-2, it was “the common crawl and our own crawled 8 million web pages”, and since then I feel it’s mostly left out, falling back on (easily manipulated) benchmarks instead 😬

Arthur Besse@lemmy.ml · 6 days ago

Yep. But just providing a list of millions of URLs and saying “we trained on this” as some models in the past have done also didn’t make it possible to replicate; by the time anyone re-fetches them all, many of the URLs will inevitably have changed or disappeared.

YourMomsTrashman@lemmy.world · 6 days ago

That’s exactly why projects like the common crawl exist though !

Data Poisoning could be a tool we use to identify AI that has used copyritten material

Data Poisoning could be a tool we use to identify AI that has used copyritten material

Poison Your Data. Fight Back Against AI.