Data Poisoning could be a tool we use to identify AI that has used copyritten material

ekZepp@lemmy.world · 10 days ago

Data Poisoning could be a tool we use to identify AI that has used copyritten material

Arthur Besse@lemmy.ml · 10 days ago

identify AI that has used copyrighted material

but, that is basically all modern “AI”.

(the only LLM i’ve heard of which actually claims that its training corpus is freely licensed is Apertus…)

youcantreadthis@quokk.au · edit-2 10 days ago

We callin it Plagarized Information Stochastic Stupidity now the only PISS you’ve heard of

Hackworth@piefed.ca · 10 days ago

Adobe claims to only train their image generator, Firefly, on images from their stock library.

very_well_lost@lemmy.world · 10 days ago

cloudskater@piefed.blahaj.zone · 8 days ago

Even if it didn’t use copyrighted stuff, the concept of “generative” AI is fascist to its very core.

HaraldvonBlauzahn@feddit.org · 8 days ago

the concept of “generative” AI is fascist to its very core.

Can you explain? I might miss the connection.

YourMomsTrashman@lemmy.world · 9 days ago

Traditionally, with machine learning, it is standard practice to mention what datasets and/or pretrains were used, so that the results are transparent and can be replicated. With GPT-2, it was “the common crawl and our own crawled 8 million web pages”, and since then I feel it’s mostly left out, falling back on (easily manipulated) benchmarks instead 😬

Arthur Besse@lemmy.ml · 9 days ago

Yep. But just providing a list of millions of URLs and saying “we trained on this” as some models in the past have done also didn’t make it possible to replicate; by the time anyone re-fetches them all, many of the URLs will inevitably have changed or disappeared.

YourMomsTrashman@lemmy.world · 8 days ago

That’s exactly why projects like the common crawl exist though !

Data Poisoning could be a tool we use to identify AI that has used copyritten material

Data Poisoning could be a tool we use to identify AI that has used copyritten material

Poison Your Data. Fight Back Against AI.