• YourMomsTrashman@lemmy.world
    link
    fedilink
    arrow-up
    2
    ·
    6 days ago

    Traditionally, with machine learning, it is standard practice to mention what datasets and/or pretrains were used, so that the results are transparent and can be replicated. With GPT-2, it was “the common crawl and our own crawled 8 million web pages”, and since then I feel it’s mostly left out, falling back on (easily manipulated) benchmarks instead 😬

    • Arthur Besse@lemmy.ml
      link
      fedilink
      English
      arrow-up
      2
      ·
      6 days ago

      Yep. But just providing a list of millions of URLs and saying “we trained on this” as some models in the past have done also didn’t make it possible to replicate; by the time anyone re-fetches them all, many of the URLs will inevitably have changed or disappeared.