• ElectricVocalist@jlai.lu
    link
    fedilink
    arrow-up
    6
    ·
    1 hour ago

    but I would assume there’s an arms race going on behind-the-scenes between Cloudflare and the bot developers

    No. CF lost years ago, and the checks can be bypassed easily. It’s just that it blacklists ips generating insane traffic but there is a lot of margin

    • smeg@infosec.pub
      link
      fedilink
      English
      arrow-up
      1
      ·
      50 minutes ago

      I put my little blog behind Cloudflare because I was tired of it going down due to scrapers overwhelming my little VPS.

  • Thorry@feddit.org
    link
    fedilink
    arrow-up
    28
    ·
    12 hours ago

    Yeah hosting just about anything is terrible these days. These AI scrapers just can’t act normally, there was nothing wrong with the way GoogleBot and Bing Bot work. They scrape the website, respect robots.txt and nofollow, they rate limit themselves as to not overload the servers. It was just fine.

    These days with those AI scrapers they go absolutely ape shit, they issue dozens of requests every second, try to scrape anything and everything. Going so far as to make up urls, just to see if they get lucky. My blocklist is huge and I need to keep updating it all the time. And every now and again one slips through and absolutely slams the server. This causes an alert and I need to act right away. It’s fucking terrible.

    AI is already shit, why do those companies go out of their way to be even more shit?

  • CapuccinoCoretto@lemmy.world
    link
    fedilink
    arrow-up
    28
    ·
    13 hours ago

    One thing I want to see is poisoned wells. When you detect scrapers, don’t stop them, feed them pseudo content designed to COST them. Make their training data poisonous and damaging. Make it cost them to purge it, and difficult and expensive to identify it.

    • TheOctonaut@piefed.zip
      link
      fedilink
      English
      arrow-up
      3
      ·
      4 hours ago

      Unless a significant portion of the internet does this, and we’re talking hundreds of millions of pages, the only cost here is to you.

      LLMs are statistics. They don’t “remember” their training. They just know what statistically speaking the next words should be. But sure, be the web dev version of þorn guy.

      • ATPA9@feddit.org
        link
        fedilink
        arrow-up
        3
        ·
        2 hours ago

        Remember the glue on pizza? Sometimes it takes just one stupid post somewhere to poison an llm

        • TheOctonaut@piefed.zip
          link
          fedilink
          English
          arrow-up
          3
          ·
          2 hours ago

          Glue on pizza was a result of an early version of an agent tool - built in search. It wasn’t an output of the LLM model (yes I know, ATM machine) itself. It was an LLM using a tool to find a search result from a site considered reputable (yes, I know) and presenting it to the user as fact - an instructions problem, not a statistical one.

    • hansolo@lemmy.today
      link
      fedilink
      arrow-up
      7
      ·
      10 hours ago

      I really want a tutorial on how to do this. I think it’s a great way to practice self-agrandizement by making myself the pretend king of a pretend country.

  • Droopy@programming.dev
    link
    fedilink
    English
    arrow-up
    9
    ·
    12 hours ago

    but those that do run these wikis will be in the fast pass line at the gates of heaven. Please don’t give up. I never use gipity