OpenAI just admitted it can’t identify AI-generated text. That’s bad for the internet and it could be really bad for AI models.::In January, OpenAI launched a system for identifying AI-generated text. This month, the company scrapped it.

  • @Hamartiogonic@sopuli.xyz
    link
    fedilink
    English
    932 years ago

    Text written before 2023 is going be exceptionally valuable because that way we can be reasonably sure it wasn’t contaminated by an LLM.

    This reminds me of some research institutions pulling up sunken ships so that they can harvest the steel and use it to build sensitive instruments. You see, before the nuclear tests there was hardly any radiation anywhere. However, after America and the Soviet Union started nuking stuff like there’s no tomorrow, pretty much all steel on Earth has been a little bit contaminated. Not a big issue for normal people, but scientists building super sensitive equipment certainly notice the difference between pre-nuclear and post-nuclear steel

      • @evatronic@lemm.ee
        link
        fedilink
        English
        62 years ago

        It is also worth nothing that we can make low or no radiation-contaminated steel, it’s just really expensive and hard and happens in very low quantities.

    • @lily33@lemmy.world
      link
      fedilink
      English
      52 years ago

      Not really. If it’s truly impossible to tell the text apart, than it doesn’t really pose a problem for training AI. Otherwise, next-gen AI will be able to tell apart text generated by current gen AI, and it will get filtered out. So only the most recent data will have unfiltered shitty AI-generated stuff, but they don’t train AI on super-recent text anyway.

      • @Womble@lemmy.world
        link
        fedilink
        English
        172 years ago

        This is not the case. Model collapse is a studied phenomenon for LLMs and leads to deteriorating quality when models are trained on the data that comes from themselves. It might not be an issue if there were thousands of models out there but there are only 3-5 base models that all the others are derivatives of IIRC.

        • @lily33@lemmy.world
          link
          fedilink
          English
          1
          edit-2
          2 years ago

          I don’t see how that affects my point.

          • Today’s AI detector can’t tell apart the output of today’s LLM.
          • Future AI detector WILL be able to tell apart the output of today’s LLM.
          • Of course, future AI detector won’t be able to tell apart the output of future LLM.

          So at any point in time, only recent text could be “contaminated”. The claim that “all text after 2023 is forever contaminated” just isn’t true. Researchers would simply have to be a bit more careful including it.