Houston, we have a problem: GPT plagiarism

Houston, we have a problem: GPT plagiarism
An image of an ancient Egyptian drawing of the Uroboros, the snake swallowing tail. From wikimedia commons by virtue of the Egyptian museum of Berlin.

You've been hearing a lot about hallucinations. Some people call them hallucinations or confabulations, but let's call them out for what they are. Bad results. Sometimes, terrible results. Calling it a hallucination is another way of personifying and using "anthropomorphic" language to persuade investors and consumers that there is intelligence behind what the tool executes. This should concern any business leader investing significantly in building these tools into their infrastructure. I have a longer piece coming about the alleged reasoning capability, which is also proving to be a hollow claim unless one can claim that a written text that explains or shows the work of reasoning is a reasoning capability.

Yesterday, I wanted to touch on the "hallucination problem" and why the alleged panacea, RAG technology, is not likely to fix this problem because it can't be generative without being able to "hallucinate." However, when preparing this and discussing it with our team, we encountered another disturbing problem, which I will allude to here. Vaguely. We don't like to be hasty. There are a lot of things that can contribute to the quality of outcomes with these tools, and we don't want to jump to conclusions.

Generative Transformer technology allegedly attempts to deploy a quick search over dense data to make predictions based on frequency and patterns that the coders have given it. Another way of saying that is it guesses what should appear as an answer to a given query. Related to this issue, there is controversy over the fact that there is sometimes no synthesis in what it finds; the alleged claim has been it picks a specific text as the answer, lifts it in its entirety, and presents it as a "generated answer." Interestingly, This is what the NYT lawsuit against Open AI is all about. While the interface shows the answer appearing one character at a time to give the illusion of a generated answer, many people report that their work was not just illegally scraped and used to "teach" the model (another anthropomorphizing word). It simply regurgitates that content verbatim.

ChatGPT and its apologists claim that this is not the case and will imply condescendingly its because the people who claim that it does so do not understand the tech. They will insist that it does not plagiarize or copy and paste. However, it's becoming a challenging problem to prove. It appears plagiarism detectors (even the ones that pre-existed ChatGPT) are using GPT technology, and it also appears that they all have trouble with recent journalism text and text that comes from Chat GPT. It also appears to have erratic results with text found on the internet that was generated by itself before it was published in the article and website. Did you follow that? GPT--either by fault or design--will not detect its own text as plagiarism even when you can manually prove it is plagiarism. It's getting to be a hall of mirrors fairly quick, isn't it? It's the snake swallowing its tail. We've been discussing this idea for a while—when the GPT is fed a steady diet of its own content, the model breaks down, and it literally cannot handle its synthetic data well. Meanwhile, the internet is accelerating toward becoming the majority GPT text, predicted to reach 90% by 2026.

We are digging into this a little more. We want to replicate the results independently across different machines. Why different machines? I also have evidence that Chat GPT directly lifted its own text and mine from prior chats in the prior history to answer one of my queries.

But we'd like to know if you are following what we are saying and running your own experiments. We have more to follow. For now, I think things will get stranger before they become clearer.

Singular XQ is 100% publicly funded by individuals who support the free and open research of emerging technology and its social and cultural impacts. Consider sponsoring our podcast or subscribing to the paid version of this newsletter to support us. As a fiscally sponsored 501 (c) 3 we can provide tax deductions, if needed.