What is Chat GPT's "deep research?"
In February of this particular year, you can be excused if you missed the announcement regarding Chat GPT's offering "deep research." As an addendum to it's 03 "reasoning" model, according to the announcement on their website, deep research is a new capability, described as "an agent that uses reasoning to synthesize large amounts of online information and complete multi-step research tasks for you. Available to Pro users today, Plus and Team next."
The trouble is that no one has defined the claim to reasoning. A loud and credentialed crowd of AI professionals and scholars claim that LLMs don't reason and it's not what they are designed to do. According to this team of Apple researchers, there is no evidence that these mathematical models perform reasoning tasks at all, though they may provide the uneducated the illusion that they do. Instead, they are "fragile" (ibid.) mathematical models (changing a single name can decrease accuracy by greater than 10%, for example) that can only replicate what it has searched. Further, they don't demonstrate "understanding" of their own mathematical operations, to quote the conclusion of the paper where they introduce a new benchmark for reasoning:
The high variance in LLM performance on different versions of the same question, their substantial drop in performance with a minor increase in difficulty, and their sensitivity to inconsequential information indicate that their reasoning is fragile. It may resemble sophisticated pattern matching more than true logical reasoning. We remind the reader that both GSM8K and GSM-Symbolic include relatively simple grade-school math questions, requiring only basic arithmetic operations at each step. Hence, the current limitations of these models are likely to be more pronounced in more challenging mathematical benchmarks.
We advise our network not to accept claims to reasoning, understanding, or logic in any of these tools. There is no consensus among professionals or academic that these tools can perform high level tasks like these. We decided to do our own quick and dirty research. What did we find?
First, how does it work?
Well, it doesn't work much differently than how it claimed to work before. It is prompt based and it performs a multi-stepped search, eventually producing a well formatted and laid out report.
The claim is that in automating searches and synthesis it can gather and synthesize diverse information and replace hours of sometimes tedious work with more depth and breadth than a single human researcher could.
So what's the problem?
Open AI presents it for a variety of academic and business use cases. For academic use cases (and we'd argue for business ones as well the standards should be the same) Chat GPT has been legally barred from access to academic journals. No literature review conducted will consider any of the research from the leading research publications in the field.
We surveyed some people in the field who have tried to use it but found the results to be opaque and then began using Grok and Mistral to see if it could correct and explicate the results in depth.
In our own experimentation, there were profound mis-citations and inaccuracies, which is what we have come to expect from all LLMs due to their inherent "fragility" as the paper cited above describes it. So, what time it takes to do the work creates more time in validating and correcting what it produces. As one of our own researchers said who asked that we not name them:
Discovery is the fun work for people who do research, editing someone else's questionable work isn't useful at all, except perhaps to augment your own traditionally built working bibliography. Epic fail.
So it may produce the report in under 5 minutes but it will take an hour to read and several more hours to cross-reference and validate. If you are an academic you still have to do a formal lit-review because you haven't searched the journals in your field yet.
All of this is followed by a $200 a month cost--so the value doesn't appear to be there yet when you consider its claims to reasoning are questionable at best, the amount of work it takes to validate and proofread, and the fact that it is unable to search the most relevant sources available.
"Deep research" might be a little too shallow for a 4 figure cost annually.
Singular XQ exists to provide open access research and low cost or pro bono training and career support for people with barriers to access for careers in tech. Your subscriptions help us achieve our goals to upskill and educate the public and close the gap in technology. Consider upgrading today to help keep the public informed and educated free of special interest.