AI text detectors: a stairway to heaven or hell?
The emergence of GPTZero, OpenAI’s text classifier and Turnitin’s AI detector bring a risk of over-reliance on AI classifiers. Are they a solution or a further problem to be solved?
You may also like
Popular resources
It is claimed that artificial intelligence (AI) text classifiers are able to check if a text has been written by a human or by AI – and they are being developed by a variety of players, such as OpenAI, Turnitin and GPTZero. Clearly, these powerful tools are welcome given the current widespread use of ChatGPT, but, as with any AI tool, there are a few risks associated with their use. In particular, there is the risk of over-reliance – that is, of users blindly accepting an AI recommendation that might be wrong.
Can proper reliance be achieved?
The performance of AI text classifiers is reported using standard benchmarks, centred around the concepts of true positives and false positives on which we have built literacy during the pandemic, when assessing the performance of Covid-19 tests. In the context of AI text classifiers, false positives are texts that are wrongly assessed as written by AI. For example, OpenAI claims that:
“Our classifier correctly identifies 26% of AI-written text (true positives) as ‘likely AI-written’, while incorrectly labelling human-written text as AI-written 9% of the time (false positives).”
Another popular classifier, known as GPTZero, created by a senior at Princeton University, claims that it classifies “99% of the human-written articles correctly, and 85% of the AI-generated articles correctly”.
And more recently, Turnitin claims that its forthcoming classifier “identifies 97 per cent of ChatGPT and GPT3-authored writing, with a very low, less than 1/100 false-positive rate”.
Although these summary metrics are an excellent reference, it is important to appreciate what aspects they miss.
First, the reported performance, as measured by true-positive and false-positive rates, might change across strata. In other contexts where these benchmarks are used, such as in medical statistics, they are often reported over different subpopulations, because it is well known that the accuracy of a medical classifier can change, for example, along with a patient’s age.
- Resource collection: AI transformers like ChatGPT are here, so what next?
- Keep calm and carry on: ChatGPT doesn’t change a thing for academic integrity
- How AI will make university teachers’ lives easier
Similarly, in the context of AI text classifiers, performance can vary across different fields – such as biology, philosophy and the like: it can depend on the number of characters in the text and so on. Proper reliance on AI text classifiers thus requires users to be aware of this caveat, noting that the reported metrics might not apply exactly to their field. Hopefully, edtech will provide additional information on this soon, but for the moment there is scant detail available.
Second, it important to emphasise that the reported performance tells us nothing about the probability that a specific text is AI-written, given that the classifier claims it was AI-written. On the contrary, the reported rates are informative only about the reverse – that is, about the probability that the classifier claims a text is AI-written, given that it was indeed AI-written. In other words, the reported rates are not directly informative about the likelihood of a specific text to be truly written by AI; rather, they are mainly informative about the general performance of the AI text classifiers.
Seeing beyond reported performance measures
As noted above, GPTZero has a far superior performance to that of OpenAI in terms of true and false positives. However, keeping in mind the disclaimers made earlier about what these summary measures miss, let’s run two simple experiments on GPTZero and OpenAI. Let’s start with Stairway to Heaven, a song by Led Zeppelin that was released in 1971, more than 50 years before the rise of chatGPT. The outcome of GPTZero for this song is: "Your text may include parts written by AI”, and it highlights the following parts:
Ooh ooh ooh ooh ooh
And she’s buying a stairway to heaven
There’s a sign on the wall
But she wants to be sure
…
Oh whoa-whoa-whoa, oh-oh
If there's a bustle in your hedgerow, don't be alarmed now
It’s just a spring clean for the May Queen
Yes, there are two paths you can go by, but in the long run
And there’s still time to change the road you’re on
And it makes me wonder
Oh, whoa
Your head is humming and it won’t go
In case you don’t know
Robert Plant and Jimmy Page can rest reassured that AI will not bring them another long copyright case such as the one the band faced already over this song. Evidently, this is a false positive from GPTZero. Running the same lyrics on OpenAI Text Classifier yields: “The classifier considers the text to be unlikely AI-generated.” Nevertheless, an important question a marker would ask that remains unanswered is: How unlikely? What is the actual probability of that text being AI-generated?
Let’s now consider Bohemian Rhapsody by Queen. Again, GPTZero claims that “your text may include parts written by AI”, and it highlights the following parts:
He’s just a poor boy from a poor family,
Spare him his life from this monstrosity
Easy come, easy go, will you let me go
Bismillah!
No, we will not let you go
OpenAI on the other hand claims that “the classifier considers the text to be very unlikely AI-generated”.
The much-needed AI text detectors are welcome, but we will need to keep having a holistic view in mind whenever we are judging academic misconduct cases. When they are assessing poor scholarship cases, educators must rely on human expertise and judgement and regard these classifiers as add-ons, whose conclusions require critical analysis. We must remember that it is much more harmful to falsely detect human-written text as AI-written.
Regardless of their potential, these classifiers must never be accepted as error-free oracles that accurately classify all human or AI-written text. If universities are going to use these tools extensively for marking, training should be offered in order to mitigate the risk of inappropriate use. Edtech will benefit from offering further detail on the performance of their detectors, and from inviting scrutiny by external researchers. Proper reliance will require combined action from both edtech and its users.
*The advice in the article is that of the author, and it does not necessarily reflect the University of Edinburgh’s position on the subject.
Miguel de Carvalho is a reader in statistics at the University of Edinburgh.
If you found this interesting and want advice and insight from academics and university staff delivered direct to your inbox each week, sign up for the THE Campus newsletter.