ChatGPT Forces Rethink of Traditional Student Exams

The Exam That Writes Itself

In March 2023, a professor at a European university sat down to grade a stack of final essays. One submission stood out. It was polished, confident, and perfectly structured. It cited sources the professor had never seen. When he checked, those sources did not exist. The student had used ChatGPT. The essay was flawless. It was also complete nonsense.

This is not a story about cheating. It is a story about what happens when the tool students use to cheat becomes indistinguishable from the tool they use to think.

The paper that sparked this reckoning, "ChatGPT: Bullshit spewer or the end of traditional assessments in higher education?" by Jürgen Rudolph, Samson Tan, and Shannon Tan (2023), published in the Journal of Applied Learning & Teaching, has already been cited over 1,600 times. It is one of the first peer reviewed academic articles to take ChatGPT seriously as a threat to how we evaluate learning. And its central argument is both simple and terrifying: the traditional exam, as we know it, may already be dead.

What ChatGPT Actually Does to an Exam

Rudolph, Tan, and Tan (2023) did not just theorize about ChatGPT. They sat down and tested it. They fed it real exam questions from university courses. They asked it to write essays, solve problems, and produce arguments. The results were unsettling.

The chatbot could generate text that was indistinguishable from a human student's work. It could answer questions about literature, history, philosophy, and even some technical subjects. It could argue both sides of a debate. It could produce a passable essay on a topic it had never seen before, using citations that looked real but were often fabricated.

The authors describe ChatGPT as a "bullshit spewer" in the title of their paper, borrowing philosopher Harry Frankfurt's definition of bullshit: speech intended to persuade without regard for truth. ChatGPT does not care if what it says is true. It cares only that the output looks like something a human would write. And it is very, very good at that.

This matters because most traditional exams, especially take home essays and open book tests, rely on exactly the kind of work ChatGPT can fake: coherent prose, logical structure, and plausible claims. If a machine can produce that in seconds, what exactly are we testing?

The Three Kinds of Assessment That Now Fail

Rudolph, Tan, and Tan (2023) break down the problem into three categories, each corresponding to a type of assessment that ChatGPT can defeat.

The Take Home Essay

This is the most obvious victim. A student assigned a 2,000 word essay on the causes of World War I can now paste the prompt into ChatGPT and receive a complete draft in under a minute. The essay will have an introduction, body paragraphs, and a conclusion. It will use academic language. It will cite sources, even if those sources are sometimes invented. The student can then edit it slightly and submit it. The professor will likely never know.

The authors note that ChatGPT does not just produce text. It can also revise it, rewrite it in different styles, and even argue against its own conclusions. This means a student can generate multiple drafts, pick the best one, and claim it as their own work.

The Open Book Exam

Open book exams were supposed to test higher order thinking: analysis, synthesis, evaluation. The idea was that if students could look up facts, the exam should test what they could do with those facts. ChatGPT undermines this completely.

Rudolph, Tan, and Tan (2023) found that ChatGPT can analyze texts, compare arguments, and even generate original insights, at least at the level of a competent undergraduate. It can synthesize information from multiple sources. It can evaluate the strengths and weaknesses of an argument. These are exactly the skills open book exams were designed to measure.

The In Person Essay

Even the traditional in person, timed essay is not safe. A student can memorize an outline generated by ChatGPT and then reproduce it in the exam room. Or, if the exam allows internet access, they can use ChatGPT in real time. The authors point out that many universities still allow laptops in exam halls, and ChatGPT is accessible through any web browser.

The only assessment that currently seems immune is the oral exam, where a student must defend their ideas in real time. But even that is not foolproof. A student can memorize ChatGPT generated answers and deliver them convincingly.

The One Thing ChatGPT Cannot Do

Here is the twist that Rudolph, Tan, and Tan (2023) emphasize repeatedly: ChatGPT is terrible at anything that requires actual understanding.

It cannot reason through novel problems. It cannot explain its own logic. It cannot admit when it is wrong. It cannot learn from its mistakes in any meaningful way. It is, in the authors' words, a "bullshit spewer" that produces plausible text without any underlying comprehension.

This is the key insight that most discussions of ChatGPT miss. The chatbot is not intelligent. It is fluent. And fluency is not the same as understanding.

But here is the problem: most of our exams test fluency, not understanding. They test whether a student can produce a coherent argument, cite sources correctly, and follow academic conventions. These are exactly the skills ChatGPT has mastered. The skills we actually care about, like critical thinking, creativity, and the ability to reason from first principles, are rarely tested directly.

How the Study Was Done

Rudolph, Tan, and Tan (2023) used a mixed methods approach. They conducted an extensive literature review of existing research on AI in education, covering over 100 sources. They then experimented with ChatGPT directly, testing its ability to answer questions from various academic disciplines.

The authors do not report specific numbers of test questions or exact success rates. Instead, they describe qualitative findings: ChatGPT could produce passable answers to most standard exam questions, but it struggled with questions that required specific factual knowledge, logical reasoning, or awareness of its own limitations.

They also analyzed ChatGPT's output for style, structure, and plausibility. They found that the chatbot's writing was often indistinguishable from a human student's, especially in subjects where the expected answer follows a predictable format.

This methodology has limitations. The authors tested only a subset of possible questions. They did not conduct a controlled experiment comparing ChatGPT's output to human students' work. And the chatbot itself has been updated since the paper was published, potentially changing its capabilities. But the core finding, that ChatGPT can produce convincing exam answers, has been confirmed by dozens of subsequent studies.

What This Means for Universities

Rudolph, Tan, and Tan (2023) do not just diagnose the problem. They offer recommendations, and those recommendations are surprisingly radical.

Stop Testing What ChatGPT Can Do

The authors argue that universities should stop using assessments that ChatGPT can easily fake. This means eliminating most take home essays, open book exams, and even some in person written exams. These assessments, they argue, are no longer valid measures of student learning because they cannot distinguish between human and machine output.

Start Testing What ChatGPT Cannot Do

Instead, universities should focus on assessments that require skills ChatGPT lacks. These include:

▸Oral exams and presentations, where students must defend their ideas in real time
▸Project based assessments that require hands on work, such as lab reports, coding projects, or creative portfolios
▸Collaborative assignments that test teamwork and communication
▸Reflective writing that draws on personal experience, which ChatGPT cannot fake because it has no personal experience
▸In person problem solving sessions where students must work through novel problems without access to external tools

Redesign the Curriculum Around AI

The authors go further. They argue that universities should not just fight ChatGPT. They should embrace it. Students should be taught how to use AI tools effectively, how to evaluate their output critically, and how to combine human and machine intelligence.

This means teaching students to use ChatGPT as a research assistant, a brainstorming tool, or a writing coach, but also teaching them to recognize its limitations. It means designing assignments that require students to go beyond what ChatGPT can produce, to add their own analysis, creativity, and judgment.

What This Research Does Not Prove

The Rudolph, Tan, and Tan (2023) paper is not the final word on this topic. It is an early exploration, published in early 2023, when ChatGPT was still new. The authors themselves acknowledge several limitations.

First, the paper does not provide quantitative data on how often ChatGPT can pass specific exams. It describes qualitative findings but does not report success rates or statistical comparisons. This makes it hard to know exactly how big the threat is.

Second, the paper focuses on text based assessments. It does not address other forms of evaluation, such as multiple choice tests, practical exams, or performance based assessments. ChatGPT may be less useful for these formats, but the authors do not explore this.

Third, the paper was published before many universities implemented AI detection tools. It is possible that these tools, while imperfect, can catch some ChatGPT generated submissions. The authors do not discuss this possibility in detail.

Finally, the paper assumes that ChatGPT will remain roughly as capable as it was in early 2023. But the technology is evolving rapidly. Later versions of ChatGPT, as well as competing models like Google's Gemini and Anthropic's Claude, may be even better at generating convincing academic text. Or they may introduce new limitations. We do not know.

Despite these limitations, the paper's core argument, that ChatGPT forces a fundamental rethink of assessment, has held up remarkably well. It was one of the first to make this argument clearly and systematically, and it has shaped the conversation ever since.

The Deeper Problem: We Were Testing the Wrong Thing

The most unsettling implication of Rudolph, Tan, and Tan (2023) is not that ChatGPT can cheat. It is that our exams were already testing skills that do not matter.

Think about what a typical university exam measures: the ability to recall information, organize it into a coherent argument, and present it in academic prose. These are useful skills, but they are not the skills that distinguish a great student from a mediocre one. They are not the skills that matter in the real world.

What matters in the real world is the ability to ask the right questions, to evaluate conflicting sources of information, to think creatively about novel problems, to collaborate with others, to communicate complex ideas to different audiences. These are skills that ChatGPT cannot fake, at least not yet.

The authors suggest that ChatGPT is not the enemy. It is a mirror. It shows us how shallow our assessments have become. If a machine can pass our exams, maybe our exams were not testing much to begin with.

What This Actually Means

▸Stop assigning take home essays as the primary form of assessment. They are no longer valid. If a student can generate a passing essay with ChatGPT, the essay does not measure what you think it measures. Replace them with oral exams, in person problem solving sessions, or project based assessments that require hands on work.

▸Teach students to use AI, not just to avoid it. The authors recommend integrating AI literacy into the curriculum. Students should learn how to use ChatGPT as a tool, how to evaluate its output, and how to combine it with their own thinking. This is not about giving up on academic integrity. It is about preparing students for a world where AI is everywhere.

▸Design assessments that require personal experience or original thinking. ChatGPT cannot write about what it has never experienced. It cannot describe the feeling of conducting a lab experiment, the frustration of debugging code, or the insight gained from a class discussion. Assessments that draw on these experiences are harder for AI to fake.

▸Use AI detection tools, but do not rely on them. The authors do not discuss detection tools in detail, but subsequent research has shown that they are unreliable. They produce false positives, false negatives, and can be easily bypassed. The best defense against AI generated work is to design assessments that AI cannot do, not to try to catch AI generated work after the fact.

▸Rethink what you are actually testing. If ChatGPT can pass your exam, your exam is not testing what you think it is testing. The authors argue that this is an opportunity, not a crisis. It forces us to ask: what do we actually want students to learn? And how can we measure that? These are questions we should have been asking all along.

References

[1]Jürgen Rudolph, Samson Tan, Shannon Tan (2023). ChatGPT: Bullshit spewer or the end of traditional assessments in higher education?. Journal of Applied Learning & TeachingDOI· 1,642 citations