Medical Residency Selection Relies on Flawed Metrics
governance9 min read1,726 words

Medical Residency Selection Relies on Flawed Metrics

Medical residency selection relies on metrics like test scores and letters that poorly predict clinical performance. These flawed measures may worsen bias and miss key applicant qualities.

K

Karan Mehta

Business researcher and analyst covering technology disruption, market dynamics,...

The USMLE Score You Sweated Over Probably Doesn’t Predict How Good a Doctor You’ll Be

flawed metrics bias
flawed metrics bias

Every spring, tens of thousands of medical students open an email from the National Resident Matching Program and learn where they will spend the next three to seven years of their lives. For many, that moment is the culmination of a process that felt less like a job interview and more like a gauntlet: four years of medical school, grueling clerkships, a personal statement rewritten seventeen times, and a standardized test—the United States Medical Licensing Examination (USMLE)—that can make or break a career.

But here’s the uncomfortable truth that a new systematic review has made impossible to ignore: many of the metrics residency programs use to select future doctors have shockingly weak evidence behind them. And some of the most heavily weighted factors—like USMLE scores and grades—should probably be playing a much smaller role.

The review, published in the Journal of Graduate Medical Education by J. Lipman, Colleen Y. Colbert, Rendell W. Ashton, and Judith C. French, examined 231 studies on the tools used to evaluate residency applicants. What they found is a system built on tradition, intuition, and hope rather than hard data (Lipman et al., 2023).

What Do Programs Actually Look At?

clinical performance assessment
clinical performance assessment

Residency selection in the United States is a high stakes matching game. Programs receive thousands of applications for a handful of spots. They need to winnow the pile somehow. So they turn to the Electronic Residency Application Service (ERAS), which collects everything from test scores to volunteer work.

The review categorized these metrics into ten domains: research productivity, awards, USMLE scores, personal statements, letters of recommendation, medical school transcripts, work and volunteer experiences, medical school demographics, diversity/equity/inclusion factors, and the interview.

The authors then asked a simple question: Does the evidence actually support using these things to predict who will become a good resident?

The answer, it turns out, depends heavily on what you mean by “good.”

The Interview Is a Wildly Unreliable Bet

residency applicant review
residency applicant review

Let’s start with the interview, because it is the most universally used selection tool in residency. Almost every program interviews candidates. It feels personal. It feels human. It feels like the one part of the process where you can actually get to know someone.

The evidence says otherwise.

Lipman and colleagues found that the interview has “low quality” support for predicting future performance (Lipman et al., 2023). That does not mean interviews are useless. It means the way most programs conduct them—unstructured conversations where interviewers ask whatever comes to mind—produces results that are about as reliable as reading tea leaves.

Structured interviews, where every candidate is asked the same questions and scored on the same rubric, do better. But most programs do not use them. They rely on the gut feeling of a faculty member who might have met the applicant for twenty minutes over Zoom.

The review also found that interviews are particularly susceptible to bias. Applicants from underrepresented backgrounds may be judged more harshly or more leniently depending on the interviewer’s unconscious assumptions. The interview, in other words, is a tool that feels objective but is anything but.

USMLE Scores: The Sacred Cow

If there is one metric that residency programs treat as gospel, it is the USMLE Step 1 and Step 2 scores. These three digit numbers are used to screen out applicants before they even get a second look. Programs post minimum cutoffs. Students obsess over every point.

The review suggests this obsession is misplaced.

Lipman and colleagues found that USMLE scores have a “limited role” in predicting resident performance (Lipman et al., 2023). Yes, higher scores correlate slightly with better performance on in training exams. But the correlation with clinical skills, patient outcomes, or professionalism is weak to nonexistent.

This matters because the USMLE is a known source of inequity. Students from wealthier backgrounds can afford expensive prep courses and take time off to study. Students who work during medical school or who have family obligations often cannot. The test measures test taking ability, not doctoring ability.

In 2022, the USMLE Step 1 changed from a three digit score to a pass/fail format, partly in response to these concerns. But Step 2 still reports a numeric score, and many programs have simply shifted their weight to that exam. The review suggests this is a mistake.

Grades and Rankings: The Emperor’s New Clothes

Medical school grades and class rankings are another sacred cow. Programs want to know where an applicant stood relative to their peers. They treat honors grades as proof of excellence.

The evidence does not support this.

The review found that grades and national rankings should have a “limited role” in selection (Lipman et al., 2023). The problem is that grades are not standardized. An “honors” grade at one medical school might mean something very different at another. Some schools grade on a curve. Others do not. Some inflate grades aggressively. Others are stingy.

Worse, grades can reflect bias. Studies have shown that underrepresented minority students and women receive lower clinical grades on average, even when their performance is objectively similar to peers. Using grades as a primary filter can systematically exclude talented candidates.

The Personal Statement: A Masterpiece of Wishful Thinking

Every residency applicant writes a personal statement. It is supposed to reveal their character, their motivations, their unique journey. Programs read them hoping to find the diamond in the rough.

The review found that personal statements have “low quality” support for predicting future performance (Lipman et al., 2023). They are easy to fake, hard to verify, and often indistinguishable from one another. Most applicants write about the same things: a formative patient encounter, a family member’s illness, a desire to serve underserved communities. The statements are sincere, but they are not predictive.

Some programs have started using personal statements to screen for “red flags” like poor writing or unprofessional language. That is a reasonable use. But using them to rank candidates? The evidence says no.

Letters of Recommendation: Who You Know Matters

Letters of recommendation are one of the few metrics that the review found to have some positive evidence. But even here, the picture is complicated.

The quality of a letter depends heavily on who writes it. A letter from a well known academic physician carries more weight than one from a community preceptor, regardless of what the letter actually says. This creates an advantage for students at elite institutions who have access to famous letter writers.

The review found that letters can predict performance, but only when they are structured and specific (Lipman et al., 2023). Vague letters that say “this student is excellent” are useless. Letters that describe concrete behaviors and skills are better. But most programs do not require structured letters, so the variability is enormous.

What About Diversity, Equity, and Inclusion?

One of the review’s most important contributions is that it explicitly examined the DEI implications of each metric. This is rare in the literature.

The authors found that many traditional metrics systematically disadvantage applicants from underrepresented backgrounds. USMLE scores, grades, and awards all show significant disparities by race, ethnicity, and socioeconomic status (Lipman et al., 2023). Using these metrics as primary filters reduces the diversity of the resident pool.

Metrics that showed less bias included research productivity, volunteer experience, and the Medical Student Performance Evaluation (MSPE), which is a summary letter from the medical school. The MSPE, in particular, was found to have moderate predictive value without the same degree of demographic bias.

The review recommends that programs explicitly consider DEI when designing their selection processes. That means weighting metrics that are less biased and being transparent about how decisions are made.

What This Research Does NOT Prove

It would be easy to read this review and conclude that residency selection is a complete mess with no reliable tools. That is not quite right.

The review does not say that all metrics are useless. It says that most metrics have low quality evidence behind them. That is a different claim. It is possible that some of these tools work better than the studies show, but the studies are too weak or too small to prove it.

The review also does not say that programs should ignore all data. It says they should be more skeptical of the data they use and more deliberate about how they combine it.

Most importantly, the review does not offer a simple alternative. It does not say “just use the interview and ignore everything else.” It says the evidence is mixed and that programs need to think harder about what they value.

How the Study Was Done

This was a systematic review, meaning the authors searched multiple databases for every study that examined the link between an ERAS metric and a residency outcome. They found 2,599 unique studies, reviewed them, and included 231 that met their criteria. Two authors independently reviewed each study, and a third resolved disagreements. The agreement between reviewers was moderate (kappa=0.53), which is typical for this kind of work.

The review is thorough and well conducted. It is limited by the quality of the underlying studies, many of which are small, single institution, or poorly controlled. But that is precisely the point: the evidence base itself is weak.

What This Actually Means

  • Stop treating USMLE scores as a proxy for clinical ability. They measure test taking, not doctoring. Use them as a minimum threshold, not a ranking tool. Programs that cut off applicants at a certain score are likely excluding excellent candidates.
  • Restructure the interview. Unstructured interviews are barely better than random. Switch to structured interviews with standardized questions and scoring rubrics. Train interviewers on bias. Measure interrater reliability.
  • Weight the MSPE more heavily. The Medical Student Performance Evaluation has moderate predictive value and less demographic bias than grades or test scores. It is not perfect, but it is better than the alternatives.
  • Be transparent about what you value. Programs should publish their selection criteria and explain why they use each metric. This allows applicants to make informed decisions and holds programs accountable.
  • Audit your own process. Track which metrics actually predict performance in your program. If USMLE scores do not correlate with resident evaluations, stop using them as a primary filter. The evidence is clear: tradition is not a good reason to keep doing something that does not work.

References

  1. [1]J. Lipman, Colleen Y. Colbert, Rendell W Ashton, Judith C. French (2023). A Systematic Review of Metrics Utilized in the Selection and Prediction of Future Performance of Residents in the United States.. Journal of Graduate Medical EducationDOI· 20 citations
#medical residency#selection metrics#bias#clinical performance
K

Karan Mehta

Business researcher and analyst covering technology disruption, market dynamics, and startup ecosystems.

Reader Comments (2)

Dr. Ananya Sharma★★★★★

Interesting, but you didn’t address the Indian NEET-PG cutoff obsession. We see rank lists prioritized over clinical aptitude. Did your data include any Asian programs? Would love a follow-up on how regional exam weightings distort selection.

Ravi Iyer★★★★★

As a former residency interviewer, I’ve seen stellar candidates rejected due to arbitrary publication thresholds. Metrics like h-index for fresh graduates? Nonsense. We need structured interviews and situational judgment tests, not just number games.

Leave a comment

Related Articles