Image: 조선일보

OpenAI o1 Series Model Outperforms Harvard Physicians in Beth Israel Deaconess ER Diagnoses

03 May, 2026.Technology and Science.13 sources

Key Takeaways

OpenAI's o1 model outperformed physicians in emergency room diagnoses in Harvard/Beth Israel Deaconess study.
The evaluation used real ER cases with disorganized data and missing information.
Safety concerns and limitations accompany AI deployment, cautioning against replacing doctors.

Study finds AI beats doctors

A Harvard-led study published in Science found that an OpenAI reasoning model, identified as part of the “o1 series,” outperformed physicians across emergency-room diagnosis tasks that included “messy emergency room cases drawn from real records.”

“AI shows promise in emergency room diagnosis When a patient first comes through the doors of an emergency room, Ontario physician Dr”

CBC

In 76 emergency room records from Beth Israel Deaconess Medical Center (BIDMC) in Boston, the model faced “scattered notes, missing details, and early decisions made before a diagnosis was confirmed,” and still produced higher diagnostic accuracy than doctors at each stage of care.

CBC

At triage—the first sorting step—the model named an exact or very close diagnosis in 67.1 percent of cases, while attending physicians’ scores remained lower.

After an emergency physician gathered more information, the rate rose to 72.4 percent, and then reached 81.6 percent at admission, with the study describing that “Early uncertainty, not polished textbook cases, became the pressure point.”

The model’s advantage also persisted in later stages where more information was available, and the study framed the result as pushing medical AI “beyond exam success and toward the harder question of whether it can be tested safely in hospitals.”

The report also emphasized that the system came from OpenAI’s “o1 series,” and that it was tested as a reasoning model that “listed likely diagnoses and suggested the next move in care.”

How the test was run

The study’s design centered on comparing AI and physicians using the same information at defined moments in emergency care, while also stressing that the evaluation did not involve live doctor-patient interaction.

Earth described that BIDMC records “did not get cleaned before the model saw them,” and that the team “didn’t pre-process the data at all,” leaving the electronic health records “messy” in the way real charts often are.

Digital Trends Español

In the BIDMC real-record test, reviewers were blinded so they “did not know whether a diagnosis came from a human or model,” and the report said this “reduced favoritism” even though it “could not show whether the tool improves live patient care.”

The Guardian described a similar setup in which an AI and “a pair of human doctors” were each given “the same standard electronic health record,” including “vital sign data, demographic information and a few sentences from a nurse about why the patient was there.”

The Guardian further said the AI identified the exact or very close diagnosis in 67% of cases, and that the model’s accuracy “rose to 82% when more detail was available.”

CBC described the research as using OpenAI’s “o1-preview” model at three points—“initial triage, doctor examination in the ER and admission to the medical floor or intensive care unit”—and stated that “None of the testing involved actual doctor-patient interactions and had no effect on real diagnoses or treatments.”

Across these accounts, the study’s scope was explicitly limited to text-based reasoning, with the Guardian warning that “The study only tested humans against AIs looking at patient data that can be communicated via text.”

Voices urge caution and trials

The study’s findings triggered a set of cautions and calls for prospective evaluation, with multiple named clinicians and researchers emphasizing both the promise and the limits of what was tested.

“In a recent study, an AI system beat physicians across a broad set of medical reasoning tests, including messy emergency room cases drawn from real records”

Earth

Earth quoted lead author Arjun K. Manrai saying, “We tested the AI model against virtually every benchmark, and it eclipsed both prior models and our physician baselines,” while also describing that the work pushed toward “prospective clinical trials” to test whether AI assistance changes patient outcomes during real visits.

Earth also included a safety framing from Dr. Peter G. Brodeur: “A model might get the top diagnosis right but also suggest unnecessary testing that could expose a patient to harm,” and it stressed that “Safety therefore depends on the whole recommendation – not only the first name on the diagnosis list.”

Straight Arrow News quoted Dr. Adam Rodman saying he gets “a little bit queasy about how some of these results might be used,” and it also quoted Rodman: “No one should look at this and say we do not need doctors.”

CBC added a practical boundary from Dr. Nour Khatib, who said, “It’s just another tool to help us give the patient the highest quality care possible,” and it quoted Rodman describing the reasoning model as “a reasoning model is different from your standard large language model because it has been instructed to think out loud, to solve problems like humans.”

The Guardian echoed the insistence that the results do not mean replacement, quoting Arjun Manrai: “I don’t think our findings mean that AI replaces doctors,” and quoting Rodman that AI would join physicians in a “triadic care model … the doctor, the patient, and an artificial intelligence system.”

In parallel, the UDG TV report warned that “their authors warn that these results do not mean AI systems are ready to practice medicine by themselves,” and it quoted Brodeur in a Harvard statement: “A model could be right about the main diagnosis but also suggest unnecessary tests that could endanger the patient.”

Different outlets emphasize different stakes

While the core result—that an OpenAI reasoning model outperformed physicians in emergency-room diagnosis tasks—appeared across outlets, the emphasis shifted between what the study measured and what it implied for real-world deployment.

Earth framed the work as a benchmark problem, noting that “We used to evaluate models with multiple-choice tests; now they are consistently scoring close to 100 percent and we can’t track progress anymore because we’re already at the ceiling,” and it argued that the real question was whether messy records could be used safely in hospitals.

Le Matin.ma

Semafor highlighted a forward-looking timeline, saying “In 10 years, AI agents will likely be commonplace in emergency medicine,” and it added a CEO perspective that it “borders on âmalpracticeâ if doctors do not use frontier models for second opinions already.”

The Guardian, by contrast, focused on the clinical reasoning framing and the limitations of text-only evaluation, stating that “The study only tested humans against AIs looking at patient data that can be communicated via text” and that “That means the AI was performing more like a clinician producing a second opinion based on paperwork.”

CBC centered on the practical workflow boundary, quoting Dr. Amol Verma calling it “false comparison” to claim AI is “better than doctors,” and it stressed that “It’s the physical examination — how someone looks, sounds and feels — that forms a diagnosis.”

PhonAndroid emphasized the reluctance to “get carried away,” stating that “the authors of the study are the first to temper the enthusiasm,” and it pointed to limitations such as the model’s struggle with multimodal data and that “A randomized clinical trial remains indispensable before any real deployment with patients.”

Meanwhile, Earth and the Guardian both described a specific example involving a blood clot to the lungs and worsening symptoms, where the AI noticed a history of lupus, but the Guardian explicitly said “The AI was proved correct,” while Earth used the broader framing of early uncertainty and messy records.

What happens next for care

The study’s results were presented as creating immediate pressure for hospitals and regulators to move from retrospective benchmarks to prospective evaluation, because the work did not measure live patient outcomes and did not test non-text signals.

“Buy crypto Markets Spot Futures GOLD Earn Event Center More Enter the token or contract address Sign in [link] [link] AI diagnostic accuracy surpasses emergency department physicians in a groundbreaking Harvard study 2026/05/03 18:25 Reading time: 9 min Share If you have comments or concerns about this content, contact us via BitcoinWorld The AI diagnostic accuracy surpasses emergency department physicians in a groundbreaking Harvard study A groundbreaking Harvard study reveals that AI offers more accurate diagnoses than emergency department physicians in certain clinical scenarios, marking a significant milestone in medical artificial intelligence”

MEXC

Earth said “Strong benchmark scores now create a practical problem for hospitals, regulators, developers, and patients who need proof,” and it called for “Prospective clinical trials” to test whether AI assistance changes patient outcomes during real visits.

PhonAndroid

Earth also warned that “Clinical care runs on more than text, and this test did not measure everything doctors notice,” listing “Voices, breathing effort, posture, images, family concerns, and bedside changes” as examples of cues not captured by the evaluation.

The Guardian similarly stated that “The study only tested humans against AIs looking at patient data that can be communicated via text,” and it added that the AI’s reading of signals “such as the patient’s level of distress and their visual appearance, were not tested.”

PhonAndroid echoed the deployment barrier by saying the authors “nuance these conclusions” and that “A randomized clinical trial remains indispensable before any real deployment with patients.”

In addition to clinical uncertainty, the Guardian raised accountability concerns, quoting Rodman that “There is not a formal framework right now for accountability,” and it said patients “want humans to guide them through life or death decisions [and] to guide them through challenging treatment decisions.”

The report also described a potential behavioral risk: Dr. Wei Xing said “some of the other findings suggested doctors may unconsciously defer to the AI’s answer rather than thinking independently,” and he argued that “This tendency could grow more significant as AI becomes more routinely used in clinical settings.”

More on Technology and Science

NHS Rolls Out One-Minute Keytruda Injection for 14,000 Cancer Patients in England

10 sources compared

Israeli Telecom Infrastructure Used Spyware to Track Phones in Thailand, South Africa, Norway, Bangladesh, Malaysia

10 sources compared

WHO Says Hantavirus Outbreak Kills Three on MV Hondius Cruise Ship in Atlantic

59 sources compared

Google Bought British Deepmind, Driving European Tech Value From Europe to the U.S.

11 sources compared