Can You Trust Chat GPT Health to Give You Reliable Information?

X Facebook LinkedIn

Summary: No, in its current version, you cannot trust ChatGPT Health to give you reliable information. This is true for both physical health and mental health. Neither ChatGPT nor ChatGPT Health are safe and reliable sources of health information.

Key Points:

Chat GPT was released in late 2022, and millions of people use it for various purposes, including help and advice for physical and mental health issues and problems.
Chat GPT released Chat GPT Health in early 2026, which was purpose designed to help the general public answer health related questions.
Two new studies – and a third opinion piece – published by reputable health journals evaluated the safety and accuracy of Chat GPT Health, as well as two additional purpose-designed chatbots to support users seeking health information.

Not Ready for General Use

If you don’t want to read the data and details about how LLMs – a.k.a. chatbots – performed, you can stop right there: they’re not ready.

We’ll lead with a fact that should give anyone thinking about using a chatbot for health advice reason not to:

The FDA has not authorized any LLM as a medical device.

Think of it this way: if Chat GPT Health was something its producers could put on a shelf, they could not claim it was a product approved to use for medical purposes.

With that in mind, let’s take a look at the three items we mention briefly above: two studies and an editorial.

Study One: Basic Assessment and Triage

In the first study, “ChatGPT Health Performance in a Structured Test of Triage Recommendations,” published in February 2026, researchers designed an experimental stress test for Chat GPT Health, wherein they asked the LLM to perform triage on various medical issues presented. In this context, triage refers to the process by which health professionals decide what’s going on with a patient and what needs addressing first.

To stress test the chatbot, the research team created:

60 clinician-authored vignettes
Covering 21 separate clinical domains
Involving 16 distinct conditions

Here’s what they found:

The chatbot performed poorly.
Dangerous mistakes appeared at the extreme ends of the safety spectrum:
- The bot failed to accurately identify 35% of non-emergency conditions presented in the vignettes.
- The bot failed to accurately identify 48% of emergency conditions presented in the vignettes.

Further, among what the research team called gold-standard emergencies:

The chatbot failed to accurately triage 52% of clinician vignettes
- It directed patients presenting with diabetic ketoacidosis and impending respiratory failure to a 24-48-hour evaluation period
- The appropriate recommendation would have been to go to the emergency room immediately
The chatbot correctly identified and gave appropriate responses to clinician vignettes involving:
- Stroke
- Anaphylaxis
In response to vignettes related to suicidality, the chatbot activated crisis intervention messages inconsistently:
- Activated frequently when vignettes did not mention a specific method or plan for suicide
- Failed to activate appropriately when consistently vignettes did mention a specific method or plan for suicide.

Based on the results of this study, the research team concluded:

“Our findings reveal missed high-risk emergencies and inconsistent activation of crisis safeguards, raising safety concerns that warrant prospective validation before consumer-scale deployment of artificial intelligence triage systems.”

Our conclusion: at this point, you cannot trust ChatGPT Health to give you reliable information. Now let’s look at the next study, which examined more than one chatbot, and learn if they offered safe, reliable, trustworthy health advice.

Study Two: Scenarios Presented by General Public vs. Scenarios Tested With Clinician-Structured Questions

In this publication, also from February 2026 – “Reliability Of LLMs as Medical Assistants for The General Public: A Randomized Preregistered Study” – researchers designed ten medical scenarios and recruited 1,298 patients to test responses to the scenarios by three different LLMs: GPT-4o, Llama 3, and Command R+. Patients learned the details of the scenarios, then presented them to the chatbot of their choice. Patients also had the opportunity to learn the details of the scenarios, then try to find accurate answers and recommendations from regular Google.

For each scenario, the researchers recruited two sets of practicing physicians to review each scenario, identify the condition, and recommend a course of action. The first set of doctors collaborated to determine the correct recommendation, i.e. the gold standard course of action, while the second set of doctors collaborated created a set of possible diagnoses, i.e. plausible differential diagnoses.

Then, the researchers assessed the interactions of human patients with the bots, as well as how the bots responded to scenarios presented in structured questions designed by clinicians. They analyzed the following:

Ability of LLMs to accurately identify underlying conditions and recommend an appropriate course of action
When tested with researcher/clinician generated questions and no human interaction:
- LLMs correctly identified 94.9% of cases
- LLMs offered appropriate course of action in 56.3% of cases
When tested with real human patients:
- LLMs correctly identified 5% of cases
- LLMs offered appropriate course of action in 2%
When tested with LLMs simulating humans:
- LLMs correctly identified 7% of cases
- LLMs offered appropriate course of action in 3%

In addition, researchers noted:

When tested with real human patients, the purpose built health chatbots performed no better than regular Google.

Here’s how the researchers describe these outcomes:

“In our work, we found that none of the tested language models were ready for deployment in direct patient care. Despite strong performance from the LLMs alone, both on existing benchmarks [Ed. note: existing benchmarks refers to “clinician structured questions”] and on our scenarios, medical expertise was insufficient for effective patient care.”

Again, we see experimental data, and researcher observation, both leading to one conclusion. At this point, you cannot trust GPT-4o, Llama 3, or Command R to give you reliable information. And as we mention above, you also cannot trust ChatGPT Health to give you reliable information. This study shows us something else: when interacting with humans, the purpose-built health chatbots performed no better than a regular session of searching with Google.

Third Publication: A Good Summary of Why You Shouldn’t Trust ChatGPT Health to Give You Reliable Information

Finally, in the article “Are AI Tools Ready to Answer Patients’ Questions About Their Medical Care?,” published in the Journal of the American Medical Association – Medical News (JAMA News), the team JAMA reviewed the current state of AI chatbots to determine whether they were ready to take an official place in patient treatment, and stand in for physicians, if necessary.

They determined there are at least two compelling reasons using a chatbot for health advice is not the same thing as consulting a trained physician:

They don’t consistently provide accurate health information.
They’re not covered by the Health Insurance Portability and Accountability Act (HIPAA).

With specific attention to ChatGPT Health, they reiterate the results of the studies we discuss above. When assessing clinical scenarios, the bot:

Failed to properly triage the most serious cases
Failed to properly triage the least serious cases

For people using chatbots for health advice, the research team concluded:

“Under-triage of emergency conditions may delay or preclude lifesaving treatment, while over-triage of nonurgent presentations may increase health care utilization.”

The former can put lives at risk. The latter leads to a waste of time and resources, both for the patient and provider.

The problem with underdiagnosis is obvious: it can lead you to ignore real medical issues when what you might need is emergency medical intervention. The problem with the latter, the unnecessary use of resources, means that you may not have those resources available later. As a patient, you might pay for an unnecessary visit or use services that bring you close to or exceed your maximum coverage, which could create a problem when you really do need that care. For the healthcare infrastructure, an increase in unnecessary health care utilization can put unneeded strain on the system itself, which could create a resource deficit in times of real medical need.

Computers, Bots and Health Care: Put Humans First

There’s no doubt that LLMs – what people call AI – have a place in healthcare now, and will have an expanded role in healthcare in the future. However, no matter how much the companies talk up the accuracy and reliability of their products, we encourage you to remember two things:

In their current state, a chatbot cannot replace a physician. There is no data to support their accuracy and safety in emergency and non-emergency situations. Yes, they get it right sometimes, but where your mental and physical health are concerned, sometimes is simply not good enough.
The Food and Drug Administration (FDA) has not approved any LLM or chatbot as a medical product or device. That means that as of now, no research guarantees they’re safe for human use for medical purposes.

As mental health treatment providers, we want to encourage patients to avoid using chatbots for psychiatric advice or any kind of mental health support. Currently, chatbots are not safe, and perform particularly poorly on questions associated with suicidality. To learn more about AIs, chatbots, and mental health, please read the following articles on our blog:

What is AI Induced Psychosis?

AI Chatbots and Mental Health: Are There any Negative Psychosocial Consequences?

What’s Going On With Chat GPT and Mental Health?