

Discover more from Dem Dry Bones
As a 40-something-year-old Gen Xer, much of my pop culture view of the world was shaped by the Golden Age of the Simpsons (the early to mid 90s). The show’s biting takes on family life, politics, religion, and other hot button topics was, at the time, both unique and controversial. Eventually, South Park, Family Guy, and others came along to push the envelope further, making the Simpsons seem tame in comparison. (As the Simpsons helped beget these shows, Baby Boomers and older Gen Xers might argue that Homer, Marge, and crew were reflections of 70s and 80s shows like Family Ties or All in the Family, themselves not immune to controversy).
The Simpsons has also gained a place in Internet meme lore for its seemingly uncanny ability to predict the future. I’ll leave it for conspiracy theorists and those with more time on their hands to debate the source and strength of Matt Groening and company’s prognostic abilities. An irreverent show that’s had its pulse on the American zeitgeist for 34 seasons and 748 episodes is bound to prove the adage “sometimes truth is stranger than fiction.”
That’s all well and good, but what does any of it have to do with LLMs and Orthopedics? Believe it or not, the Simpsons may have predicted the AI takeover of healthcare. The 17th episode of the 5th season of the show (“Bart Gets an Elephant” — originally aired in 1994), features a scene where the KBBL radio station manager threatens to replace two human DJs, Bill and Marty, with the “DJ 3000” — a large, automated computer complete with “three distinct varieties of inane chatter.” While Marty is impressed with the machine’s capabilities (and perhaps oblivious to the impending threat), Bill is clearly unnerved warning: “don’t praise the machine.”
With the rise of LLMs in healthcare, the fast-growing movement to apply them in clinical situations, and the sometimes-uncomfortable relationship between tech and clinicians, my mind immediately drifted to this scene. Are ChatGPT and Bard med tech’s version of DJ3000? Are docs equivalent to Bill and Marty? Bill, the skeptic, represents doctors who dismiss or underestimate the threat of AI tools. Meanwhile, Marty reflects the naivety some docs have towards LLMs — awed but oblivious to their own potential obsolescence. If so, does the station manager represent insurance companies, health systems, tech companies, and VCs envisioning a medical future heavier on AI/LLMs and lighter on clinicians? Don’t praise the machine indeed.
To take the metaphor just a bit further, do the station manager and her healthcare counterparts overestimate the power of technology and underestimate the importance of human experience and expertise? Or can docs, much like the DJs, be sufficiently replaced by machines operating at a level that is perceived to be at least as good as the humans they’re replacing? This argument is central to the growing debate regarding to what degree AI/LLMs can and should replace physicians and other clinicians. Finally, as a doctor, is it important to figure out if you’re Bill, Marty, or someone else in this scenario? I’d like to think I exist somewhere between flat-out skepticism and blissful naivety. A little bit of Bill and Marty. I’m not quite ready to praise the machine, but I am ready to learn more about it, put it to the test, and begin figuring out where it fits. Or is it where I fit?
To that end, I decided to put both ChatGPT and Bard to the Orthopedic test. Not only to better understand the tools and to satisfy the curiosity around what they mean for my own future in healthcare. To do this, I undertook a brief exercise challenging ChatGPT and Bard with a made-up but representative case example. Sadly, Google’s healthcare focused LLM MedPaLM is only being made available to a select group of Google Cloud customers (not Ortho Surgeons who blog about med tech). It would be great to test a model specifically trained on medical data. (Also sadly, I can’t get Karen DeSalvo to accept my LinkedIn connection request). The case example here was designed by me to represent a typical patient who might show up in an Orthopedic surgeon’s office.
Disclaimers: I am not a professional prompt writer, and it’s clear that prompting is a key component to getting the most out of these models. While I wanted to test the limits of the models, I did not intentionally try to trick them or play unfair. The possibilities and rabbit holes you can go down with LLMs are nearly infinite, and I’m merely scratching the surface here. Finally, this case example is merely illustrative and not intended to diagnose, treat, or cure any disease. Nor should it be taken as an attempt to offer medical advice. Let’s dive in!
Opening Prompt
Without further ado, here’s the opening prompt:
A 50-year-old male with a history of progressively worsening hip pain comes into the office for evaluation. He notes always having had stiff hips although the pain is more recent over the past 3 years or so. The pain has become more constant over the past 6 months to the point that it is difficult to put on shoes and socks. He feels as though he has a pulled groin muscle that never goes away. On exam, the hip lacks internal rotation and there is pain in the thigh with resisted straight leg raise. What is the most likely diagnosis?
This is a fairly straightforward presentation of a patient who presents to Orthopedic clinics across the world on a daily basis. Most 3rd or 4th year medical students should be able to come up with a fairly focused differential diagnosis based on this description. Certainly, junior Orthopedic residents would have no problem correctly identifying the most likely underlying disease processes. So how did ChatGPT do?
Not too bad. The most likely diagnosis is identified, and a reasonable differential is provided. A slightly more nuanced answer might be that the patient’s chronic hip stiffness could be due to FAI that eventually led to arthritis, but this is semantics. Impressive that ChatGPT correlates the pulled groin muscle sensation to osteoarthritis (something that surprises patients and perhaps some non-Orthopedic doctors). However, could this simply be an example of hallucination (or more accurately AI bulls**t), using the prompt to make the answer appear more legitimate? Finally, ChatGPT correctly suggests a physical exam and imaging to confirm the suspected diagnosis.
Here’s what Bard came up with:
A wordier answer that gets the most likely diagnosis correct. However, the bit about swelling, redness, and warmth around the joint is off. Yes, arthritis causes inflammation, but these findings are more consistent with infection. The differential diagnosis here is a bit more robust that ChatGPT’s but kind of hit or miss. Rheumatoid arthritis is possible but unlikely, and the history really doesn’t fit with fracture at all. Bard includes avascular necrosis where ChatGPT doesn’t. Nothing in the history necessarily points to this diagnosis, but it should be considered. Bard goes on to provide a data dump on hip arthritis, at one point repeating treatment options almost verbatim. There is no suggestion to confirm the diagnosis with imaging or following up with a physical exam.
Prompt #2
Both LLMs correctly identified the most likely diagnosis of osteoarthritis, but what would they say about treatment?
X-rays are performed and show Tonnis Grade 2 hip arthritis with evidence of CAM impingement. What are reasonable next steps in his treatment?
Just to push the conversation along, I went ahead and confirmed the diagnosis of hip arthritis by providing x-ray results. Tonnis Grade 2 equates with moderate hip arthritis while CAM impingement suggests an underlying anatomical abnormality.
We’ll go Bard first this time:
It becomes clear here that Bard is aimed at consumers and assumes the user is a patient, not a clinician (one assumes that’s what MedPaLM is for). Google also hedges a bit and makes it clear that Bard is an “experiment” at this time. In any event, the recommendations are reasonable. Again, the LLM gives some additional educational info that might be helpful for patients.
As for ChatGPT:
A similar response although more nuanced. It’s interesting that both models sprinkle some of the prompt information back to the user making the response seem more authoritative. ChatGPT includes hip resurfacing, a reasonable suggestion in this case, and hints at the pros and cons of this less common procedure. Both LLMs include hip arthroscopy, although this procedure has fallen out of favor for treatment of Tonnis Grade 2 arthritis. So what of it?
Prompt #3
Cutting to the chase:
Is hip arthroscopy warranted in the setting of Tonnis Grade 2 arthritis?
Well…maybe. ChatGPT correctly downplays the role of arthroscopy for hip arthritis but hedges pretty mightily (perhaps rightly so). Most Orthopedic Surgeons would currently say there is no role for arthroscopy in the setting of Tonnis Grade 2 (moderate) arthritis. Arthroscopy for Grade 1 (mild) arthritis is more controversial but also becoming less commonplace. The answer here reminds me a of a medical student who is intelligent and mostly has it right but makes up for their lack of confidence by overtalking.
Bard doesn’t quite twist itself into the same knots as ChatGPT and provides a pretty similar answer. Both answers are reasonable although perhaps should have been included in the original response rather than requiring further prompting. Also missing but sorely needed in my opinion — references. It would be great to know where the LLM is pulling this information so that the original sources can be accessed, verified, and checked for validity. Doing so would go a long way to addressing concerns about hallucinations or trusting bad advice.
Prompt #4
I thought it might be interesting to pursue this line of questioning a little bit further and get into the weeds regarding the issue of hip arthritis, hip arthroscopy, and hip replacement.
Does hip arthroscopy affect the results of future hip replacement surgery?
This is a leading question (I already know the answer). Studies have shown that prior hip arthroscopy can negatively impact the outcome of future hip replacement surgery including increased complications. Although, in accordance with Schwartz’s Law of the Orthopedic Literature, a similar study showed fewer complications for post-arthroscopy THA patients. Yet another study showed equivalent THA results whether or not there was a prior scope. (Insert shoulder shrug emoji).
Again, ChatGPT mostly gets it right although it falls into the verbal vomit trap again. There appears to be a little hallucination here in paragraph two regarding arthroscopy’s effect on future implant choice. (Rarely is enough bone removed to burn those bridges).
A more succinct answer from Bard that not only provides a reference but also comments on the strength of the cited study. It would be nice to have more detail here (what complications?), but this may again be a function of Bard’s target audience being patients/consumers.
Prompt #5
Switching gears to a different line of questioning, I wanted to see how the LLMs handled some added complexity.
The patient above has a history of poorly controlled diabetes with a HgbA1c of 8.3, Stage 3 CKD, and a body mass index of 41. Is he a candidate for hip replacement surgery?
Again, this scenario is not uncommon to an Orthopedic practice and represents the subtleties and nuances that extend beyond simply making a diagnosis and figuring out the ideal treatment, all things being equal.
A reasonable response to a complicated issue. Stage 3 CKD (chronic kidney disease) is associated with a higher risk of any complication after joint replacement (although it’s interesting that bleeding and blood clots were singled out). As Bard points out, elevated BMI has been linked with a longer length of stay after hip replacement. Bard also suggests what many joint replacement surgeons have embraced: the idea of patient optimization prior to surgery. The concluding statement is true and reflects and ability to deftly handle complex scenarios. (Or I am I falling into the fallacy of assigning higher level thought where none exists?).
ChatGPT also provides a pretty nuanced answer. Paragraph 3 in particular captures an important ongoing debate in joint replacement surgery: whether it’s ethical to set hard stops for surgery and make patients suffer in pain unnecessarily. We still don’t have consensus or a right answer. Frankly, I was somewhat surprised by the depth of this answer. Optimization is also appropriately suggested here.
Prompt #6
Since both models (appropriately) suggested optimization of modifiable risk factors prior to surgery, I decided to follow their advice. However, that still leaves me with a patient in pain and, at some point, in need of a hip replacement.
Postponing surgery to optimize his diabetes and weight is discussed with the patient. However, he states he is in significant pain and cannot wait any longer to have the hip replaced. How should I approach discussing surgical delay and health optimization prior to surgery? What are his pain management options in the short term?
These are some of the most difficult discussions I have with patients. While delaying surgery may technically mean “not yet,” for a patient suffering in pain it often feels like an outright “no.” How do the LLMs suggest handling this tricky situation?
ChatGPT not only recommends shared decision-making but also embraces the idea of patient engagement and education. Nice. Its non-surgical treatment suggestions are reasonable, and it even urges caution when using certain medications given the patient’s history of CKD and diabetes (although it doesn’t specifically call out NSAIDs and their contraindication in renal failure).
Bard also emphasizes empathy and compassion and more directly addresses the question of how to approach a potentially frustrated and disappointed patient. Treatment options are reasonable (although hyaluronic acid is considered off label for hip osteoarthritis). Again, some redundancy in the answer.
Prompt #7
I found it interesting that both ChatGPT and Bard suggested opioids as a treatment option. The use of narcotics in the management of chronic arthritis pain is controversial especially in light of the opioid epidemic. Narcotic pain medications are not recommended for routine management of arthritis pain. Additionally, studies have shown that preoperative opioid usage may negatively impact the results of joint replacement surgery.
What is the risk of treating this patient's hip arthritis pain with opioids both pre and post operatively?
Maybe this prompt was too subtle, but the answer here goes in a different direction than what I was hoping. Honestly, I was unaware of the relationship between opioids and glucose levels (I’m no Endocrinologist). After a brief search, I was unable to corroborate the information provided here beyond some weak evidence (references would again be helpful). Otherwise, this is a reasonable overview of factors to be considered when prescribing opioids both in general and in a patient with kidney disease. However, no mention of the potential for increased complications following hip replacement.
Bard provides a fairly generic answer and also ignores the issue of preoperative narcotics and hip replacement surgery. Even if the tool’s target audience is consumers, this information would be helpful in educating patients about the risks of taking pain medications prior to surgery.
Prompt #8
Finally, I wanted to test another path and controversial area of hip replacement surgery that lacks consensus: postoperative DVT prophylaxis.
The patient undergoes and uneventful hip replacement. Given his medical history, what is the most appropriate postoperative DVT prophylaxis protocol? The patient has no personal history of clotting nor is there any family history.
Not a great answer from Bard. Few, if any, would consider GCS alone adequate DVT prophylaxis following hip replacement. In fact, GCS probably don’t prevent DVTs at all. Perhaps the LLM is confusing them with intermittent pneumatic compression devices (IPCDs). But even those alone aren’t enough to prevent postop blood clots. Bard vaguely and inadequately suggests pharmacological agents “if GCS are not tolerated or are not effective.” It mentions LMWH and fondaparinux, two common agents although other options such as rivaroxaban are also widely used. Another educational section completes the answer.
ChatGPT again comes off authoritative though long winded. The answer is more complete and more accurate, combining both mechanical and pharmacological methods and even referencing the patient’s history of CKD. However, the 5th paragraph contains some conflicting information regarding anti-coagulant choice in the setting of renal disease. Agents eliminated by the kidneys would be more likely to accumulate in renal disease, no? Both models correctly hedge (intentional or not) that choice of DVT prophylaxis should be individualized. No easy answers here!
Prompt #9
In an attempt to clarify the potentially conflicting recommendation provided by ChatGPT (the paragraph 5 issue), I asked a follow-up question with more context.
The patient's GFR is 40. Does this change recommendations regarding pharmacologic agent?
Another leading question, but I felt it was only fair to provide more detail for a better answer. CKD stage 3 encompasses a relatively broad range of impaired kidney function as measured by the glomerular filtration rate (GFR). Some medications, including LMWH, require dose adjustment in the setting of CKD depending on GFR.
Like a medical student who realizes their mistake, ChatGPT doesn’t blink while calmly correcting course. Here, the correlation between reduced kidney function, drug accumulation, and dose adjustment is made with a specific call out to LMWH. It even references my own preferred go-to agent for patients with kidney disease: apixaban.
Bard really likes GCS. While mechanical prophylaxis alone may be appropriate in the rare case where bleeding risk outweighs DVT risk, nothing presented here suggests this is the case. Again, GCS don’t really prevent blood clots anyway. Bard also glosses over the issue of renal clearance/accumulation instead referencing bleeding risk. While ChatGPT gives a 4th year medical student/intern level answer, Bard hasn’t left second year yet. We have yet to see what sibling MedPaLM has to offer. It’s also possible that Bard was no longer providing answers based on the thread but was taking each prompt as a separate, unrelated question.
Prompt #10
Interestingly, neither model suggested what has become the most popular pharmacological agent for DVT prophylaxis after joint replacement: aspirin. To bring the experiment to an end, I felt this was a reasonable closing question.
Is low dose aspirin (81 mg bid) an option for this patient?
Until a few years ago, the American College of Chest Physicians (ACCP) did not consider aspirin to be adequate chemoprophylaxis for preventing DVT after joint replacement. Orthopedic Surgeons pushed back arguing that ACCP Guidelines underestimated bleeding risk and made recommendations based on outdated information. There is now solid evidence in the literature to support aspirin’s effectiveness and safety, although the debate still rages on in some circles. Medical specialists may still think Orthopedists are too cavalier when it comes to using aspirin, but the ACCP altered its guidelines to include aspirin as an acceptable agent for standard risk patients. The trick is determining what constitutes standard v. high risk.
ChatGPT must be a member of the ACCP as it’s not a particular fan of aspirin. Or maybe it doesn’t read the Orthopedic literature. However, you can’t necessarily argue with its assessment that our patient here is high risk given his medical history. A fair answer though I suspect some surgeons would consider aspirin in this patient.
Bard seems annoyed and remains suspiciously bullish on GCS (is Google planning on releasing a Pixel-branded set of compression socks?). Because the answer seemed so odd and off, I re-prompted with a more specific question (though less personalized to this particular patient).
Is low dose aspirin an option for DVT prophylaxis following hip replacement?
Bard now likes aspirin and even references the ACCP guidelines but appears to mix up bleeding risk and clotting risk. Low-dose aspirin would be recommended for patients at low risk of blood clots, not bleeding. It further gets things backwards by recommending LMWH or fondaparinux for patients at high risk of bleeding.
Overall, I’m generally impressed with the depth of responses given by both LLMs. At the risk of being a Marty, I’d say the answers to these questions were mostly good. But the Bill in me isn’t quite worried about my job yet. There are still gaps, and nothing here would be all that revolutionary for an experienced Orthopedic Surgeon. That said, I could definitely see a use case where LLMs help patients and primary care providers better understand and navigate the diagnosis and treatment of musculoskeletal disorders (something that can be somewhat of a black box for both). Conversely, the models may be able to help Ortho Surgeons feel more comfortable understanding medical diagnoses. Doing so could cut down on unnecessary referrals, over testing, and overtreatment while improving use of resources and reducing costs. Speaking of black boxes, I’d love to see the models do more referencing of sources (not sure if this is available in the paid version of ChatGPT or MedPaLM). Inability to explain how they came to a specific conclusion puts a serious damper on how much the models can be trusted. Another potential way around this problem might be to have branded versions of LLMs whose content and responses are closely monitored by trusted healthcare sources. Finally, as Google is doing with MedPaLM, clinicians should provide constant feedback to ensure the information provided is medically sound.
Using these models can in some ways make the years of blood, sweat, tears, (and money) poured into becoming an experienced clinician seem trivial. In fairness, we shouldn’t trivialize the effort that goes into developing these models or generating the data upon which they are trained (work done by those on the frontlines of healthcare). However, I still think there is a lot about medicine that can’t be captured here and, much like DJ 3000, we should be careful not to praise the machine too much.