Although popular AI models score highly on medical exams, their accuracy drops significantly when making a diagnosis based on a conversation with a simulated patient