Lewis Goldberg was flabbergasted by the results of his experiment. How could a simple algorithm perform better than the excellent doctors that had trained it?
Goldberg was a psychologist who wished to study how experts made decisions. Back in the 1970’s, He interviewed some radiologists to find out how they diagnosed stomach X-rays. The doctors looked for 7 signs – the size of the ulcer, the shape of its borders, the width of the crater and so on. The doctors then studied how these signs appeared in different combinations, and through years of training and practice, concluded whether the patient had cancer or not.
Goldberg wished to build an algorithm that mimicked the doctors’ clinical diagnosis. As a starting point, he gave equal weights to all the 7 signs. He was skeptical whether this would work, but he wanted a starting point. He then gave a bunch of X-rays to the doctors for their diagnosis and fed the details from these X-rays to the machine. He asked the doctors to rate each X-ray on a seven point scale ranging from “definitely malignant” to “definitely benign”. Unknown to the doctors, he also slipped in some duplicates into their piles.
The results? Goldberg’s rudimentary algorithm was extremely good at predicting the doctors’ diagnosis. In fact, it outperformed the best doctor in the training set. How could this be? How could an algorithm be better than the group of doctors that trained it?
A closer look at the doctors’ diagnosis revealed some inconsistencies. The physicians widely disagreed amongst themselves – while one doctor would diagnose an X-ray as malignant, another would produce the opposite diagnosis. More surprisingly, the doctors even disagreed with their own diagnosis on the duplicate X-Rays about 20% of the time. On the other hand, the simple algorithm was ruthlessly consistent.
The doctors were not incompetent – they came up with the rules for diagnosing X-rays, which are documented in medical textbooks and eventually fed to the algorithm. They were just not as consistent as the machine was in applying those rules. In more than 200 similar experiments, clinical diagnosis (done using expert intuition and judgement) was compared to statistical diagnosis (done by applying a procedure, like an algorithm). Till date, there has been no study where clinicians have outperformed statistical methods.
The doctors’ only folly was that they were human. Human brains are creative, observant, and intelligent. But they are not consistent. In fact, one can argue how a completely consistent machine (like a computer) cannot possess intelligence. The very randomness that gives us our creativity and intelligence takes away our ability to make statistically correct decisions.
With subjective decisions, ones that need creative problem solving, our brains continue to be our best bet. But when it comes to objective, statistical decisions, where the same, well-established rules ought to be followed with ruthless efficiency, the algorithmic models that mimic human judgement are often superior to humans themselves.
Inspiration: The Undoing Project – Michael Lewis, Thinking, Fast and Slow – Daniel Kahneman