Why are deep learning technologists so overconfident?

Deep learning researchers have proved skeptics wrong before, but the past doesn't predict the future.

and

Aug 31, 2022

Deep learning researchers have been predicting for a while that the technology will make various professions obsolete and that self-driving cars are imminent. We’re still waiting. Some have even claimed that they are nearing artificial general intelligence, or AI capable of equalling or exceeding human performance at all tasks.

Hype is nothing new to machine learning, but this wave seems different. Billions of dollars in funding have been allocated based on this hype, and it has led to a massive amount of public confusion (which motivated our book).

Obviously, there are self-serving reasons for any field to hype itself. But that doesn’t explain all of it, and many deep learning people genuinely believe their overconfident predictions. We think there are a few cultural and historical reasons for this. We hope that understanding those reasons will help you resist the hype and push back the next time you meet a true believer of deep learning — while still acknowledging that it works well in a limited set of domains and tasks.

The dogma that all problems are the same

Every scientific field has a central dogma: a core belief that binds the group together and gives it its identity. The central dogma of deep learning is that the only thing you need in order to solve a new type of learning problem is to collect training examples (such as images and their corresponding descriptions). The thinking is that you can use the same generic neural architectures and training algorithms that have worked for other problems in the past.

Because of this belief, the deep learning community doesn’t pay nearly enough attention to the question of what makes some learning problems different from others. The fundamental insight in our book — namely that perception problems, judgment problems, and social prediction problems are all radically different from each other — isn’t appreciated by most deep learning technologists.1 Without recognizing this, it’s hard to build an intuition for whether improvements in one area translate to another, and it’s easy to get carried away by the success of deep learning at generating images or transcribing speech.

In our experience, even deep learning technologists at the top of their game are surprised when we point out that deep learning (or any other form of AI) struggles when tasked with predicting the future. This confusion is made worse by a careless vocabulary choice: in the machine learning community, the word “prediction” refers to all applications of machine learning. So deep learning is often billed as an extraordinary tool for prediction, when prediction is in fact the one thing it’s really bad at.2

Long simmering grudges

Deep learning entered the sphere of mainstream awareness less than a decade ago, but the science is ancient. This 1986 paper in Nature contains almost all of the core technical innovations that make it work well. (The term deep learning hadn’t been coined yet and neural networks was used instead).

However, the dataset sizes and compute resources that were available in the 1980s weren’t enough to demonstrate the effectiveness of deep learning, and other machine learning techniques like support vector machines took center stage.

Neural networks researchers persevered for decades through the skepticism they encountered. Many of today’s deep learning researchers have long heard that neural networks can't do this or that, and have proven the skeptics wrong. So when they hear about the impossibility of predicting crime or job performance, they tend to dismiss it as the view of uninformed outsiders who will soon be corrected.

This story of how deep learning was unfairly ignored has so often been told within that community that it seems to have led to an us-versus-them mentality.3 We suspect that many researchers see hype as fair game in the quest to convince the world that deep learning is amazing.

Neglect of domain expertise

One natural consequence of the central dogma of deep learning: the role of domain experts is seen as data labeling and nothing else. By domain experts we mean doctors and nurses in the case of medical AI, or social workers in the case of AI for welfare benefits automation. They are not seen as partners in designing the system. They are not seen as clients whom the AI system is meant to help by augmenting their abilities. They are seen as inefficiencies to be automated away, and standing in the way of progress.

As one study found: “[AI] developers conceived of workers as corrupt, lazy, non-compliant, and as datasets themselves, pursuing surveillance and gamification to discipline workers to collect better quality data.”

This contempt is also mixed with an ignorance of what domain experts actually do. Technologists proclaiming that AI will make various professions obsolete is like if the inventor of the typewriter had proclaimed that it will make writers and journalists obsolete, failing to recognize that professional expertise is more than the externally visible activity. Of course, jobs and tasks have been successfully automated throughout history, but someone who doesn't work in a profession and doesn't understand its nuances is in a poor position to make predictions about how automation will impact it.

Admittedly, there are some famous cases of domain expertise proving much less helpful for AI development than originally thought. A frequently cited quote goes, "Every time I fire a linguist, the performance of the speech recognizer goes up". Noted AI researcher Rich Sutton wrote an essay in which he forcefully argued that attempts to add domain knowledge to AI systems actually hold back progress. This is the dominant view in the deep learning community. But what’s interesting is the four areas Sutton used to make his point: chess playing, Go playing, computer vision, and natural language processing.

Sutton’s argument breaks down quickly once we leave these highly circumscribed domains with clear ground truth, like chess or Go; or highly circumscribed tasks in computer vision and natural language processing, such as object recognition. Again, it is critical to disaggregate different kinds of learning problems. As just one example, consider car financing, a problem domain that is much messier than chess or Go. A recent ethnographic study described the mighty struggles of a team of data scientists to even define the target variable. Problem formulation is just one part of the machine learning pipeline in these cases where domain expertise can’t be reduced to data labeling.

From benchmarks to the real world

Sutton’s argument even breaks down for computer vision and natural language processing — if we care about real-world performance and not just performance on benchmark datasets defined via a one-dimensional accuracy metric. The language and vision systems developed without linguistic or cultural expertise excel at propagating the harmful stereotypes contained in their haphazardly constructed datasets.

Almost any engineering product encounters an unending variety of corner cases in the real world that weren’t anticipated during development. This is well known, and we accept that it often takes five or ten years after a successful prototype for a product to be ready for the mass market. Curiously, most machine learning technologists have been insulated from this frustration. If you think about the kinds of areas where machine learning has found most use, serving ads or recommending products online, failures are not costly, and so the “move fast and break things” culture has served the industry well. So this community is used to declaring a problem solved when good benchmark performance is reached in the lab. But this approach is a poor fit for healthcare or self-driving cars where a single failure can be catastrophic. The last 10% is 90% of the effort.

The penumbra of AGI hype

The deep learning community is adjacent to the community of thinkers and researchers who focus on artificial general intelligence (AGI). For example, Open AI and DeepMind are major players in both areas.

Unfortunately, the discourse on AGI has lost touch with reality, and includes claims that today’s neural nets are conscious or that AGI is imminent. Most significantly, AGI hypers have succeeded in directing a large amount of funding toward the fanciful goal of ensuring that this supposedly imminent AGI will be aligned with humanity’s interests. Crucially, much of this community believes that we already have all the innovations we need to reach AGI; we just need to throw more hardware at the neural networks we already have.

We won’t waste your time arguing with these claims. Our point is that this hype casts a long shadow into the deep learning community. When all this AGI talk is normalized, even those who don’t fully believe it will have a hard time staying grounded and realistic about the capabilities and limits of deep learning. And it shows.

Summary: errors of extrapolation

A lot of the reasons for the overconfidence of deep learning technologists involve a kernel of truth: deep learning is indeed quite general compared to previous machine learning methods; the deep learning community does have a history of proving skeptics wrong; in many domains, expertise is not in fact as valuable as once thought; deep learning does work well across a broad range of domains, including healthcare, if we only care about performance on benchmarks; and researchers have indeed gotten quite far by building bigger and bigger neural networks without fundamental changes to the architecture.

But going from those truths to the grander claims we’re seeing requires unjustified extrapolation: ignoring the differences between domains, conflating the technical barriers that deep learning once faced with the socio-technical barriers that it now faces, ignoring the yawning gap between the lab and the field, and assuming that past trends will continue into the future. That last assumption, fittingly, is also at the heart of how machine learning itself operates.

Not thinking about the differences between domains is fine for a researcher or developer who works in one narrow domain, but is problematic when researchers claim to have developed methods that work well across domains.

[EDITED TO ADD] Upon reflection, this paragraph is oversimplified and misrepresented our point. It’s not that deep learning researchers think it’s possible to predict the future accurately, but rather that they are surprised when we point out the recent research showing how low the state-of-the-art accuracy figures are for social prediction tasks, and the fact that that it’s hard to beat linear/logistic regression. We also didn’t mean to imply that the machine learning community is confused about the two meanings of the word prediction, but rather that the community’s vocabulary choice tends to mislead those outside it (notably policy makers; the misunderstanding that AI is a tool for predicting the future seems to drive a lot of bad policy).

Some have argued that there was nothing unfair about the lesser level of interest in deep learning in the ‘80s and ‘90s, and that this was the prudent response to the evidence available at the time.

Emily M. Bender

Thank you for this! A lot of what you say really resonates for me and I especially appreciate the point about the "wishful mnemonic" (<= Drew McDermott's wonderful term) "predict".

Three quick bits of feedback:

For our paper "AI and the Everything in the Whole Wide World Benchmark" (Raji et al 2021, NeurIPS Datasets and Benchmarks track) it would better to point to the published version instead of arXiv as you are now doing. You can find that here: https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/hash/084b6fbb10729ed4da8c3d3f5a3ae7c9-Abstract-round2.html

I think that the way that arXiv is used also a factor in the culture of hype and supposedly "fast progress" in deep learning/AI, and it is always valuable to point to peer-reviewed venues for papers that have actually been peer reviewed.

Second, I object to the assertion that NLP is a "highly circumscribed domain" like chess or Go. There are tasks within NLP that are highly circumscribed, but that doesn't go for the domain as a whole. I have no particular expertise in computer vision, but at the very least it also seems extremely ill-defined compared to chess and Go. If it's "highly circumscribed" it isn't in the same way the games are. You kind of get to this in the next paragraph (for both NLP and CV), but I think it would be better to avoid the assertion. These domains only look "highly circumscribed" if you look at them without any domain expertise. (Though again, for CV, it's a little unclear what the domain of expertise even is...)

Finally, I'd like to respond to this: "Noted AI researcher Rich Sutton wrote an essay in which he forcefully argued that attempts to add domain knowledge to AI systems actually hold back progress."

That framing suggests that the progress we should care about is progress in AI. But that is entirely at odds with what you say above regarding domain experts:

"They are not seen as partners in designing the system. They are not seen as clients whom the AI system is meant to help by augmenting their abilities. They are seen as inefficiencies to be automated away, and standing in the way of progress."

I think it is a mistake to cede the framing to those who think the point of this all is to build AI, rather than to build tools that support human flourishing. (<= That lovely turn of phrase comes from Batya Friedman's keynote at NAACL 2022.)

Expand full comment

2 replies

Jimbo

Thanks for this. The hype and hubris coming from some corners of the community is harmful in a lot of ways. Why would a student work hard on their education if they've been convinced human labour is about to become obsolete? Look at the steady flow of questions on quora from people concerned all sorts of careers are about to become obsolete. The popular narrative on AI has to change.

15 more comments...