AI companies are pivoting from creating gods to building products. Good.
Turning models into products runs into five challenges
AI companies are collectively planning to spend a trillion dollars on hardware and data centers, but there’s been relatively little to show for it so far. This has led to a chorus of concerns that generative AI is a bubble. We won’t offer any predictions on what’s about to happen. But we think we have a solid diagnosis of how things got to this point in the first place.
In this post, we explain the mistakes that AI companies have made and how they have been trying to correct them. Then we will talk about five barriers they still have to overcome in order to make generative AI commercially successful enough to justify the investment.
Product-market fit
When ChatGPT launched, people found a thousand unexpected uses for it. This got AI developers overexcited. They completely misunderstood the market, underestimating the huge gap between proofs of concept and reliable products. This misunderstanding led to two opposing but equally flawed approaches to commercializing LLMs.
OpenAI and Anthropic focused on building models and not worrying about products. For example, it took 6 months for OpenAI to bother to release a ChatGPT iOS app and 8 months for an Android app!
Google and Microsoft shoved AI into everything in a panicked race, without thinking about which products would actually benefit from AI and how they should be integrated.
Both groups of companies forgot the “make something people want” mantra. The generality of LLMs allowed developers to fool themselves into thinking that they were exempt from the need to find a product-market fit, as if prompting a model to perform a task is a replacement for carefully designed products or features.
OpenAI and Anthropic’s DIY approach meant that early adopters of LLMs disproportionately tended to be bad actors, since they are more invested in figuring out how to adapt new technologies for their purposes, whereas everyday users want easy-to-use products. This has contributed to a poor public perception of the technology.1
Meanwhile the AI-in-your-face approach by Microsoft and Google has led to features that are occasionally useful and more often annoying. It also led to many unforced errors due to inadequate testing like Microsoft's early Sydney chatbot and Google's Gemini image generator. This has also caused a backlash.
But companies are changing their ways. OpenAI seems to be transitioning from a research lab focused on a speculative future to something resembling a regular product company. If you take all the human-interest elements out of the OpenAI boardroom drama, it was fundamentally about the company's shift from creating gods to building products. Anthropic has been picking up many of the researchers and developers at OpenAI who cared more about artificial general intelligence and felt out of place at OpenAI, although Anthropic, too, has recognized the need to build products.
Google and Microsoft are slower to learn, but our guess is that Apple will force them to change. Last year Apple was seen as a laggard on AI, but it seems clear in retrospect that the slow and thoughtful approach that Apple showcased at WWDC, its developer conference, is more likely to resonate with users.2 Google seems to have put more thought into integrating AI in its upcoming Pixel phones and Android than it did into integrating it in search, but the phones aren’t out yet, so let’s see.
And then there’s Meta, whose vision is to use AI to create content and engagement on its ad-driven social media platforms. The societal implications of a world awash in AI-generated content are double-edged, but from a business perspective it makes sense.
The big five challenges for consumer AI
There are five limitations of LLMs that developers need to tackle in order to make compelling AI-based consumer products.3 (We will discuss many of these in our upcoming online workshop on building useful and reliable AI agents on August 29.)
1. Cost
There are many applications where capability is not the barrier, cost is. Even in a simple chat application, cost concerns dictate how much history a bot can keep track of — processing the entire history for every response quickly gets prohibitively expensive as the conversation grows longer.
There has been rapid progress on cost — in the last 18 months, cost-for-equivalent-capability has dropped by a factor of over 100.4 As a result, companies are claiming that LLMs are, or will soon be, “too cheap to meter”. Well, we’ll believe it when they make the API free.
More seriously, the reason we think cost will continue to be a concern is that in many applications, cost improvements directly translate to accuracy improvements. That’s because repeatedly retrying a task tens, thousands, or even millions of times turns out to be a good way to improve the chances of success, given the randomness of LLMs. So the cheaper the model, the more retries we can make with a given budget. We quantified this in our recent paper on agents; since then, many other papers have made similar points.
That said, it is plausible that we’ll soon get to a point where in most applications, cost optimization isn’t a serious concern.
2. Reliability
We see capability and reliability as somewhat orthogonal. If an AI system performs a task correctly 90% of the time, we can say that it is capable of performing the task but it cannot do so reliably. The techniques that get us to 90% are unlikely to get us to 100%.
With statistical learning based systems, perfect accuracy is intrinsically hard to achieve. If you think about the success stories of machine learning, like ad targeting or fraud detection or, more recently, weather forecasting, perfect accuracy isn’t the goal — as long as the system is better than the state of the art, it is useful. Even in medical diagnosis and other healthcare applications, we tolerate a lot of error.
But when developers put AI in consumer products, people expect it to behave like software, which means that it needs to work deterministically. If your AI travel agent books vacations to the correct destination only 90% of the time, it won’t be successful. As we’ve written before, reliability limitations partly explain the failures of recent AI-based gadgets.
AI developers have been slow to recognize this because as experts, we are used to conceptualizing AI as fundamentally different from traditional software. For example, the two of us are heavy users of chatbots and agents in our everyday work, and it has become almost automatic for us to work around the hallucinations and unreliability of these tools. A year ago, AI developers hoped or assumed that non-expert users would learn to adapt to AI, but it has gradually become clear that companies will have to adapt AI to user expectations instead, and make AI behave like traditional software.
Improving reliability is a research interest of our team at Princeton. For now, it’s fundamentally an open question whether it’s possible to build deterministic systems out of stochastic components (LLMs). Some companies have claimed to have solved reliability — for example, legal tech vendors have touted “hallucination-free” systems. But these claims were shown to be premature.
3. Privacy
Historically, machine learning has often relied on sensitive data sources such browsing histories for ad targeting or medical records for health tech. In this sense, LLMs are a bit of an anomaly, since they are primarily trained on public sources such as web pages and books.5
But with AI assistants, privacy concerns have come roaring back. To build useful assistants, companies have to train systems on user interactions. For example, to be good at composing emails, it would be very helpful if models were trained on emails. Companies’ privacy policies are vague about this and it is not clear to what extent this is happening.6 Emails, documents, screenshots, etc. are potentially much more sensitive than chat interactions.
There is a distinct type of privacy concern relating to inference rather than training. For assistants to do useful things for us, they must have access to our personal data. For example, Microsoft announced a controversial feature that would involve taking screenshots of users’ PCs every few seconds, in order to give its CoPilot AI a memory of your activities. But there was an outcry and the company backtracked.
We caution against purely technical interpretations of privacy such as “the data never leaves the device.” Meredith Whittaker argues that on-device fraud detection normalizes always-on surveillance and that the infrastructure can be repurposed for more oppressive purposes. That said, technical innovations can definitely help.
4. Safety and security
There is a cluster of related concerns when it comes to safety and security: unintentional failures such as the biases in Gemini’s image generation; misuses of AI such as voice cloning or deepfakes; and hacks such as prompt injection that can leak users’ data or harm the user in other ways.
We think accidental failures are fixable. As for most types of misuses, our view is that there is no way to create a model that can’t be misused and so the defenses must primarily be located downstream. Of course, not everyone agrees, so companies will keep getting bad press for inevitable misuses, but they seem to have absorbed this as a cost of doing business.
Let’s talk about the third category — hacking. From what we can tell, it is the one that companies seem to be paying the least attention to. At least theoretically, catastrophic hacks are possible, such as AI worms that spread from user to user, tricking those users’ AI assistants into doing harmful things including creating more copies of the worm.
Although there have been plenty of proof-of-concept demonstrations and bug bounties that uncovered these vulnerabilities in deployed products, we haven't seen this type of attack in the wild. We aren’t sure if this is because of the low adoption of AI assistants, or because the clumsy defenses that companies have pulled together have proven sufficient, or something else. Time will tell.
5. User interface
In many applications, the unreliability of LLMs means that there will have to be some way for the user to intervene if the bot goes off track. In a chatbot, it can be as simple as regenerating an answer or showing multiple versions and letting the user pick. But in applications where errors can be costly, such as flight booking, ensuring adequate supervision is more tricky, and the system must avoid annoying the user with too many interruptions.
The problem is even harder with natural language interfaces where the user speaks to the assistant and the assistant speaks back. This is where a lot of the potential of generative AI lies. As just one example, AI that disappeared into your glasses and spoke to you when you needed it, without even being asked — such as by detecting that you were staring at a sign in a foreign language — would be a whole different experience than what we have today. But the constrained user interface leaves very little room for incorrect or unexpected behavior.
Concluding thoughts
AI boosters often claim that due to the rapid pace of improvement in AI capabilities, we should see massive societal and economic effects soon. We are skeptical of the trend extrapolation and sloppy thinking that goes into those capability forecasts. More importantly, even if AI capability does improve rapidly, developers have to solve the challenges discussed above. These are sociotechnical and not purely technical, so progress will be slow. And even if those challenges are solved, organizations need to integrate AI into existing products and workflows and train people to use it productively while avoiding its pitfalls. We should expect this to happen on a timescale of a decade or more rather than a year or two.
Further reading
Benedict Evans has written about the importance of building single-purpose software using general-purpose language models.
To be clear, we don't think that reducing access to state-of-the-art models will reduce misuse. But when it comes to LLMs, misuse is easier than legitimate uses (which require thought), so it isn't a surprise that misuses have been widespread.
The pace of AI adoption is relative. Even Apple's approach to integrating AI into its products has been criticized as too fast-paced.
These are about factors that matter to the user experience; we are setting aside environmental costs, training on copyrighted data, etc.
For example, GPT-3.5 (text-davinci-003) in the API cost $20 per million tokens, whereas gpt-4o-mini, which is more powerful, costs only 15 cents.
To be clear, just because the data sources are public doesn’t mean there are no privacy concerns.
For example, Google says “we use publicly available information to help train Google’s AI models”. Elsewhere it says that it may use private data such as emails to provide services, maintain and improve services, personalize services, and develop new services. One approach that is consistent with these disclosures is that only public data is used for pre-training of models like Gemini, but private data is used to fine-tune those models to create, say, an email auto-response bot. Anthropic is the one exception we know of. It says: “We do not train our generative models on user-submitted data unless a user gives us explicit permission to do so. To date we have not used any customer or user-submitted data to train our generative models.” This commitment to privacy is admirable, though we predict that it will put the company at a disadvantage if it more fully embraces building products.
Subscribe to AI Snake Oil
What Artificial Intelligence Can Do, What It Can’t, and How to Tell the Difference
The title is splendid, and the article delivers, well done!
I've been saying more or less the same (not so well written) for a long time in my own "The Skeptic AI Enthusiast" –for instance, in the last issue I include the quote "The antidote to hype is to focus on concrete value." While many get excited about AGI, I prefer useful AI products that deliver value to the user, like Grammarly, Udio, Midjourney and many others.
Great article with many profound points, especially the question of whether deterministic systems are feasible given stochastic components like LLMs. At Microsoft, we've made encouraging progress with constrained decoding via our Guidance OSS library. Guidance masks out token that would not match a user's expectations, thereby leading to much better aligned and more predictable generations. https://github.com/guidance-ai/guidance