What Apple's LLM Fumbles Say About LLMs (Rather Than About Apple)
But the OS owner is still in pole position, no matter what the vibes are.
A great deal has been written about what Apple’s fumbles with Apple Intelligence mean about Apple.1 But more interesting to me is what these fumbles mean about LLMs and how we reason about their progress. It may be true that Apple’s executives simply forgot the painfully-learned lessons our industry has been taught by vaporware disasters, or were too keen to market under-differentiated iPhone iterations with “the new hotness,” or even that they succumbed to “market pressure” to keep up in an exploding field, though their stock price doesn’t suggest as much. But I think it’s more plausible that like many others, these leaders simply misunderstood what rapid recent progress in LLMs meant for near-term future progress, that they didn’t reckon on what LLMs might stay bad at for a while.
This would be somewhat surprising in itself. Nowhere are “last mile” challenges more clearly demonstrated than with the actual last miles to be driven by self-driving cars, which make use of many of the same software and process instrumentalities. It has been almost ten years since Elon Musk predicted that Teslas would be able to drive door-to-door, which they still cannot. Waymo’s self-reported progress has been impressive, but the YoY improvement has slowed dramatically, and the last time Cruise reported data it didn’t seem like they were anywhere near “fully automated driving.” Because Apple has futzed around repeatedly with making cars, I’d expect them to know that rapid gains on the first X% do not always indicate that the final (100-X)% will be covered quickly, or at all. But in frothy moments, anyone can lose their heads, and because LLMs are often beyond the ken of even experienced technical thinkers, it’s easier than ever to look at a reductive chart of improvements and think: “Well, next year, these things will be able to do anything!” even when they still cannot drive as well as a median American.2
The main things missing from Apple Intelligence are:
a substantially improved Siri, that is: a Siri who responds to requests at approximately the rate of success one has with ChatGPT (which is very high, of course)
a kind of integration they call “personal context,” in which the LLM draws from all it knows about you from your device and can therefore e.g. relate the contents of text messages and maps data to your calendar (“Schedule a meeting with David for 3pm on Tuesday.” “But Mills, you texted Abby that you’d do pick up that afternoon, and the drive back from school normally takes about 20 minutes.” “Oh, thank you Siri, tell David we need to move it back.”).
actual awareness of and even control of your device, such that you can use conversation as a “front end” for the entirety of your apps, telling Siri / AI what to do and letting it actually “view” and manipulate the apps on your phone to accomplish whatever you say.
There’s naturally some overlap between these: a really conversationally-capable Siri who cannot control your apps and has no idea what’s going on with your life has limited utility (and, since other chat agents exist, no differentiation). And it’s reasonably alarming that Apple continues to let Siri underperform so badly; long before the announcement of Apple Intelligence, Siri was a catastrophe. But the latter two elements —personal context and awareness and control of devices— do not currently exist anywhere. OpenAI’s “Operator” is all right at operating a web browser, sort of, kind of; presumably the “computer use” mode of Claude, from Anthropic, is too. But I at least haven’t seen evidence that anyone has nailed the interface-to-use-all-interfaces yet.
I think expecting anyone to do so touches on my area of interest: where do LLMs struggle, and why? For personal context features, I assume the challenge is bounded, but still real: Apple Intelligence needs to have a somewhat large “context window,” and the process by which information is put into it and taken out needs to be figured out. Given that personal context information of import can reside in scores of places —dozens of apps, dozens of individual emails, threads, exchanges, images, data stores, etc.— this seems non-trivial, at least as hard as “knowing what an individual should know” often is. There are probably generalizable techniques that can work for a large share of instances, like having the LLM consider any information in the top N most-used or most-recently updated apps, but some sources, like Health or Maps, might not make the cut while still being crucial. And context windows are constrained in size, so Apple needs to push the envelope, intelligently meta-manage what goes into them, and keep it up to date / remove stale information effectively, all within the parameters they’ve set for privacy.3
But for awareness and control of the device, I think the challenge is even greater. AppleScript used to allow users to control apps through their interfaces easily, but that was very different than what’s being attempted with LLMs. AppleScript required a human who understood the software and interface to specify, “top-down,” as it were, what the computer should do; LLMs will need to infer functionality, UI elements’ meanings, flows, etc., from their training data. For some percentage of top apps, this may come easily, but the long-tail —the last mile— will resist this. (And of course, hallucinations will persist; the impact of hallucination on these types of features is incredible to contemplate, and handling them well —e.g. how to manage when Apple Intelligence does inexplicably errant things with your apps or personal information— is wild to imagine designing for).
In both of these cases, it’s probably possible today to make something that sort of works: it has some of your personal context, some of the time, and it only errs occasionally; or it can use some apps well, most of the time, but many apps it cannot use well a lot of the time. And so on. If Apple is at a disadvantage, it’s that there’s a likely tension between both their relatively high standards of reliability and their concerns around personal data and this sort of feature’s performance. But I think it’s actually not clear yet that they’re mistaken in erring on the side of caution! Enthusiasts want to see these tools at whatever level of reliability, but Apple’s also got institutional memory of e.g. Newton’s handwriting recognition and the press around it. And unreliable LLM “awareness and control” of iPhone, with unreliable “personal context,” could be the sort of thing people find appalling, enraging, laughable; and it could even do real damage, as iPhones have incredibly important information on them and are integrated with much of our lives.
It wouldn’t surprise me if Apple executives —like millions of onlookers— simply didn’t reckon on how long that last mile is: they saw rapid gains in the capacities of their LLMs and assumed that “given current rates of improvement, by early 2025 this will be able to do it all” and only discovered as they went that for many kinds of functionality, their LLM(s) aren’t close. They may have been as surprised as anyone that their ads turned out to be bullshit! Within the communities of people working on these products, there is no limit to the optimism, and almost no reckoning with what seem to be persistent problems and even, possibly, real theoretical boundaries to what we should expect. If Apple believes that other LLM companies are further ahead, they can simply buy one; they have the cash. But perhaps the reason they’re taking this on the chin is because they know that these particular functions are not commodified, and certainly not in ways that can be instantiated on-device anyway. In other words: this marketing gaffe may have, as many speculate, band consequences, but it will have no technological strategic consequences. If LLMs can be made that can do this, Apple will make or buy them and integrate them where it would matter; and if they can’t, the entire affair is a whiff, but a whiff duplicated by many of the smartest people in all companies and scenes: assuming that inference will lead to a flexible generalizability without limit.
Whatever the case, I think Apple remains better-positioned than any other company to make money off of LLMs: their devices are where all the information an LLM needs to be truly useful already resides. I don’t think they’re “falling behind” so much as people assume from this debacle; no product I can buy provides an LLM that understands me, my life, my communications, my locations, my software usage, and ties it all together and interacts on my behalf with it all as I want. And other LLM companies are in a rapidly commodifying space themselves, with no obvious means of getting access to that wealth of personalizing information, or “taking over” your device usage, especially on iOS, where apps cannot use other apps. If these features are even possible to engineer, only Apple can engineer them; and if they are not, Apple is far from the only company fooled by rates of progress in LLM development, far from alone in forgetting that sometimes, the last mile is the mile that matters most.
A quick recap: Apple announced and heavily marketed a variety of “AI” features, the most important of which have not shipped and have now been significantly delayed; the gap between what they announced and what’s they’ve shipped, or seem likely to ship soon, is extraordinarily large; at present, Apple Intelligence is minimally useful or important to users, is mostly gimmicky and uncompetitive, whereas the promised functionality would have been profound, changing how users interact with iPhones to an extent that’s hard to overstate and would have changed e.g. the commercial dynamics around many categories of apps.
The median American isn’t a very good driver, either! How people hollering about imminent AGI accommodate this fact is unknown to me; they may have a great answer, but why we’d expect LLMs to replace lawyers when they can’t replace me driving to my daughter’s school is mysterious.
This meta-management problem is also fairly philosophically rich to contemplate, incidentally. “What matters to a given person” is often hard for a spouse to fully grasp, let alone describe, and the idea that we’ll simply infer it seems to strain credulity.
You're making sense to me about Apple.
When they supposedly choked on their proposed AI improvements of Siri, I was not disturbed, because I figured their efforts to allow people to control privacy would bang heads with the expected functionality of an AI assistant.
I 'upgraded' to a 14 pro with big storage, because I wasn't ready to integrate such AI assistance into my life. Call me a luddite :-D
This latest AI hype cycle all seems to be based around that Caloric Theory of Intelligence: that intelligence is a substance, that LLMs collect and distill that substance, and it is totally fungible, in that it can be spent like money on arbitrary tasks.
This is of course nonsense, but it's going to take a while longer for the AI industrial complex to catch up, and some of them never will.