In a recently published research article, Apple’s AI experts describe a system in which Siri gains many more skills. The researchers themselves are convinced that it scores better than GPT-4, which tools such as ChatGPT use. In the article, Apple describes a voice assistant that can perform many more useful tasks using a Large Language Model (LLM). The model is called image state. They call it ReALM, or Reference Resolution As Language Modeling. ReALM takes into account what’s on your screen, but also looks at which tasks are active and which entities are relevant to a conversation.
Needs a lot of context
For example, entities that are relevant to a conversation are looked at. When users talk normally, they sometimes use vague language input such as: “Can you this grab it for me?” A human listener can understand what is meant based on the situation (a hand extended, or the direction the speaker is looking). A computer has more difficulty with this.
Apple may have found a solution by developing a model that takes into account all kinds of entities that are not explicitly named. This can be background tasks, data on a screen and other things. If a user is cooking, a housemate will know that cooking utensils or an ingredient probably need to be grabbed. If there is an iPad with the recipe for a cocktail on the screen, something like a cocktail shaker or a bottle of drink will be needed. And if an alarm sounds in the background, there may be a greater need for a fire extinguisher or fire blanket. Traditional methods require much larger models and much more reference material, but Apple says it has now streamlined the approach by recognizing relevant factors and converting them into text.
Smarter Siri: is it really going to happen?
That way, Apple can build a smarter and more useful Siri. The speed at which this assistant can perform tasks is much faster than the competition, the researchers say. As comparison material, they used the ChatGPT variants that use GPT-3.5 and GPT-4, the current standards. Since developments in the field of AI are moving fast, it is good to know that the comparative test was conducted on January 24, 2024. In the meantime, both models may have become smarter again.
Both variants had to predict a list of entities from a set that had been made available in advance. GPT-3.5 only used text, while GPT-4 also allows images. Apple used screenshots and noticed that the old saying “a picture is worth a thousand words” also applies here: according to Apple, the models performed a lot better.
Apple says the smallest reALM model achieves performance comparable to GPT-4 and the larger reALM models significantly outperform it. Apple’s system requires fewer parameters to parse contextual data. Apple’s model shows that giving commands to Siri can be faster and more efficient by converting the given context into text, which is easier to parse through a Large Language Model. According to the researchers, the reason why Apple’s model performs better is because GPT-4 relies on processing images to understand information on the screen.
AI functions on the device itself
Apple wanted to demonstrate that it is possible to develop an on-device system that performs just as well as the online models. This is important for Apple to guarantee user privacy. For example, if you use the system for health questions, you do not want your medical details to end up in an online system, even if it is only used for training purposes.
Hopefully we will hear more about this when iOS 18 is announced on June 10 and we will also be able to use it in Dutch in the short term and not just in English and Chinese. It also remains to be seen whether Apple will immediately deploy the larger on-device models, or start with smaller models that deliver less impressive performance.