Top / Sun, 19 May 2024 The Indian Express

AI’s ‘Her’ moment: OpenAI’s GPT-4o and Google’s Project Astra make real-life strides

You’ll get used to it.”Theodore Twombly fell in love with a digital assistant Samantha in the 2013 hit ‘Her’. But Google and OpenAI’s new announcements could change what it means to be a virtual help altogether. But that aside, OpenAI says that its new model accepts as input any combination of text, audio, image, and video and generates any combination of text, audio, and image outputs. There are, however, some fundamental differences in the approach that OpenAI and Google have taken. We used these learnings to build out our safety interventions in order to improve the safety of interacting with GPT-4o.

“Well, you seem like a person, but you’re just a voice in a computer.” “I can understand how the limited perspective of an un-artificial mind would perceive it that way. You’ll get used to it.”

Theodore Twombly fell in love with a digital assistant Samantha in the 2013 hit ‘Her’. The lifelike assistant, voiced by Scarlett Johansson, displayed a sense of humour, intelligence, and empathy that made her seem human to Theodore (played by Joaquin Phoenix).

But last week, when OpenAI showcased the new strides it took with its new GPT-4o (the o standing for ‘omni’), it signalled that such artificial intelligence (AI)-based assistants are no longer merely the stuff of science fiction films and literature.

And a day later, when Google showed the progress it has made on its virtual assistant, it marked a tangible direction that AI could take for end users — to create lifelike assistants that can be helpful for a number of real-life scenarios, from giving suggestions on how one can comb their hair by looking at their picture, to empathising with them.

Siri and Alexa never really managed to cement their place as useful digital assistants, primarily due to their inability to pick up on the nuances of conversation. But Google and OpenAI’s new announcements could change what it means to be a virtual help altogether.

For a large part of the population which is facing a loneliness crisis, it remains to be seen the shape and place such assistants occupy in people’s lives. And of course, questions about the lens with which they’re made, with the assistants’ voicings in demos being that of women (lending to the idea of how technologies developed in patriarchal societies are likely to view women) are some things which would need to be contended with as such assistants reach the phones and computers of more people in the coming years.

But that aside, OpenAI says that its new model accepts as input any combination of text, audio, image, and video and generates any combination of text, audio, and image outputs. It can respond to audio inputs in as little as 232 milliseconds, with an average of 320 milliseconds, which the company says is similar to human response time in a conversation.

“…we trained a single new model end-to-end across text, vision, and audio, meaning that all inputs and outputs are processed by the same neural network. Because GPT-4o is our first model combining all of these modalities, we are still just scratching the surface of exploring what the model can do and its limitations,” OpenAI said in a blog post.

In a demo video released by OpenAI, its assistant responded almost instantly to the questions, it could sing too, and offer tips on how a person could comb their hair before they went for an interview by looking at their face through the phone’s front camera.

Two different paths

At Google I/O, the company’s annual developer conference, the company showed that unlike common perception it had not fallen behind OpenAI in the AI race. The company showed a very early version of what it hopes could become a universal smartphone assistant.

Google is calling it Project Astra, and it is a real-time, multimodal AI assistant that can see the world, remember where one has left a thing and even answer if a computer code is correct by looking at it through the phone’s camera.

In a demo video shared by Google, an Astra user in Google’s London office asks the system to identify a part of a speaker, find their missing glasses, review code, and more. It all works practically in real time and in a very conversational way.

There are, however, some fundamental differences in the approach that OpenAI and Google have taken. OpenAI’s assistant displayed a wide range of emotions and tonalities in its voicing — from slight giggles, to subdued whispers depending on what was being asked of it. In contrast, Google’s assistant was more straight-forward, there is no range of emotional diversity in its voice.

Early days

While the developments feel fascinating because of how tangible they feel, it is still early days for the technology and it is not without its share of limitations and challenges.

OpenAI, for instance, said that GPT-4o is still in the early stages of exploring the potential of unified multimodal interaction, meaning certain features like audio outputs are initially accessible in a limited form only, with preset voices.

The company said that further development and updates are necessary to fully realise its potential in handling complex multimodal tasks seamlessly. “GPT-4o has also undergone extensive external red teaming with 70+ external experts in domains such as social psychology, bias and fairness, and misinformation to identify risks that are introduced or amplified by the newly added modalities. We used these learnings to build out our safety interventions in order to improve the safety of interacting with GPT-4o. We will continue to mitigate new risks as they’re discovered,” OpenAI said.