On Wednesday, Google unveiled its generative artificial intelligence platform Gemini, the latest competition for a growing field of prompt-driven AI chatbots, including the wildly popular ChatGPT, and the culmination of years of development and tens of billions of dollars in investments.
To show off Gemini’s capabilities, which Google and Alphabet CEO Sundar Pichai described as having “state-of-the-art performance across many leading benchmarks.” The search giant dropped a mind-blowing sizzle reel that features a human prompter having a real-time, video and voice-enabled interaction with Gemini. The six-minute video appears to display the AI-driven chatbot’s abilities to “see” and describe drawings and objects, track attempts at sleight-of-hand tricks, pose potential projects for materials it’s shown and even recognize a video clip that shows someone acting out the motions from the famous “bullet dodge” scene from the film “The Matrix.”
Turns out, however, that there’s a good bit of artifice in this portrayal of the purported latest and greatest in artificial intelligence development.
The first clue that all may not be happening as it appears can be found in a disclaimer in the video’s description box that reads, “For the purposes of this demo, latency has been reduced and Gemini outputs have been shortened for brevity.”
But a deeper investigation by Bloomberg columnist Parmy Olson unveiled that in reality, the demo also wasn’t carried out in real time or in voice. Google told Olson that the video was actually assembled “using still image frames from the footage, and prompting via text,” and they pointed to a site showing how others could interact with Gemini with photos of their hands or of drawings or other objects.
“In other words, the voice in the demo was reading out human-made prompts they’d made to Gemini, and showing them still images,” Olson wrote. “That’s quite different from what Google seemed to be suggesting: that a person could have a smooth voice conversation with Gemini as it watched and responded in real time to the world around it.”
A report on the video by Ars Technica notes that even if Gemini’s capabilities were more accurately displayed in the video, the platform’s image recognition abilities are “nothing to sneeze at” and roughly on par with the capabilities of OpenAI’s multimodal GPT-4V, which also has “vision” capabilities.
In a Thursday posting on X, formerly known as Twitter, Oriol Vinyals, vice president of research and deep learning lead at Google’s AI-focused DeepMind group, which helped develop the Gemini platform, wrote that the video is ultimately an accurate record of what Gemini can do and was meant to “inspire” developers to engage with the tool.
“All the user prompts and outputs in the video are real, shortened for brevity,” Vinyals tweeted. “The video illustrates what the multimodal user experiences built with Gemini could look like. We made it to inspire developers.”