Google's Robot AI Folds Delicate Origami, Closes Zipper Bags Without Damage

Introduction to Gemini Robotics

On Wednesday, Google DeepMind announced two new AI models designed to control robots: Gemini Robotics and Gemini Robotics-ER. The company claims these models will help robots of many shapes and sizes understand and interact with the physical world more effectively and delicately than previous systems, paving the way for applications such as humanoid robot assistants.

The Challenge of Embodied AI

It’s worth noting that even though hardware for robot platforms appears to be advancing at a steady pace, creating a capable AI model that can pilot these robots autonomously through novel scenarios with safety and precision has proven elusive. What the industry calls "embodied AI" is a moonshot goal of Nvidia, for example, and it remains a holy grail that could potentially turn robotics into general-use laborers in the physical world.

How Gemini Robotics Works

Google’s new models build upon its Gemini 2.0 large language model foundation, adding capabilities specifically for robotic applications. Gemini Robotics includes what Google calls "vision-language-action" (VLA) abilities, allowing it to process visual information, understand language commands, and generate physical movements. By contrast, Gemini Robotics-ER focuses on "embodied reasoning" with enhanced spatial understanding, letting roboticists connect it to their existing robot control systems.

Examples of Gemini Robotics in Action

For example, with Gemini Robotics, you can ask a robot to "pick up the banana and put it in the basket," and it will use a camera view of the scene to recognize the banana, guiding a robotic arm to perform the action successfully. Or you might say, "fold an origami fox," and it will use its knowledge of origami and how to fold paper carefully to perform the task.

Advancements Over Previous Models

In 2023, Google’s RT-2 represented a notable step toward more generalized robotic capabilities by using Internet data to help robots understand language commands and adapt to new scenarios, then doubling performance on unseen tasks compared to its predecessor. Two years later, Gemini Robotics appears to have made another substantial leap forward, not just in understanding what to do but in executing complex physical manipulations that RT-2 explicitly couldn’t handle.

Better Generalized Results

According to DeepMind, the new Gemini Robotics system demonstrates much stronger generalization, or the ability to perform novel tasks that it was not specifically trained to do, compared to its previous AI models. In its announcement, the company claims Gemini Robotics "more than doubles performance on a comprehensive generalization benchmark compared to other state-of-the-art vision-language-action models." Generalization matters because robots that can adapt to new scenarios without specific training for each situation could one day work in unpredictable real-world environments.

Partnerships and Future Applications

Google is attempting to make the real thing: a generalist robot brain. With that goal in mind, the company announced a partnership with Austin, Texas-based Apptronik to "build the next generation of humanoid robots with Gemini 2.0." While trained primarily on a bimanual robot platform called ALOHA 2, Google states that Gemini Robotics can control different robot types, from research-oriented Franka robotic arms to more complex humanoid systems like Apptronik’s Apollo robot.

Safety and Limitations

For safety considerations, Google mentions a "layered, holistic approach" that maintains traditional robot safety measures like collision avoidance and force limitations. The company describes developing a "Robot Constitution" framework inspired by Isaac Asimov’s Three Laws of Robotics and releasing a dataset called "ASIMOV" to help researchers evaluate safety implications of robotic actions.

Conclusion

Google’s new Gemini Robotics and Gemini Robotics-ER models represent a significant step forward in the development of embodied AI. With their ability to understand and interact with the physical world, these models have the potential to enable a wide range of applications, from humanoid robot assistants to general-use laborers. While there are still challenges to overcome, Google’s advancements in this field are an exciting development for the future of robotics.

FAQs

Q: What are Gemini Robotics and Gemini Robotics-ER?
A: Gemini Robotics and Gemini Robotics-ER are two new AI models designed to control robots, developed by Google DeepMind.
Q: What is embodied AI?
A: Embodied AI refers to the ability of a robot to understand and interact with the physical world in a way that is similar to human intelligence.
Q: What are the potential applications of Gemini Robotics?
A: The potential applications of Gemini Robotics include humanoid robot assistants, general-use laborers, and other tasks that require robots to understand and interact with the physical world.
Q: How does Gemini Robotics differ from previous AI models?
A: Gemini Robotics differs from previous AI models in its ability to execute complex physical manipulations and its stronger generalization capabilities.
Q: What is the "Robot Constitution" framework?
A: The "Robot Constitution" framework is a set of principles developed by Google that is inspired by Isaac Asimov’s Three Laws of Robotics, designed to ensure the safe and responsible development of robots.