An AI Model That “Understands Space”
Most AI models are great at looking at images, writing text, and analyzing data—but if you ask it “move that red cup on the left 15 cm to the right,” it would probably be confused.
That’s exactly what Gemini Robotics-ER 1.6 is built to solve.
Google has officially opened this model to developers via the Gemini API and Google AI Studio. ER stands for Embodied Reasoning—enabling AI to not just understand images, but to truly grasp object positions, relationships, and possible physical actions in three-dimensional space.
For developers, this is a tool worth diving into.
Core Capabilities of Robotics-ER 1.6
Spatial Reasoning
Robotics-ER 1.6 can estimate object relative positions and depth relationships from a single RGB image or camera stream. This isn’t achieved through additional depth sensors—the model itself has learned visual spatial understanding.
Practical implication: robots don’t need expensive LiDAR or stereo cameras; a regular camera is enough for AI to understand scene geometry.
Manipulation Planning
Given a goal (“arrange the scattered blocks into a line”), the model can output a series of decomposed action steps, including:
- Which object to grasp
- From which angle to approach
- Move to which target position
- Release timing
These outputs aren’t natural language descriptions—they’re structured command formats that robot control systems can directly parse.
Multimodal Input Integration
Robotics-ER 1.6 can simultaneously accept:
- Visual input (images, video frames)
- Text instructions
- Sensor values (temperature, force, acceleration, etc.)
And output reasoning results with integrated spatial understanding—much closer to real-world scenario needs than pure visual classification.
How Developers Connect to the API?
Quick Start
| |
Robot Manipulation Command Output
For scenarios requiring structured output, you can use a System Prompt to guide the model to output JSON-formatted action sequences:
| |
Real-time Streaming Scenario
| |
Real-world Application Scenarios
Industrial Automation: Visual-guided Grasping
Traditional industrial robots grab objects at fixed coordinates—failure occurs when object positions shift. Robotics-ER enables robots to “see” the actual position of objects in the moment and dynamically adjust grasping paths—particularly valuable for mixed-line production and irregular incoming materials.
Warehouse Logistics: Flexible Sorting
E-commerce warehouse items come in countless shapes and sizes. Robotics-ER’s manipulation planning can automatically select optimal grasping strategies based on object geometry, without needing to program each SKU individually.
AR/MR Development: Spatial Annotation
When developing applications for AR devices like Apple Vision Pro or Meta Quest, you need to precisely position virtual objects in real space. Robotics-ER’s spatial understanding helps AR applications more accurately comprehend the user’s environment.
Drone Navigation: Scene Awareness
Indoor drones or low-altitude autonomous flyers need visual scene understanding when GPS signals are unstable. Robotics-ER’s spatial reasoning enables natural language-style environmental understanding like “seeing a door and knowing if it can fit through.”
How Does It Compare to Other Models?
| Capability | Regular Gemini Pro | Gemini Vision | Robotics-ER 1.6 |
|---|---|---|---|
| Image Understanding | ✅ | ✅ | ✅ |
| Text Reasoning | ✅ | ✅ | ✅ |
| Spatial Relationship Understanding | ❌ | Limited | ✅ |
| Depth Estimation | ❌ | ❌ | ✅ |
| Manipulation Action Planning | ❌ | ❌ | ✅ |
| Sensor Data Integration | ❌ | ❌ | ✅ |
Robotics-ER isn’t replacing existing models—it’s adding a new dimension for specific scenarios—particularly applications that need to understand the “physical world.”
Limitations and Notes
A few things developers should keep in mind:
Latency Issues: Spatial reasoning requires more computational power than regular text reasoning, so API response times are relatively longer. For control loops requiring real-time feedback (<100ms), you’ll still need to pair with lightweight models on the edge.
Still in Restricted Access: Not all developers can get full functionality immediately. Some advanced features (like manipulation command output) require an application process.
Accuracy Depends on Training Data: The model performs better in general scenarios (despots, warehouses, kitchens); highly specialized industrial scenarios still require fine-tuning or few-shot prompting.
Doesn’t Directly Control Hardware: Robotics-ER outputs reasoning results—actual robot control needs to be implemented with ROS 2, robot SDKs, or custom controllers.
Try It Now
- Go to Google AI Studio
- Select the model
gemini-robotics-er-1.6 - Upload an image containing objects
- Enter spatial reasoning or manipulation planning questions
Even without robot hardware, you can test spatial reasoning capabilities with simulated images.
What This Means for Developers
The significance of Gemini Robotics-ER 1.6 opening its API is making AI visual reasoning capabilities—previously only affordable by large robotics companies—accessible to every developer in API form.
You don’t need to train your own spatial perception models, you don’t need to hire machine learning engineers—as long as you can call a REST API, you can add “understanding the 3D world” capability to your applications.
This isn’t science fiction—it’s a tool you can start experimenting with today.
This article is based on Google’s official announcements. Technical details and API interfaces are subject to Google AI Studio documentation.