What is Gemini Robotics-ER 1.6?

This is a multimodal AI model launched by Google, specifically designed for robotics and spatial reasoning. It can understand 3D environments and plan physical manipulation sequences, now available to developers via Gemini API.

What does ER stand for?

ER stands for Embodied Reasoning—the ability for AI to understand and reason about physical space, object properties, and manipulation actions in the real world, not just process text or images.

How can developers get access to Gemini Robotics-ER 1.6?

Apply for early access through Google AI Studio, or integrate directly using the Gemini API with model ID 'gemini-robotics-er-1.6'. Some features are still in limited testing.

What's the difference between Robotics-ER and regular Gemini models?

Robotics-ER additionally has spatial reasoning, depth estimation, and manipulation planning capabilities. It can understand 3D scenes and output actionable robot command sequences—features that regular Gemini models don't have.

What applications is Gemini Robotics-ER 1.6 suitable for?

Industrial robot visual guidance, warehouse automation, AR spatial labeling, drone path planning, and any application requiring understanding of 3D spatial relationships.

Google Gemini Robotics-ER 1.6 Open API — What Can Developers Do Now?

An AI Model That “Understands Space”

Most AI models are great at looking at images, writing text, and analyzing data—but if you ask it “move that red cup on the left 15 cm to the right,” it would probably be confused.

That’s exactly what Gemini Robotics-ER 1.6 is built to solve.

Google has officially opened this model to developers via the Gemini API and Google AI Studio. ER stands for Embodied Reasoning—enabling AI to not just understand images, but to truly grasp object positions, relationships, and possible physical actions in three-dimensional space.

For developers, this is a tool worth diving into.

Core Capabilities of Robotics-ER 1.6

Spatial Reasoning

Robotics-ER 1.6 can estimate object relative positions and depth relationships from a single RGB image or camera stream. This isn’t achieved through additional depth sensors—the model itself has learned visual spatial understanding.

Practical implication: robots don’t need expensive LiDAR or stereo cameras; a regular camera is enough for AI to understand scene geometry.

Manipulation Planning

Given a goal (“arrange the scattered blocks into a line”), the model can output a series of decomposed action steps, including:

Which object to grasp
From which angle to approach
Move to which target position
Release timing

These outputs aren’t natural language descriptions—they’re structured command formats that robot control systems can directly parse.

Multimodal Input Integration

Robotics-ER 1.6 can simultaneously accept:

Visual input (images, video frames)
Text instructions
Sensor values (temperature, force, acceleration, etc.)

And output reasoning results with integrated spatial understanding—much closer to real-world scenario needs than pure visual classification.

How Developers Connect to the API?

Quick Start

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
import google.generativeai as genai
from PIL import Image
import requests

genai.configure(api_key="YOUR_API_KEY")

# Use the Robotics-ER model
model = genai.GenerativeModel("gemini-robotics-er-1.6")

# Load the scene image
image = Image.open("workspace_scene.jpg")

# Ask a spatial reasoning question
response = model.generate_content([
    image,
    "Identify all objects on the table and describe their relative positional relationships.\n"
    "What action steps are needed to move the blue cube next to the red circle?"
])

print(response.text)

Robot Manipulation Command Output

For scenarios requiring structured output, you can use a System Prompt to guide the model to output JSON-formatted action sequences:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
system_prompt = """
You are a robot manipulation planner.
After receiving an image and target instruction, output a JSON-formatted action sequence:
{
  "steps": [
    {"action": "move_to", "target": "blue_cube", "confidence": 0.95},
    {"action": "grasp", "grip_force": "medium"},
    {"action": "move_to", "position": {"x": 0.3, "y": 0.1, "z": 0.05}},
    {"action": "release"}
  ]
}
"""

model = genai.GenerativeModel(
    "gemini-robotics-er-1.6",
    system_instruction=system_prompt
)

Real-time Streaming Scenario

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
import cv2

cap = cv2.VideoCapture(0)  # Camera stream

while True:
    ret, frame = cap.read()
    if not ret:
        break

    # Run inference every N frames (adjust frequency based on scene dynamics)
    if frame_count % 30 == 0:
        image = Image.fromarray(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB))
        response = model.generate_content([
            image,
            "Are there any anomalies in the scene that require robot intervention?"
        ])
        handle_response(response.text)

Real-world Application Scenarios

Industrial Automation: Visual-guided Grasping

Traditional industrial robots grab objects at fixed coordinates—failure occurs when object positions shift. Robotics-ER enables robots to “see” the actual position of objects in the moment and dynamically adjust grasping paths—particularly valuable for mixed-line production and irregular incoming materials.

Warehouse Logistics: Flexible Sorting

E-commerce warehouse items come in countless shapes and sizes. Robotics-ER’s manipulation planning can automatically select optimal grasping strategies based on object geometry, without needing to program each SKU individually.

AR/MR Development: Spatial Annotation

When developing applications for AR devices like Apple Vision Pro or Meta Quest, you need to precisely position virtual objects in real space. Robotics-ER’s spatial understanding helps AR applications more accurately comprehend the user’s environment.

Indoor drones or low-altitude autonomous flyers need visual scene understanding when GPS signals are unstable. Robotics-ER’s spatial reasoning enables natural language-style environmental understanding like “seeing a door and knowing if it can fit through.”

How Does It Compare to Other Models?

Capability	Regular Gemini Pro	Gemini Vision	Robotics-ER 1.6
Image Understanding	✅	✅	✅
Text Reasoning	✅	✅	✅
Spatial Relationship Understanding	❌	Limited	✅
Depth Estimation	❌	❌	✅
Manipulation Action Planning	❌	❌	✅
Sensor Data Integration	❌	❌	✅

Robotics-ER isn’t replacing existing models—it’s adding a new dimension for specific scenarios—particularly applications that need to understand the “physical world.”

Limitations and Notes

A few things developers should keep in mind:

Latency Issues: Spatial reasoning requires more computational power than regular text reasoning, so API response times are relatively longer. For control loops requiring real-time feedback (<100ms), you’ll still need to pair with lightweight models on the edge.

Still in Restricted Access: Not all developers can get full functionality immediately. Some advanced features (like manipulation command output) require an application process.

Accuracy Depends on Training Data: The model performs better in general scenarios (despots, warehouses, kitchens); highly specialized industrial scenarios still require fine-tuning or few-shot prompting.

Doesn’t Directly Control Hardware: Robotics-ER outputs reasoning results—actual robot control needs to be implemented with ROS 2, robot SDKs, or custom controllers.

Try It Now

Go to Google AI Studio
Select the model gemini-robotics-er-1.6
Upload an image containing objects
Enter spatial reasoning or manipulation planning questions

Even without robot hardware, you can test spatial reasoning capabilities with simulated images.

What This Means for Developers

The significance of Gemini Robotics-ER 1.6 opening its API is making AI visual reasoning capabilities—previously only affordable by large robotics companies—accessible to every developer in API form.

You don’t need to train your own spatial perception models, you don’t need to hire machine learning engineers—as long as you can call a REST API, you can add “understanding the 3D world” capability to your applications.

This isn’t science fiction—it’s a tool you can start experimenting with today.

This article is based on Google’s official announcements. Technical details and API interfaces are subject to Google AI Studio documentation.

An AI Model That “Understands Space”#

Core Capabilities of Robotics-ER 1.6#

Spatial Reasoning#

Manipulation Planning#

Multimodal Input Integration#

How Developers Connect to the API?#

Quick Start#

Robot Manipulation Command Output#

Real-time Streaming Scenario#

Real-world Application Scenarios#

Industrial Automation: Visual-guided Grasping#

Warehouse Logistics: Flexible Sorting#

AR/MR Development: Spatial Annotation#

Drone Navigation: Scene Awareness#

How Does It Compare to Other Models?#

Limitations and Notes#

Try It Now#

What This Means for Developers#

Get new posts by email: