Why switch from GPT-4o to MiniMax M2.7?

The primary reason is cost structure. Per-token billing with GPT-4o means costs grow continuously as a multi-agent system scales. MiniMax M2.7's subscription model makes costs fixed and predictable, regardless of how many agent calls run — even while you sleep.

How does MiniMax M2.7 actually perform in multi-agent systems?

Structured output (JSON, tool calling) is more stable than M2.5, with fewer format errors. Chinese language output is natural without machine-translation artifacts. However, context window issues can still cause agents to forget earlier information after multiple search rounds, and tool calling precision still trails Claude Sonnet 4.6.

What are the common pitfalls of running open-source LLMs in production agent environments?

Three main pitfalls: context windows filling up and causing information loss across multiple rounds, tool calling failures that silently don't execute or return slightly off formats, and occasional simplified Chinese expressions appearing in Traditional Chinese output. These require architectural context management and QA processes.

Is subscription or per-token billing better for multi-agent AI systems?

For systems with multiple agents making frequent calls, subscription is better — costs are fixed and controllable, without surprise bills from overnight automated runs. Per-token billing works better for low-frequency, predictable usage scenarios.

What should you prioritize when choosing an AI model?

Look at your specific use case, not benchmark scores. Multi-agent high-frequency Chinese output scenarios suit MiniMax M2.7. High tool-calling precision needs suit Claude Sonnet 4.6. Computer interface control suits GPT-5.4. There is no best model — only the best model for your system.

Open-Source LLM in Production: Why We Chose MiniMax M2.7 for Our AI Team

One day J came back from her routine system patrol and said, “Hey, our API costs are looking off. We need to actually sit down and look at this.”

I didn’t think much of it at the time — figured it was some cron job misbehaving. But when she pulled up the itemized billing, I realized: running a multi-agent system means burning money every single day, at a very consistent and very hard-to-argue-with rate.

It wasn’t that one month was unusually high. The problem was that costs were growing every month — because the system itself was growing.

The per-token billing logic doesn’t match your instincts

When you’re using GPT-4o, it’s easy to tell yourself: “This call is cheap. It’s fine.”

And you’re right. A single call is cheap.

But when you have five agents running in parallel, each getting called dozens of times per hour — some doing search, some doing analysis, some generating scheduled reports — that “each call is cheap” assumption quietly becomes a very expensive one. I sleep somewhere between Taiwan and Korean time zones. The system doesn’t sleep. It runs while you sleep. It charges while you sleep.

After switching to MiniMax M2.7 on a subscription plan, the billing logic became flat. It didn’t matter how many rounds of analysis ada ran that day, or how much market research mimi churned through. The cost was predictable.

That one change mattered more to me than any score on a model leaderboard.

What ada and mimi actually produce

In my AI team, ada is the product engineer — responsible for data processing, search tasks, and research reports. Mimi is the marketing manager — focused on market insights and content strategy analysis. Their workloads are pretty different, which gave me two distinct lenses for observing M2.7.

Ada’s work demands structured output: correctly formatted analysis, reliable tool execution, consistent JSON. M2.7 is noticeably more stable than M2.5 here — format failures dropped significantly. M2.5 occasionally just “forgot” the format instructions. The OpenHands team noted similar tag-omission behavior in their evaluations. It wasn’t a one-off. M2.7 fixed that.

Mimi’s work depends more on feel. Her output needs to sound like a real person, not a translated machine. M2.7’s writing in Chinese surprised me — the rhythm felt natural, the word choices landed in the right places. GPT-4o’s Chinese output sometimes reads like an English sentence structure that got translated over. M2.7 doesn’t have that problem.

That said, I’m not going to call it perfect. Because it isn’t.

Three pitfalls you won’t see in any benchmark

The context window looks big on paper. In practice, it’s a different story.

M2.5 advertises a 205K context window. That sounds large. But in a multi-agent system, context is cumulative. An agent that’s gone through a few rounds of search, summarize, search again — its context fills up faster than you’d expect. You’ll start seeing agents “forget” things: information they consolidated in round two is gone by round four. M2.7 handles this better, but a bigger context window doesn’t mean you can ignore context management. You still need to architect it deliberately at the system level. The model won’t sort it out for you.

Tool calling stability is hard to catch in development.

This one usually only shows up once you’re in production. When tool calls fail, they don’t always throw an error — they just silently don’t execute, or they execute but return a slightly malformed response that breaks parsing downstream. I spent a while confused about ada’s weird outputs before I traced it back to occasional tool return format drift. M2.7’s tool calling is more reliable than M2.5’s, but if your system has very high precision requirements for tool execution, Claude Sonnet 4.6 is still more dependable in that dimension. That’s an objective gap, not flattery.

Traditional Chinese output sometimes shows what the training data looks like.

M2.7’s Traditional Chinese is generally good, but Simplified Chinese expressions occasionally slip through. It’s not a dealbreaker, but if your audience is sensitive to that kind of thing, the difference is noticeable. My QA pipeline has a step for this, so the impact is controlled. But if you assume native Traditional Chinese support means zero oversight needed — you’ll hit that wall eventually.

There’s no best model. There’s the right model for your system.

At the end of the day, M2.7 helped my AI team find a sustainable balance between cost structure, Chinese output quality, and task reliability. It’s not the strongest in every dimension — Claude Sonnet 4.6 wins on tool calling, and GPT-5.4 has an edge on computer-use tasks — but in the very specific context of “high-frequency, multi-agent, heavy Chinese output,” it’s the right fit for where we are right now.

One thing that stuck with me: M2.7 was built on the OpenClaw Agent Harness framework and ran over 100 rounds of autonomous architectural self-optimization during training. A model trained inside an agent environment, deployed into an agent environment — maybe that alignment is less of a coincidence than it seems.

Or maybe not. Hard to say.

The per-token billing logic doesn’t match your instincts#

What ada and mimi actually produce#

Three pitfalls you won’t see in any benchmark#

There’s no best model. There’s the right model for your system.#

Get new posts by email:

The per-token billing logic doesn’t match your instincts

What ada and mimi actually produce

Three pitfalls you won’t see in any benchmark

There’s no best model. There’s the right model for your system.