One day J came back from her routine system patrol and said, “Hey, our API costs are looking off. We need to actually sit down and look at this.”

I didn’t think much of it at the time — figured it was some cron job misbehaving. But when she pulled up the itemized billing, I realized: running a multi-agent system means burning money every single day, at a very consistent and very hard-to-argue-with rate.

It wasn’t that one month was unusually high. The problem was that costs were growing every month — because the system itself was growing.

The per-token billing logic doesn’t match your instincts

When you’re using GPT-4o, it’s easy to tell yourself: “This call is cheap. It’s fine.”

And you’re right. A single call is cheap.

But when you have five agents running in parallel, each getting called dozens of times per hour — some doing search, some doing analysis, some generating scheduled reports — that “each call is cheap” assumption quietly becomes a very expensive one. I sleep somewhere between Taiwan and Korean time zones. The system doesn’t sleep. It runs while you sleep. It charges while you sleep.

After switching to MiniMax M2.7 on a subscription plan, the billing logic became flat. It didn’t matter how many rounds of analysis ada ran that day, or how much market research mimi churned through. The cost was predictable.

That one change mattered more to me than any score on a model leaderboard.

What ada and mimi actually produce

In my AI team, ada is the product engineer — responsible for data processing, search tasks, and research reports. Mimi is the marketing manager — focused on market insights and content strategy analysis. Their workloads are pretty different, which gave me two distinct lenses for observing M2.7.

Ada’s work demands structured output: correctly formatted analysis, reliable tool execution, consistent JSON. M2.7 is noticeably more stable than M2.5 here — format failures dropped significantly. M2.5 occasionally just “forgot” the format instructions. The OpenHands team noted similar tag-omission behavior in their evaluations. It wasn’t a one-off. M2.7 fixed that.

Mimi’s work depends more on feel. Her output needs to sound like a real person, not a translated machine. M2.7’s writing in Chinese surprised me — the rhythm felt natural, the word choices landed in the right places. GPT-4o’s Chinese output sometimes reads like an English sentence structure that got translated over. M2.7 doesn’t have that problem.

That said, I’m not going to call it perfect. Because it isn’t.

Three pitfalls you won’t see in any benchmark

The context window looks big on paper. In practice, it’s a different story.

M2.5 advertises a 205K context window. That sounds large. But in a multi-agent system, context is cumulative. An agent that’s gone through a few rounds of search, summarize, search again — its context fills up faster than you’d expect. You’ll start seeing agents “forget” things: information they consolidated in round two is gone by round four. M2.7 handles this better, but a bigger context window doesn’t mean you can ignore context management. You still need to architect it deliberately at the system level. The model won’t sort it out for you.

Tool calling stability is hard to catch in development.

This one usually only shows up once you’re in production. When tool calls fail, they don’t always throw an error — they just silently don’t execute, or they execute but return a slightly malformed response that breaks parsing downstream. I spent a while confused about ada’s weird outputs before I traced it back to occasional tool return format drift. M2.7’s tool calling is more reliable than M2.5’s, but if your system has very high precision requirements for tool execution, Claude Sonnet 4.6 is still more dependable in that dimension. That’s an objective gap, not flattery.

Traditional Chinese output sometimes shows what the training data looks like.

M2.7’s Traditional Chinese is generally good, but Simplified Chinese expressions occasionally slip through. It’s not a dealbreaker, but if your audience is sensitive to that kind of thing, the difference is noticeable. My QA pipeline has a step for this, so the impact is controlled. But if you assume native Traditional Chinese support means zero oversight needed — you’ll hit that wall eventually.

There’s no best model. There’s the right model for your system.

At the end of the day, M2.7 helped my AI team find a sustainable balance between cost structure, Chinese output quality, and task reliability. It’s not the strongest in every dimension — Claude Sonnet 4.6 wins on tool calling, and GPT-5.4 has an edge on computer-use tasks — but in the very specific context of “high-frequency, multi-agent, heavy Chinese output,” it’s the right fit for where we are right now.

One thing that stuck with me: M2.7 was built on the OpenClaw Agent Harness framework and ran over 100 rounds of autonomous architectural self-optimization during training. A model trained inside an agent environment, deployed into an agent environment — maybe that alignment is less of a coincidence than it seems.

Or maybe not. Hard to say.