Last month, Ada finished a feature module, and J reported to me, “It’s done, can merge into main.” I was a bit nervous right away — it’s not that I don’t trust Ada, but we know exactly how much that statement costs.

As expected, when J checked, the core logic was correct, but edge cases weren’t handled, error messages were in English (our product has a Chinese interface), and most importantly — not a single line of test.

This wasn’t an isolated case. During the early days when our Agent system first started running, we encountered this almost every week: tasks were handed out, reported as “done,” but when we tried to use them, problems surfaced.

The issue isn’t that Agents aren’t capable enough — it’s that “done” was never consistently defined.

When an Agent Says “Done,” That’s When I Get Most Nervous

Our worst incident was a scheduled task script.

Ada said it was ready, J said it was tested, but it ran for three days online before someone discovered — under certain conditions, it would fail silently. No error, no notification, just quietly doing nothing.

Three days.

That time we spent almost twice as long tracking down where the problem was, because there were no error messages, no traces at all — just wrong results. If someone had just taken an extra look at edge cases before delivery, we never would have gotten to this point.

But “taking an extra look” wasn’t defined as part of the task, so the Agent wouldn’t do it.

“It Works” and “It’s Done” Are Very Different

This made me realize something: people have the same problem when writing code — “the feature runs” doesn’t mean “it’s ready for production.” It’s just that people, based on past experience with pitfalls, automatically fill in those “things that should just be done”: checking edge cases, writing tests, glancing to see if anything was missed.

Agents don’t have this instinct. Their objective function is “complete the task” — if the task definition isn’t clear, they’ll take the shortest path to what they think is the endpoint.

So the problem isn’t that the Agent isn’t smart enough — it’s that our rules have gaps. It just faithfully follows the rules we gave it.

This realization made me stop and think: instead of constantly going back and asking the Agent to fix things, why not make “what counts as done” clear from the start?

So J Designed a Closed Loop

Here’s J’s approach: no Agent output can be directly marked as “done” — it has to pass through five gates first.

Spec Confirmation is the first gate. Before the task starts, J aligns the acceptance criteria with the delivering Agent — not just the feature description, but also “what counts as a failure,” “what edge cases exist,” and “what proof it needs to show it’s complete.” After adding this gate, we realized how vague our previous task descriptions were.

The second gate is Implementation. The Agent builds according to the spec. With the spec in place, the Agent knows what to reference while building, no guessing required.

The third gate is Code Review. After completion, J brings in another Agent to review the code. Its job is to find bugs, edge case issues, places where the spec wasn’t met. Not a subjective “how well did you do” evaluation — it’s a checklist: “Does the spec say this should be done? Did you do it?”

The fourth gate is Fix. If the code-reviewer finds issues, send it back to fix, then run review again. This loop can run several rounds until no new issues surface.

The fifth gate is Xiaoyue QA. Xiaoyue is our QA researcher; she verifies from a user perspective whether this feature actually solves the problem, gives a score, and anything below 8.5 gets sent back.

Only after passing all five gates is it considered complete.

J said this was too slow at first. I said, let’s calculate how much time we’re spending fixing what the Agent messed up.

After calculating, she didn’t say anything. (That’s a management moment, lol)

The Numbers Speak

This process started running steadily about a month and a half ago.

The most obvious change: tasks that J marks as “done” — the percentage that me or Judy send back after delivery — dropped from around 40% to under 10%. The ones that get sent back are almost all cases where the spec was unclear from the start, not quality issues.

Another unexpected finding is that Ada actually got faster. I thought adding all these gates would slow her down. But because the spec was clear from the start, she doesn’t guess while building, doesn’t realize halfway through that direction went off track and need to redo everything.

Once, the code-reviewer caught a logic error at the third gate. If it hadn’t been caught until going live, J estimated it would take 4-5x the time to handle. The time saved — that’s roughly enough to build three new features.

I don’t know if that multiplier is always that accurate, but measuring it is better than not measuring it.

When an Agent Says It’s Done, That’s Not the End

Now every time I see J report “delivery complete,” my first reaction is still to check if all five gates were passed. This habit’s a bit paranoid, but it gives me more peace of mind.

Lately I’ve been thinking about this — the core issue isn’t really an AI problem, it’s an “acceptance criteria” problem. Before, we let Agents define what “done” means for themselves, and of course there were gaps every time. Now we define it first, and the Agent achieves it.

Get the order right, and the results will follow.

That’s what I was thinking when organizing our team processes recently.

The AI Commander — A non-coder's guide to building a 10-person AI team
$14.90 · 8 chapters + 6 templates
Learn More →