📰 Key Takeaways

NVIDIA developed a five-stage Task-Seeded Synthetic Data Generation (Task-Seeded SDG) process for the Nemotron series, selecting roughly 70 public tasks (~700 subtasks) from lm-eval-harness. These were divided into two seed categories: knowledge-intensive (39 tasks, ~3M samples) and reasoning-intensive (34 tasks, ~1.5M samples). LLMs then generated different but similarly capable Q&A pairs, with reasoning chains and domain knowledge attached before unified filtering and packaging. In ablation experiments, the version with context dominated: GPQA-Diamond CoT jumped from 34.85 to 45.96 (+11.11), AGIEval-en CoT +6.16, MMLU-Pro 5-shot +2.44. When this synthetic data was mixed into Nemotron-3 Nano’s post-training (100B token scale), final GPQA rose from 30.8 to 41.9 (+11.1), MMLU-Pro +1.8, coding ability +1.9, commonsense understanding +1.6 — multiple dimensions grew in sync, proving that broad task coverage effectively prevents overfitting to any single evaluation style. Key design principles include: storing semantic text instead of option codes for answers, and carefully balancing task ratios when mixing datasets to ensure stable, comprehensive improvement across knowledge, reasoning, and coding abilities.


💬 JudyAI Lab Viewpoint

The five-stage Task-Seeded Synthetic Data Generation process NVIDIA developed for Nemotron is the first concrete demonstration of how to scale training data production using a structured method — getting small models to improve across multiple benchmarks simultaneously, rather than just padding scores on a single task.

What deserves our attention most is how it deliberately separates “knowledge-intensive” from “reasoning-intensive” seed tasks and carefully balances their ratios when mixing into post-training. The ablation实验 clearly shows: the version with context pushed GPQA-Diamond CoT from 34.85 to 45.96, a gap of over 11 percentage points. This tells us: the quality of synthetic data depends not just on generation volume, but on structural design — covering ~70 public tasks and ~700 subtasks is key to preventing models from overfitting to specific evaluation styles. The fact that coding ability, commonsense understanding, and reasoning ability all improved together across multiple dimensions shows that breadth of task coverage itself is an anti-overfitting design. Another detail worth remembering: store semantic text instead of option codes for answers, so the model truly learns semantic understanding rather than memorizing option positions.

If you’re supplementing synthetic training data for your own model or application, ask yourself first: are my task seeds diverse enough, or am I putting all my eggs in one ability dimension?


📅 Source Information


🔗 Further Reading