From Scarcity to Scale: How Synthetic Personas Can Bootstrap Japanese AI Development

Community Article Published February 19, 2026

AI could write the next chapter of Japan's economic story, with forecasts suggesting the technology could unlock over ¥100 trillion ($650 billion USD) in economic value. But realizing that potential depends on one thing most AI projects lack: usable training data.

This challenge is especially acute for developers building AI systems that understand Japanese language and culture. While English-language training data is abundant, Japanese developers face a persistent scarcity problem: not enough task-specific, culturally grounded data to bootstrap high-performing models. Collecting, cleaning, and labeling new examples is slow, expensive, and rarely keeps pace with iteration cycles.

The result is a data wall that blocks innovation before it starts.

A New Path Forward

image

New research from leading IT firm NTT DATA demonstrates how synthetic data can dismantle this wall—turning minimal proprietary data into production-scale training sets while preserving privacy and performance.

Using Nemotron-Personas-Japan (NVIDIA's open synthetic dataset of 6 million culturally grounded Japanese personas generated with NeMo Data Designer), NTT DATA achieved a massive boost in model accuracy on a legal Q&A task, increasing from 15.3% to 79.3%, along with a similar jump in answer consistency.

That's more than a 60-point improvement achieved without exposing sensitive data to the training pipeline.

For readers interested in the full experimental methodology and evaluation framework, NTT DATA’s detailed technical write-up (available in Japanese) provides a deeper dive into the study design and results.

The takeaway: enterprises can bootstrap domain-specific intelligence with minimal proprietary data using entirely open-source infrastructure. Open personas enable both better models and more agile data operations.

The Experiment

To rigorously test the approach, NTT DATA created a controlled evaluation using fictional legal documents, ensuring the model had to acquire genuinely new knowledge. They leveraged the following for training:

Base Model: tsuzumi 2 (NTT's proprietary LLM)

Data Expansion Model: GPT-OSS-120b

Seed Data: Nemotron-Personas-Japan

Judge Model: GPT-5 (LLM-as-a-judge method)

Using 500 personas from Nemotron-Personas-Japan to expand just 450 raw seed samples, they generated over 138,000 training examples—a synthetic set 300x larger than the manual equivalent—and boosted model accuracy from 15.3% to 79.3%.

The results speak directly to the data scarcity challenge enterprises face:

Configuration Seed Data Synthetic Expansion Accuracy
Baseline (no training) 15.3%
SFT with synthetic data 450 samples 138,000 examples 79.3%

Beyond raw accuracy, the synthetic training data eliminated hallucinations that plagued the baseline model. Where the untrained model invented plausible but incorrect legal classifications, the fine-tuned version learned to extract precise terminology without adding noise.

Perhaps most valuable for enterprise deployment: NTT DATA found that Continued Pre-training (CPT) became optional depending on the use case, even when adding new knowledge is necessary, provided sufficient synthetic fine-tuning data was available. This suggests developers can leverage a more cost-effective training pipeline that skips the resource-intensive CPT phase entirely and focuses on more iterative synthetic data generation for supervised fine-tuning (SFT).

This efficiency gain translates directly to reduced compute costs and faster iteration cycles.

Minimal proprietary data. Maximum domain lift.

“By expanding a small proprietary dataset with Nemotron Personas, we can effectively build task-specific models even when data availability is limited," says Shinya Higuchi, Senior Manager of AI Technology Department at NTT DATA's Technology and Innovation General headquarters. "This approach shows strong potential to improve outcomes for pre-research, customer support, and marketing applications where proprietary data is scarce.”

Screenshot 2026-02-19 at 6.43.37 AM

Privacy by Design

The accuracy gains are compelling, but they also raise a deeper question: what about the data that never makes it into the pipeline at all?

Over 90% of valuable enterprise data remains untapped due to privacy regulations, security risks, and licensing constraints. In Japan, frameworks like the Personal Information Protection Act (PIPA) and the country's innovation-first AI governance guidelines (published September 2025) reinforce this reality: responsible data handling isn't optional, even as AI advancement accelerates.

Synthetic data offers a path through this tension. By generating training examples that capture authentic patterns without personally identifiable information (PII), organizations can achieve data minimization and model performance simultaneously. Simply expose minimal proprietary data to bootstrap, then expand synthetically to production scale.

So synthetic data isn't just a training optimization: it's a privacy-enhancing technology (PET) that creates a goldilocks zone where data compliance and AI capability coexist. And because synthetic pipelines are reproducible and auditable, they also support the trust and transparency requirements that governance teams and regulators increasingly demand.

Sovereign Data Spaces

For Japanese enterprises building sovereign AI, data sovereignty is a prerequisite. But sovereignty alone isn't enough; models also need grounded intelligence: behavior shaped by local norms and domain constraints rather than statistical exposure to Western-centric corpora. Nemotron-Personas-Japan functions as a data primitive for this grounding: 6 million personas rooted in official Japanese demographic and labor statistics, covering 1,500+ occupation categories and regional distribution.

But the implications extend beyond individual organizations. NTT DATA and other leaders are actively developing "data spaces": collaborative environments where governments and companies can exchange AI-ready, synthesized data products under shared governance and privacy guarantees. Federated learning and other end-to-end encryption technologies enable this decentralized approach. Synthetic data serves as a complementary enabler, allowing organizations to contribute synthetic representations of their data patterns without exposing underlying sensitive information.

This shifts data risk management from a defensive posture to a collaborative one that’s aligned with Japan's vision for innovation-led AI governance. This approach also challenges the assumption that AI progress must flow from a small number of globally trained models. Instead, it points toward a future where many sovereign, interoperable AI systems are built locally on open, privacy-preserving foundations.

Start Building

The data wall is real. But as NTT DATA's research demonstrates, the tools to overcome it are now open and accessible. Synthetic data isn't a future capability—it's a present-day solution that developers can deploy today to build sovereign, culturally grounded AI systems without sacrificing privacy or performance.

Ready to get started? Explore the open-source NeMo Data Designer library, or dive into the full Nemotron-Personas-Japan dataset on Hugging Face. For a deeper technical dive, NTT DATA's full write-up (available in Japanese) covers the methodology and experimental design.

Nemotron-Personas-Japan is available under CC BY 4.0 for commercial and non-commercial use.

Community

Sign up or log in to comment