Human-1: A Full-Duplex Conversational Model for Hindi

🎙️ Try the live demo → | 📄 Paper →

Human-1 by Josh Talks is the first full-duplex spoken dialogue model for Hindi, built by adapting Kyutai's Moshi architecture. It enables real-time, natural Hindi conversation with support for interruptions, overlaps, backchannels, and natural turn-taking — trained on 26,000 hours of real spontaneous Hindi conversations from 14,695 speakers.

Hindi-Moshi Architecture

Model Details


Developed by	Bhaskar Singh, Shobhit Banga, Pranav Sharma — JoshTalks
Base model	kyutai/moshiko-pytorch-bf16
Language	Hindi (hi)
Model type	Full-duplex speech-to-speech dialogue
Format	SafeTensors (fp32)
Tokenizer	Custom Hindi SentencePiece (32,000 vocabulary)
Audio codec	Mimi (frozen, 12.5 Hz, 1.1 kbps)
License	CC-BY-4.0

What was changed from base Moshi

The original English SentencePiece tokenizer was replaced with a Hindi SentencePiece model (32,000 vocabulary) trained on a large Hindi text corpus. This required reinitialisation of three vocabulary-dependent parameter groups:

text_emb — text token embedding in the Temporal Transformer
depformer.emb.0 — text token embedding in the Depth Transformer
text_linear — text output projection layer

All audio processing components (Mimi codec) and remaining transformer weights retain their pre-trained values. Mimi generalises to Hindi without retraining (STOI: 0.878, PESQ: 2.55).

For full architecture details, see the Moshi paper.

Training

Data

The model was trained on a purpose-built corpus of 26,000 hours of real Hindi spontaneous conversations — to our knowledge, the largest conversational speech corpus for any Indian language.

Characteristic	Value
Total duration	26,000 hours
Unique speakers	14,695
Recording type	Spontaneous, unscripted conversations
Channels	Stereo (separate per speaker)
Quality control	Trained annotators + manual checks

The stereo recording format with separate speaker channels enables direct learning of turn-taking, overlaps, and backchannels from natural interactions — without requiring artificial speaker diarisation.

Two-stage training recipe

Stage 1 — Pre-training on the full 26,000-hour corpus. Learning rate of 3×10⁻⁵ (matching original Moshi pre-training). AdamW with β₁=0.9, β₂=0.95, weight decay 0.1. Effective batch size of 64 (~2.9 hours of audio per update). Trained for 1 epoch (~10,000 steps) in approximately 13 hours on 8× NVIDIA H100 80GB GPUs.

Stage 2 — Fine-tuning on ~990 hours of curated high-quality conversational data. Split learning rates: 2×10⁻⁶ for the Temporal Transformer, 4×10⁻⁶ for the Depth Transformer. Optimal checkpoint selected at step 4,812 based on minimum total validation loss (3.370).

Training infrastructure

8× NVIDIA H100 80GB GPUs with bf16 mixed precision.

Evaluation

Perplexity

Measured using Sarvam-1 (2B) on Whisper-v3 transcriptions of generated speech.

Temperature	PPL ↓
Ground-truth	237.1
Human-1 (τ=0.8)	356.9
Human-1 (τ=0.9)	467.1
Human-1 (τ=1.0)	640.6

Human Evaluation

130 evaluators completed 2,125 rating tasks comparing human speech with model responses. Each instance contained two audio samples (Voice A: Human, Voice B: Model) rated on 5-point Likert scales for naturalness and clarity.

Perceptual quality:

Metric	Human Score	Model Score	Human Preferred	Model Preferred	Tie
Naturalness	4.55	4.10	30.0%	3.1%	66.9%
Clarity	4.05	3.04	—	—	—

Generated speech achieves high perceptual quality, with naturalness scores approaching human speech and most pairwise comparisons resulting in ties.

Conversational rubric evaluation:

Evaluators also assessed conversational quality using three binary rubric questions measuring whether generated responses behave like natural conversational speech.

Rubric	Pass Rate
Human-like interaction	≈85%
Appropriateness (response follows prompt)	≈53%
Completion (response forms a complete reply)	≈42%

While the model frequently produces speech that sounds human-like, maintaining contextual relevance and producing fully complete conversational responses remains an ongoing challenge.

Turn-Taking Analysis

Temperature τ=0.9 produces turn-taking dynamics closest to ground-truth.

Model	τ	IPU/min	Pause	Gap	Overlap
Ground-truth	—	35.30	10.49	8.51	3.03
Human-1	0.8	23.12	9.16	6.77	1.67
Human-1	0.9	29.14	9.24	8.54	4.30
Human-1	1.0	38.90	11.67	8.10	9.68

Conversation Style

Human-1 is trained on topic-driven conversations - real dialogues where two speakers discuss a subject naturally, with backchannels, interruptions, and organic turn-taking.

After an initial introduction, the model will typically propose a topic and steer the conversation toward it, preferring structured discussion over open-ended chitchat. Users can also introduce their own topic - the model will pick it up and engage in a focused discussion around it. This is an intentional design choice - the training data consists of real conversations where speakers engage in focused, in-depth discussions on assigned topics.

This makes the model particularly well-suited for domain-specific conversational applications. Our key finding is that the model's ability to stay on-topic emerges naturally from the structure of the training data alone - without any explicit prompting, reward shaping, or guardrails. This suggests that with sufficient hours of domain-specific conversational data, this approach can produce models that learn the conversational norms of virtually any domain - customer support, healthcare consultations, language tutoring, sales, therapy, and more - opening a direct path from curated conversations to deployable, real-world voice agents. Exploring this is an active direction of our future work.

Files

├── model.safetensors                              # Human-1 LM weights
├── tokenizer-e351c8d8-checkpoint125.safetensors   # Mimi audio codec (frozen, from Moshi)
├── tokenizer_hindi.model                          # Hindi SentencePiece tokenizer
├── tokenizer_hindi.vocab                          # Vocabulary reference
├── hindi_moshi_architecture.svg                   # Architecture diagram
└── README.md

Quick Start

1. Install uv

curl -LsSf https://astral.sh/uv/install.sh | sh
source $HOME/.local/bin/env

2. Create project and install dependencies

uv init human-1 && cd human-1
uv python install 3.12
uv python pin 3.12
uv add moshi huggingface_hub

3. Download the model

uv run huggingface-cli download JoshTalksAI/Human-1 --local-dir ./weights

4. Run the server

uv run -m moshi.server \
    --moshi-weight ./weights/model.safetensors \
    --mimi-weight ./weights/tokenizer-e351c8d8-checkpoint125.safetensors \
    --tokenizer ./weights/tokenizer_hindi.model

Intended Use

The model is intended for research in full-duplex spoken dialogue systems for Hindi and Indian languages. It can be used as a conversational agent for casual Hindi conversations.

Limitations

Trained primarily on Hindi conversational speech. Performance on other languages or domains is not guaranteed.
Inherits limitations from the base Moshi architecture regarding audio quality at 1.1 kbps bitrate.
Hindi text tokens are sparser relative to audio (~75% PAD ratio vs. 65% in English) due to Devanagari encoding more phonemic content per token.
Not intended for impersonation or any malicious use.
This model is for research purposes. We do not recommend it for providing advice or performing any professional duty.

Citation

@article{singh2026human1,
  title   = {Human-1 by Josh Talks : A Full-Duplex Conversational Modeling Framework in Hindi using Real-World Conversations},
  author  = {Bhaskar Singh and Shobhit Banga and Pranav Sharma},
  year    = {2026},
  institution = {JoshTalks}
}

Acknowledgments

Built on Moshi by Kyutai. We thank the 14,695 speakers who contributed to the Hindi conversational corpus.

Downloads last month: 94

Safetensors

Model size

8B params

Tensor type

F32

Model tree for JoshTalksAI/Human-1

Base model

kyutai/moshiko-pytorch-bf16

Finetuned

(7)

this model

Space using JoshTalksAI/Human-1 1

Papers for JoshTalksAI/Human-1

Human-1 by Josh Talks: A Full-Duplex Conversational Modeling Framework in Hindi using Real-World Conversations

Paper • 2604.23295 • Published 30 days ago • 2

Moshi: a speech-text foundation model for real-time dialogue

Paper • 2410.00037 • Published Sep 17, 2024 • 17