Instructions to use JoshTalksAI/Human-1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Moshi
How to use JoshTalksAI/Human-1 with Moshi:
# pip install moshi # Run the interactive web server python -m moshi.server --hf-repo "JoshTalksAI/Human-1" # Then open https://localhost:8998 in your browser
# pip install moshi import torch from moshi.models import loaders # Load checkpoint info from HuggingFace checkpoint = loaders.CheckpointInfo.from_hf_repo("JoshTalksAI/Human-1") # Load the Mimi audio codec mimi = checkpoint.get_mimi(device="cuda") mimi.set_num_codebooks(8) # Encode audio (24kHz, mono) wav = torch.randn(1, 1, 24000 * 10) # [batch, channels, samples] with torch.no_grad(): codes = mimi.encode(wav.cuda()) decoded = mimi.decode(codes) - Notebooks
- Google Colab
- Kaggle
Human-1: A Full-Duplex Conversational Model for Hindi
🎙️ Try the live demo → | 📄 Paper →
Human-1 by Josh Talks is the first full-duplex spoken dialogue model for Hindi, built by adapting Kyutai's Moshi architecture. It enables real-time, natural Hindi conversation with support for interruptions, overlaps, backchannels, and natural turn-taking — trained on 26,000 hours of real spontaneous Hindi conversations from 14,695 speakers.
Model Details
| Developed by | Bhaskar Singh, Shobhit Banga, Pranav Sharma — JoshTalks |
| Base model | kyutai/moshiko-pytorch-bf16 |
| Language | Hindi (hi) |
| Model type | Full-duplex speech-to-speech dialogue |
| Format | SafeTensors (fp32) |
| Tokenizer | Custom Hindi SentencePiece (32,000 vocabulary) |
| Audio codec | Mimi (frozen, 12.5 Hz, 1.1 kbps) |
| License | CC-BY-4.0 |
What was changed from base Moshi
The original English SentencePiece tokenizer was replaced with a Hindi SentencePiece model (32,000 vocabulary) trained on a large Hindi text corpus. This required reinitialisation of three vocabulary-dependent parameter groups:
text_emb— text token embedding in the Temporal Transformerdepformer.emb.0— text token embedding in the Depth Transformertext_linear— text output projection layer
All audio processing components (Mimi codec) and remaining transformer weights retain their pre-trained values. Mimi generalises to Hindi without retraining (STOI: 0.878, PESQ: 2.55).
For full architecture details, see the Moshi paper.
Training
Data
The model was trained on a purpose-built corpus of 26,000 hours of real Hindi spontaneous conversations — to our knowledge, the largest conversational speech corpus for any Indian language.
| Characteristic | Value |
|---|---|
| Total duration | 26,000 hours |
| Unique speakers | 14,695 |
| Recording type | Spontaneous, unscripted conversations |
| Channels | Stereo (separate per speaker) |
| Quality control | Trained annotators + manual checks |
The stereo recording format with separate speaker channels enables direct learning of turn-taking, overlaps, and backchannels from natural interactions — without requiring artificial speaker diarisation.
Two-stage training recipe
Stage 1 — Pre-training on the full 26,000-hour corpus. Learning rate of 3×10⁻⁵ (matching original Moshi pre-training). AdamW with β₁=0.9, β₂=0.95, weight decay 0.1. Effective batch size of 64 (~2.9 hours of audio per update). Trained for 1 epoch (~10,000 steps) in approximately 13 hours on 8× NVIDIA H100 80GB GPUs.
Stage 2 — Fine-tuning on ~990 hours of curated high-quality conversational data. Split learning rates: 2×10⁻⁶ for the Temporal Transformer, 4×10⁻⁶ for the Depth Transformer. Optimal checkpoint selected at step 4,812 based on minimum total validation loss (3.370).
Training infrastructure
8× NVIDIA H100 80GB GPUs with bf16 mixed precision.
Evaluation
Perplexity
Measured using Sarvam-1 (2B) on Whisper-v3 transcriptions of generated speech.
| Temperature | PPL ↓ |
|---|---|
| Ground-truth | 237.1 |
| Human-1 (τ=0.8) | 356.9 |
| Human-1 (τ=0.9) | 467.1 |
| Human-1 (τ=1.0) | 640.6 |
Human Evaluation
130 evaluators completed 2,125 rating tasks comparing human speech with model responses. Each instance contained two audio samples (Voice A: Human, Voice B: Model) rated on 5-point Likert scales for naturalness and clarity.
Perceptual quality:
| Metric | Human Score | Model Score | Human Preferred | Model Preferred | Tie |
|---|---|---|---|---|---|
| Naturalness | 4.55 | 4.10 | 30.0% | 3.1% | 66.9% |
| Clarity | 4.05 | 3.04 | — | — | — |
Generated speech achieves high perceptual quality, with naturalness scores approaching human speech and most pairwise comparisons resulting in ties.
Conversational rubric evaluation:
Evaluators also assessed conversational quality using three binary rubric questions measuring whether generated responses behave like natural conversational speech.
| Rubric | Pass Rate |
|---|---|
| Human-like interaction | ≈85% |
| Appropriateness (response follows prompt) | ≈53% |
| Completion (response forms a complete reply) | ≈42% |
While the model frequently produces speech that sounds human-like, maintaining contextual relevance and producing fully complete conversational responses remains an ongoing challenge.
Turn-Taking Analysis
Temperature τ=0.9 produces turn-taking dynamics closest to ground-truth.
| Model | τ | IPU/min | Pause | Gap | Overlap |
|---|---|---|---|---|---|
| Ground-truth | — | 35.30 | 10.49 | 8.51 | 3.03 |
| Human-1 | 0.8 | 23.12 | 9.16 | 6.77 | 1.67 |
| Human-1 | 0.9 | 29.14 | 9.24 | 8.54 | 4.30 |
| Human-1 | 1.0 | 38.90 | 11.67 | 8.10 | 9.68 |
Conversation Style
Human-1 is trained on topic-driven conversations - real dialogues where two speakers discuss a subject naturally, with backchannels, interruptions, and organic turn-taking.
After an initial introduction, the model will typically propose a topic and steer the conversation toward it, preferring structured discussion over open-ended chitchat. Users can also introduce their own topic - the model will pick it up and engage in a focused discussion around it. This is an intentional design choice - the training data consists of real conversations where speakers engage in focused, in-depth discussions on assigned topics.
This makes the model particularly well-suited for domain-specific conversational applications. Our key finding is that the model's ability to stay on-topic emerges naturally from the structure of the training data alone - without any explicit prompting, reward shaping, or guardrails. This suggests that with sufficient hours of domain-specific conversational data, this approach can produce models that learn the conversational norms of virtually any domain - customer support, healthcare consultations, language tutoring, sales, therapy, and more - opening a direct path from curated conversations to deployable, real-world voice agents. Exploring this is an active direction of our future work.
Files
├── model.safetensors # Human-1 LM weights
├── tokenizer-e351c8d8-checkpoint125.safetensors # Mimi audio codec (frozen, from Moshi)
├── tokenizer_hindi.model # Hindi SentencePiece tokenizer
├── tokenizer_hindi.vocab # Vocabulary reference
├── hindi_moshi_architecture.svg # Architecture diagram
└── README.md
Quick Start
1. Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh
source $HOME/.local/bin/env
2. Create project and install dependencies
uv init human-1 && cd human-1
uv python install 3.12
uv python pin 3.12
uv add moshi huggingface_hub
3. Download the model
uv run huggingface-cli download JoshTalksAI/Human-1 --local-dir ./weights
4. Run the server
uv run -m moshi.server \
--moshi-weight ./weights/model.safetensors \
--mimi-weight ./weights/tokenizer-e351c8d8-checkpoint125.safetensors \
--tokenizer ./weights/tokenizer_hindi.model
Intended Use
The model is intended for research in full-duplex spoken dialogue systems for Hindi and Indian languages. It can be used as a conversational agent for casual Hindi conversations.
Limitations
- Trained primarily on Hindi conversational speech. Performance on other languages or domains is not guaranteed.
- Inherits limitations from the base Moshi architecture regarding audio quality at 1.1 kbps bitrate.
- Hindi text tokens are sparser relative to audio (~75% PAD ratio vs. 65% in English) due to Devanagari encoding more phonemic content per token.
- Not intended for impersonation or any malicious use.
- This model is for research purposes. We do not recommend it for providing advice or performing any professional duty.
Citation
@article{singh2026human1,
title = {Human-1 by Josh Talks : A Full-Duplex Conversational Modeling Framework in Hindi using Real-World Conversations},
author = {Bhaskar Singh and Shobhit Banga and Pranav Sharma},
year = {2026},
institution = {JoshTalks}
}
Acknowledgments
Built on Moshi by Kyutai. We thank the 14,695 speakers who contributed to the Hindi conversational corpus.
- Downloads last month
- 94
Model tree for JoshTalksAI/Human-1
Base model
kyutai/moshiko-pytorch-bf16