Instructions to use anthonym21/json-tokenizer-structured with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use anthonym21/json-tokenizer-structured with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="anthonym21/json-tokenizer-structured")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("anthonym21/json-tokenizer-structured", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use anthonym21/json-tokenizer-structured with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "anthonym21/json-tokenizer-structured" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "anthonym21/json-tokenizer-structured", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/anthonym21/json-tokenizer-structured
- SGLang
How to use anthonym21/json-tokenizer-structured with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "anthonym21/json-tokenizer-structured" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "anthonym21/json-tokenizer-structured", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "anthonym21/json-tokenizer-structured" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "anthonym21/json-tokenizer-structured", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use anthonym21/json-tokenizer-structured with Docker Model Runner:
docker model run hf.co/anthonym21/json-tokenizer-structured
| library_name: transformers | |
| license: mit | |
| language: | |
| - en | |
| tags: | |
| - tokenizer | |
| - json | |
| - bpe | |
| - structured-data | |
| - llm | |
| pipeline_tag: text-generation | |
| # json-tokenizer: Structure-Aware Tokenization for JSON | |
| A structure-aware tokenizer that assigns dedicated single tokens to JSON grammar elements, learns a compact key vocabulary from training data, and applies byte-pair encoding (BPE) only to value content. | |
| **Paper:** [Structure-Aware Tokenization for JSON: Exploiting Schema Repetition for Compact Token Sequences with a 90x Smaller Vocabulary](https://doi.org/10.5281/zenodo.18879110) | |
| **Code:** [github.com/anthony-maio/json-tokenizer](https://github.com/anthony-maio/json-tokenizer) | |
| ## Key Results | |
| | Metric | Value | | |
| |--------|-------| | |
| | Token savings vs cl100k_base | **5-15%** on schema-repetitive JSON | | |
| | Vocabulary size | **4,251 tokens** (vs 100,256 for cl100k_base) | | |
| | Vocab reduction | **~90x smaller** | | |
| | Roundtrip fidelity | **100% lossless** across 4,200+ test objects | | |
| | Crossover point | Beats cl100k_base at just **558 tokens** | | |
| ## Architecture | |
| Three-tier vocabulary: | |
| 1. **Structural tokens** (IDs 0-15): `{`, `}`, `[`, `]`, `:`, `,`, `true`, `false`, `null`, type markers | |
| 2. **Key vocabulary** (IDs 32-N): Learned single-token keys from training data (125 keys) | |
| 3. **BPE subwords** (IDs N+1 to N+B): Byte-pair encoding trained on JSON value strings (4,096 tokens) | |
| ## This Model | |
| This pretrained tokenizer was trained on structured JSON datasets: | |
| - GeoJSON city features (geographic data) | |
| - Observability telemetry logs (monitoring data) | |
| - Kubernetes manifests (infrastructure config) | |
| - Structured API outputs | |
| - Synthetic training corpus (700 objects) | |
| **Total training objects:** 72,991 | |
| **Vocabulary:** 4,251 tokens (16 structural + 16 reserved + 125 keys + 4,096 BPE) | |
| ## Usage | |
| ### With HuggingFace Transformers | |
| ```python | |
| # Requires: pip install json-tokenizer[huggingface] | |
| from json_tokenizer.hf_compat import JSONPreTrainedTokenizer | |
| tokenizer = JSONPreTrainedTokenizer.from_pretrained("anthonym21/json-tokenizer-structured") | |
| # Encode JSON | |
| output = tokenizer('{"name": "Alice", "age": 30, "active": true}') | |
| print(output["input_ids"]) | |
| # Decode back to JSON (lossless) | |
| decoded = tokenizer.decode(output["input_ids"]) | |
| print(decoded) # {"name": "Alice", "age": 30, "active": true} | |
| ``` | |
| ### With Core Library | |
| ```python | |
| # Requires: pip install json-tokenizer | |
| from json_tokenizer import JSONTokenizer | |
| tokenizer = JSONTokenizer.load("./path/to/saved/tokenizer") | |
| # Encode (accepts Python dicts, lists, or JSON strings) | |
| ids = tokenizer.encode({"name": "Alice", "age": 30}) | |
| # Decode (lossless roundtrip) | |
| json_str = tokenizer.decode(ids) | |
| ``` | |
| ## Training Your Own | |
| ```python | |
| from json_tokenizer import JSONTokenizer | |
| tok = JSONTokenizer(bpe_vocab_size=4096, max_key_vocab=512) | |
| tok.train_from_json_files(["your_data.jsonl"]) | |
| tok.save("./my_tokenizer") | |
| # Convert to HF format | |
| from json_tokenizer.hf_compat import JSONPreTrainedTokenizer | |
| hf_tok = JSONPreTrainedTokenizer.from_json_tokenizer(tok) | |
| hf_tok.save_pretrained("./my_hf_tokenizer") | |
| ``` | |
| ## Where It Wins / Where It Loses | |
| | Scenario | json-tokenizer | cl100k_base | | |
| |----------|---------------|-------------| | |
| | GeoJSON (schema-repetitive) | **+7.8% savings** | baseline | | |
| | Telemetry logs | **+5.5% savings** | baseline | | |
| | Batch JSON arrays | **+9.3% savings** | baseline | | |
| | Config objects | **+12.3% savings** | baseline | | |
| | Prose-heavy JSON (Alpaca) | -26.2% | **wins** | | |
| | K8s manifests (deep nesting) | break-even | break-even | | |
| **Best for:** API responses, observability logs, function calling, structured outputs | |
| **Not for:** Prose-heavy JSON, general-purpose text | |
| ## Citation | |
| ```bibtex | |
| @software{maio2026jsontokenizer, | |
| author = {Maio, Anthony}, | |
| title = {Structure-Aware Tokenization for {JSON}: Exploiting Schema Repetition for Compact Token Sequences with a 90x Smaller Vocabulary}, | |
| year = {2026}, | |
| url = {https://github.com/anthony-maio/json-tokenizer}, | |
| doi = {10.5281/zenodo.18879110}, | |
| version = {0.2.0}, | |
| license = {MIT} | |
| } | |
| ``` | |
| ## License | |
| MIT | |