Instructions to use anthonym21/json-tokenizer-structured with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use anthonym21/json-tokenizer-structured with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="anthonym21/json-tokenizer-structured")

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("anthonym21/json-tokenizer-structured", dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use anthonym21/json-tokenizer-structured with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "anthonym21/json-tokenizer-structured"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "anthonym21/json-tokenizer-structured",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/anthonym21/json-tokenizer-structured

SGLang

How to use anthonym21/json-tokenizer-structured with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "anthonym21/json-tokenizer-structured" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "anthonym21/json-tokenizer-structured",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "anthonym21/json-tokenizer-structured" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "anthonym21/json-tokenizer-structured",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use anthonym21/json-tokenizer-structured with Docker Model Runner:
```
docker model run hf.co/anthonym21/json-tokenizer-structured
```

json-tokenizer-structured / README.md

anthonym21

Update tokenizer: 73K training objects, 125 keys, DOI 10.5281/zenodo.18879110

f876778 verified 2 months ago

preview code

raw

history blame contribute delete

4.1 kB

	---
	library_name: transformers
	license: mit
	language:
	- en
	tags:
	- tokenizer
	- json
	- bpe
	- structured-data
	- llm
	pipeline_tag: text-generation
	---

	# json-tokenizer: Structure-Aware Tokenization for JSON

	A structure-aware tokenizer that assigns dedicated single tokens to JSON grammar elements, learns a compact key vocabulary from training data, and applies byte-pair encoding (BPE) only to value content.

	Paper: [Structure-Aware Tokenization for JSON: Exploiting Schema Repetition for Compact Token Sequences with a 90x Smaller Vocabulary](https://doi.org/10.5281/zenodo.18879110)

	Code: [github.com/anthony-maio/json-tokenizer](https://github.com/anthony-maio/json-tokenizer)

	## Key Results

	\| Metric \| Value \|
	\|--------\|-------\|
	\| Token savings vs cl100k_base \| 5-15% on schema-repetitive JSON \|
	\| Vocabulary size \| 4,251 tokens (vs 100,256 for cl100k_base) \|
	\| Vocab reduction \| ~90x smaller \|
	\| Roundtrip fidelity \| 100% lossless across 4,200+ test objects \|
	\| Crossover point \| Beats cl100k_base at just 558 tokens \|

	## Architecture

	Three-tier vocabulary:
	1. Structural tokens (IDs 0-15): `{`, `}`, `[`, `]`, `:`, `,`, `true`, `false`, `null`, type markers
	2. Key vocabulary (IDs 32-N): Learned single-token keys from training data (125 keys)
	3. BPE subwords (IDs N+1 to N+B): Byte-pair encoding trained on JSON value strings (4,096 tokens)

	## This Model

	This pretrained tokenizer was trained on structured JSON datasets:
	- GeoJSON city features (geographic data)
	- Observability telemetry logs (monitoring data)
	- Kubernetes manifests (infrastructure config)
	- Structured API outputs
	- Synthetic training corpus (700 objects)

	Total training objects: 72,991
	Vocabulary: 4,251 tokens (16 structural + 16 reserved + 125 keys + 4,096 BPE)

	## Usage

	### With HuggingFace Transformers

	```python
	# Requires: pip install json-tokenizer[huggingface]
	from json_tokenizer.hf_compat import JSONPreTrainedTokenizer

	tokenizer = JSONPreTrainedTokenizer.from_pretrained("anthonym21/json-tokenizer-structured")

	# Encode JSON
	output = tokenizer('{"name": "Alice", "age": 30, "active": true}')
	print(output["input_ids"])

	# Decode back to JSON (lossless)
	decoded = tokenizer.decode(output["input_ids"])
	print(decoded) # {"name": "Alice", "age": 30, "active": true}
	```

	### With Core Library

	```python
	# Requires: pip install json-tokenizer
	from json_tokenizer import JSONTokenizer

	tokenizer = JSONTokenizer.load("./path/to/saved/tokenizer")

	# Encode (accepts Python dicts, lists, or JSON strings)
	ids = tokenizer.encode({"name": "Alice", "age": 30})

	# Decode (lossless roundtrip)
	json_str = tokenizer.decode(ids)
	```

	## Training Your Own

	```python
	from json_tokenizer import JSONTokenizer

	tok = JSONTokenizer(bpe_vocab_size=4096, max_key_vocab=512)
	tok.train_from_json_files(["your_data.jsonl"])
	tok.save("./my_tokenizer")

	# Convert to HF format
	from json_tokenizer.hf_compat import JSONPreTrainedTokenizer
	hf_tok = JSONPreTrainedTokenizer.from_json_tokenizer(tok)
	hf_tok.save_pretrained("./my_hf_tokenizer")
	```

	## Where It Wins / Where It Loses

	\| Scenario \| json-tokenizer \| cl100k_base \|
	\|----------\|---------------\|-------------\|
	\| GeoJSON (schema-repetitive) \| +7.8% savings \| baseline \|
	\| Telemetry logs \| +5.5% savings \| baseline \|
	\| Batch JSON arrays \| +9.3% savings \| baseline \|
	\| Config objects \| +12.3% savings \| baseline \|
	\| Prose-heavy JSON (Alpaca) \| -26.2% \| wins \|
	\| K8s manifests (deep nesting) \| break-even \| break-even \|

	Best for: API responses, observability logs, function calling, structured outputs
	Not for: Prose-heavy JSON, general-purpose text

	## Citation

	```bibtex
	@software{maio2026jsontokenizer,
	author = {Maio, Anthony},
	title = {Structure-Aware Tokenization for {JSON}: Exploiting Schema Repetition for Compact Token Sequences with a 90x Smaller Vocabulary},
	year = {2026},
	url = {https://github.com/anthony-maio/json-tokenizer},
	doi = {10.5281/zenodo.18879110},
	version = {0.2.0},
	license = {MIT}
	}
	```

	## License

	MIT