Instructions to use PeymanHosseini/Hummingbird with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use PeymanHosseini/Hummingbird with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="PeymanHosseini/Hummingbird")# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("PeymanHosseini/Hummingbird", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use PeymanHosseini/Hummingbird with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "PeymanHosseini/Hummingbird" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "PeymanHosseini/Hummingbird", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/PeymanHosseini/Hummingbird
- SGLang
How to use PeymanHosseini/Hummingbird with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "PeymanHosseini/Hummingbird" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "PeymanHosseini/Hummingbird", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "PeymanHosseini/Hummingbird" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "PeymanHosseini/Hummingbird", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use PeymanHosseini/Hummingbird with Docker Model Runner:
docker model run hf.co/PeymanHosseini/Hummingbird
Hummingbird 0.0 Release
This is Hummingbird 0.0, a 1B proof-of-concept causal language model based on Efficient Attention, which reduces the number of parameters in the attention layer by 50% compared to standard Multi-Head Attention.
This version of Hummingbird is only meant to demonstrate Efficient Attention for use in causal language modelling. It has been trained on only 15 Billion tokens and is not safeguarded. Therefore, we do not recommend using it as a chatbot.
Model Details
The model consists of 1.1 Billion parameters with the following specifications:
| Parameter | size |
|---|---|
| # Transformer Blocks | 10 |
| Model Dimension | 3072 |
| # Heads | 1 |
The Attention Mechanism used is based on our newly proposed Efficient Attention from our paper, You Need to Pay Better Attention: Rethinking the Mathematics of Attention Mechanism (arXiv:2403.01643). We have chosen the number of heads to be 1 as an interesting case study since all current LMs use multiple heads.
The loss plot below illustrates the model's performance during training. For comparison, when trained on 15 billion tokens, Hummingbird achieves a slightly lower loss than TinyLlama, a model of similar size.
Team
The design and training of Hummingbird has been done jointly by Mehran Hosseini and Peyman Hosseini.
If you use Efficient Attention or Hummingbird, please cite our paper:
@article{Hosseinis24BetterAttention,
title = {You Need to Pay Better Attention: Rethinking the Mathematics of Attention Mechanism},
author = {Hosseini, Mehran and Hosseini, Peyman},
journal = {arXiv preprint arXiv:2403.01643},
year = {2024}
}
- Downloads last month
- 5