Instructions to use bigcode/starcoder with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use bigcode/starcoder with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="bigcode/starcoder")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("bigcode/starcoder") model = AutoModelForCausalLM.from_pretrained("bigcode/starcoder") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use bigcode/starcoder with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "bigcode/starcoder" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "bigcode/starcoder", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/bigcode/starcoder
- SGLang
How to use bigcode/starcoder with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "bigcode/starcoder" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "bigcode/starcoder", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "bigcode/starcoder" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "bigcode/starcoder", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use bigcode/starcoder with Docker Model Runner:
docker model run hf.co/bigcode/starcoder
May I ask if there are plans to provide 8-bit or 4-bit quantized versions?
Thank you for creating the StarCoder model. However, it is estimated that only GPUs like the A100 will be able to perform inference with this model. May I ask if there are plans to provide 8-bit or 4-bit quantized versions?
Hi @intelligencegear ,
You can already use the 8bit model out of the box, by installing bitsandbytes and accelerate. Just run the following:
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("bigcode/starcoder", load_in_8bit=True, device_map="auto")
...
You can have a deeper look of what it is using under the hood here
load_in_8bit != GPTQ quantized. The performance is vastly difference with GPTQ being superior.
These are still a work in progress, but you can get early versions here: https://huggingface.co/mayank31398
I believe running time is not as good as we'd like, but quality of generated code seems to be good. A full evaluation is pending.
These are still a work in progress, but you can get early versions here: https://huggingface.co/mayank31398
I believe running time is not as good as we'd like, but quality of generated code seems to be good. A full evaluation is pending.
The job is awesome, and I have tested it myself. However, there may not be a directly usable checkpoint available due to licensing reasons. Instead, it needs to be converted on the machine memory. In other words, a machine with possibly 64GB of memory is required, and the hardware requirement is relatively high.
load_in_8bit != GPTQ quantized. The performance is vastly difference with GPTQ being superior.
Yes, that's exactly what I wanted to say.
yes, LLM.int8() works quite differently to GPTQ.
LLM.int8() which is achieved by load_in_8bit=True does quantization at load time.
However, GPTQ uses Optimal Brain Quantization for GPT-like models and this requires a samples for quantization.
@mayank31398 Do you think there will be a front end or interactive mode that works with your repo? Santa coder is great but without a chat like interface that can maintain context, Starcoder pretty much becomes unusable except for very specific situations. Thanks!
hey @syntaxing
there is already a model called starchat. Demo here: https://huggingface.co/spaces/HuggingFaceH4/starchat-playground
yes, LLM.int8() works quite differently to GPTQ.
LLM.int8() which is achieved byload_in_8bit=Truedoes quantization at load time.
However, GPTQ uses Optimal Brain Quantization for GPT-like models and this requires a samples for quantization.
@mayank31398 @syntaxing it is indeed a different process, but wasn't it shown that LLM.int8() does not result in statistically significant performance degradation in this blog post? In that case, what's the benefit of going with a process that requires data samples and GPU hours (talking about the 8bit case, not the 4bit case)?
Mainly asking to validate if I'm leaving performance on the table by using bitsandbytes.
@JacopoBandoni sorry for the late reply.
GPTQ and LLM.int8() are completely different quantization algorithms.
Please refer to their papers for the same.