May I ask if there are plans to provide 8-bit or 4-bit quantized versions?

#19

by intelligencegear - opened May 9, 2023

May 9, 2023

Thank you for creating the StarCoder model. However, it is estimated that only GPUs like the A100 will be able to perform inference with this model. May I ask if there are plans to provide 8-bit or 4-bit quantized versions?

ybelkada

BigCode org May 9, 2023

Hi @intelligencegear ,
You can already use the 8bit model out of the box, by installing bitsandbytes and accelerate. Just run the following:

from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("bigcode/starcoder", load_in_8bit=True, device_map="auto")
...

You can have a deeper look of what it is using under the hood here

syntaxing

May 10, 2023

load_in_8bit != GPTQ quantized. The performance is vastly difference with GPTQ being superior.

arjunguha

BigCode org May 10, 2023

These are still a work in progress, but you can get early versions here: https://huggingface.co/mayank31398

I believe running time is not as good as we'd like, but quality of generated code seems to be good. A full evaluation is pending.

intelligencegear

May 10, 2023

These are still a work in progress, but you can get early versions here: https://huggingface.co/mayank31398

I believe running time is not as good as we'd like, but quality of generated code seems to be good. A full evaluation is pending.

The job is awesome, and I have tested it myself. However, there may not be a directly usable checkpoint available due to licensing reasons. Instead, it needs to be converted on the machine memory. In other words, a machine with possibly 64GB of memory is required, and the hardware requirement is relatively high.

intelligencegear

May 10, 2023

•

edited May 10, 2023

load_in_8bit != GPTQ quantized. The performance is vastly difference with GPTQ being superior.

Yes, that's exactly what I wanted to say.

intelligencegear changed discussion status to closed May 10, 2023

intelligencegear changed discussion status to open May 10, 2023

mayank-mishra

BigCode org May 10, 2023

yes, LLM.int8() works quite differently to GPTQ.
LLM.int8() which is achieved by load_in_8bit=True does quantization at load time.
However, GPTQ uses Optimal Brain Quantization for GPT-like models and this requires a samples for quantization.

syntaxing

May 10, 2023

@mayank31398 Do you think there will be a front end or interactive mode that works with your repo? Santa coder is great but without a chat like interface that can maintain context, Starcoder pretty much becomes unusable except for very specific situations. Thanks!

mayank-mishra

BigCode org May 11, 2023

hey @syntaxing
there is already a model called starchat. Demo here: https://huggingface.co/spaces/HuggingFaceH4/starchat-playground

JoaoMoura

May 18, 2023

•

edited May 18, 2023

yes, LLM.int8() works quite differently to GPTQ.
LLM.int8() which is achieved by load_in_8bit=True does quantization at load time.
However, GPTQ uses Optimal Brain Quantization for GPT-like models and this requires a samples for quantization.

@mayank31398 @syntaxing it is indeed a different process, but wasn't it shown that LLM.int8() does not result in statistically significant performance degradation in this blog post? In that case, what's the benefit of going with a process that requires data samples and GPU hours (talking about the 8bit case, not the 4bit case)?

Mainly asking to validate if I'm leaving performance on the table by using bitsandbytes.

mayank-mishra

BigCode org Oct 14, 2023

@JacopoBandoni sorry for the late reply.
GPTQ and LLM.int8() are completely different quantization algorithms.
Please refer to their papers for the same.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment