bigcode
/

santacoder

@@ -195,15 +195,18 @@ model-index:
 2. [Use](#use)
 3. [Limitations](#limitations)
 4. [Training](#training)
-5. [Citation](#citation)
 # Model Summary
-The SantaCoder models are a series of 1B parameter models trained on Python, Java, and JavaScript. They were trained on datasets with different filter parameters and with architecture and objective variations. The main model uses multi-query attention, was trained using near-deduplication and commnent-to-code ratio as filtering criteria and using the Fill-in-the-Middle objective.
 - **Repository:** [bigcode/Megatron-LM](https://github.com/bigcode-project/Megatron-LM)
-- **Project Website:** [bigcode-project.org]www.bigcode-project.org)
-- **Paper:** [Coming soon]()
 - **Point of Contact:** [contact@bigcode-project.org](mailto:contact@bigcode-project.org)
 - **Languages:** Python, Java, and JavaScript
@@ -224,7 +227,8 @@ The `dedup-alt-comments` model is the best performing model and was trained twic
 ## Intended use
 **Feel free to share your generations in the Community tab!**
@@ -269,7 +273,7 @@ model = AutoModelForCausalLM.from_pretrained(
 ### Attribution
-The pretraining dataset of the model was filtered for permissive licenses only. Nevertheless, the model can generate source code verbatim from the dataset which requires attribution. We provide a [search index](TODO) that let's you search through the pretraining data to identify where generated code came from and apply the proper attribution to your code.
 # Limitations
@@ -296,6 +300,8 @@ The model has been trained on source code in Python, Java, and JavaScript. The p
 - **Neural networks:** [PyTorch](https://github.com/pytorch/pytorch)
 - **FP16 if applicable:** [apex](https://github.com/NVIDIA/apex)
 # Citation
 **TODO**

 2. [Use](#use)
 3. [Limitations](#limitations)
 4. [Training](#training)
+5. [License](#license)
+6. [Citation](#citation)
 # Model Summary
+The SantaCoder models are a series of 1B parameter models trained on the Python, Java, and JavaScript subset of [The Stack (v1.1)](https://huggingface.co/datasets/bigcode/the-stack) (which excluded opt-out requests).
+The main model uses multi-query attention, was trained using near-deduplication and comment-to-code ratio as filtering criteria and using the Fill-in-the-Middle objective.
+In addition there are several models that were trained on datasets with different filter parameters and with architecture and objective variations.
 - **Repository:** [bigcode/Megatron-LM](https://github.com/bigcode-project/Megatron-LM)
+- **Project Website:** [bigcode-project.org](www.bigcode-project.org)
+- **Paper:** [🎅SantaCoder: Don't reach for the stars!🌟]()
 - **Point of Contact:** [contact@bigcode-project.org](mailto:contact@bigcode-project.org)
 - **Languages:** Python, Java, and JavaScript
 ## Intended use
+The model was trained on GitHub code. As such it is _not_ an instruction model and commands like "Write a function that computes the square root." do not work well.
+You should phrase commands like they occur in source code such as comments (e.g. `# the following function computes the sqrt`) or write a function signature and docstring and let the model complete the function body.
 **Feel free to share your generations in the Community tab!**
 ### Attribution
+The pretraining dataset of the model was filtered for permissive licenses only. Nevertheless, the model can generate source code verbatim from the dataset which requires attribution. We provide a [search index](https://huggingface.co/spaces/bigcode/santacoder-search) that let's you search through the pretraining data to identify where generated code came from and apply the proper attribution to your code.
 # Limitations
 - **Neural networks:** [PyTorch](https://github.com/pytorch/pytorch)
 - **FP16 if applicable:** [apex](https://github.com/NVIDIA/apex)
+# License
+The model is licenses under the CodeML Open RAIL-M v0.1 license. You can find the full license [here](https://huggingface.co/spaces/bigcode/license).
 # Citation
 **TODO**