HuggingFaceFW/finephrase-rephrased
6.94 TB
We release large pre-training datasets to accelerate open LLM development. Part of the Hugging Face Science team (hf.co/science)
FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale
This is the home of the π· FineData team, a branch of the π€ Hugging Face Science Team releasing large scale pre-training datasets to accelerate open LLM development.
Explore synthetic data experiments on a virtual bookshelf
Viewer to explore the finewiki dataset
Explore and download the FineWeb webβtext dataset
Evaluate multilingual models using FineTasks