HPLT download web site

HPLT project data download site

Monolingual data

Data release 3 (July 2025)

Here, we publish the 3nd version of our monolingual corpora compiled from large web crawls provided by Internet Archive and CommonCrawl projects. See more details at the HPLT project website.

There are 198 languages in this release (50 Terabytes in total size). The corpora are provided as JSONL files compressed with zstd. For convenience, data is split into multiple shards. The number of shards per language depends on the size of the specific corpus.

The simplest way to download the data is to use wget -i with the language-specific mapping files containing full download addresses for all shards of this particular language, for example (for Crimean Tatar in Latin script):
wget -O - https://data.hplt-project.org/three/sorted/crh_Latn.map | wget -x -nH --cut-dirs=2 -i -

The above command retrieves the map for chr_Latn and feeds it as a list of download addresses into a second wget invocation, requesting the creation of local directories (-x), but cutting off the host and first two directory components (-nH --cut-dirs=2).

To download all available data, there is a larger mapping file for the full multilingual (excluding English) portion, amounting to a download of around 20 terabytes. The complete English data comprises some 30 terabytes and can be downloaded using its per-language mapping file. These can be retrieved using e.g. wget, and used as input directives for larger downloads, much like in the example above:

wget https://data.hplt-project.org/three/sorted/multilingual.map
wget https://data.hplt-project.org/three/sorted/eng_Latn.map

See https://huggingface.co/datasets/HPLT/HPLT3.0 for instructions on how to load the monolingual portion of the HPLT v3.0 dataset in the Huggingface Datasets format.

Language-specific mapping files

License

These datasets are released under this licensing scheme:

We do not own any of the text from which these datasets have been extracted.
We make available the actual packaging of the HPLT datasets under the Creative Commons CC0 license.

Notice and take down policy

Notice: Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
Clearly identify the copyrighted work claimed to be infringed.
Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
You can reach us at hplt-datasets@ufal.mff.cuni.cz.

Take down: We will comply to legitimate requests by removing the affected sources from the next release of the corpora.

High-Performance Language Technology (HPLT)

This project has received funding from the European Union’s Horizon Europe research and innovation programme under grant agreement No 101070350 and from UK Research and Innovation (UKRI) under the UK government’s Horizon Europe funding guarantee (grant number 10052546).