HPLT download web site

HPLT project data download site

Monolingual data

Data release 1 (September 2023)

Here, we publish monolingual corpora compiled from large web crawls provided by Internet Archive and CommonCrawl projects. See more details at the HPLT project website.

There are 75 languages in this release (22 Terabytes in total size). The corpora are provided as JSONL files compressed with zstd. For convenience, data is split into multiple shards, a few Gbytes each. The number of shards per language depends on the size of the specific corpus.

The simplest way to download the data is to use wget -i with the language-specific mapping files containing full URLs for all shards of this particular language, for example:
wget -i https://data.hplt-project.org/one/monotext/eu_map.txt

If you want to download all the 22 Terabytes in one click, use
wget -nH -x --cut-dirs=2 -i https://data.hplt-project.org/one/monotext/hplt_monolingual_map_all_1.txt

Language-specific mapping files

Bilingual data

Data release 1 (September 2023)

Here, we publish bilingual corpora compiled from large web crawls provided by Internet Archive and CommonCrawl projects. See more details at the HPLT project website.

There are 14 language pairs in this release (25 Gigabytes in total size). The corpora are provided in raw TSV files gzip compressed. The raw format means that data is coming directly after the sentence aligment step and only obvious noise has been filtered from it. So, it is likely to contain high portions of non-parallel sentences.

License

These data are released under this licensing scheme:

We do not own any of the text from which these text data has been extracted.
We license the actual packaging of these text data under the Creative Commons CC0 license ("no rights reserved").

Notice and take down policy

Notice: Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
Clearly identify the copyrighted work claimed to be infringed.
Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
And contact the HPLT project using the feedback form at our website.

Take down: We will comply to legitimate requests by removing the affected sources from the next release of the corpora.

High-Performance Language Technology (HPLT)

This project has received funding from the European Union’s Horizon Europe research and innovation programme under grant agreement No 101070350 and from UK Research and Innovation (UKRI) under the UK government’s Horizon Europe funding guarantee [grant number 10052546].