Here, we publish the 2nd version of our monolingual corpora compiled from large web crawls provided by Internet Archive and CommonCrawl projects. See more details at the HPLT project website.
There are 193 languages in this release (15 Terabytes in total size of the "cleaned" version). The corpora are provided as JSONL files compressed with zstd. For convenience, data is split into multiple shards, a few Gbytes each. The number of shards per language depends on the size of the specific corpus.
The simplest way to download the data is to use wget -i with the
language-specific mapping files containing full URLs for all shards of this particular language, for example:
wget -i https://data.hplt-project.org/two/cleaned/epo_Latn_map.txt
If you want to download all the 15 Terabytes in one click, use
wget -nH -x --cut-dirs=2 -i https://data.hplt-project.org/two/cleaned/hplt_monolingual_map_cleaned_2.0.txt
These datasets are released under this licensing scheme:
Notice: Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
Take down: We will comply to legitimate requests by removing the affected sources from the next release of the corpora.
High-Performance Language Technology (HPLT)
This project has received funding from the European Union’s Horizon Europe research and innovation programme under grant agreement No 101070350 and from UK Research and Innovation (UKRI) under the UK government’s Horizon Europe funding guarantee [grant number 10052546].