DHPLT is an open collection of diachronic corpora in 41 diverse languages. It is based on the web-crawled HPLT datasets; we use web crawl timestamps as the approximate signal of document creation time.
The collection covers three time periods: 2011-2015, 2020-2021 and 2024-present (1 million documents per time period for each language). We additionally provide pre-computed word type and token embeddings and lexical substitutions for our chosen target words, while at the same time leaving it open for the other researchers to come up with their own target words using the same datasets.
DHPLT is introduced in "DHPLT: large-scale multilingual diachronic corpora and word representations for semantic change modelling" by Mariia Fedorova and Andrey Kutuzov and Khonzoda Umarova. All our code is on GitHub, please raise an issue in the repository if you have any questions.
Each language directory below contains the folllowing items (more details in the paper):
2011_2015.jsonl.zst: one million full-text documents sampled from time period 2011-2015 2020_2021.jsonl.zst: one million full text documents sampled from time period 2020-2021 2024_2025.jsonl.zst: one million full text documents sampled from time period 2024 bert_embeddings/: BERT token embeddings for a set of target words, sorted by time period and the first character bert_substitutions/: BERT lexical substitutions for a set of target words, sorted by time period and the first character t5_embeddings/: T5 token embeddings for a set of target words, sorted by time period and the first character word2vec/: word2vec models trained on each time period and aligned to the 2024 model xlmr_embeddings/: XLM-R token embeddings for a set of target words, sorted by time period and the first characterxlmr_substitutions/: XLM-R lexical substitutions for a set of target words, sorted by time period and the first characterThese datasets are released under this licensing scheme:
Notice: Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
Take down: We will comply to legitimate requests by removing the affected sources from the next release of the corpora.
High-Performance Language Technology (HPLT)
This project has received funding from the European Union’s Horizon Europe research and innovation programme under grant agreement No 101070350 and from UK Research and Innovation (UKRI) under the UK government’s Horizon Europe funding guarantee (grant number 10052546).