HPLT 2.0 deduplicated datasets by language families and the number of characters