HPLT 3.0 cleaned datasets by language families and the number of characters