Corpus (lang) | millions of tokens |
---|---|
NEW: English (ClueWeb09)1 | 82,581 |
Russian | 20,162 |
English | 12,968 |
French | 12,369 |
Japanese | 11,113 |
Polish | 9,567 |
Spanish (American) | 8,719 |
Arabic | 6,637 |
Czech | 5,818 |
Turkish | 4,125 |
Hungarian | 3,184 |
Italian | 3,077 |
German | 2,844 |
Spanish (European) | 2,459 |
Chinese | 2,107 |
Portuguese (European) | 948 |
Slovak | 876 |
Bulgarian | 849 |
Norwegian | 770 |
Korean | 561 |
czes (Czech) | 465 |
Estonian | 324 |
Kazakh | 139 |
Azerbaijani | 115 |
Tajik | 52 |
Uzbek | 25 |
Kyrgyz | 24 |
Turkmen | 2 |
DESAM (Czech) | 1 |
1 In order to get access to the English ClueWeb collection from 2009, please acquire the (free) license from Carnegie Mellon first and then contact us for granting the access.