| Corpus (lang) | millions of tokens |
|---|---|
| NEW: English (ClueWeb09)1 | 82,581 |
| Russian | 20,162 |
| English | 12,968 |
| French | 12,369 |
| Japanese | 11,113 |
| Polish | 9,567 |
| Spanish (American) | 8,719 |
| Arabic | 6,637 |
| Czech | 5,818 |
| Turkish | 4,125 |
| Hungarian | 3,184 |
| Italian | 3,077 |
| German | 2,844 |
| Spanish (European) | 2,459 |
| Chinese | 2,107 |
| Portuguese (European) | 948 |
| Slovak | 876 |
| Bulgarian | 849 |
| Norwegian | 770 |
| Korean | 561 |
| czes (Czech) | 465 |
| Estonian | 324 |
| Kazakh | 139 |
| Azerbaijani | 115 |
| Tajik | 52 |
| Uzbek | 25 |
| Kyrgyz | 24 |
| Turkmen | 2 |
| DESAM (Czech) | 1 |
1 In order to get access to the English ClueWeb collection from 2009, please acquire the (free) license from Carnegie Mellon first and then contact us for granting the access.