OSCAR 2019
OSCAR 2019 is the original 2019 release of the OSCAR corpus. It has been generated from Common Crawl corpus using the goclassy architecture.
Features
OSCAR 2019 is shuffled at line level and no metadata is provided. Thus it is mainly intended to be used in the training of unsupervised language models for NLP.
Data is distributed by language in both original and deduplicated form.
If you need the unshuffled version of OSCAR, please contact us using the contact form. Please include your name, affiliation, contact details, which languages do you need and a brief description of how you intend to use OSCAR. You can also download it using HuggingFace’s datasets library.
Even though OSCAR is not Postcardware, we do appreciate when our users send us a postcard. If you want to send us one, you can find the address in the contact section down below.
Citing OSCAR
If you use OSCAR to train a language model, text generation model or any other ML model in general please consider citing our latest paper:
@inproceedings{ortiz-suarez-etal-2020-monolingual,
title = "A Monolingual Approach to Contextualized Word Embeddings for Mid-Resource Languages",
author = "Ortiz Su{\'a}rez, Pedro Javier and
Romary, Laurent and
Sagot, Beno{\^\i}t",
booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
month = jul,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.acl-main.156",
pages = "1703--1714",
abstract = "We use the multilingual OSCAR corpus, extracted from Common Crawl via language classification, filtering and cleaning, to train monolingual contextualized word embeddings (ELMo) for five mid-resource languages. We then compare the performance of OSCAR-based and Wikipedia-based ELMo embeddings for these languages on the part-of-speech tagging and parsing tasks. We show that, despite the noise in the Common-Crawl-based OSCAR data, embeddings trained on OSCAR perform much better than monolingual embeddings trained on Wikipedia. They actually equal or improve the current state of the art in tagging and parsing for all five languages. In particular, they also improve over multilingual Wikipedia-based contextual embeddings (multilingual BERT), which almost always constitutes the previous state of the art, thereby showing that the benefit of a larger, more diverse corpus surpasses the cross-lingual benefit of multilingual embedding architectures.",
}
The Unshuffled OSCAR
If you need a copy of any of the unshuffled sub-corpora, please contact us using the contact form down below. Please include your name, affiliation, contact details, which languages do you need and a brief description of how you intend to use OSCAR. We will evaluate your request and answer accordingly.
{{% callout note %}} The unshuffled OSCAR is now available in HuggingFace’s datasets library {{% /callout %}} They have obtained our permission to redistribute the unshuffled OSCAR and they allow users to download a corpus all at once as opposed to file by file. You can get more information about how to download OSCAR using their library by visiting OSCAR's dataset card.
Downloading OSCAR
All the data is distributed by language, both the original and the deduplicated versions of the data are available. To download a file just click the desired link on the table below. Languages are split in shards of around 700MB, these shards are standalone. A plain text file with checksums is also provided.
The OSCAR corpus is yet to be filtered, so please be careful when using it, specially for text generation tasks! To see which sub-corpora have been audited, please refer to the list of publications above for more information.
You'll be asked to create an HumanID account in order to download a corpus. This is intended, and we do it in order to limit traffic and reduce abuse of the infrastructure. The OSCAR corpus is hosted by Huma-Num, you can read more about them on their website.
All sizes are for the uncompressed files.
Language | Words original | Size original | File original | Words deduplicated | Size deduplicated | File deduplicated |
---|---|---|---|---|---|---|
Afrikaans | 43,482,801 | 241M | af | 29,533,437 | 163M | af |
Albanian | 374,196,110 | 2.3G | sq | 186,856,699 | 1.2G | sq |
Alemannic | 841,750 | 5.0M | als | 459,001 | 2.8M | als |
Amharic | 28,301,601 | 360M | am | 16,086,628 | 206M | am |
Arabic | 8,117,162,828 | 82G | ar | 3,171,221,354 | 32G | ar |
Aragonese | 52,896 | 1.3M | an | 45,669 | 801K | an |
Armenian | 273,919,388 | 3.7G | hy | 110,196,043 | 1.5G | hy |
Assamese | 6,956,663 | 113M | as | 4,366,570 | 71M | as |
Asturian | 381,005 | 2.4M | ast | 325,237 | 2.0M | ast |
Avaric | 24,720 | 409K | av | 19,478 | 324K | av |
Azerbaijani | 322,641,710 | 2.8G | az | 167,742,296 | 1.5G | az |
Bashkir | 9,796,764 | 128M | ba | 6,922,589 | 90M | ba |
Basque | 120,456,652 | 848M | eu | 45,359,710 | 342M | eu |
Bavarian | 399 | 503 | bar | 399 | 503 | bar |
Belarusian | 144,579,630 | 1.8G | be | 83,499,037 | 1.1G | be |
Bengali | 623,575,733 | 11G | bn | 363,766,143 | 5.8G | bn |
Bihari | 8,848 | 110K | bh | 2,875 | 34K | bh |
Bishnupriya | 198,286 | 4.1M | bpy | 96,940 | 1.7M | bpy |
Bosnian | 106,448 | 447K | bs | 20,485 | 116K | bs |
Breton | 5,013,241 | 29M | br | 2,890,384 | 16M | br |
Bulgarian | 2,947,648,106 | 32G | bg | 1,268,114,977 | 14G | bg |
Burmese | 56,111,184 | 1.9G | my | 30,102,173 | 1.1G | my |
Catalan | 1,360,212,450 | 8.0G | ca | 729,333,440 | 4.3G | ca |
Cebuano | 6,603,567 | 39M | ceb | 3,675,024 | 24M | ceb |
Central Bikol | 312 | 885 | bcl | 312 | 885 | bcl |
Central Khmer | 20,690,610 | 1.1G | km | 10,082,245 | 581M | km |
Central Kurdish | 48,478,334 | 487M | ckb | 18,726,721 | 226M | ckb |
Chavacano | 130 | 520 | cbk | 130 | 520 | cbk |
Chechen | 711,051 | 8.3M | ce | 568,146 | 6.7M | ce |
Chinese | 14,986,424,850 | 508G | zh | 6,350,215,113 | 249G | zh |
Chuvash | 3,041,614 | 39M | cv | 2,054,810 | 26M | cv |
Cornish | 8,329 | 44K | kw | 2,704 | 14K | kw |
Croatian | 34,232,765 | 226M | hr | 16,727,640 | 110M | hr |
Czech | 7,715,977,441 | 53G | cs | 3,540,997,509 | 24G | cs |
Danish | 2,637,463,889 | 16G | da | 1,620,091,317 | 9.5G | da |
Dhivehi | 7,559,472 | 126M | dv | 4,726,660 | 79M | dv |
Dimli | 19 | 146 | diq | 19 | 146 | diq |
Dutch | 13,020,136,373 | 78G | nl | 6,598,786,137 | 39G | nl |
Eastern Mari | 565,992 | 7.2M | mhr | 469,297 | 6.0M | mhr |
Egyptian Arabic | 7,305,151 | 66M | arz | 3,659,419 | 33M | arz |
Emilian-Romagnol | 6,376 | 25K | eml | 6,121 | 24K | eml |
English | 418,187,793,408 | 2.3T | en | 215,841,256,971 | 1.2T | en |
Erzya | 90 | 1.4K | myv | 78 | 1.2K | myv |
Esperanto | 48,486,161 | 299M | eo | 37,324,446 | 228M | eo |
Estonian | 643,163,730 | 4.8G | et | 309,931,463 | 2.3G | et |
Finnish | 3,196,666,419 | 27G | fi | 1,597,855,468 | 13G | fi |
French | 46,896,036,417 | 282G | fr | 23,206,776,649 | 138G | fr |
Galician | 102,011,291 | 620M | gl | 63,600,602 | 384M | gl |
Georgian | 171,950,621 | 3.6G | ka | 91,569,739 | 1.9G | ka |
German | 44,878,908,446 | 308G | de | 21,529,164,172 | 145G | de |
Goan Konkani | 124,277 | 2.2M | gom | 102,306 | 1.8M | gom |
Guarani | 7,382 | 36K | gn | 4,680 | 24K | gn |
Gujarati | 72,045,701 | 1.1G | gu | 50,023,432 | 722M | gu |
Haitian | 1,014 | 3.9K | ht | 832 | 3.3K | ht |
Hebrew | 2,067,753,528 | 20G | he | 1,032,018,056 | 9.8G | he |
Hindi | 1,372,234,782 | 17G | hi | 745,774,934 | 8.9G | hi |
Hungarian | 5,163,936,345 | 40G | hu | 2,339,127,555 | 18G | hu |
Icelandic | 219,900,094 | 1.5G | is | 129,818,331 | 846M | is |
Ido | 25,702 | 147K | io | 22,773 | 130K | io |
Iloko | 142,942 | 874K | ilo | 105,564 | 636K | ilo |
Indonesian | 4,574,692,265 | 30G | id | 2,394,957,629 | 16G | id |
Interlingua | 180,231 | 662K | ia | 100,019 | 360K | ia |
Interlingue | 5,352 | 24K | ie | 602 | 1.6K | ie |
Irish | 14,483,593 | 88M | ga | 10,017,303 | 60M | ga |
Italian | 22,248,707,341 | 137G | it | 11,250,012,896 | 69G | it |
Japanese | 4,962,979,182 | 216G | ja | 1,123,067,063 | 106G | ja |
Javanese | 104,896 | 659K | jv | 86,654 | 583K | jv |
Kalmyk | 10,277 | 113K | xal | 10,155 | 112K | xal |
Kannada | 81,186,863 | 1.7G | kn | 49,343,462 | 1.1G | kn |
Karachay-Balkar | 185,436 | 2.6M | krc | 166,496 | 2.3M | krc |
Kazakh | 191,126,469 | 2.7G | kk | 108,388,743 | 1.5G | kk |
Kirghiz | 44,194,823 | 600M | ky | 28,982,620 | 388M | ky |
Komi | 201,404 | 2.3M | kv | 95,243 | 1.2M | kv |
Korean | 2,368,765,142 | 24G | ko | 1,120,375,149 | 12G | ko |
Kurdish | 15,561,003 | 94M | ku | 9,946,440 | 60M | ku |
Lao | 4,133,311 | 174M | lo | 2,583,342 | 114M | lo |
Latin | 4,122,201 | 26M | la | 1,328,038 | 8.3M | la |
Latvian | 520,761,977 | 4.0G | lv | 236,428,905 | 1.8G | lv |
Lezghian | 247,646 | 3.3M | lez | 224,871 | 3.0M | lez |
Limburgan | 4,730 | 29K | li | 4,283 | 27K | li |
Lithuanian | 1,159,661,742 | 8.8G | lt | 516,183,525 | 3.9G | lt |
Lojban | 154,330 | 736K | jbo | 141,973 | 678K | jbo |
Lombard | 75,229 | 443K | lmo | 73,665 | 433K | lmo |
Low German | 2,906,347 | 18M | nds | 2,146,417 | 13M | nds |
Lower Sorbian | 1,787 | 13K | dsb | 966 | 7.1K | dsb |
Luxembourgish | 4,403,577 | 29M | lb | 3,087,650 | 21M | lb |
Macedonian | 189,289,873 | 2.1G | mk | 102,849,595 | 1.2G | mk |
Maithili | 69,161 | 317K | mai | 874 | 11K | mai |
Malagasy | 3,068,360 | 21M | mg | 1,872,044 | 13M | mg |
Malay | 16,696,882 | 111M | ms | 6,045,753 | 42M | ms |
Malayalam | 189,534,472 | 4.9G | ml | 95,892,551 | 2.5G | ml |
Maltese | 2,995,654 | 24M | mt | 2,163,358 | 17M | mt |
Marathi | 162,609,404 | 2.7G | mr | 82,130,803 | 1.4G | mr |
Mazanderani | 73,870 | 691K | mzn | 64,481 | 602K | mzn |
Minangkabau | 5,682 | 608K | min | 4,825 | 310K | min |
Mingrelian | 299,098 | 5.8M | xmf | 228,629 | 4.4M | xmf |
Mirandese | 171 | 1.2K | mwl | 152 | 1.1K | mwl |
Modern Greek | 5,479,180,137 | 62G | el | 2,412,419,435 | 27G | el |
Mongolian | 181,307,167 | 2.2G | mn | 68,362,013 | 838M | mn |
Nahuatl languages | 1,234 | 12K | nah | 1,193 | 11K | nah |
Neapolitan | 5,282 | 17K | nap | 4,147 | 13K | nap |
Nepali | 107,448,208 | 1.8G | ne | 71,628,317 | 1.2G | ne |
Newari | 564,697 | 5.5M | new | 288,995 | 4.1M | new |
Northern Frisian | 1,516 | 4.4K | frr | 1,516 | 4.4K | frr |
Northern Luri | 8,022 | 76K | lrc | 6,740 | 63K | lrc |
Norwegian | 1,344,326,388 | 8.0G | no | 804,894,377 | 4.7G | no |
Norwegian Nynorsk | 14,764,980 | 85M | nn | 9,435,139 | 54M | nn |
Occitan | 750,301 | 5.8M | oc | 512,678 | 3.7M | oc |
Oriya | 14,938,567 | 248M | or | 11,321,740 | 188M | or |
Ossetian | 1,031,268 | 13M | os | 878,765 | 11M | os |
Pampanga | 130 | 760 | pam | 52 | 304 | pam |
Panjabi | 61,847,806 | 763M | pa | 37,555,835 | 460M | pa |
Persian | 9,096,554,121 | 79G | fa | 4,363,505,319 | 38G | fa |
Piemontese | 362,013 | 2.1M | pms | 337,246 | 1.9M | pms |
Polish | 15,277,255,137 | 109G | pl | 6,708,709,674 | 47G | pl |
Portuguese | 20,641,903,898 | 124G | pt | 10,751,156,918 | 64G | pt |
Pushto | 46,559,441 | 361M | ps | 31,347,348 | 242M | ps |
Quechua | 10,186 | 78K | qu | 8,691 | 67K | qu |
Romanian | 3,984,317,058 | 25G | ro | 1,741,794,069 | 11G | ro |
Romansh | 1,093 | 7.4K | rm | 960 | 6.5K | rm |
Russia Buriat | 963 | 13K | bxr | 809 | 11K | bxr |
Russian | 92,522,407,837 | 1.2T | ru | 46,692,691,520 | 568G | ru |
Sanskrit | 4,331,569 | 93M | sa | 1,713,930 | 37M | sa |
Scottish Gaelic | 310,689 | 1.9M | gd | 207,110 | 1.3M | gd |
Serbian | 364,395,411 | 3.9G | sr | 207,561,168 | 2.2G | sr |
Serbo-Croatian | 5,292,184 | 25M | sh | 1,040,573 | 5.8M | sh |
Sicilian | 554 | 3.3K | scn | 468 | 2.8K | scn |
Sindhi | 43,530,158 | 347M | sd | 33,028,015 | 263M | sd |
Sinhala | 93,053,465 | 1.4G | si | 50,864,857 | 802M | si |
Slovak | 1,322,247,763 | 9.1G | sk | 656,346,179 | 4.5G | sk |
Slovenian | 387,399,700 | 2.5G | sl | 193,926,684 | 1.3G | sl |
Somali | 1,202 | 61K | so | 472 | 16K | so |
South Azerbaijani | 2,175,054 | 27M | azb | 1,528,709 | 19M | azb |
Spanish | 47,545,122,279 | 278G | es | 25,928,290,729 | 149G | es |
Sundanese | 30,321 | 211K | su | 20,278 | 141K | su |
Swahili | 2,211,927 | 13M | sw | 1,376,963 | 8.1M | sw |
Swedish | 7,155,994,312 | 44G | sv | 4,106,120,608 | 25G | sv |
Tagalog | 98,949,299 | 573M | tl | 70,121,601 | 407M | tl |
Tajik | 31,758,142 | 379M | tg | 21,029,893 | 249M | tg |
Tamil | 420,537,132 | 9.3G | ta | 226,013,330 | 5.1G | ta |
Tatar | 51,034,893 | 670M | tt | 23,825,695 | 305M | tt |
Telugu | 123,711,517 | 2.5G | te | 79,094,167 | 1.6G | te |
Thai | 951,743,087 | 36G | th | 368,965,202 | 16G | th |
Tibetan | 1,483,589 | 187M | bo | 936,556 | 138M | bo |
Turkish | 7,577,388,700 | 60G | tr | 3,365,734,289 | 27G | tr |
Turkmen | 1,113,869 | 11M | tk | 752,326 | 6.8M | tk |
Tuvinian | 759 | 12K | tyv | 540 | 7.9K | tyv |
Uighur | 8,657,141 | 122M | ug | 5,852,225 | 83M | ug |
Ukrainian | 4,204,381,276 | 53G | uk | 2,252,380,351 | 28G | uk |
Upper Sorbian | 545,351 | 4.2M | hsb | 236,867 | 1.8M | hsb |
Urdu | 331,817,982 | 2.7G | ur | 218,030,228 | 1.7G | ur |
Uzbek | 2,450,256 | 21M | uz | 1,381,644 | 12M | uz |
Venetian | 3,492 | 18K | vec | 3,199 | 17K | vec |
Vietnamese | 12,036,845,359 | 68G | vi | 5,577,159,843 | 32G | vi |
Volapük | 321,121 | 2.0M | vo | 318,568 | 2.0M | vo |
Walloon | 50,720 | 273K | wa | 37,543 | 203K | wa |
Waray | 397,315 | 2.5M | war | 336,311 | 2.2M | war |
Welsh | 37,422,441 | 213M | cy | 23,574,673 | 133M | cy |
Western Frisian | 5,691,077 | 35M | fy | 4,223,816 | 26M | fy |
Western Mari | 93,338 | 1.2M | mrj | 87,780 | 1.1M | mrj |
Western Panjabi | 1,426,986 | 12M | pnb | 1,111,112 | 9.0M | pnb |
Wu Chinese | 11,189 | 109K | wuu | 4,333 | 32K | wuu |
Yakut | 2,547,623 | 42M | sah | 1,789,174 | 26M | sah |
Yiddish | 13,834,320 | 141M | yi | 8,212,970 | 84M | yi |
Yoruba | 8,906 | 55K | yo | 3,518 | 27K | yo |
Yue Chinese | 186 | 3.7K | yue | 128 | 2.2K | yue |
License
These data are released under this licensing scheme:
- We do not own any of the text from which these data has been extracted.
- We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved").
- To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR.
- This work is published from: France.
Notice and take down policy
Notice: Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
- And use the contact form below.
Take down: We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Models
Here is a list of some language models that have been trained using the OSCAR corpus or that are part of the OSCAR project:
Featured Models
Here is a list of Language models trained by the community:
Model | Language | Cased | Corpus | Authors | Paper | Website | Files | License |
---|---|---|---|---|---|---|---|---|
AraBERT | Arabic | Cased | OSCAR, Wikipedia, 1.5B words Arabic Corpus, OSIAN, Assafir | Wissam Antoun, Fady Baly and Hazem Hajj | ACL Anthology | GitHub | Hugging Face | N/A |
Arabic-BERT | Arabic | Cased | OSCAR and Wikipedia | Ali Safaya | ArXiv | GitHub | Hugging Face | MIT |
AraELECTRA | Arabic | Cased | OSCAR, Wikipedia, 1.5B words Arabic Corpus, OSIAN, Assafir | Wissam Antoun, Fady Baly and Hazem Hajj | ArXiV | GitHub | Hugging Face | N/A |
AraGPT2 | Arabic | Cased | OSCAR, Wikipedia, 1.5B words Arabic Corpus, OSIAN, Assafir | Wissam Antoun, Fady Baly and Hazem Hajj | ArXiv | GitHub | Hugging Face | N/A |
CamemBERT | French | Cased | OSCAR | Louis Martin, Benjamin Muller, Pedro Javier Ortiz Suárez, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot | ACL 2020 | camembert-model.fr | camembert-base.tar.gz | MIT |
CamemBERT | French | Cased | Subsample of OSCAR (4 GB of text) | Louis Martin, Benjamin Muller, Pedro Javier Ortiz Suárez, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot | ACL 2020 | camembert-model.fr | camembert-base-oscar-4gb.tar.gz | MIT |
LePetit | French | Cased | Subsample of OSCAR (2 GB of text) | Vincent Micheli, Martin d'Hoffschmidt, Quentin Heinrich | Medium blog | illuin.tech | Hugging Face | MIT |
GigaBERT | Arabic | Cased and Uncased | OSCAR, Wikipedia, Gigaword | Wuwei Lan, Yang Chen, Wei Xu, Alan Ritter | EMNLP 2020 | GitHub | Hugging Face | MIT |
ELECTRA | Norwegian | Cased | OSCAR and OPUS | Viktor Alm | N/A | Hugging Face | Hugging Face | N/A |
BERT | Romanian | Cased | OSCAR, Wikipedia and OPUS | Dumitrescu Stefan and Andrei Avram | SOON | GitHub | Hugging Face | MIT |
BERT | Romanian | Uncased | OSCAR, Wikipedia and OPUS | Dumitrescu Stefan and Andrei Avram | SOON | GitHub | Hugging Face | MIT |
RoBERTa | Sinhala | N/A | OSCAR | Keshan Sodimana | N/A | Hugging Face | Hugging Face | N/A |
BERT | Turkish | Cased and Uncased | OSCAR, Wikipedia and OPUS | Stefan Schweter | Zenodo | GitHub | Hugging Face | MIT |
ELECTRA | Turkish | Cased | OSCAR, Wikipedia and OPUS | Stefan Schweter | Zenodo | GitHub | Hugging Face | MIT |
XLMIndic | Hindi, Bengali, Gujarati, Panjabi, Marathi, Oriya, Assamese, Sinhala, Nepali, Bihari, Bishnupriya, Maithili, Goan Konkani, Sanskrit | Cased | OSCAR | Ibraheem Muhammad Moosa, Mahmud Shimul and Ashfia Binte Habib | Arxiv | GitHub | Hugging Face | MIT |
If you have trained a model using the OSCAR corpus and would like to have it featured here, please open a pull request in our GitHub repo. Help us grow the community!