Skip to content

mOSCAR

mOSCAR is a large-scale multilingual and multimodal document corpus crawled from the web. It covers 163 languages, 315M documents, 214B tokens and 1.2B images. We carefully conduct a set of filtering and evaluation steps to make sure mOSCAR is sufficiently safe, diverse and of good quality.

Access

Access to the mOSCAR is granted via the Hugging Face Hub.

All data is available at https://huggingface.co/datasets/oscar-corpus/mOSCAR.

Paper link: https://arxiv.org/abs/2406.08707

Layout

{
    'images': [{'img_idx': '#000002',
                'sha512': '65c1e5605d48f8753256f758bd442cbdd43e6987691227b1ea6b81430ff36609f46d448c8171546232fe0c258d9e44ce4378f32e8ada5c43c314df5a5e230de2',
                'url': 'https://actuconsommation.fr/wp-content/uploads/2020/05/Disneylands-Japon-1068x712.jpg'}],
    'metadata': [{'node_order': 'img_#000002|txt_#000000|txt_#000001|txt_#000002|txt_#000003|txt_#000004|txt_#000005|txt_#000006|txt_#000009',
                  'url': 'https://actuconsommation.fr/2020/05/11/disneyland-une-reouverture-sous-haute-securite-a-shanghai-ce-lundi/'}],
    'text': [{'text': 'Disneyland : une réouverture sous haute sécurité à Shanghai ce lundi', 'text_idx': '#000000'},
             {'text': 'Des milliers de visiteurs ont pu pénétrer lundi dans le Disneyland de Shanghai, le premier des six parcs de [...]', text_idx': '#000001'},
             [...] ]
}

Language table

Lang. name Code Family Script #documents #images # tokens
Acehnese ace_Latn Austronesian Latin 7,803 32,461 2,889,134
Mesopotamian Arabic acm_Arab Afro-Asiatic Arabic 2,274 10,620 1,047,748
Tunisian Arabic aeb_Arab Afro-Asiatic Arabic 7,640 41,570 2,715,187
Afrikaans afr_Latn Indo-European Latin 54,895 247,774 39,956,585
South Levantine Arabic ajp_Arab Afro-Asiatic Arabic 12,098 87,837 5,167,813
Tosk Albanian als_Latn Indo-European Latin 861,678 2,569,164 452,737,251
Amharic amh_Ethi Afro-Asiatic Ge‘ez 39,588 152,646 35,089,019
North Levantine Arabic apc_Arab Afro-Asiatic Arabic 19,904 128,966 9,560,701
Modern Standard Arabic arb_Arab Afro-Asiatic Arabic 3,936,851 15,126,931 3,401,919,964
Najdi Arabic ars_Arab Afro-Asiatic Arabic 60,229 296,741 43,610,873
Moroccan Arabic ary_Arab Afro-Asiatic Arabic 142,386 698,051 204,723,454
Egyptian Arabic arz_Arab Afro-Asiatic Arabic 835,529 4,054,632 653,626,387
Assamese asm_Beng Indo-European Bengali 3,948 9,210 640,390
Asturian ast_Latn Indo-European Latin 165,745 962,723 37,547,944
Awadhi awa_Deva Indo-European Devanagari 29,324 107,483 4,961,635
Central Aymara ayr_Latn Aymaran Latin 27,384 151,889 5,148,970
South Azerbaijani azb_Arab Turkic Arabic 8,274 38,233 5,256,693
North Azerbaijani azj_Latn Turkic Latin 516,021 1,808,060 257,825,849
Bashkir bak_Cyrl Turkic Cyrillic 4,532 17,174 3,038,766
Bambara bam_Latn Manding Latin 7,674 39,190 1,243,332
Balinese ban_Latn Austronesian Latin 1,886 11,266 542,015
Belarusian bel_Cyrl Indo-European Cyrillic 63,309 287,539 72,976,520
Bemba bem_Latn Atlantic–Congo Latin 1,096 7,479 1,340,471
Bengali ben_Beng Indo-European Bengali 270,406 947,035 35,858,814
Bhojpuri bho_Deva Indo-European Devanagari 6,366 28,131 875,463
Banjar bjn_Latn Austronesian Latin 5,427 27,803 1,898,526
Bosnian bos_Latn Indo-European Latin 1,960,599 7,633,049 1,255,000,505
Buginese bug_Latn Austronesian Latin 3,312 18,648 588,678
Bulgarian bul_Cyrl Indo-European Cyrillic 2,591,998 11,670,028 1,760,971,620
Catalan cat_Latn Indo-European Latin 1,153,864 4,736,634 606,447,390
Cebuano ceb_Latn Austronesian Latin 16,990 91,234 10,748,818
Czech ces_Latn Indo-European Latin 3,918,837 13,291,309 2,823,172,996
Central Kurdish ckb_Arab Indo-European Arabic 36,725 136,566 22,322,689
Crimean Tatar crh_Latn Turkic Latin 6,376 24,124 1,742,727
Welsh cym_Latn Indo-European Latin 40,408 165,897 27,748,345
Danish dan_Latn Indo-European Latin 2,076,298 9,559,600 1,238,277,499
German deu_Latn Indo-European Latin 20,662,696 87,976,200 8,544,986,218
Southwestern Dinka dik_Latn Nilo-Saharan Latin 1,712 6,635 1,319,943
Greek ell_Grek Indo-European Greek 4,916,081 15,209,058 2,923,201,041
English eng_Latn Indo-European Latin 52,215,013 207,904,315 33,570,108,782
Esperanto epo_Latn Artificial Latin 25,157 124,996 28,586,195
Estonian est_Latn Uralic Latin 1,040,368 5,217,366 619,215,048
Basque eus_Latn Isolate Latin 849,043 3,445,539 277,145,498
Faroese fao_Latn Indo-European Latin 15,411 60,340 6,691,327
Fijian fij_Latn Austronesian Latin 1,528 8,776 487,388
Finnish fin_Latn Uralic Latin 2,396,033 10,365,333 1,781,044,864
French fra_Latn Indo-European Latin 20,305,739 78,179,601 14,362,579,829
Friulian fur_Latn Indo-European Latin 37,290 256,456 5,949,600
Nigerian Fulfulde fuv_Latn Atlantic-Congo Latin 1,568 7,124 401,852
West Central Oromo gaz_Latn Afro-Asiatic Latin 4,058 11,763 1,786,093
Scottish Gaelic gla_Latn Indo-European Latin 29,710 153,249 14,605,090
Irish gle_Latn Indo-European Latin 68,858 315,132 47,438,400
Galician glg_Latn Indo-European Latin 518,973 2,381,475 217,063,180
Guarani grn_Latn Tupian Latin 490,945 2,416,633 89,921,114
Gujarati guj_Gujr Indo-European Gujarati 23,062 91,320 3,324,866
Haitian Creole hat_Latn Indo-European Latin 257,745 1,570,699 62,847,106
Hausa hau_Latn Afro-Asiatic Latin 25,364 104,934 13,089,932
Hebrew heb_Hebr Afro-Asiatic Hebrew 1,109,591 4,766,483 893,327,320
Hindi hin_Deva Indo-European Devanagari 579,430 1,830,667 122,558,353
Chhattisgarhi hne_Deva Indo-European Devanagari 1,581 7,263 273,174
Croatian hrv_Latn Indo-European Latin 1,719,617 8,425,510 1,010,674,096
Hungarian hun_Latn Uralic Latin 3,534,506 15,390,083 2,831,715,050
Armenian hye_Armn Indo-European Armenian 339,962 1,141,885 205,635,952
Igbo ibo_Latn Atlantic-Congo Latin 11,529 68,049 8,701,070
Ilocano ilo_Latn Austronesian Latin 78,872 523,195 8,116,113
Indonesian ind_Latn Austronesian Latin 7,016,291 17,324,777 3,981,843,468
Icelandic isl_Latn Indo-European Latin 244,676 1,027,465 137,015,973
Italian ita_Latn Indo-European Latin 12,937,153 47,476,971 8,311,790,842
Javanese jav_Latn Austronesian Latin 24,785 135,583 16,908,805
Japanese jpn_Jpan Japonic Kanji 14,415,292 23,893,768 8,923,348,944
Kabyle kab_Latn Afro-Asiatic Latin 18,508 106,730 4,079,553
Kannada kan_Knda Dravidian Kannada 12,978 42,621 1,442,776
Kashmiri kas_Arab Indo-European Arabic 3,109 11,408 5,731,910
Georgian kat_Geor Kartvelian Georgian 354,436 1,304,281 275,223,026
Kazakh kaz_Cyrl Turkic Cyrillic 252,242 732,648 140,049,214
Halh Mongolian khk_Cyrl Mongolic Cyrillic 124,412 508,217 84,535,241
Khmer khm_Khmr Austroasiatic Kher 24,495 122,243 3,043,925
Kinyarwanda kin_Latn Atlantic-Congo Latin 30,401 172,201 12,049,616
Kyrgyz kir_Cyrl Uralic Cyrillic 53,010 199,713 34,404,281
Northern Kurdish kmr_Latn Indo-European Latin 39,262 164,666 23,834,960
Korean kor_Hang Koreanic Hanja 2,614,089 13,563,283 2,006,080,705
Lao lao_Laoo Kra-Dai Lao 50,611 208,768 31,029,380
Ligurian lij_Latn Indo-European Latin 8,751 56,266 2,958,179
Limburgish lim_Latn Indo-European Latin 189,547 1,076,047 42,534,327
Lingala lin_Latn Atlantic-Congo Latin 24,614 152,132 4,053,459
Lithuanian lit_Latn Indo-European Latin 1,688,811 8,869,443 1,161,476,040
Lombard lmo_Latn Indo-European Latin 30,506 151,855 9,058,614
Latgalian ltg_Latn Indo-European Latin 11,948 61,624 4,148,492
Luxembourgish ltz_Latn Indo-European Latin 44,987 246,346 16,676,872
Ganda lug_Latn Afro-Asiatic Latin 1,878 7,215 789,917
Mizo lus_Latn Sino-Tibetan Latin 7,880 26,817 4,978,472
Standard Latvian lvs_Latn Indo-European Latin 896,243 4,141,648 587,653,855
Magahi mag_Deva Indo-European Devanagari 1,097 3,847 205,763
Malayalam mal_Mlym Dravidian Malayalam 14,140 52,679 1,689,010
Marathi mar_Deva Indo-European Devanagari 50,391 163,868 6,689,250
Minangkabau min_Latn Austronesian Latin 9,341 35,309 1,256,931
Macedonian mkd_Cyrl Indo-European Cyrillic 542,250 1,853,070 307,232,151
Maltese mlt_Latn Afro-Asiatic Latin 120,888 709,242 36,097,957
Maori mri_Latn Austronesian Latin 24,322 130,137 24,957,914
Burmese mya_Mymr Sino-Tibetan Mon 8,144 44,188 539,527
Dutch nld_Latn Indo-European Latin 17,096,727 65,606,013 9,670,041,731
Norwegian Nynorsk nno_Latn Indo-European Latin 199,355 1,012,313 67,799,774
Norwegian Bokmål nob_Latn Indo-European Latin 2,229,702 9,698,128 1,294,178,095
Nepali npi_Deva Indo-European Devanagari 31,239 127,193 3,138,539
Nyanja nya_Latn Atlantic-Congo Latin 12,047 67,192 8,596,769
Occitan oci_Latn Indo-European Latin 164,852 671,881 59,309,549
Odia ory_Orya Indo-European Odia 4,319 15,574 378,635
Pangasinan pag_Latn Austronesian Latin 4,214 32,287 546,071
Eastern Panjabi pan_Guru Indo-European Gurmukhi 11,497 46,168 1,887,991
Papiamento pap_Latn Indo-European Latin 55,224 363,015 10,002,655
Southern Pashto pbt_Arab Indo-European Arabic 32,604 110,807 29,170,322
Western Persian pes_Arab Indo-European Arabic 7,048,946 25,200,571 6,210,479,015
Plateau Malgasy plt_Latn Austronesian Latin 32,521 120,673 29,263,848
Polish pol_Latn Indo-European Latin 14,549,605 60,639,244 11,104,144,109
Portuguese por_Latn Indo-European Latin 8,145,664 26,530,423 4,760,063,083
Dari prs_Arab Indo-European Arabic 515,041 2,589,859 517,053,967
Ayacucho Quechua quy_Latn Quechuan Latin 1,578 11,817 362,690
Romanian ron_Latn Indo-European Latin 5,180,171 17,964,048 3,548,291,261
Rundi run_Latn Atlantic-Congo Latin 20,001 67,096 8,686,054
Russian rus_Cyrl Indo-European Cyrillic 15,913,845 69,542,828 18,909,213,208
Sango sag_Latn Atlantic-Congo Latin 2,124 13,556 454,455
Sicilian scn_Latn Indo-European Latin 73,199 424,362 27,110,743
Sinhala sin_Sinh Indo-European Sinhalese 58,767 221,183 14,270,972
Slovak slk_Latn Indo-European Latin 3,008,599 15,067,234 1,963,804,563
Slovenian slv_Latn Indo-European Latin 1,472,025 7,210,285 935,834,754
Samoan smo_Latn Austronesian Latin 12,346 71,359 14,954,824
Shona sna_Latn Atlantic-Congo Latin 12,698 68,782 6,112,600
Sindhi snd_Arab Indo-European Arabic 21,095 74,289 17,647,825
Somali som_Latn Afro-Asiatic Latin 77,343 301,429 34,554,975
Southern Sotho sot_Latn Atlantic-Congo Latin 7,718 43,146 6,156,450
Spanish spa_Latn Indo-European Latin 22,713,366 78,361,087 14,616,773,475
Sardinian srd_Latn Indo-European Latin 675,539 4,059,493 106,159,957
Serbian srp_Cyrl Indo-European Cyrillic 604,557 2,286,171 401,223,741
Sundanese sun_Latn Austronesian Latin 44,310 236,025 13,627,832
Swedish swe_Latn Indo-European Latin 3,302,730 10,860,518 1,779,284,152
Swahili swh_Latn Atlantic-Congo Latin 137,134 593,418 59,454,896
Silesian szl_Latn Indo-European Latin 23,535 132,459 5,996,972
Tamil tam_Taml Dravidian Tamil 36,196 167,669 4,834,946
Tatar tat_Cyrl Turkic Cyrillic 37,188 143,842 22,831,350
Telugu tel_Telu Dravidian Telugu 22,974 81,033 2,273,772
Tajik tgk_Cyrl Turkic Cyrillic 125,236 417,591 90,503,778
Tagalog tgl_Latn Austronesian Latin 151,437 673,814 97,708,639
Thai tha_Thai Kra-Dai Thai 2,983,837 11,621,786 2,839,211,104
Tigrinya tir_Ethi Afro-Asiatic Ge‘ez 2,657 8,707 1,725,422
Tok Pisin tpi_Latn Indo-European Latin 5,063 35,169 460,853
Turkmen tuk_Latn Turkic Latin 13,024 57,354 9,766,999
Turkish tur_Latn Turkic Latin 4,478,700 12,401,091 2,394,669,068
Twi twi_Latn Atlantic-Congo Latin 3,305 13,634 495,220
Uyghur uig_Arab Turkic Arabic 10,713 41,709 6,785,318
Ukrainian ukr_Cyrl Indo-European Cyrillic 2,721,424 10,929,796 1,928,351,595
Urdu urd_Arab Indo-European Arabic 407,098 1,239,125 242,007,283
Northern Uzbek uzn_Latn Turkic Latin 156,632 798,155 89,022,562
Venetian vec_Latn Indo-European Latin 330,611 1,830,777 71,077,531
Vietnamese vie_Latn Viet-Muong Latin 12,621,521 47,411,488 11,616,191,199
Wolof wol_Latn Atlantic-Congo Latin 4,658 20,380 1,596,432
Xhosa xho_Latn Atlantic-Congo Latin 25,950 142,387 15,809,823
Eastern Yiddish ydd_Hebr Indo-European Hebrew 12,486 57,510 17,369,727
Yoruba yor_Latn Atlantic-Congo Latin 56,700 286,933 32,614,558
Yue Chinese yue_Hant Sino-Tibetan Hant 33,671 203,513 24,172,441
Chinese (Simplified) zho_Hans Sino-Tibetan Hanzi 9,861,262 36,152,754 8,078,842,701
Chinese (Traditional) zho_Hant Sino-Tibetan Hant 3,967,966 16,307,258 2,962,854,441
Standard Malay zsm_Latn Austronesian Latin 1,179,744 5,488,632 432,667,199
Zulu zul_Latn Atlantic-Congo Latin 30,717 156,639 11,345,288

License

These data are released under this licensing scheme:

  • We do not own any of the text from which these data has been extracted.
  • We license the actual packaging of these data under the Creative Commons CC BY 4.0 license.
  • To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR.
  • This work is published from: France.

Please also refer to Common Crawl's Terms of Use

Citation

@article{futeral2024moscar,
  title={mOSCAR: A Large-scale Multilingual and Multimodal Document-level Corpus},
  author={Futeral, Matthieu and Zebaze, Armel and Suarez, Pedro Ortiz and Abadji, Julien and Lacroix, R{\'e}mi and Schmid, Cordelia and Bawden, Rachel and Sagot, Beno{\^\i}t},
  journal={arXiv preprint arXiv:2406.08707},
  year={2024}
}