OSCAR 22.01
OSCAR 2201 is the OSCAR version from January, 2022, the November/December 2021 dump of Common Crawl. It features a different file layout that makes it not backward compatible with code designed to run with previous OSCAR versions.
Request access 🤗 Datasets Read the paper
Summary
OSCAR 22.01 is document-oriented, which means that rather than extracting lines and sorting them in language subcorpora, we identify documents as a whole. The main differences are that sentences in a document are contiguous and should make sense one after another, but sentences are not guaranteed to be of the subcorpus' language.
Note
As an example, the English Wikipedia page about La Marseillaise contains sentences in French (The anthem's lyrics). In line-oriented corpora, these sentences would have been put in the French subcorpus. In OSCAR 22.01, they should be along with the article, in a document classified as English.
Layout
As previous corpora, there is one subcorpus per language, plus one new subcorpus for multilingual documents. Subcorpora are distributed in JSONLines, split into 1GB chunks, then gzipped.
Note
Splits are completely independent and self-contained: It is possible to only download en_meta_134.jsonl.gz
and to do processing on it.
Example document
{
"content":"newline\nseparaaaaaaaaaaated\ncontent", // (1)
"warc_headers":{ // (2)
"warc-refers-to":"<urn:uuid:83f2e1d4-5ed3-41db-86ff-f7826c4c20f9>",
"warc-date":"2021-09-16T11:07:14Z",
"warc-block-digest":"sha1:X3OWP47FG2O5LBNMFSNB44FJF2SSRC26",
"warc-type":"conversion",
"warc-identified-content-language":"eng",
"content-length":"1694",
"warc-target-uri":"https://foo.bar",
"warc-record-id":"<urn:uuid:3304bc27-17d0-4ffd-a692-340381478a5f>",
"content-type":"text/plain"
},
"metadata":{
// (3)
"identification":{
"label":"en",
"prob":0.6268374
},
// (4)
"annotation":[
"short_sentences",
"footer"
],
// (5)
"sentence_identifications":[
{
"label":"en",
"prob":0.93925816
},
null,
{
"label":"en",
"prob":0.9606543
}
]
}
}
- Content. Lines are separated by
\n
. - Headers from the crawler. Note that nothing is changed, so the content length may be incorrect.
- Document-wide identification.
prob
is the weighted average of the confidence of identified lines. - Annotations of the document.
null
if no annotation. - Line-by-line identifications.
null
for each line that has no identification.
Annotations
tiny
: The document has a low (<5) number of lines.short_sentences
: The document has a high number (>50%) of short lines (<400 bytes)header
: The document has a high number of short lines at its head, suggesting the presence of low quality content.footer
: The document has a high number of short lines at its tail, suggesting the presence of low quality content.noisy
: The document has a high percentage of punctuation (>50%)adult
: The document contains adult content. This annotation uses a blocklist and labels a tiny part of the corpus: It does not catch most of the adult content.
More information about the thresholds and annotators are present in our paper.
Filtering
Tip
Filtering can be done using oscar-tools
, a high performance toolkit that provides rapid and efficient ways of transforming corpora into what you need. More info here.
Filtering can be done using classic Python tools, such as ujson
.
While we don't supply a Python library enabling easy filtering/transformation for OSCAR 22.01, we provide some filtering examples that you can change to better suit your needs.
Getting documents that come from Wikipedia only
Using filters on warc_headers.warc-target-uri
makes filtering on URLs easy.
Extracting lines from non-annotated documents
Non-annotated documents are suspected to be cleaner than annotated ones, so extracting their content should be interesting to do. We extract lines from documents where metadata.annotations == null
.
Getting Alemannic lines from the German corpus
As detailed in our paper, we found that the German corpus has a (relative to the Alemannic corpus size) important amount of Alemannic. We use a filter on metadata.sentence_identifications
to extract those sentences.
Languages
OSCAR 22.01 has subcorpora for 142 languages (counting the Multilingual corpus). The following table exhibits the size, number of documents and number of words for each of them.
Note that the size accounts for the raw uncompressed file size, counting metadata.
Language table
Language | Size | # Documents | # Words |
---|---|---|---|
Multilingual | 12.1 GB | 1,210,685 | 936,187,711 |
Afrikaans | 47.0 MB | 12,393 | 6,227,310 |
Albanian | 3.0 GB | 437,287 | 326,325,149 |
Alemannic / Swiss German | 363.6 kB | 139 | 37,381 |
Amharic | 461.0 MB | 37,513 | 30,481,153 |
Arabic | 84.2 GB | 8,718,929 | 6,103,711,887 |
Aragonese | 10.6 kB | 12 | 51 |
Armenian | 4.7 GB | 379,267 | 268,031,270 |
Assamese | 221.2 MB | 17,084 | 11,109,557 |
Asturian | 73.6 kB | 77 | 3,919 |
Avaric | 18.6 kB | 14 | 582 |
Azerbaijani | 3.5 GB | 491,847 | 291,927,692 |
Bangla | 15.1 GB | 1,171,501 | 751,877,226 |
Bashkir | 95.5 MB | 11,198 | 5,418,474 |
Basque | 1.1 GB | 233,658 | 97,092,942 |
Belarusian | 1.8 GB | 180,046 | 107,227,860 |
Bihari languages | 24.2 kB | 27 | 569 |
Bishnupriya | 2.0 MB | 271 | 98,419 |
Bosnian | 10.3 kB | 10 | 422 |
Breton | 33.7 MB | 16,119 | 3,111,619 |
Bulgarian | 35.1 GB | 2,887,115 | 2,405,981,285 |
Burmese | 1.9 GB | 158,733 | 44,835,970 |
Catalan | 13.9 GB | 2,627,307 | 1,508,919,864 |
Cebuano | 44.6 MB | 5,742 | 5,253,785 |
Central Kurdish | 716.4 MB | 84,950 | 43,913,025 |
Chechen | 14.0 MB | 4,086 | 798,766 |
Chinese | 900.9 GB | 56,524,518 | 23,149,203,886 |
Chuvash | 41.8 MB | 4,750 | 2,465,782 |
Cornish | 1.4 kB | 2 | 55 |
Croatian | 11.2 MB | 11,462 | 505,369 |
Czech | 58.6 GB | 10,381,916 | 5,452,724,456 |
Danish | 12.6 GB | 2,265,479 | 1,454,439,292 |
Dimli (individual language) | 706 Bytes | 1 | 19 |
Divehi | 217.2 MB | 24,067 | 10,112,205 |
Dutch | 114.0 GB | 20,206,532 | 12,329,127,151 |
Eastern Mari | 11.3 MB | 1,612 | 641,525 |
Egyptian Arabic | 2.8 MB | 1,256 | 176,096 |
English | 3.2 TB | 431,992,659 | 377,376,402,775 |
Esperanto | 558.3 MB | 111,932 | 58,416,628 |
Estonian | 9.2 GB | 1,362,524 | 820,975,443 |
Filipino | 646.5 MB | 70,394 | 81,881,278 |
Finnish | 37.8 GB | 4,948,961 | 2,900,615,928 |
French | 382.2 GB | 52,037,098 | 41,713,990,658 |
Galician | 255.2 MB | 88,803 | 27,051,212 |
Georgian | 7.1 GB | 488,588 | 281,430,479 |
German | 496.7 GB | 70,075,424 | 46,826,676,844 |
Goan Konkani | 787.2 kB | 46 | 38,831 |
Greek | 78.3 GB | 6,738,546 | 5,031,242,803 |
Guarani | 9.0 kB | 10 | 374 |
Gujarati | 4.8 GB | 136,467 | 301,170,777 |
Hebrew | 30.3 GB | 3,132,396 | 2,249,377,984 |
Hindi | 23.3 GB | 1,529,907 | 1,534,799,198 |
Hungarian | 53.9 GB | 6,866,062 | 4,598,787,907 |
Icelandic | 2.0 GB | 396,183 | 210,365,124 |
Ido | 77.3 kB | 105 | 2,690 |
Iloko | 97.9 kB | 75 | 8,592 |
Indonesian | 17.4 GB | 2,244,622 | 1,984,195,207 |
Interlingua | 40.2 kB | 6 | 10,125 |
Irish | 45.6 MB | 12,233 | 4,877,850 |
Italian | 229.3 GB | 28,502,092 | 24,294,684,830 |
Japanese | 258.7 GB | 36,328,931 | 5,592,948,356 |
Javanese | 152.7 kB | 70 | 10,441 |
Kalmyk | 9.3 kB | 9 | 250 |
Kannada | 2.6 GB | 150,850 | 108,450,571 |
Karachay-Balkar | 119.6 kB | 91 | 4,089 |
Kazakh | 2.9 GB | 261,085 | 157,267,307 |
Khmer | 1.9 GB | 121,910 | 30,564,131 |
Komi | 119.9 kB | 127 | 3,335 |
Korean | 51.8 GB | 5,881,481 | 3,854,968,649 |
Kurdish | 150.3 MB | 29,906 | 17,390,759 |
Kyrgyz | 518.6 MB | 62,244 | 28,028,986 |
Lao | 337.1 MB | 28,914 | 6,682,982 |
Latin | 4.1 MB | 4,397 | 187,446 |
Latvian | 8.2 GB | 1,032,987 | 707,361,898 |
Lezghian | 375.5 kB | 124 | 19,250 |
Limburgish | 1.4 kB | 2 | 41 |
Lithuanian | 20.0 GB | 2,303,070 | 1,712,802,056 |
Lojban | 1.9 MB | 570 | 260,542 |
Lombard | 2.6 kB | 2 | 225 |
Low German | 9.0 MB | 1,938 | 1,012,561 |
Lower Sorbian | 707 Bytes | 1 | 17 |
Luxembourgish | 15.8 MB | 5,108 | 1,545,946 |
Macedonian | 3.6 GB | 341,775 | 244,058,579 |
Maithili | 21.6 kB | 23 | 483 |
Malagasy | 57.3 MB | 3,028 | 7,279,056 |
Malay | 5.3 MB | 5,228 | 217,818 |
Malayalam | 4.1 GB | 250,972 | 137,831,247 |
Maltese | 2.5 MB | 2,208 | 118,190 |
Marathi | 3.3 GB | 250,376 | 160,179,233 |
Mazanderani | 128.2 kB | 76 | 7,337 |
Minangkabau | 6.0 MB | 585 | 614,613 |
Mingrelian | 7.6 MB | 2,550 | 253,333 |
Mongolian | 2.8 GB | 237,719 | 176,405,432 |
Nahuatl languages | 8.7 kB | 12 | 179 |
Nepali | 3.7 GB | 391,947 | 177,885,116 |
Newari | 5.7 MB | 1,134 | 273,837 |
Norwegian | 2.8 GB | 973,188 | 279,182,902 |
Norwegian Nynorsk | 6.8 MB | 5,835 | 459,183 |
Occitan | 2.1 MB | 373 | 31,061 |
Odia | 487.9 MB | 52,942 | 23,755,902 |
Ossetic | 13.9 MB | 3,560 | 800,430 |
Pashto | 490.3 MB | 50,312 | 46,293,249 |
Persian | 77.4 GB | 7,665,871 | 6,430,164,396 |
Piedmontese | 1.7 MB | 698 | 188,270 |
Polish | 139.0 GB | 19,301,137 | 12,584,498,906 |
Portuguese | 170.3 GB | 23,735,707 | 18,441,864,893 |
Punjabi | 1.1 GB | 68,094 | 70,068,604 |
Quechua | 744 Bytes | 1 | 14 |
Romanian | 49.2 GB | 4,624,764 | 5,261,803,995 |
Russia Buriat | 32.9 kB | 39 | 785 |
Russian | 1.1 TB | 76,060,844 | 62,811,122,663 |
Sakha | 65.6 MB | 6,284 | 3,473,813 |
Sanskrit | 136.0 MB | 4,472 | 5,671,369 |
Scottish Gaelic | 137.7 kB | 136 | 7,769 |
Serbian | 6.9 GB | 577,472 | 482,932,670 |
Serbian (Latin) | 931.8 kB | 738 | 92,875 |
Sicilian | 1.5 kB | 2 | 50 |
Sindhi | 117.1 MB | 15,516 | 10,685,611 |
Sinhala | 2.0 GB | 108,593 | 113,179,741 |
Slovak | 16.5 GB | 2,409,555 | 1,619,121,944 |
Slovenian | 1.2 GB | 351,894 | 118,400,246 |
Somali | 2.1 kB | 3 | 109 |
South Azerbaijani | 14.1 MB | 5,381 | 693,746 |
Spanish | 381.9 GB | 51,386,247 | 42,829,835,316 |
Sundanese | 5.0 MB | 263 | 547,145 |
Swahili | 1.3 MB | 462 | 123,050 |
Swedish | 48.0 GB | 7,541,278 | 5,078,331,128 |
Tajik | 870.9 MB | 46,366 | 56,627,727 |
Tamil | 11.4 GB | 556,772 | 452,343,748 |
Tatar | 915.3 MB | 76,398 | 51,875,265 |
Telugu | 3.4 GB | 249,756 | 137,752,065 |
Thai | 66.1 GB | 5,030,254 | 1,626,779,846 |
Tibetan | 234.5 MB | 18,683 | 2,286,269 |
Turkish | 75.1 GB | 10,826,031 | 6,421,221,358 |
Turkmen | 4.4 MB | 2,485 | 276,632 |
Ukrainian | 48.8 GB | 4,558,214 | 2,879,585,992 |
Emiliano-Romagnolo[eml] | 901 Bytes | 1 | 53 |
Upper Sorbian | 132.8 kB | 110 | 8,825 |
Urdu | 3.4 GB | 336,994 | 332,816,354 |
Uyghur | 201.9 MB | 18,556 | 11,240,889 |
Uzbek | 19.9 MB | 9,526 | 1,370,842 |
Vietnamese | 98.9 GB | 9,587,233 | 12,283,185,482 |
Volapük | 825.9 kB | 661 | 57,039 |
Walloon | 105.7 kB | 138 | 4,386 |
Waray | 7.6 MB | 933 | 830,872 |
Welsh | 409.3 MB | 90,378 | 49,488,495 |
Western Frisian | 75.3 MB | 21,946 | 6,357,929 |
Western Mari | 743.5 kB | 155 | 43,916 |
Western Panjabi | 46.7 MB | 6,790 | 4,060,419 |
Wu Chinese | 137.2 kB | 88 | 3,056 |
Yiddish | 232.5 MB | 23,418 | 15,809,780 |
Yoruba | 24.7 kB | 26 | 1,042 |
Multilingual | 12.1 GB | 1,210,685 | 936,187,711 |