OSCAR 22.01

OSCAR 2201 is the OSCAR version from January, 2022, the November/December 2021 dump of Common Crawl. It features a different file layout that makes it not backward compatible with code designed to run with previous OSCAR versions.

Request access 🤗 Datasets Read the paper

Summary

OSCAR 22.01 is document-oriented, which means that rather than extracting lines and sorting them in language subcorpora, we identify documents as a whole. The main differences are that sentences in a document are contiguous and should make sense one after another, but sentences are not guaranteed to be of the subcorpus' language.

Note

As an example, the English Wikipedia page about La Marseillaise contains sentences in French (The anthem's lyrics). In line-oriented corpora, these sentences would have been put in the French subcorpus. In OSCAR 22.01, they should be along with the article, in a document classified as English.

Layout

As previous corpora, there is one subcorpus per language, plus one new subcorpus for multilingual documents. Subcorpora are distributed in JSONLines, split into 1GB chunks, then gzipped.

Note

Splits are completely independent and self-contained: It is possible to only download en_meta_134.jsonl.gz and to do processing on it.

Example document

{
  "content":"newline\nseparaaaaaaaaaaated\ncontent", // (1)
  "warc_headers":{ // (2) 
    "warc-refers-to":"<urn:uuid:83f2e1d4-5ed3-41db-86ff-f7826c4c20f9>", 
    "warc-date":"2021-09-16T11:07:14Z",
    "warc-block-digest":"sha1:X3OWP47FG2O5LBNMFSNB44FJF2SSRC26",
    "warc-type":"conversion",
    "warc-identified-content-language":"eng",
    "content-length":"1694",
    "warc-target-uri":"https://foo.bar",
    "warc-record-id":"<urn:uuid:3304bc27-17d0-4ffd-a692-340381478a5f>",
    "content-type":"text/plain"
  },
  "metadata":{
    // (3)
    "identification":{
      "label":"en",
      "prob":0.6268374
    },

    // (4)
    "annotation":[
      "short_sentences",
      "footer"
    ],

    // (5)
    "sentence_identifications":[
      {
        "label":"en",
        "prob":0.93925816
      },
      null,
      {
        "label":"en",
        "prob":0.9606543
      }
    ]
  }
}

Content. Lines are separated by \n.
Headers from the crawler. Note that nothing is changed, so the content length may be incorrect.
Document-wide identification. prob is the weighted average of the confidence of identified lines.
Annotations of the document. null if no annotation.
Line-by-line identifications. null for each line that has no identification.

Annotations

tiny: The document has a low (<5) number of lines.
short_sentences: The document has a high number (>50%) of short lines (<400 bytes)
header: The document has a high number of short lines at its head, suggesting the presence of low quality content.
footer: The document has a high number of short lines at its tail, suggesting the presence of low quality content.
noisy: The document has a high percentage of punctuation (>50%)
adult: The document contains adult content. This annotation uses a blocklist and labels a tiny part of the corpus: It does not catch most of the adult content.

More information about the thresholds and annotators are present in our paper.

Filtering

Tip

Filtering can be done using oscar-tools, a high performance toolkit that provides rapid and efficient ways of transforming corpora into what you need. More info here.

Filtering can be done using classic Python tools, such as ujson. While we don't supply a Python library enabling easy filtering/transformation for OSCAR 22.01, we provide some filtering examples that you can change to better suit your needs.

Getting documents that come from Wikipedia only

Using filters on warc_headers.warc-target-uri makes filtering on URLs easy.

TODO

Extracting lines from non-annotated documents

Non-annotated documents are suspected to be cleaner than annotated ones, so extracting their content should be interesting to do. We extract lines from documents where metadata.annotations == null.

TODO

Getting Alemannic lines from the German corpus

As detailed in our paper, we found that the German corpus has a (relative to the Alemannic corpus size) important amount of Alemannic. We use a filter on metadata.sentence_identifications to extract those sentences.

TODO

Languages

OSCAR 22.01 has subcorpora for 142 languages (counting the Multilingual corpus). The following table exhibits the size, number of documents and number of words for each of them.

Note that the size accounts for the raw uncompressed file size, counting metadata.

Language table

Language	Size	# Documents	# Words
Multilingual	12.1 GB	1,210,685	936,187,711
Afrikaans	47.0 MB	12,393	6,227,310
Albanian	3.0 GB	437,287	326,325,149
Alemannic / Swiss German	363.6 kB	139	37,381
Amharic	461.0 MB	37,513	30,481,153
Arabic	84.2 GB	8,718,929	6,103,711,887
Aragonese	10.6 kB	12	51
Armenian	4.7 GB	379,267	268,031,270
Assamese	221.2 MB	17,084	11,109,557
Asturian	73.6 kB	77	3,919
Avaric	18.6 kB	14	582
Azerbaijani	3.5 GB	491,847	291,927,692
Bangla	15.1 GB	1,171,501	751,877,226
Bashkir	95.5 MB	11,198	5,418,474
Basque	1.1 GB	233,658	97,092,942
Belarusian	1.8 GB	180,046	107,227,860
Bihari languages	24.2 kB	27	569
Bishnupriya	2.0 MB	271	98,419
Bosnian	10.3 kB	10	422
Breton	33.7 MB	16,119	3,111,619
Bulgarian	35.1 GB	2,887,115	2,405,981,285
Burmese	1.9 GB	158,733	44,835,970
Catalan	13.9 GB	2,627,307	1,508,919,864
Cebuano	44.6 MB	5,742	5,253,785
Central Kurdish	716.4 MB	84,950	43,913,025
Chechen	14.0 MB	4,086	798,766
Chinese	900.9 GB	56,524,518	23,149,203,886
Chuvash	41.8 MB	4,750	2,465,782
Cornish	1.4 kB	2	55
Croatian	11.2 MB	11,462	505,369
Czech	58.6 GB	10,381,916	5,452,724,456
Danish	12.6 GB	2,265,479	1,454,439,292
Dimli (individual language)	706 Bytes	1	19
Divehi	217.2 MB	24,067	10,112,205
Dutch	114.0 GB	20,206,532	12,329,127,151
Eastern Mari	11.3 MB	1,612	641,525
Egyptian Arabic	2.8 MB	1,256	176,096
English	3.2 TB	431,992,659	377,376,402,775
Esperanto	558.3 MB	111,932	58,416,628
Estonian	9.2 GB	1,362,524	820,975,443
Filipino	646.5 MB	70,394	81,881,278
Finnish	37.8 GB	4,948,961	2,900,615,928
French	382.2 GB	52,037,098	41,713,990,658
Galician	255.2 MB	88,803	27,051,212
Georgian	7.1 GB	488,588	281,430,479
German	496.7 GB	70,075,424	46,826,676,844
Goan Konkani	787.2 kB	46	38,831
Greek	78.3 GB	6,738,546	5,031,242,803
Guarani	9.0 kB	10	374
Gujarati	4.8 GB	136,467	301,170,777
Hebrew	30.3 GB	3,132,396	2,249,377,984
Hindi	23.3 GB	1,529,907	1,534,799,198
Hungarian	53.9 GB	6,866,062	4,598,787,907
Icelandic	2.0 GB	396,183	210,365,124
Ido	77.3 kB	105	2,690
Iloko	97.9 kB	75	8,592
Indonesian	17.4 GB	2,244,622	1,984,195,207
Interlingua	40.2 kB	6	10,125
Irish	45.6 MB	12,233	4,877,850
Italian	229.3 GB	28,502,092	24,294,684,830
Japanese	258.7 GB	36,328,931	5,592,948,356
Javanese	152.7 kB	70	10,441
Kalmyk	9.3 kB	9	250
Kannada	2.6 GB	150,850	108,450,571
Karachay-Balkar	119.6 kB	91	4,089
Kazakh	2.9 GB	261,085	157,267,307
Khmer	1.9 GB	121,910	30,564,131
Komi	119.9 kB	127	3,335
Korean	51.8 GB	5,881,481	3,854,968,649
Kurdish	150.3 MB	29,906	17,390,759
Kyrgyz	518.6 MB	62,244	28,028,986
Lao	337.1 MB	28,914	6,682,982
Latin	4.1 MB	4,397	187,446
Latvian	8.2 GB	1,032,987	707,361,898
Lezghian	375.5 kB	124	19,250
Limburgish	1.4 kB	2	41
Lithuanian	20.0 GB	2,303,070	1,712,802,056
Lojban	1.9 MB	570	260,542
Lombard	2.6 kB	2	225
Low German	9.0 MB	1,938	1,012,561
Lower Sorbian	707 Bytes	1	17
Luxembourgish	15.8 MB	5,108	1,545,946
Macedonian	3.6 GB	341,775	244,058,579
Maithili	21.6 kB	23	483
Malagasy	57.3 MB	3,028	7,279,056
Malay	5.3 MB	5,228	217,818
Malayalam	4.1 GB	250,972	137,831,247
Maltese	2.5 MB	2,208	118,190
Marathi	3.3 GB	250,376	160,179,233
Mazanderani	128.2 kB	76	7,337
Minangkabau	6.0 MB	585	614,613
Mingrelian	7.6 MB	2,550	253,333
Mongolian	2.8 GB	237,719	176,405,432
Nahuatl languages	8.7 kB	12	179
Nepali	3.7 GB	391,947	177,885,116
Newari	5.7 MB	1,134	273,837
Norwegian	2.8 GB	973,188	279,182,902
Norwegian Nynorsk	6.8 MB	5,835	459,183
Occitan	2.1 MB	373	31,061
Odia	487.9 MB	52,942	23,755,902
Ossetic	13.9 MB	3,560	800,430
Pashto	490.3 MB	50,312	46,293,249
Persian	77.4 GB	7,665,871	6,430,164,396
Piedmontese	1.7 MB	698	188,270
Polish	139.0 GB	19,301,137	12,584,498,906
Portuguese	170.3 GB	23,735,707	18,441,864,893
Punjabi	1.1 GB	68,094	70,068,604
Quechua	744 Bytes	1	14
Romanian	49.2 GB	4,624,764	5,261,803,995
Russia Buriat	32.9 kB	39	785
Russian	1.1 TB	76,060,844	62,811,122,663
Sakha	65.6 MB	6,284	3,473,813
Sanskrit	136.0 MB	4,472	5,671,369
Scottish Gaelic	137.7 kB	136	7,769
Serbian	6.9 GB	577,472	482,932,670
Serbian (Latin)	931.8 kB	738	92,875
Sicilian	1.5 kB	2	50
Sindhi	117.1 MB	15,516	10,685,611
Sinhala	2.0 GB	108,593	113,179,741
Slovak	16.5 GB	2,409,555	1,619,121,944
Slovenian	1.2 GB	351,894	118,400,246
Somali	2.1 kB	3	109
South Azerbaijani	14.1 MB	5,381	693,746
Spanish	381.9 GB	51,386,247	42,829,835,316
Sundanese	5.0 MB	263	547,145
Swahili	1.3 MB	462	123,050
Swedish	48.0 GB	7,541,278	5,078,331,128
Tajik	870.9 MB	46,366	56,627,727
Tamil	11.4 GB	556,772	452,343,748
Tatar	915.3 MB	76,398	51,875,265
Telugu	3.4 GB	249,756	137,752,065
Thai	66.1 GB	5,030,254	1,626,779,846
Tibetan	234.5 MB	18,683	2,286,269
Turkish	75.1 GB	10,826,031	6,421,221,358
Turkmen	4.4 MB	2,485	276,632
Ukrainian	48.8 GB	4,558,214	2,879,585,992
Emiliano-Romagnolo[eml]	901 Bytes	1	53
Upper Sorbian	132.8 kB	110	8,825
Urdu	3.4 GB	336,994	332,816,354
Uyghur	201.9 MB	18,556	11,240,889
Uzbek	19.9 MB	9,526	1,370,842
Vietnamese	98.9 GB	9,587,233	12,283,185,482
Volapük	825.9 kB	661	57,039
Walloon	105.7 kB	138	4,386
Waray	7.6 MB	933	830,872
Welsh	409.3 MB	90,378	49,488,495
Western Frisian	75.3 MB	21,946	6,357,929
Western Mari	743.5 kB	155	43,916
Western Panjabi	46.7 MB	6,790	4,060,419
Wu Chinese	137.2 kB	88	3,056
Yiddish	232.5 MB	23,418	15,809,780
Yoruba	24.7 kB	26	1,042
Multilingual	12.1 GB	1,210,685	936,187,711