Skip to content

OSCAR 21.09

Features

These are the versions of tooling, schemes and data

  • CommonCrawl version: February/March 2021 (2021.10)
  • OSCAR Schema version: v1.1 : Incorporates metadata in a backward compatible manner.
  • Ungoliant version: v1 : New generation tool, faster and better documented/tested than the previous one: goclassy.

Changes

  • As per OSCAR Schema v1.1, each document/record has associated metadata.
  • New languages: Manx, Rusyn, Scots and West Flemish. Their size and quality still has to be assessed.
  • Removed languages: Central Bikol and Cantonese. Cantonsese was of a very low quality. Central Bikol corpus is still available on OSCAR 2019.

Table

Language OSCAR 2019 OSCAR 2019 deduplicated OSCAR 21.09 OSCAR 21.09 deduplicated Issues
af Afrikaans 251MB 170MB 258MB 157MB
sq Albanian 2GB 1GB 3GB 1GB
am Amharic 377MB 215MB 405MB 241MB
ar Arabic 87GB 33GB 69GB 35GB
an Aragonese 1MB 822KB 1MB 608KB
hy Armenian 3GB 1GB 4GB 1GB
as Assamese 117MB 73MB 135MB 95MB
ast Asturian 2MB 2MB 7MB 4MB
av Avaric 418KB 331KB 421KB 325KB
az Azerbaijani 2GB 1GB 3GB 1GB
bn Bangla 10GB 6GB 14GB 7GB
ba Bashkir 133MB 93MB 110MB 77MB
eu Basque 889MB 358MB 900MB 503MB
bar Bavarian 507B 507B 2KB 1KB
be Belarusian 1GB 1GB 2GB 1GB
bh Bihari languages 112KB 34KB 579KB 120KB
bpy Bishnupriya 4MB 1MB 11MB 4MB
bs Bosnian 459KB 120KB 310KB 175KB
br Breton 29MB 16MB 49MB 23MB
bg Bulgarian 33GB 14GB 34GB 15GB
my Burmese 2GB 1GB 2GB 1GB
yue Cantonese 3KB 2KB - -
ca Catalan 8GB 4GB 13GB 6GB
ceb Cebuano 40MB 24MB 81MB 58MB
bcl Central Bikol 886B 886B - -
ckb Central Kurdish 509MB 236MB 784MB 367MB
cbk Chavacano 521B 521B 168B 168B {{< issue cbk >}}
ce Chechen 8MB 6MB 29MB 20MB
zh Chinese 544GB 267GB 500GB 266GB
cv Chuvash 40MB 27MB 60MB 41MB
kw Cornish 44KB 14KB 119KB 72KB
hr Croatian 237MB 115MB 361MB 169MB
cs Czech 56GB 25GB 72GB 33GB
da Danish 16GB 10GB 18GB 10GB
diq Dimli (individual language) 147B 147B 294B 147B
dv Divehi 131MB 81MB 143MB 111MB
nl Dutch 82GB 41GB 97GB 47GB
mhr Eastern Mari 7MB 6MB 15MB 10MB
arz Egyptian Arabic 68MB 34MB 48MB 21MB
en English 2520GB 1294GB 2936GB 1342GB
myv Erzya 1KB 1KB 29KB 2KB
eo Esperanto 312MB 238MB 560MB 390MB
et Estonian 5GB 2GB 7GB 3GB
tl Filipino 601MB 426MB 699MB 383MB
fi Finnish 28GB 13GB 35GB 20GB
fr French 302GB 147GB 340GB 161GB
gl Galician 650MB 402MB 989MB 549MB
ka Georgian 3GB 1GB 6GB 2GB
de German 330GB 155GB 433GB 184GB
gom Goan Konkani 2MB 1MB 3MB 2MB
el Greek 66GB 28GB 72GB 30GB
gn Guarani 36KB 23KB 32KB 25KB
gu Gujarati 1GB 756MB 1GB 950MB
ht Haitian Creole 3KB 3KB 2KB 1KB
he Hebrew 21GB 10GB 29GB 11GB
hi Hindi 17GB 9GB 26GB 13GB
hu Hungarian 42GB 18GB 60GB 29GB
is Icelandic 1GB 887MB 2GB 1GB
io Ido 151KB 133KB 276KB 221KB
ilo Iloko 896KB 653KB 1MB 857KB
id Indonesian 32GB 16GB 40GB 22GB
ia Interlingua 678KB 368KB 291KB 172KB
ie Interlingue 24KB 1KB 7KB 2KB
ga Irish 91MB 62MB 131MB 69MB
it Italian 146GB 73GB 192GB 94GB
ja Japanese 231GB 112GB 208GB 96GB
jv Javanese 675KB 598KB 858KB 728KB
xal Kalmyk 115KB 114KB 62KB 62KB
kn Kannada 1GB 1GB 2GB 1GB
krc Karachay-Balkar 2MB 2MB 2MB 2MB
kk Kazakh 2GB 1GB 3GB 1GB
km Khmer 1GB 608MB 1GB 860MB
kv Komi 2MB 1MB 1MB 588KB
ko Korean 25GB 11GB 35GB 15GB
ku Kurdish 98MB 62MB 152MB 108MB
ky Kyrgyz 629MB 406MB 485MB 334MB
lo Lao 181MB 118MB 287MB 163MB
la Latin 26MB 8MB 103MB 9MB
lv Latvian 4GB 1GB 6GB 2GB
lez Lezghian 3MB 3MB 2MB 2MB
li Limburgish 29KB 27KB 76KB 54KB
lt Lithuanian 9GB 4GB 12GB 5GB
jbo Lojban 753KB 694KB 929KB 731KB
lmo Lombard 454KB 444KB 1MB 1MB
nds Low German 18MB 13MB 25MB 17MB
dsb Lower Sorbian 13KB 7KB 31KB 14KB
lb Luxembourgish 30MB 21MB 54MB 37MB
mk Macedonian 2GB 1GB 3GB 1GB
mai Maithili 324KB 10KB 685KB 24KB
mg Malagasy 21MB 13MB 59MB 38MB
ms Malay 116MB 43MB 146MB 60MB
ml Malayalam 5GB 2GB 4GB 2GB
mt Maltese 24MB 17MB 51MB 26MB
gv Manx - - 1KB 907B
mr Marathi 2GB 1GB 3GB 1GB
mzn Mazanderani 708KB 617KB 1MB 1MB
min Minangkabau 622KB 317KB 8MB 1MB
xmf Mingrelian 6MB 4MB 16MB 10MB
mwl Mirandese 1KB 1KB 3KB 2KB
mn Mongolian 2GB 879MB 1GB 912MB
nah Nahuatl languages 11KB 10KB 34KB 21KB
nap Neapolitan 17KB 13KB 1KB 1KB {{< issue nap >}}
ne Nepali 1GB 1GB 3GB 2GB
new Newari 5MB 4MB 6MB 4MB
frr Northern Frisian 4KB 4KB 7KB 5KB {{< issue frr >}}
lrc Northern Luri 77KB 64KB 183B 183B
no Norwegian Bokmål 8GB 5GB 9GB 4GB
nn Norwegian Nynorsk 88MB 56MB 123MB 66MB
oc Occitan 6MB 3MB 12MB 5MB
or Odia 259MB 196MB 538MB 357MB
os Ossetic 12MB 10MB 11MB 6MB
pam Pampanga 763B 307B 3KB 3KB
ps Pashto 378MB 253MB 404MB 286MB
fa Persian 84GB 39GB 79GB 35GB
pms Piedmontese 2MB 1MB 4MB 3MB
pl Polish 116GB 50GB 122GB 48GB
pt Portuguese 132GB 67GB 159GB 71GB
pa Punjabi 799MB 481MB 769MB 430MB
qu Quechua 80KB 68KB 322KB 230KB
ro Romanian 26GB 11GB 37GB 15GB
rm Romansh 7KB 6KB 3KB 3KB
bxr Russia Buriat 12KB 10KB 22KB 18KB
ru Russian 1239GB 609GB 1201GB 542GB
rue Rusyn - - 247B 247B
sah Sakha 43MB 27MB 57MB 39MB
sa Sanskrit 96MB 38MB 72MB 43MB
sco Scots - - 1KB 1KB {{< issue sco >}}
gd Scottish Gaelic 1MB 1MB 2MB 1MB
sr Serbian 4GB 2GB 6GB 3GB
sh Serbian (Latin) 25MB 6MB 13MB 9MB
scn Sicilian 3KB 2KB 4KB 3KB
sd Sindhi 363MB 274MB 75MB 50MB
si Sinhala 1GB 840MB 1GB 791MB
sk Slovak 9GB 4GB 14GB 6GB
sl Slovenian 2GB 1GB 4GB 1GB
so Somali 62KB 15KB 15KB 13KB {{< issue so >}}
azb South Azerbaijani 28MB 19MB 47MB 29MB
es Spanish 297GB 159GB 342GB 160GB
su Sundanese 216KB 145KB 397KB 274KB
sw Swahili 13MB 8MB 11MB 7MB
sv Swedish 46GB 26GB 43GB 19GB
tg Tajik 396MB 260MB 985MB 321MB {{< issue tg >}}
ta Tamil 9GB 5GB 10GB 5GB
tt Tatar 701MB 319MB 947MB 424MB
te Telugu 2GB 1GB 3GB 1GB
th Thai 38GB 17GB 62GB 26GB
bo Tibetan 195MB 144MB 439MB 358MB
gsw[^1] Alemannic German 5MB 2MB 7MB 5MB
tr Turkish 63GB 28GB 73GB 33GB {{< issue tr >}}
tk Turkmen 10MB 7MB 25MB 20MB
tyv Tuvinian 11KB 8KB 9KB 7KB
uk Ukrainian 56GB 29GB 53GB 28GB
eml Emiliano-Romagnolo[^2] 25KB 23KB 22KB 20KB
hsb Upper Sorbian 4MB 1MB 2MB 1MB
ur Urdu 2GB 1GB 2GB 1GB
ug Uyghur 127MB 86MB 187MB 123MB
uz Uzbek 21MB 11MB 56MB 28MB
vec Venetian 18KB 16KB 37KB 28KB
vi Vietnamese 72GB 33GB 87GB 42GB
vo Volapük 2MB 2MB 2MB 2MB
wa Walloon 280KB 207KB 511KB 329KB
war Waray 2MB 2MB 4MB 4MB
cy Welsh 223MB 139MB 307MB 180MB
vls West Flemish - - 134B 134B {{< issue vls >}}
fy Western Frisian 35MB 26MB 82MB 57MB
mrj Western Mari 1MB 1MB 645KB 521KB
pnb Western Panjabi 11MB 9MB 68MB 45MB
wuu Wu Chinese 111KB 32KB 145KB 69KB {{< issue wuu >}}
yi Yiddish 146MB 87MB 199MB 93MB
yo Yoruba 56KB 26KB 229KB 120KB

OSCAR Schema v1.1.0

The new OSCAR schema incorporates backward-compatible changes.

Changes

The old OSCAR Schema v1.0 featured the following file hierarchy, in an uncompressed form:

/
├── af
   ├── af_sha256.txt
   └── af.txt.gz
├── de
   ├── de_sha256.txt    # Checksum file 
   └── de.txt.gz        # Textual content
├── en
   ├── en_part_1.txt.gz        # Multipart example
   ├── en_part_2.txt.gz
   └── en_sha256.txt
├── yi
   ├── yi_sha256.txt
   └── yi.txt.gz
└── zh
    ├── zh_sha256.txt
    └── zh.txt.gz

The new OSCAR Schema v1.1 features the following file hierarchy (some languages omitted):

/
├── af
   ├── af_meta.jsonl.gz
   ├── af_sha256.txt
   └── af.txt.gz
├── de
   ├── de_meta.jsonl.gz # Metadata, in JSONLines format
   ├── de_sha256.txt    # Checksum file 
   └── de.txt.gz        # Textual content
├── en
   ├── en_meta_part_1.jsonl.gz # Multipart example
   ├── en_meta_part_2.jsonl.gz # Each part is independent,
   ├── en_part_1.txt.gz        # Ex: en_part_2.txt.gz and en_meta_part_2.jsonl.gz
   ├── en_part_2.txt.gz
   └── en_sha256.txt
├── yi
   ├── yi_meta.jsonl.gz
   ├── yi_sha256.txt
   └── yi.txt.gz
└── zh
    ├── zh_meta.jsonl.gz
    ├── zh_sha256.txt
    └── zh.txt.gz

File formats

.txt files

Lines are newline-separated, and documents are double-newline separated. In other terms, there is a blank line between each document.

.jsonl files

These are the metadata, in JSONLines format.

Each line follows the following JSON Scheme:

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "title": "Metadata",
  "description": "Holds record headers.\n\nEach metadata is linked to a specific paragraph/text zone",
  "type": "object",
  "required": [
    "headers",
    "nb_sentences",
    "offset"
  ],
  "properties": {
    "headers": {
      "type": "object",
      "additionalProperties": {
        "type": "string"
      }
    },
    "nb_sentences": {
      "type": "integer",
      "format": "uint",
      "minimum": 0.0
    },
    "offset": {
      "type": "integer",
      "format": "uint",
      "minimum": 0.0
    }
  }
}

Example:

{
   "headers":{                  // these headers keys are *almost* always present.
      "content-length":"11062", // the content length is not changed and reflects the 
                                // length before filtering and eventual deduplication.
      "warc-target-uri":"...",
      "warc-type":"conversion",
      "content-type":"text/plain",
      "warc-date":"2021-02-24T17:55:29Z", // Following WARC specification, it is the crawl date.
      "warc-identified-content-language":"eng,zho",
      "warc-refers-to":"<urn:uuid:c649de0e-42a3-4e69-b675-98e28e084698>",
      "warc-block-digest":"sha1:V4PYYGYA6ZYA2WACDKSNL6NXGDN6XK6X",
      "warc-record-id":"<urn:uuid:121a822f-5362-4559-8891-d085415cdd90>"
   },
   "offset":0, // Related text is in the text file, from lines offset+1 to lines offset+nb_sentences.
   "nb_sentences":9
}

<lang>_sha256.txt files

These are used to check for eventual corruption during download. They can be used by running sha256sum -c <lang>_sha256.txt.

[^1]: gsw is ISO 639-2 for Alemannic German. It was previously identified as als in previous OSCAR versions, due to a bug in fasttext. [^2]: eml identification tag is deprecated and corresponds to rgn and egl tags in ISO 639-3