OSCAR 21.09
Features
These are the versions of tooling, schemes and data
- CommonCrawl version: February/March 2021 (2021.10)
- OSCAR Schema version: v1.1 : Incorporates metadata in a backward compatible manner.
- Ungoliant version: v1 : New generation tool, faster and better documented/tested than the previous one: goclassy.
Changes
- As per OSCAR Schema v1.1, each document/record has associated metadata.
- New languages: Manx, Rusyn, Scots and West Flemish. Their size and quality still has to be assessed.
- Removed languages: Central Bikol and Cantonese. Cantonsese was of a very low quality. Central Bikol corpus is still available on OSCAR 2019.
Table
Language | OSCAR 2019 | OSCAR 2019 deduplicated | OSCAR 21.09 | OSCAR 21.09 deduplicated | Issues | |
---|---|---|---|---|---|---|
af | Afrikaans | 251MB | 170MB | 258MB | 157MB | |
sq | Albanian | 2GB | 1GB | 3GB | 1GB | |
am | Amharic | 377MB | 215MB | 405MB | 241MB | |
ar | Arabic | 87GB | 33GB | 69GB | 35GB | |
an | Aragonese | 1MB | 822KB | 1MB | 608KB | |
hy | Armenian | 3GB | 1GB | 4GB | 1GB | |
as | Assamese | 117MB | 73MB | 135MB | 95MB | |
ast | Asturian | 2MB | 2MB | 7MB | 4MB | |
av | Avaric | 418KB | 331KB | 421KB | 325KB | |
az | Azerbaijani | 2GB | 1GB | 3GB | 1GB | |
bn | Bangla | 10GB | 6GB | 14GB | 7GB | |
ba | Bashkir | 133MB | 93MB | 110MB | 77MB | |
eu | Basque | 889MB | 358MB | 900MB | 503MB | |
bar | Bavarian | 507B | 507B | 2KB | 1KB | |
be | Belarusian | 1GB | 1GB | 2GB | 1GB | |
bh | Bihari languages | 112KB | 34KB | 579KB | 120KB | |
bpy | Bishnupriya | 4MB | 1MB | 11MB | 4MB | |
bs | Bosnian | 459KB | 120KB | 310KB | 175KB | |
br | Breton | 29MB | 16MB | 49MB | 23MB | |
bg | Bulgarian | 33GB | 14GB | 34GB | 15GB | |
my | Burmese | 2GB | 1GB | 2GB | 1GB | |
yue | Cantonese | 3KB | 2KB | - | - | |
ca | Catalan | 8GB | 4GB | 13GB | 6GB | |
ceb | Cebuano | 40MB | 24MB | 81MB | 58MB | |
bcl | Central Bikol | 886B | 886B | - | - | |
ckb | Central Kurdish | 509MB | 236MB | 784MB | 367MB | |
cbk | Chavacano | 521B | 521B | 168B | 168B | {{< issue cbk >}} |
ce | Chechen | 8MB | 6MB | 29MB | 20MB | |
zh | Chinese | 544GB | 267GB | 500GB | 266GB | |
cv | Chuvash | 40MB | 27MB | 60MB | 41MB | |
kw | Cornish | 44KB | 14KB | 119KB | 72KB | |
hr | Croatian | 237MB | 115MB | 361MB | 169MB | |
cs | Czech | 56GB | 25GB | 72GB | 33GB | |
da | Danish | 16GB | 10GB | 18GB | 10GB | |
diq | Dimli (individual language) | 147B | 147B | 294B | 147B | |
dv | Divehi | 131MB | 81MB | 143MB | 111MB | |
nl | Dutch | 82GB | 41GB | 97GB | 47GB | |
mhr | Eastern Mari | 7MB | 6MB | 15MB | 10MB | |
arz | Egyptian Arabic | 68MB | 34MB | 48MB | 21MB | |
en | English | 2520GB | 1294GB | 2936GB | 1342GB | |
myv | Erzya | 1KB | 1KB | 29KB | 2KB | |
eo | Esperanto | 312MB | 238MB | 560MB | 390MB | |
et | Estonian | 5GB | 2GB | 7GB | 3GB | |
tl | Filipino | 601MB | 426MB | 699MB | 383MB | |
fi | Finnish | 28GB | 13GB | 35GB | 20GB | |
fr | French | 302GB | 147GB | 340GB | 161GB | |
gl | Galician | 650MB | 402MB | 989MB | 549MB | |
ka | Georgian | 3GB | 1GB | 6GB | 2GB | |
de | German | 330GB | 155GB | 433GB | 184GB | |
gom | Goan Konkani | 2MB | 1MB | 3MB | 2MB | |
el | Greek | 66GB | 28GB | 72GB | 30GB | |
gn | Guarani | 36KB | 23KB | 32KB | 25KB | |
gu | Gujarati | 1GB | 756MB | 1GB | 950MB | |
ht | Haitian Creole | 3KB | 3KB | 2KB | 1KB | |
he | Hebrew | 21GB | 10GB | 29GB | 11GB | |
hi | Hindi | 17GB | 9GB | 26GB | 13GB | |
hu | Hungarian | 42GB | 18GB | 60GB | 29GB | |
is | Icelandic | 1GB | 887MB | 2GB | 1GB | |
io | Ido | 151KB | 133KB | 276KB | 221KB | |
ilo | Iloko | 896KB | 653KB | 1MB | 857KB | |
id | Indonesian | 32GB | 16GB | 40GB | 22GB | |
ia | Interlingua | 678KB | 368KB | 291KB | 172KB | |
ie | Interlingue | 24KB | 1KB | 7KB | 2KB | |
ga | Irish | 91MB | 62MB | 131MB | 69MB | |
it | Italian | 146GB | 73GB | 192GB | 94GB | |
ja | Japanese | 231GB | 112GB | 208GB | 96GB | |
jv | Javanese | 675KB | 598KB | 858KB | 728KB | |
xal | Kalmyk | 115KB | 114KB | 62KB | 62KB | |
kn | Kannada | 1GB | 1GB | 2GB | 1GB | |
krc | Karachay-Balkar | 2MB | 2MB | 2MB | 2MB | |
kk | Kazakh | 2GB | 1GB | 3GB | 1GB | |
km | Khmer | 1GB | 608MB | 1GB | 860MB | |
kv | Komi | 2MB | 1MB | 1MB | 588KB | |
ko | Korean | 25GB | 11GB | 35GB | 15GB | |
ku | Kurdish | 98MB | 62MB | 152MB | 108MB | |
ky | Kyrgyz | 629MB | 406MB | 485MB | 334MB | |
lo | Lao | 181MB | 118MB | 287MB | 163MB | |
la | Latin | 26MB | 8MB | 103MB | 9MB | |
lv | Latvian | 4GB | 1GB | 6GB | 2GB | |
lez | Lezghian | 3MB | 3MB | 2MB | 2MB | |
li | Limburgish | 29KB | 27KB | 76KB | 54KB | |
lt | Lithuanian | 9GB | 4GB | 12GB | 5GB | |
jbo | Lojban | 753KB | 694KB | 929KB | 731KB | |
lmo | Lombard | 454KB | 444KB | 1MB | 1MB | |
nds | Low German | 18MB | 13MB | 25MB | 17MB | |
dsb | Lower Sorbian | 13KB | 7KB | 31KB | 14KB | |
lb | Luxembourgish | 30MB | 21MB | 54MB | 37MB | |
mk | Macedonian | 2GB | 1GB | 3GB | 1GB | |
mai | Maithili | 324KB | 10KB | 685KB | 24KB | |
mg | Malagasy | 21MB | 13MB | 59MB | 38MB | |
ms | Malay | 116MB | 43MB | 146MB | 60MB | |
ml | Malayalam | 5GB | 2GB | 4GB | 2GB | |
mt | Maltese | 24MB | 17MB | 51MB | 26MB | |
gv | Manx | - | - | 1KB | 907B | |
mr | Marathi | 2GB | 1GB | 3GB | 1GB | |
mzn | Mazanderani | 708KB | 617KB | 1MB | 1MB | |
min | Minangkabau | 622KB | 317KB | 8MB | 1MB | |
xmf | Mingrelian | 6MB | 4MB | 16MB | 10MB | |
mwl | Mirandese | 1KB | 1KB | 3KB | 2KB | |
mn | Mongolian | 2GB | 879MB | 1GB | 912MB | |
nah | Nahuatl languages | 11KB | 10KB | 34KB | 21KB | |
nap | Neapolitan | 17KB | 13KB | 1KB | 1KB | {{< issue nap >}} |
ne | Nepali | 1GB | 1GB | 3GB | 2GB | |
new | Newari | 5MB | 4MB | 6MB | 4MB | |
frr | Northern Frisian | 4KB | 4KB | 7KB | 5KB | {{< issue frr >}} |
lrc | Northern Luri | 77KB | 64KB | 183B | 183B | |
no | Norwegian Bokmål | 8GB | 5GB | 9GB | 4GB | |
nn | Norwegian Nynorsk | 88MB | 56MB | 123MB | 66MB | |
oc | Occitan | 6MB | 3MB | 12MB | 5MB | |
or | Odia | 259MB | 196MB | 538MB | 357MB | |
os | Ossetic | 12MB | 10MB | 11MB | 6MB | |
pam | Pampanga | 763B | 307B | 3KB | 3KB | |
ps | Pashto | 378MB | 253MB | 404MB | 286MB | |
fa | Persian | 84GB | 39GB | 79GB | 35GB | |
pms | Piedmontese | 2MB | 1MB | 4MB | 3MB | |
pl | Polish | 116GB | 50GB | 122GB | 48GB | |
pt | Portuguese | 132GB | 67GB | 159GB | 71GB | |
pa | Punjabi | 799MB | 481MB | 769MB | 430MB | |
qu | Quechua | 80KB | 68KB | 322KB | 230KB | |
ro | Romanian | 26GB | 11GB | 37GB | 15GB | |
rm | Romansh | 7KB | 6KB | 3KB | 3KB | |
bxr | Russia Buriat | 12KB | 10KB | 22KB | 18KB | |
ru | Russian | 1239GB | 609GB | 1201GB | 542GB | |
rue | Rusyn | - | - | 247B | 247B | |
sah | Sakha | 43MB | 27MB | 57MB | 39MB | |
sa | Sanskrit | 96MB | 38MB | 72MB | 43MB | |
sco | Scots | - | - | 1KB | 1KB | {{< issue sco >}} |
gd | Scottish Gaelic | 1MB | 1MB | 2MB | 1MB | |
sr | Serbian | 4GB | 2GB | 6GB | 3GB | |
sh | Serbian (Latin) | 25MB | 6MB | 13MB | 9MB | |
scn | Sicilian | 3KB | 2KB | 4KB | 3KB | |
sd | Sindhi | 363MB | 274MB | 75MB | 50MB | |
si | Sinhala | 1GB | 840MB | 1GB | 791MB | |
sk | Slovak | 9GB | 4GB | 14GB | 6GB | |
sl | Slovenian | 2GB | 1GB | 4GB | 1GB | |
so | Somali | 62KB | 15KB | 15KB | 13KB | {{< issue so >}} |
azb | South Azerbaijani | 28MB | 19MB | 47MB | 29MB | |
es | Spanish | 297GB | 159GB | 342GB | 160GB | |
su | Sundanese | 216KB | 145KB | 397KB | 274KB | |
sw | Swahili | 13MB | 8MB | 11MB | 7MB | |
sv | Swedish | 46GB | 26GB | 43GB | 19GB | |
tg | Tajik | 396MB | 260MB | 985MB | 321MB | {{< issue tg >}} |
ta | Tamil | 9GB | 5GB | 10GB | 5GB | |
tt | Tatar | 701MB | 319MB | 947MB | 424MB | |
te | Telugu | 2GB | 1GB | 3GB | 1GB | |
th | Thai | 38GB | 17GB | 62GB | 26GB | |
bo | Tibetan | 195MB | 144MB | 439MB | 358MB | |
gsw[^1] | Alemannic German | 5MB | 2MB | 7MB | 5MB | |
tr | Turkish | 63GB | 28GB | 73GB | 33GB | {{< issue tr >}} |
tk | Turkmen | 10MB | 7MB | 25MB | 20MB | |
tyv | Tuvinian | 11KB | 8KB | 9KB | 7KB | |
uk | Ukrainian | 56GB | 29GB | 53GB | 28GB | |
eml | Emiliano-Romagnolo[^2] | 25KB | 23KB | 22KB | 20KB | |
hsb | Upper Sorbian | 4MB | 1MB | 2MB | 1MB | |
ur | Urdu | 2GB | 1GB | 2GB | 1GB | |
ug | Uyghur | 127MB | 86MB | 187MB | 123MB | |
uz | Uzbek | 21MB | 11MB | 56MB | 28MB | |
vec | Venetian | 18KB | 16KB | 37KB | 28KB | |
vi | Vietnamese | 72GB | 33GB | 87GB | 42GB | |
vo | Volapük | 2MB | 2MB | 2MB | 2MB | |
wa | Walloon | 280KB | 207KB | 511KB | 329KB | |
war | Waray | 2MB | 2MB | 4MB | 4MB | |
cy | Welsh | 223MB | 139MB | 307MB | 180MB | |
vls | West Flemish | - | - | 134B | 134B | {{< issue vls >}} |
fy | Western Frisian | 35MB | 26MB | 82MB | 57MB | |
mrj | Western Mari | 1MB | 1MB | 645KB | 521KB | |
pnb | Western Panjabi | 11MB | 9MB | 68MB | 45MB | |
wuu | Wu Chinese | 111KB | 32KB | 145KB | 69KB | {{< issue wuu >}} |
yi | Yiddish | 146MB | 87MB | 199MB | 93MB | |
yo | Yoruba | 56KB | 26KB | 229KB | 120KB |
OSCAR Schema v1.1.0
The new OSCAR schema incorporates backward-compatible changes.
Changes
The old OSCAR Schema v1.0 featured the following file hierarchy, in an uncompressed form:
/
├── af
│ ├── af_sha256.txt
│ └── af.txt.gz
├── de
│ ├── de_sha256.txt # Checksum file
│ └── de.txt.gz # Textual content
├── en
│ ├── en_part_1.txt.gz # Multipart example
│ ├── en_part_2.txt.gz
│ └── en_sha256.txt
├── yi
│ ├── yi_sha256.txt
│ └── yi.txt.gz
└── zh
├── zh_sha256.txt
└── zh.txt.gz
The new OSCAR Schema v1.1 features the following file hierarchy (some languages omitted):
/
├── af
│ ├── af_meta.jsonl.gz
│ ├── af_sha256.txt
│ └── af.txt.gz
├── de
│ ├── de_meta.jsonl.gz # Metadata, in JSONLines format
│ ├── de_sha256.txt # Checksum file
│ └── de.txt.gz # Textual content
├── en
│ ├── en_meta_part_1.jsonl.gz # Multipart example
│ ├── en_meta_part_2.jsonl.gz # Each part is independent,
│ ├── en_part_1.txt.gz # Ex: en_part_2.txt.gz and en_meta_part_2.jsonl.gz
│ ├── en_part_2.txt.gz
│ └── en_sha256.txt
├── yi
│ ├── yi_meta.jsonl.gz
│ ├── yi_sha256.txt
│ └── yi.txt.gz
└── zh
├── zh_meta.jsonl.gz
├── zh_sha256.txt
└── zh.txt.gz
File formats
.txt
files
Lines are newline-separated, and documents are double-newline separated. In other terms, there is a blank line between each document.
.jsonl
files
These are the metadata, in JSONLines format.
Each line follows the following JSON Scheme:
{
"$schema": "http://json-schema.org/draft-07/schema#",
"title": "Metadata",
"description": "Holds record headers.\n\nEach metadata is linked to a specific paragraph/text zone",
"type": "object",
"required": [
"headers",
"nb_sentences",
"offset"
],
"properties": {
"headers": {
"type": "object",
"additionalProperties": {
"type": "string"
}
},
"nb_sentences": {
"type": "integer",
"format": "uint",
"minimum": 0.0
},
"offset": {
"type": "integer",
"format": "uint",
"minimum": 0.0
}
}
}
Example:
{
"headers":{ // these headers keys are *almost* always present.
"content-length":"11062", // the content length is not changed and reflects the
// length before filtering and eventual deduplication.
"warc-target-uri":"...",
"warc-type":"conversion",
"content-type":"text/plain",
"warc-date":"2021-02-24T17:55:29Z", // Following WARC specification, it is the crawl date.
"warc-identified-content-language":"eng,zho",
"warc-refers-to":"<urn:uuid:c649de0e-42a3-4e69-b675-98e28e084698>",
"warc-block-digest":"sha1:V4PYYGYA6ZYA2WACDKSNL6NXGDN6XK6X",
"warc-record-id":"<urn:uuid:121a822f-5362-4559-8891-d085415cdd90>"
},
"offset":0, // Related text is in the text file, from lines offset+1 to lines offset+nb_sentences.
"nb_sentences":9
}
<lang>_sha256.txt
files
These are used to check for eventual corruption during download.
They can be used by running sha256sum -c <lang>_sha256.txt
.
[^1]: gsw
is ISO 639-2 for Alemannic German. It was previously identified as als
in previous OSCAR versions, due to a bug in fasttext.
[^2]: eml
identification tag is deprecated and corresponds to rgn
and egl
tags in ISO 639-3