OSCAR 21.09

Features

These are the versions of tooling, schemes and data

CommonCrawl version: February/March 2021 (2021.10)
OSCAR Schema version: v1.1 : Incorporates metadata in a backward compatible manner.
Ungoliant version: v1 : New generation tool, faster and better documented/tested than the previous one: goclassy.

Changes

As per OSCAR Schema v1.1, each document/record has associated metadata.
New languages: Manx, Rusyn, Scots and West Flemish. Their size and quality still has to be assessed.
Removed languages: Central Bikol and Cantonese. Cantonsese was of a very low quality. Central Bikol corpus is still available on OSCAR 2019.

Table

	Language	OSCAR 2019	OSCAR 2019 deduplicated	OSCAR 21.09	OSCAR 21.09 deduplicated	Issues
af	Afrikaans	251MB	170MB	258MB	157MB
sq	Albanian	2GB	1GB	3GB	1GB
am	Amharic	377MB	215MB	405MB	241MB
ar	Arabic	87GB	33GB	69GB	35GB
an	Aragonese	1MB	822KB	1MB	608KB
hy	Armenian	3GB	1GB	4GB	1GB
as	Assamese	117MB	73MB	135MB	95MB
ast	Asturian	2MB	2MB	7MB	4MB
av	Avaric	418KB	331KB	421KB	325KB
az	Azerbaijani	2GB	1GB	3GB	1GB
bn	Bangla	10GB	6GB	14GB	7GB
ba	Bashkir	133MB	93MB	110MB	77MB
eu	Basque	889MB	358MB	900MB	503MB
bar	Bavarian	507B	507B	2KB	1KB
be	Belarusian	1GB	1GB	2GB	1GB
bh	Bihari languages	112KB	34KB	579KB	120KB
bpy	Bishnupriya	4MB	1MB	11MB	4MB
bs	Bosnian	459KB	120KB	310KB	175KB
br	Breton	29MB	16MB	49MB	23MB
bg	Bulgarian	33GB	14GB	34GB	15GB
my	Burmese	2GB	1GB	2GB	1GB
yue	Cantonese	3KB	2KB	-	-
ca	Catalan	8GB	4GB	13GB	6GB
ceb	Cebuano	40MB	24MB	81MB	58MB
bcl	Central Bikol	886B	886B	-	-
ckb	Central Kurdish	509MB	236MB	784MB	367MB
cbk	Chavacano	521B	521B	168B	168B	{{< issue cbk >}}
ce	Chechen	8MB	6MB	29MB	20MB
zh	Chinese	544GB	267GB	500GB	266GB
cv	Chuvash	40MB	27MB	60MB	41MB
kw	Cornish	44KB	14KB	119KB	72KB
hr	Croatian	237MB	115MB	361MB	169MB
cs	Czech	56GB	25GB	72GB	33GB
da	Danish	16GB	10GB	18GB	10GB
diq	Dimli (individual language)	147B	147B	294B	147B
dv	Divehi	131MB	81MB	143MB	111MB
nl	Dutch	82GB	41GB	97GB	47GB
mhr	Eastern Mari	7MB	6MB	15MB	10MB
arz	Egyptian Arabic	68MB	34MB	48MB	21MB
en	English	2520GB	1294GB	2936GB	1342GB
myv	Erzya	1KB	1KB	29KB	2KB
eo	Esperanto	312MB	238MB	560MB	390MB
et	Estonian	5GB	2GB	7GB	3GB
tl	Filipino	601MB	426MB	699MB	383MB
fi	Finnish	28GB	13GB	35GB	20GB
fr	French	302GB	147GB	340GB	161GB
gl	Galician	650MB	402MB	989MB	549MB
ka	Georgian	3GB	1GB	6GB	2GB
de	German	330GB	155GB	433GB	184GB
gom	Goan Konkani	2MB	1MB	3MB	2MB
el	Greek	66GB	28GB	72GB	30GB
gn	Guarani	36KB	23KB	32KB	25KB
gu	Gujarati	1GB	756MB	1GB	950MB
ht	Haitian Creole	3KB	3KB	2KB	1KB
he	Hebrew	21GB	10GB	29GB	11GB
hi	Hindi	17GB	9GB	26GB	13GB
hu	Hungarian	42GB	18GB	60GB	29GB
is	Icelandic	1GB	887MB	2GB	1GB
io	Ido	151KB	133KB	276KB	221KB
ilo	Iloko	896KB	653KB	1MB	857KB
id	Indonesian	32GB	16GB	40GB	22GB
ia	Interlingua	678KB	368KB	291KB	172KB
ie	Interlingue	24KB	1KB	7KB	2KB
ga	Irish	91MB	62MB	131MB	69MB
it	Italian	146GB	73GB	192GB	94GB
ja	Japanese	231GB	112GB	208GB	96GB
jv	Javanese	675KB	598KB	858KB	728KB
xal	Kalmyk	115KB	114KB	62KB	62KB
kn	Kannada	1GB	1GB	2GB	1GB
krc	Karachay-Balkar	2MB	2MB	2MB	2MB
kk	Kazakh	2GB	1GB	3GB	1GB
km	Khmer	1GB	608MB	1GB	860MB
kv	Komi	2MB	1MB	1MB	588KB
ko	Korean	25GB	11GB	35GB	15GB
ku	Kurdish	98MB	62MB	152MB	108MB
ky	Kyrgyz	629MB	406MB	485MB	334MB
lo	Lao	181MB	118MB	287MB	163MB
la	Latin	26MB	8MB	103MB	9MB
lv	Latvian	4GB	1GB	6GB	2GB
lez	Lezghian	3MB	3MB	2MB	2MB
li	Limburgish	29KB	27KB	76KB	54KB
lt	Lithuanian	9GB	4GB	12GB	5GB
jbo	Lojban	753KB	694KB	929KB	731KB
lmo	Lombard	454KB	444KB	1MB	1MB
nds	Low German	18MB	13MB	25MB	17MB
dsb	Lower Sorbian	13KB	7KB	31KB	14KB
lb	Luxembourgish	30MB	21MB	54MB	37MB
mk	Macedonian	2GB	1GB	3GB	1GB
mai	Maithili	324KB	10KB	685KB	24KB
mg	Malagasy	21MB	13MB	59MB	38MB
ms	Malay	116MB	43MB	146MB	60MB
ml	Malayalam	5GB	2GB	4GB	2GB
mt	Maltese	24MB	17MB	51MB	26MB
gv	Manx	-	-	1KB	907B
mr	Marathi	2GB	1GB	3GB	1GB
mzn	Mazanderani	708KB	617KB	1MB	1MB
min	Minangkabau	622KB	317KB	8MB	1MB
xmf	Mingrelian	6MB	4MB	16MB	10MB
mwl	Mirandese	1KB	1KB	3KB	2KB
mn	Mongolian	2GB	879MB	1GB	912MB
nah	Nahuatl languages	11KB	10KB	34KB	21KB
nap	Neapolitan	17KB	13KB	1KB	1KB	{{< issue nap >}}
ne	Nepali	1GB	1GB	3GB	2GB
new	Newari	5MB	4MB	6MB	4MB
frr	Northern Frisian	4KB	4KB	7KB	5KB	{{< issue frr >}}
lrc	Northern Luri	77KB	64KB	183B	183B
no	Norwegian Bokmål	8GB	5GB	9GB	4GB
nn	Norwegian Nynorsk	88MB	56MB	123MB	66MB
oc	Occitan	6MB	3MB	12MB	5MB
or	Odia	259MB	196MB	538MB	357MB
os	Ossetic	12MB	10MB	11MB	6MB
pam	Pampanga	763B	307B	3KB	3KB
ps	Pashto	378MB	253MB	404MB	286MB
fa	Persian	84GB	39GB	79GB	35GB
pms	Piedmontese	2MB	1MB	4MB	3MB
pl	Polish	116GB	50GB	122GB	48GB
pt	Portuguese	132GB	67GB	159GB	71GB
pa	Punjabi	799MB	481MB	769MB	430MB
qu	Quechua	80KB	68KB	322KB	230KB
ro	Romanian	26GB	11GB	37GB	15GB
rm	Romansh	7KB	6KB	3KB	3KB
bxr	Russia Buriat	12KB	10KB	22KB	18KB
ru	Russian	1239GB	609GB	1201GB	542GB
rue	Rusyn	-	-	247B	247B
sah	Sakha	43MB	27MB	57MB	39MB
sa	Sanskrit	96MB	38MB	72MB	43MB
sco	Scots	-	-	1KB	1KB	{{< issue sco >}}
gd	Scottish Gaelic	1MB	1MB	2MB	1MB
sr	Serbian	4GB	2GB	6GB	3GB
sh	Serbian (Latin)	25MB	6MB	13MB	9MB
scn	Sicilian	3KB	2KB	4KB	3KB
sd	Sindhi	363MB	274MB	75MB	50MB
si	Sinhala	1GB	840MB	1GB	791MB
sk	Slovak	9GB	4GB	14GB	6GB
sl	Slovenian	2GB	1GB	4GB	1GB
so	Somali	62KB	15KB	15KB	13KB	{{< issue so >}}
azb	South Azerbaijani	28MB	19MB	47MB	29MB
es	Spanish	297GB	159GB	342GB	160GB
su	Sundanese	216KB	145KB	397KB	274KB
sw	Swahili	13MB	8MB	11MB	7MB
sv	Swedish	46GB	26GB	43GB	19GB
tg	Tajik	396MB	260MB	985MB	321MB	{{< issue tg >}}
ta	Tamil	9GB	5GB	10GB	5GB
tt	Tatar	701MB	319MB	947MB	424MB
te	Telugu	2GB	1GB	3GB	1GB
th	Thai	38GB	17GB	62GB	26GB
bo	Tibetan	195MB	144MB	439MB	358MB
gsw[^1]	Alemannic German	5MB	2MB	7MB	5MB
tr	Turkish	63GB	28GB	73GB	33GB	{{< issue tr >}}
tk	Turkmen	10MB	7MB	25MB	20MB
tyv	Tuvinian	11KB	8KB	9KB	7KB
uk	Ukrainian	56GB	29GB	53GB	28GB
eml	Emiliano-Romagnolo[^2]	25KB	23KB	22KB	20KB
hsb	Upper Sorbian	4MB	1MB	2MB	1MB
ur	Urdu	2GB	1GB	2GB	1GB
ug	Uyghur	127MB	86MB	187MB	123MB
uz	Uzbek	21MB	11MB	56MB	28MB
vec	Venetian	18KB	16KB	37KB	28KB
vi	Vietnamese	72GB	33GB	87GB	42GB
vo	Volapük	2MB	2MB	2MB	2MB
wa	Walloon	280KB	207KB	511KB	329KB
war	Waray	2MB	2MB	4MB	4MB
cy	Welsh	223MB	139MB	307MB	180MB
vls	West Flemish	-	-	134B	134B	{{< issue vls >}}
fy	Western Frisian	35MB	26MB	82MB	57MB
mrj	Western Mari	1MB	1MB	645KB	521KB
pnb	Western Panjabi	11MB	9MB	68MB	45MB
wuu	Wu Chinese	111KB	32KB	145KB	69KB	{{< issue wuu >}}
yi	Yiddish	146MB	87MB	199MB	93MB
yo	Yoruba	56KB	26KB	229KB	120KB

OSCAR Schema v1.1.0

The new OSCAR schema incorporates backward-compatible changes.

Changes

The old OSCAR Schema v1.0 featured the following file hierarchy, in an uncompressed form:

/
├── af
│   ├── af_sha256.txt
│   └── af.txt.gz
├── de
│   ├── de_sha256.txt    # Checksum file 
│   └── de.txt.gz        # Textual content
├── en
│   ├── en_part_1.txt.gz        # Multipart example
│   ├── en_part_2.txt.gz
│   └── en_sha256.txt
├── yi
│   ├── yi_sha256.txt
│   └── yi.txt.gz
└── zh
    ├── zh_sha256.txt
    └── zh.txt.gz

The new OSCAR Schema v1.1 features the following file hierarchy (some languages omitted):

/
├── af
│   ├── af_meta.jsonl.gz
│   ├── af_sha256.txt
│   └── af.txt.gz
├── de
│   ├── de_meta.jsonl.gz # Metadata, in JSONLines format
│   ├── de_sha256.txt    # Checksum file 
│   └── de.txt.gz        # Textual content
├── en
│   ├── en_meta_part_1.jsonl.gz # Multipart example
│   ├── en_meta_part_2.jsonl.gz # Each part is independent,
│   ├── en_part_1.txt.gz        # Ex: en_part_2.txt.gz and en_meta_part_2.jsonl.gz
│   ├── en_part_2.txt.gz
│   └── en_sha256.txt
├── yi
│   ├── yi_meta.jsonl.gz
│   ├── yi_sha256.txt
│   └── yi.txt.gz
└── zh
    ├── zh_meta.jsonl.gz
    ├── zh_sha256.txt
    └── zh.txt.gz

File formats

`.txt` files

Lines are newline-separated, and documents are double-newline separated. In other terms, there is a blank line between each document.

`.jsonl` files

These are the metadata, in JSONLines format.

Each line follows the following JSON Scheme:

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "title": "Metadata",
  "description": "Holds record headers.\n\nEach metadata is linked to a specific paragraph/text zone",
  "type": "object",
  "required": [
    "headers",
    "nb_sentences",
    "offset"
  ],
  "properties": {
    "headers": {
      "type": "object",
      "additionalProperties": {
        "type": "string"
      }
    },
    "nb_sentences": {
      "type": "integer",
      "format": "uint",
      "minimum": 0.0
    },
    "offset": {
      "type": "integer",
      "format": "uint",
      "minimum": 0.0
    }
  }
}

Example:

{
   "headers":{                  // these headers keys are *almost* always present.
      "content-length":"11062", // the content length is not changed and reflects the 
                                // length before filtering and eventual deduplication.
      "warc-target-uri":"...",
      "warc-type":"conversion",
      "content-type":"text/plain",
      "warc-date":"2021-02-24T17:55:29Z", // Following WARC specification, it is the crawl date.
      "warc-identified-content-language":"eng,zho",
      "warc-refers-to":"<urn:uuid:c649de0e-42a3-4e69-b675-98e28e084698>",
      "warc-block-digest":"sha1:V4PYYGYA6ZYA2WACDKSNL6NXGDN6XK6X",
      "warc-record-id":"<urn:uuid:121a822f-5362-4559-8891-d085415cdd90>"
   },
   "offset":0, // Related text is in the text file, from lines offset+1 to lines offset+nb_sentences.
   "nb_sentences":9
}

`<lang>_sha256.txt` files

These are used to check for eventual corruption during download. They can be used by running sha256sum -c <lang>_sha256.txt.

[^1]: gsw is ISO 639-2 for Alemannic German. It was previously identified as als in previous OSCAR versions, due to a bug in fasttext. [^2]: eml identification tag is deprecated and corresponds to rgn and egl tags in ISO 639-3

OSCAR 21.09

Features

Changes

Table

OSCAR Schema v1.1.0

Changes

File formats

.txt files

.jsonl files

<lang>_sha256.txt files

`.txt` files

`.jsonl` files

`<lang>_sha256.txt` files