OSCAR 23.01

OSCAR 23.01 is the January 2023 version of the OSCAR Corpus based on the November/December 2022 dump of Common Crawl. While being quite similar to OSCAR 22.01, it contains several new features, including KenLM-based adult content detection, precomputed Locality-Sensitive Hashes for near deduplication, and blocklist-based categories. OSCAR 23.01 has also moved from gzip to Zstandard compression. You might already have zstd installed on your system, but if not, please check the Zstandard website for installation instructions.

Tip

OSCAR 23.01 is similar to OSCAR 22.01. As such, please also check out the documentation for OSCAR 22.01 if you need detailed information about metadata.

Access

Note

If you already have access to the corpus, there's nothing to do! Go up in the file hierarchy on the link you've been given, and you should find the new corpus.

Access to the OSCAR Corpus changes depending on your status. More info on our dedicated page.

Getting access

New Features

KenLM-based Adult Content Filtering

For a select number of subcorpora, a measure of perplexity has been added. This perplexity comes from a KenLM model trained on harmful content, previously gathered by using the adult annotation in OSCAR 22.01. In other terms, the lower it is, the more likely a given document contains harmful/adult content.

Danger

This feature can be considered as unstable/unsafe, since we also want to evaluate its impact on particular issues.

As such, we do not provide a boolean value indicating if a given document can be harmful/adult content, but rather the raw perplexity. We have found a threshold that works well in English, but encourage you to experiment with it and to report back your findings.

Locality Sensitive Hashing

We use TLSH to compute a hash for each document.

Locality sensitive hashing is a hashing method that computes similar hashes for similar documents.

This can be used to do both exact- and near- deduplication. Same documents have same hashes (the reverse might not be true). So you only need to check for identity amongst documents with identical hashes. TLSH hashes can be compared to yield a distance metric. According to the original paper, a cutoff of < 40 yields a false positive rate of 0.07% and a detect rate of 49.6%, while a cutoff of < 100 yields a FP rate of 6.43% and detect rate of 94.5%. You should choose a value that meets your purposes.

The above is true for the default version of TLSH which is used in packages such as py-tlsh. OSCAR 23.01 uses a TLSH with a hyperparameter of 256 buckets (Full hash), and 3 byte checksums (collision rate : 1 in 5800) instead of 1 byte checksums (collision rate : 1 in 24).

If you would like to use py-tlsh, follow these instructions (You need CMake installed to perform the necessary modifications and build):

# download py-tlsh source package
pip download python-tlsh
# unpack the source tar.gz and enter the directory
tar -xvf python-tlsh-4.5.0.tar.gz && cd python-tlsh-4.5.0
# run the following command to implement the changes
# alternatively, you can use vi or a text editor
# change TLSH_BUCKETS_128 into TLSH_BUCKETS_256 and change TLSH_CHECKSUM_1B into TLSH_CHECKSUM_3B
sed -i 's/set(TLSH_BUCKETS_128 1)/set(TLSH_BUCKETS_256 1)/g; s/set(TLSH_CHECKSUM_1B 1)/set(TLSH_CHECKSUM_3B 1)/g' CMakeLists.txt

# build and activate pip venv if not already done
# python3 -m venv ~/.venv
source ~/.venv/bin/activate
# build and install the new py-tlsh
python3 setup.py install

Hashes are at metadata.tlsh.

Minor changes

metadata.annotations has been renamed metadata.quality_warnings, and only contains length based quality warnings (see the OSCAR 2201 documentation for details).
Some language tags have changed to better respect the BCP47:
- als has become gsw. Previously, als was erroneously used as the tag for Alemannic/Swiss German, whereas it is the tag for Tosk Albanian.
- eml has become x-eml. The eml tag is deprecated and as such has been replaced by a private tag (x-eml).

Layout

{
   "content":"English sentence\nphrase en français\n????????????", // (1)
   "warc_headers":{ // (2)
      "warc-identified-content-language":"fra,eng",
      "warc-target-uri":"https://fr.wikipedia.org/wiki/...",
      "warc-record-id":"<urn:uuid:29eaa920-d299-4b1d-b687-c72bd8d68116>",
      "warc-type":"conversion",
      "content-length":"35298", // (3)
      "warc-refers-to":"<urn:uuid:39e42055-0d94-4e45-9c6c-9e7056635d64>",
      "warc-block-digest":"sha1:WFH2A5WHCS2H365GIAFYQPI7UOAMFGHB", // (3)
      "warc-date":"2022-11-26T09:45:47Z",
      "content-type":"text/plain"
   },
   "metadata":{
      "identification":{ // (4)
         "label":"fr",
         "prob":0.8938327
      },
      "harmful_pp":4063.1814, // (5)
      "tlsh":"tlsh:T125315FF2B6088901EEA097015DB39B4600B...", // (6)
      "quality_warnings":[ // (7)
         "short_sentences",
         "header",
         "footer"
      ],
      "categories":[ // (8)
         "examen_pix",
         "liste_bu"
      ],
      "sentence_identifications":[ // (9)
         {
            "label":"fr",
            "prob":0.99837273
         },
         {
            "label":"en",
            "prob":0.9992377
         },
         null
      ]
   }
}

Some important notes:

Newline-separated content.
Headers from the crawled dumps are left untouched. See the WARC specification for more info.
Since warc_headers are copied and content can be altered by Ungoliant at generation stage, content-length and warc-block-digest can be different from actual values.
Document-level identification. Computation details can be found on the OSCAR 22.01 paper.
Perplexity of the document, computed using a KenLM model trained on harmful content. See this pre-print for more info. The lower this number is, the higher the probability that it will contain harmful or adult content. This annotation will be changed from harmful_pp to harmful_pplin future releases.
Locality Sensitive Hash of the documents' content, using TLSH. Useful for both exact and near deduplication.
(Corresponds to annotations pre-23.01) Potential quality warnings. Based on content/sentence length. See [OSCAR 22.01 paper for more info.
Blocklist-based categories. Uses the UT1 Blocklist, plus custom additions. Please refer to the UT1 website for categories description. Note that the categories are in French.
Sentence-level identifications. A null value means no identification with a good enough threshold (>0.8 on 23.01).

Language table

	Code	Language	# docs	# words	Content Length :
0	af	Afrikaans	23,994	6,217,024	37.2 MB
1	sq	Albanian	1,342,790	462,694,599	3.2 GB
2	am	Amharic	119,434	40,262,809	512.9 MB
3	ar	Arabic	25,012,116	10,081,452,882	110.7 GB
4	an	Aragonese	34	264	11.0 kB
5	hy	Armenian	1,056,974	336,045,041	4.9 GB
6	as	Assamese	89,542	24,395,215	412.1 MB
7	ast	Asturian	440	10,917	74.1 kB
8	av	Avaric	44	1,073	18.6 kB
9	az	Azerbaijani	1,159,994	316,850,330	3.0 GB
10	bn	Bangla	3,474,086	1,092,983,765	19.1 GB
11	ba	Bashkir	128,248	26,036,637	363.7 MB
12	eu	Basque	678,474	136,672,615	1.2 GB
13	be	Belarusian	445,612	164,729,607	2.3 GB
14	bh	Bihari languages	48	507	6.8 kB
15	bpy	Bishnupriya	2,346	346,947	5.4 MB
16	bs	Bosnian	20	395	3.0 kB
17	br	Breton	36,338	4,759,407	31.4 MB
18	bg	Bulgarian	8,933,998	3,635,273,738	44.1 GB
19	my	Burmese	430,276	82,433,836	3.0 GB
20	ca	Catalan	6,953,898	2,240,460,836	15.3 GB
21	ceb	Cebuano	16,174	6,263,404	41.1 MB
22	ckb	Central Kurdish	182,508	61,334,746	772.9 MB
23	ce	Chechen	11,686	1,051,752	13.9 MB
24	zh	Chinese	138,478,270	44,378,380,161	1.4 TB
25	cv	Chuvash	16,652	3,039,925	42.3 MB
26	kw	Cornish	8	80	432 Bytes
27	hr	Croatian	31,808	3,542,961	26.5 MB
28	cs	Czech	34,859,632	9,717,378,559	77.0 GB
29	da	Danish	7,214,338	2,217,634,340	14.8 GB
30	dv	Divehi	77,060	10,655,359	200.1 MB
31	nl	Dutch	72,552,688	19,564,553,306	135.0 GB
32	mhr	Eastern Mari	9,502	1,615,215	22.9 MB
33	arz	Egyptian Arabic	3,958	385,511	3.7 MB
34	en	English	1,235,510,986	523,869,288,690	3.4 TB
35	eo	Esperanto	226,924	67,774,923	474.8 MB
36	et	Estonian	3,601,904	938,296,892	8.0 GB
37	tl	Filipino	250,558	110,560,444	719.2 MB
38	fi	Finnish	14,471,710	4,198,143,883	41.1 GB
39	fr	French	158,334,998	62,127,088,294	430.5 GB
40	gl	Galician	248,762	38,345,625	255.7 MB
41	ka	Georgian	1,343,036	373,935,158	8.4 GB
42	de	German	206,598,430	73,848,586,648	594.7 GB
43	gom	Goan Konkani	398	121,035	2.3 MB
44	el	Greek	20,282,864	7,691,622,692	95.7 GB
45	gn	Guarani	14	260	2.2 kB
46	gu	Gujarati	425,552	417,001,705	5.6 GB
47	ht	Haitian Creole	2	20,671	93.1 kB
48	he	Hebrew	3,997,888	1,697,158,891	18.0 GB
49	hi	Hindi	5,514,454	2,475,605,444	32.6 GB
50	hu	Hungarian	21,349,372	16,013,364,289	150.1 GB
51	is	Icelandic	1,210,232	294,471,539	2.2 GB
52	io	Ido	224	2,598	16.1 kB
53	ilo	Iloko	144	4,411	28.0 kB
54	id	Indonesian	7,109,778	3,228,020,221	23.4 GB
55	ia	Interlingua	34	9,384	33.5 kB
56	ie	Interlingue	2	0	881 Bytes
57	ga	Irish	29,894	9,054,923	63.2 MB
58	it	Italian	89,021,606	36,327,274,203	259.4 GB
59	ja	Japanese	94,236,404	4,401,059,165	181.2 GB
60	jv	Javanese	172	3,286	25.7 kB
61	xal	Kalmyk	2	27	315 Bytes
62	kn	Kannada	448,500	124,924,350	2.6 GB
63	krc	Karachay-Balkar	496	8,385	122.4 kB
64	kk	Kazakh	677,622	214,679,857	3.3 GB
65	km	Khmer	450,660	59,880,231	3.2 GB
66	kv	Komi	460	5,909	70.3 kB
67	ko	Korean	15,147,698	3,435,866,935	38.1 GB
68	ku	Kurdish	80,338	25,921,607	174.1 MB
69	ky	Kyrgyz	144,288	32,062,783	489.3 MB
70	lo	Lao	118,374	10,659,203	472.1 MB
71	la	Latin	14,384	307,865	2.0 MB
72	lv	Latvian	2,435,882	845,459,899	7.4 GB
73	lez	Lezghian	676	60,634	856.6 kB
74	li	Limburgish	6	169	1.4 kB
75	lt	Lithuanian	5,182,028	1,674,362,574	14.5 GB
76	jbo	Lojban	572	312,315	1.5 MB
77	lmo	Lombard	112	3,269	21.0 kB
78	nds	Low German	5,248	1,612,175	10.7 MB
79	dsb	Lower Sorbian	8	84	664 Bytes
80	lb	Luxembourgish	18,090	2,514,838	18.4 MB
81	mk	Macedonian	1,063,298	389,344,425	4.7 GB
82	mai	Maithili	46	467	6.8 kB
83	mg	Malagasy	10,830	1,416,430	11.2 MB
84	ms	Malay	11,500	238,477	2.6 MB
85	ml	Malayalam	800,936	236,597,838	5.8 GB
86	mt	Maltese	5,180	149,886	1.3 MB
87	mr	Marathi	729,578	252,706,331	4.5 GB
88	mzn	Mazanderani	384	16,115	169.2 kB
89	min	Minangkabau	2,436	305,589	3.8 MB
90	xmf	Mingrelian	7,318	283,316	6.1 MB
91	mwl	Mirandese	4	54	423 Bytes
92	mn	Mongolian	1,061,710	454,350,415	5.8 GB
93	multi	Multilingual	2,948,202	1,251,676,406	11.9 GB
94	nah	Nahuatl languages	38	279	2.4 kB
95	ne	Nepali	1,152,156	278,901,036	4.9 GB
96	new	Newari	1,996	229,703	4.0 MB
97	no	Norwegian	2,797,378	373,160,033	2.6 GB
98	nn	Norwegian Nynorsk	19,470	575,518	3.7 MB
99	oc	Occitan	920	34,701	405.0 kB
100	or	Odia	158,426	31,963,340	543.1 MB
101	os	Ossetic	8,628	3,935,964	50.7 MB
102	ps	Pashto	87,408	30,196,179	261.6 MB
103	fa	Persian	23,813,882	9,609,206,698	93.2 GB
104	pms	Piedmontese	2,524	510,087	3.1 MB
105	pl	Polish	57,184,826	18,073,705,588	147.1 GB
106	pt	Portuguese	36,062,800	15,172,557,311	105.0 GB
107	pa	Punjabi	222,058	104,235,418	1.4 GB
108	qu	Quechua	2	13	143 Bytes
109	ro	Romanian	11,985,668	6,302,600,833	45.6 GB
110	bxr	Russia Buriat	72	698	8.2 kB
111	ru	Russian	194,143,422	78,032,029,344	1.1 TB
112	sah	Sakha	17,566	4,288,051	68.8 MB
113	sa	Sanskrit	16,802	2,479,345	56.3 MB
114	gd	Scottish Gaelic	776	18,458	146.1 kB
115	sr	Serbian	1,677,896	632,781,822	7.7 GB
116	sh	Serbian (Latin)	3,214	166,517	816.4 kB
117	sd	Sindhi	48,566	14,667,207	131.6 MB
118	si	Sinhala	301,066	172,755,385	2.6 GB
119	sk	Slovak	8,931,784	2,704,716,280	21.5 GB
120	sl	Slovenian	1,112,560	192,816,743	1.4 GB
121	so	Somali	6	51	503 Bytes
122	azb	South Azerbaijani	26,364	2,029,729	28.4 MB
123	es	Spanish	153,574,556	63,388,237,965	429.9 GB
124	su	Sundanese	18	258	2.0 kB
125	sw	Swahili	1,664	164,459	1.0 MB
126	sv	Swedish	21,891,348	6,993,719,601	50.0 GB
127	gsw	Swiss German	342	34,328	232.7 kB
128	tg	Tajik	144,932	76,987,285	1.0 GB
129	ta	Tamil	1,638,238	738,824,392	15.8 GB
130	tt	Tatar	262,654	59,253,765	833.8 MB
131	te	Telugu	644,712	201,575,815	3.9 GB
132	th	Thai	14,845,900	2,224,483,018	92.0 GB
133	bo	Tibetan	62,352	6,062,558	531.6 MB
134	tr	Turkish	26,654,330	8,290,890,087	73.7 GB
135	tk	Turkmen	4,576	325,786	3.3 MB
136	uk	Ukrainian	10,059,992	3,183,842,018	44.7 GB
137	x-eml	Emiliano-Romagnol	4	329	1.8 kB
138	hsb	Upper Sorbian	402	15,827	123.2 kB
139	ur	Urdu	887,004	434,023,273	3.8 GB
140	ug	Uyghur	51,304	14,659,554	219.8 MB
141	uz	Uzbek	15,806	1,665,960	15.3 MB
142	vi	Vietnamese	33,933,994	22,424,984,210	140.8 GB
143	vo	Volapük	896	49,968	371.9 kB
144	wa	Walloon	390	6,347	34.3 kB
145	war	Waray	1,494	19,665	126.8 kB
146	cy	Welsh	151,512	52,250,043	333.0 MB
147	fy	Western Frisian	45,458	9,885,788	70.4 MB
148	mrj	Western Mari	496	60,180	765.8 kB
149	pnb	Western Panjabi	12,904	11,844,695	105.8 MB
150	wuu	Wu Chinese	136	1,199	26.8 kB
151	yi	Yiddish	47,438	14,287,370	171.7 MB
152	yo	Yoruba	128	2,396	16.6 kB