OSCAR 23.01
OSCAR 23.01 is the January 2023 version of the OSCAR Corpus based on the November/December 2022 dump of Common Crawl. While being quite similar to OSCAR 22.01, it contains several new features, including KenLM-based adult content detection, precomputed Locality-Sensitive Hashes for near deduplication, and blocklist-based categories. OSCAR 23.01 has also moved from gzip to Zstandard compression. You might already have zstd
installed on your system, but if not, please check the Zstandard website for installation instructions.
Tip
OSCAR 23.01 is similar to OSCAR 22.01. As such, please also check out the documentation for OSCAR 22.01 if you need detailed information about metadata.
Access
Note
If you already have access to the corpus, there's nothing to do! Go up in the file hierarchy on the link you've been given, and you should find the new corpus.
Access to the OSCAR Corpus changes depending on your status. More info on our dedicated page.
New Features
Categories
OSCAR 22.01 leveraged the UT1 Blocklists project to attempt to classify some adult content present in OSCAR. The OSCAR 23.01 pipeline iterated on this to include all of the blocklists provided by UT1.
Warning
The UT1 Blocklists page lists all the categories along with a short description. We strongly encourage you to read the descriptions if you plan on using them. Please also note that these descriptions are in French. We're working on an English translation of them.
Note
A document can belong to multiple categories.
These categories are in a field that is at this path: metadata.categories
.
KenLM-based Adult Content Filtering
For a select number of subcorpora, a measure of perplexity has been added. This perplexity comes from a KenLM model trained on harmful content, previously gathered by using the adult
annotation in OSCAR 22.01.
In other terms, the lower it is, the more likely a given document contains harmful/adult content.
Danger
This feature can be considered as unstable/unsafe, since we also want to evaluate its impact on particular issues.
As such, we do not provide a boolean value indicating if a given document can be harmful/adult content, but rather the raw perplexity. We have found a threshold that works well in English, but encourage you to experiment with it and to report back your findings.
Locality Sensitive Hashing
We use TLSH to compute a hash for each document.
Locality sensitive hashing is a hashing method that computes similar hashes for similar documents.
This can be used to do both exact- and near- deduplication. Same documents have same hashes (the reverse might not be true). So you only need to check for identity amongst documents with identical hashes. TLSH hashes can be compared to yield a distance metric. According to the original paper, a cutoff of < 40 yields a false positive rate of 0.07% and a detect rate of 49.6%, while a cutoff of < 100 yields a FP rate of 6.43% and detect rate of 94.5%. You should choose a value that meets your purposes.
The above is true for the default version of TLSH which is used in packages such as py-tlsh
. OSCAR 23.01 uses a TLSH with a hyperparameter of 256 buckets (Full hash), and 3 byte checksums (collision rate : 1 in 5800) instead of 1 byte checksums (collision rate : 1 in 24).
If you would like to use py-tlsh
, follow these instructions (You need CMake
installed to perform the necessary modifications and build):
# download py-tlsh source package
pip download python-tlsh
# unpack the source tar.gz and enter the directory
tar -xvf python-tlsh-4.5.0.tar.gz && cd python-tlsh-4.5.0
# run the following command to implement the changes
# alternatively, you can use vi or a text editor
# change TLSH_BUCKETS_128 into TLSH_BUCKETS_256 and change TLSH_CHECKSUM_1B into TLSH_CHECKSUM_3B
sed -i 's/set(TLSH_BUCKETS_128 1)/set(TLSH_BUCKETS_256 1)/g; s/set(TLSH_CHECKSUM_1B 1)/set(TLSH_CHECKSUM_3B 1)/g' CMakeLists.txt
# build and activate pip venv if not already done
# python3 -m venv ~/.venv
source ~/.venv/bin/activate
# build and install the new py-tlsh
python3 setup.py install
Hashes are at metadata.tlsh
.
Minor changes
metadata.annotations
has been renamedmetadata.quality_warnings
, and only contains length based quality warnings (see the OSCAR 2201 documentation for details).- Some language tags have changed to better respect the BCP47:
als
has becomegsw
. Previously,als
was erroneously used as the tag for Alemannic/Swiss German, whereas it is the tag for Tosk Albanian.eml
has becomex-eml
. Theeml
tag is deprecated and as such has been replaced by a private tag (x-eml
).
Layout
{
"content":"English sentence\nphrase en français\n????????????", // (1)
"warc_headers":{ // (2)
"warc-identified-content-language":"fra,eng",
"warc-target-uri":"https://fr.wikipedia.org/wiki/...",
"warc-record-id":"<urn:uuid:29eaa920-d299-4b1d-b687-c72bd8d68116>",
"warc-type":"conversion",
"content-length":"35298", // (3)
"warc-refers-to":"<urn:uuid:39e42055-0d94-4e45-9c6c-9e7056635d64>",
"warc-block-digest":"sha1:WFH2A5WHCS2H365GIAFYQPI7UOAMFGHB", // (3)
"warc-date":"2022-11-26T09:45:47Z",
"content-type":"text/plain"
},
"metadata":{
"identification":{ // (4)
"label":"fr",
"prob":0.8938327
},
"harmful_pp":4063.1814, // (5)
"tlsh":"tlsh:T125315FF2B6088901EEA097015DB39B4600B...", // (6)
"quality_warnings":[ // (7)
"short_sentences",
"header",
"footer"
],
"categories":[ // (8)
"examen_pix",
"liste_bu"
],
"sentence_identifications":[ // (9)
{
"label":"fr",
"prob":0.99837273
},
{
"label":"en",
"prob":0.9992377
},
null
]
}
}
Some important notes:
- Newline-separated content.
- Headers from the crawled dumps are left untouched. See the WARC specification for more info.
- Since
warc_headers
are copied and content can be altered by Ungoliant at generation stage,content-length
andwarc-block-digest
can be different from actual values. - Document-level identification. Computation details can be found on the OSCAR 22.01 paper.
- Perplexity of the document, computed using a KenLM model trained on harmful content. See this pre-print for more info. The lower this number is, the higher the probability that it will contain harmful or adult content. This annotation will be changed from
harmful_pp
toharmful_ppl
in future releases. - Locality Sensitive Hash of the documents' content, using TLSH. Useful for both exact and near deduplication.
- (Corresponds to
annotations
pre-23.01) Potential quality warnings. Based on content/sentence length. See [OSCAR 22.01 paper for more info. - Blocklist-based categories. Uses the UT1 Blocklist, plus custom additions. Please refer to the UT1 website for categories description. Note that the categories are in French.
- Sentence-level identifications. A
null
value means no identification with a good enough threshold (>0.8 on 23.01).
Language table
Code | Language | # docs | # words | Content Length : | |
---|---|---|---|---|---|
0 | af | Afrikaans | 23,994 | 6,217,024 | 37.2 MB |
1 | sq | Albanian | 1,342,790 | 462,694,599 | 3.2 GB |
2 | am | Amharic | 119,434 | 40,262,809 | 512.9 MB |
3 | ar | Arabic | 25,012,116 | 10,081,452,882 | 110.7 GB |
4 | an | Aragonese | 34 | 264 | 11.0 kB |
5 | hy | Armenian | 1,056,974 | 336,045,041 | 4.9 GB |
6 | as | Assamese | 89,542 | 24,395,215 | 412.1 MB |
7 | ast | Asturian | 440 | 10,917 | 74.1 kB |
8 | av | Avaric | 44 | 1,073 | 18.6 kB |
9 | az | Azerbaijani | 1,159,994 | 316,850,330 | 3.0 GB |
10 | bn | Bangla | 3,474,086 | 1,092,983,765 | 19.1 GB |
11 | ba | Bashkir | 128,248 | 26,036,637 | 363.7 MB |
12 | eu | Basque | 678,474 | 136,672,615 | 1.2 GB |
13 | be | Belarusian | 445,612 | 164,729,607 | 2.3 GB |
14 | bh | Bihari languages | 48 | 507 | 6.8 kB |
15 | bpy | Bishnupriya | 2,346 | 346,947 | 5.4 MB |
16 | bs | Bosnian | 20 | 395 | 3.0 kB |
17 | br | Breton | 36,338 | 4,759,407 | 31.4 MB |
18 | bg | Bulgarian | 8,933,998 | 3,635,273,738 | 44.1 GB |
19 | my | Burmese | 430,276 | 82,433,836 | 3.0 GB |
20 | ca | Catalan | 6,953,898 | 2,240,460,836 | 15.3 GB |
21 | ceb | Cebuano | 16,174 | 6,263,404 | 41.1 MB |
22 | ckb | Central Kurdish | 182,508 | 61,334,746 | 772.9 MB |
23 | ce | Chechen | 11,686 | 1,051,752 | 13.9 MB |
24 | zh | Chinese | 138,478,270 | 44,378,380,161 | 1.4 TB |
25 | cv | Chuvash | 16,652 | 3,039,925 | 42.3 MB |
26 | kw | Cornish | 8 | 80 | 432 Bytes |
27 | hr | Croatian | 31,808 | 3,542,961 | 26.5 MB |
28 | cs | Czech | 34,859,632 | 9,717,378,559 | 77.0 GB |
29 | da | Danish | 7,214,338 | 2,217,634,340 | 14.8 GB |
30 | dv | Divehi | 77,060 | 10,655,359 | 200.1 MB |
31 | nl | Dutch | 72,552,688 | 19,564,553,306 | 135.0 GB |
32 | mhr | Eastern Mari | 9,502 | 1,615,215 | 22.9 MB |
33 | arz | Egyptian Arabic | 3,958 | 385,511 | 3.7 MB |
34 | en | English | 1,235,510,986 | 523,869,288,690 | 3.4 TB |
35 | eo | Esperanto | 226,924 | 67,774,923 | 474.8 MB |
36 | et | Estonian | 3,601,904 | 938,296,892 | 8.0 GB |
37 | tl | Filipino | 250,558 | 110,560,444 | 719.2 MB |
38 | fi | Finnish | 14,471,710 | 4,198,143,883 | 41.1 GB |
39 | fr | French | 158,334,998 | 62,127,088,294 | 430.5 GB |
40 | gl | Galician | 248,762 | 38,345,625 | 255.7 MB |
41 | ka | Georgian | 1,343,036 | 373,935,158 | 8.4 GB |
42 | de | German | 206,598,430 | 73,848,586,648 | 594.7 GB |
43 | gom | Goan Konkani | 398 | 121,035 | 2.3 MB |
44 | el | Greek | 20,282,864 | 7,691,622,692 | 95.7 GB |
45 | gn | Guarani | 14 | 260 | 2.2 kB |
46 | gu | Gujarati | 425,552 | 417,001,705 | 5.6 GB |
47 | ht | Haitian Creole | 2 | 20,671 | 93.1 kB |
48 | he | Hebrew | 3,997,888 | 1,697,158,891 | 18.0 GB |
49 | hi | Hindi | 5,514,454 | 2,475,605,444 | 32.6 GB |
50 | hu | Hungarian | 21,349,372 | 16,013,364,289 | 150.1 GB |
51 | is | Icelandic | 1,210,232 | 294,471,539 | 2.2 GB |
52 | io | Ido | 224 | 2,598 | 16.1 kB |
53 | ilo | Iloko | 144 | 4,411 | 28.0 kB |
54 | id | Indonesian | 7,109,778 | 3,228,020,221 | 23.4 GB |
55 | ia | Interlingua | 34 | 9,384 | 33.5 kB |
56 | ie | Interlingue | 2 | 0 | 881 Bytes |
57 | ga | Irish | 29,894 | 9,054,923 | 63.2 MB |
58 | it | Italian | 89,021,606 | 36,327,274,203 | 259.4 GB |
59 | ja | Japanese | 94,236,404 | 4,401,059,165 | 181.2 GB |
60 | jv | Javanese | 172 | 3,286 | 25.7 kB |
61 | xal | Kalmyk | 2 | 27 | 315 Bytes |
62 | kn | Kannada | 448,500 | 124,924,350 | 2.6 GB |
63 | krc | Karachay-Balkar | 496 | 8,385 | 122.4 kB |
64 | kk | Kazakh | 677,622 | 214,679,857 | 3.3 GB |
65 | km | Khmer | 450,660 | 59,880,231 | 3.2 GB |
66 | kv | Komi | 460 | 5,909 | 70.3 kB |
67 | ko | Korean | 15,147,698 | 3,435,866,935 | 38.1 GB |
68 | ku | Kurdish | 80,338 | 25,921,607 | 174.1 MB |
69 | ky | Kyrgyz | 144,288 | 32,062,783 | 489.3 MB |
70 | lo | Lao | 118,374 | 10,659,203 | 472.1 MB |
71 | la | Latin | 14,384 | 307,865 | 2.0 MB |
72 | lv | Latvian | 2,435,882 | 845,459,899 | 7.4 GB |
73 | lez | Lezghian | 676 | 60,634 | 856.6 kB |
74 | li | Limburgish | 6 | 169 | 1.4 kB |
75 | lt | Lithuanian | 5,182,028 | 1,674,362,574 | 14.5 GB |
76 | jbo | Lojban | 572 | 312,315 | 1.5 MB |
77 | lmo | Lombard | 112 | 3,269 | 21.0 kB |
78 | nds | Low German | 5,248 | 1,612,175 | 10.7 MB |
79 | dsb | Lower Sorbian | 8 | 84 | 664 Bytes |
80 | lb | Luxembourgish | 18,090 | 2,514,838 | 18.4 MB |
81 | mk | Macedonian | 1,063,298 | 389,344,425 | 4.7 GB |
82 | mai | Maithili | 46 | 467 | 6.8 kB |
83 | mg | Malagasy | 10,830 | 1,416,430 | 11.2 MB |
84 | ms | Malay | 11,500 | 238,477 | 2.6 MB |
85 | ml | Malayalam | 800,936 | 236,597,838 | 5.8 GB |
86 | mt | Maltese | 5,180 | 149,886 | 1.3 MB |
87 | mr | Marathi | 729,578 | 252,706,331 | 4.5 GB |
88 | mzn | Mazanderani | 384 | 16,115 | 169.2 kB |
89 | min | Minangkabau | 2,436 | 305,589 | 3.8 MB |
90 | xmf | Mingrelian | 7,318 | 283,316 | 6.1 MB |
91 | mwl | Mirandese | 4 | 54 | 423 Bytes |
92 | mn | Mongolian | 1,061,710 | 454,350,415 | 5.8 GB |
93 | multi | Multilingual | 2,948,202 | 1,251,676,406 | 11.9 GB |
94 | nah | Nahuatl languages | 38 | 279 | 2.4 kB |
95 | ne | Nepali | 1,152,156 | 278,901,036 | 4.9 GB |
96 | new | Newari | 1,996 | 229,703 | 4.0 MB |
97 | no | Norwegian | 2,797,378 | 373,160,033 | 2.6 GB |
98 | nn | Norwegian Nynorsk | 19,470 | 575,518 | 3.7 MB |
99 | oc | Occitan | 920 | 34,701 | 405.0 kB |
100 | or | Odia | 158,426 | 31,963,340 | 543.1 MB |
101 | os | Ossetic | 8,628 | 3,935,964 | 50.7 MB |
102 | ps | Pashto | 87,408 | 30,196,179 | 261.6 MB |
103 | fa | Persian | 23,813,882 | 9,609,206,698 | 93.2 GB |
104 | pms | Piedmontese | 2,524 | 510,087 | 3.1 MB |
105 | pl | Polish | 57,184,826 | 18,073,705,588 | 147.1 GB |
106 | pt | Portuguese | 36,062,800 | 15,172,557,311 | 105.0 GB |
107 | pa | Punjabi | 222,058 | 104,235,418 | 1.4 GB |
108 | qu | Quechua | 2 | 13 | 143 Bytes |
109 | ro | Romanian | 11,985,668 | 6,302,600,833 | 45.6 GB |
110 | bxr | Russia Buriat | 72 | 698 | 8.2 kB |
111 | ru | Russian | 194,143,422 | 78,032,029,344 | 1.1 TB |
112 | sah | Sakha | 17,566 | 4,288,051 | 68.8 MB |
113 | sa | Sanskrit | 16,802 | 2,479,345 | 56.3 MB |
114 | gd | Scottish Gaelic | 776 | 18,458 | 146.1 kB |
115 | sr | Serbian | 1,677,896 | 632,781,822 | 7.7 GB |
116 | sh | Serbian (Latin) | 3,214 | 166,517 | 816.4 kB |
117 | sd | Sindhi | 48,566 | 14,667,207 | 131.6 MB |
118 | si | Sinhala | 301,066 | 172,755,385 | 2.6 GB |
119 | sk | Slovak | 8,931,784 | 2,704,716,280 | 21.5 GB |
120 | sl | Slovenian | 1,112,560 | 192,816,743 | 1.4 GB |
121 | so | Somali | 6 | 51 | 503 Bytes |
122 | azb | South Azerbaijani | 26,364 | 2,029,729 | 28.4 MB |
123 | es | Spanish | 153,574,556 | 63,388,237,965 | 429.9 GB |
124 | su | Sundanese | 18 | 258 | 2.0 kB |
125 | sw | Swahili | 1,664 | 164,459 | 1.0 MB |
126 | sv | Swedish | 21,891,348 | 6,993,719,601 | 50.0 GB |
127 | gsw | Swiss German | 342 | 34,328 | 232.7 kB |
128 | tg | Tajik | 144,932 | 76,987,285 | 1.0 GB |
129 | ta | Tamil | 1,638,238 | 738,824,392 | 15.8 GB |
130 | tt | Tatar | 262,654 | 59,253,765 | 833.8 MB |
131 | te | Telugu | 644,712 | 201,575,815 | 3.9 GB |
132 | th | Thai | 14,845,900 | 2,224,483,018 | 92.0 GB |
133 | bo | Tibetan | 62,352 | 6,062,558 | 531.6 MB |
134 | tr | Turkish | 26,654,330 | 8,290,890,087 | 73.7 GB |
135 | tk | Turkmen | 4,576 | 325,786 | 3.3 MB |
136 | uk | Ukrainian | 10,059,992 | 3,183,842,018 | 44.7 GB |
137 | x-eml | Emiliano-Romagnol | 4 | 329 | 1.8 kB |
138 | hsb | Upper Sorbian | 402 | 15,827 | 123.2 kB |
139 | ur | Urdu | 887,004 | 434,023,273 | 3.8 GB |
140 | ug | Uyghur | 51,304 | 14,659,554 | 219.8 MB |
141 | uz | Uzbek | 15,806 | 1,665,960 | 15.3 MB |
142 | vi | Vietnamese | 33,933,994 | 22,424,984,210 | 140.8 GB |
143 | vo | Volapük | 896 | 49,968 | 371.9 kB |
144 | wa | Walloon | 390 | 6,347 | 34.3 kB |
145 | war | Waray | 1,494 | 19,665 | 126.8 kB |
146 | cy | Welsh | 151,512 | 52,250,043 | 333.0 MB |
147 | fy | Western Frisian | 45,458 | 9,885,788 | 70.4 MB |
148 | mrj | Western Mari | 496 | 60,180 | 765.8 kB |
149 | pnb | Western Panjabi | 12,904 | 11,844,695 | 105.8 MB |
150 | wuu | Wu Chinese | 136 | 1,199 | 26.8 kB |
151 | yi | Yiddish | 47,438 | 14,287,370 | 171.7 MB |
152 | yo | Yoruba | 128 | 2,396 | 16.6 kB |