Skip to content

OSCAR Quickstart

What is OSCAR?

OSCAR is a collection of web-based multilingual corpus of several terabytes, containing subcorpora in more than 150 languages.

Each OSCAR Corpus has a version name that tells you its approximate generation time, which usually coincides with the source crawl time. The latest OSCAR Corpus is OSCAR 2301. We advise you to always use the latest version, as we incrementally include new features that enable new ways of filtering the corpus for your applications.

Basic data layout

OSCAR is, since OSCAR 2109, document-oriented, which means that subcorpora are comprised of documents rather than individual lines.

This has important implications as to how to preprocess the data:

You can (and will) find sentences in other languages than the one you're interested in. For example, it is expected to encounter English sentences in documents from the French subcorpus.

Example

The Wikipedia article about the French anthem, La Marseillaise, contains its lyrics in French. As such, this article is expected to be present in the English subcorpus with those French lyrics.

The good news is that you can easily remove those sentences if you are not interested in them, thanks to the metadata provided alongside the main content.

OSCAR is distributed in JSONLines files, usually compressed (gzip, zstd depending on the version).

Each line of a file is a JSON Object representing a single document. Here is an example from OSCAR 2301:

{
   "content":"English sentence\nphrase en fran├žais\n????????????", // (1)
   "warc_headers":{ // (2)
      "warc-identified-content-language":"fra,eng",
      "warc-target-uri":"https://fr.wikipedia.org/wiki/...",
      "warc-record-id":"<urn:uuid:29eaa920-d299-4b1d-b687-c72bd8d68116>",
      "warc-type":"conversion",
      "content-length":"35298", // (3)
      "warc-refers-to":"<urn:uuid:39e42055-0d94-4e45-9c6c-9e7056635d64>",
      "warc-block-digest":"sha1:WFH2A5WHCS2H365GIAFYQPI7UOAMFGHB", // (3)
      "warc-date":"2022-11-26T09:45:47Z",
      "content-type":"text/plain"
   },
   "metadata":{
      "identification":{ // (4)
         "label":"fr",
         "prob":0.8938327
      },
      "harmful_pp":4063.1814, // (5)
      "tlsh":"tlsh:T125315FF2B6088901EEA097015DB39B4600B...", // (6)
      "quality_warnings":[ // (7)
         "short_sentences",
         "header",
         "footer"
      ],
      "categories":[ // (8)
         "examen_pix",
         "liste_bu"
      ],
      "sentence_identifications":[ // (9)
         {
            "label":"fr",
            "prob":0.99837273
         },
         {
            "label":"en",
            "prob":0.9992377
         },
         null
      ]
   }
}
  1. Newline-separated content.
  2. Headers from the crawled dumps are left untouched. See the WARC specification for more info.
  3. Since warc_headers are copied and content can be altered by Ungoliant at generation stage, content-length and warc-block-digest can be different from actual values.
  4. Document-level identification. Computation details can be found on the OSCAR 22.01 paper.
  5. Perplexity of the document, computed using a KenLM model trained on harmful content. See this pre-print for more info. The lower this number is, the higher the probability that it will contain harmful or adult content. This annotation will be changed from harmful_pp to harmful_pplin future releases.
  6. Locality Sensitive Hash of the documents' content, using TLSH. Useful for both exact and near deduplication.
  7. (Corresponds to annotations pre-23.01) Potential quality warnings. Based on content/sentence length. See [OSCAR 22.01 paper for more info.
  8. Blocklist-based categories. Uses the UT1 Blocklist, plus custom additions. Please refer to the UT1 website for categories description. Note that the categories are in French.
  9. Sentence-level identifications. A null value means no identification with a good enough threshold (>0.8 on 23.01).

Getting access

There are different ways of getting access to OSCAR depending on your status! Head on to our dedicated page.

Using the corpus

TODO