Skip to content

oscar-tools

oscar-tools is a toolkit that was created along with OSCAR-2201 to make operations on the corpus easy and fast.

At its core, oscar-tools provides a set of operations targeted at a given OSCAR version. As such, you shoudn't expect to have all operations available on all OSCAR versions. For example, at the time of writing, deduplicate is not available for OSCAR 22.01-like corpora.

The CLI of oscar-tools is still a bit messy and can be confusing, because we are actively working on it and on implementing essential features.

Installation

From releases

Note

Binaries are not available yet.

From cargo

Note

cargo install oscar-tools is not available yet.

From repository

Note

This could evolve rapidly.

Right now the latest version sits on the dev-oscario branch, where we're slowly replacing inline IO blocks by our Corpus IO library, oscar-io.

$> git clone https://github.com/oscar-corpus/oscar-tools #Clone the repository
$> cd oscar-tools
$> git checkout dev-oscario #Change branch
$> cargo b --release #Build the project. 
$> # Building might take some time because of 
$> # the parquet dependency that will soon be optional.
$> touch target/release/oscar-tools #Binary is here and self-sufficient.

Usage

oscar-tools --help might help you find the parameters/operations you're looking for.

Note

In the tool, v1 corresponds to 2019-like corpora, whereas v2 corresponds to 22.01-like corpora.

Each operation has different parameters.

v1 / OSCAR 2019

At the time of writing, the only operation available is dedup. It uses runiq to deduplicate corpora.

oscar-tools-v1-dedup 
line deduplication

USAGE:
    oscar-tools v1 dedup [ARGS]

ARGS:
    <SOURCE>         Corpus source file.
    <DESTINATION>    Corpus destination file. Should not exist.

OPTIONS:
    -h, --help    Print help information

v2 / OSCAR 22.01

There is a lot more operations implemented on OSCAR 22.01-like corpora.

extract-tags

extract-tags extracts documents that meet certain annotation constraints.

oscar-tools-v2-extract-tags 
Extracts a OSCAR v2 corpus restricting tags. Included tags must be present and excluded ones must be
absent. Use --clean to extract documents with no annotation only

USAGE:
    oscar-tools v2 extract-tags [OPTIONS] [--] [ARGS]

ARGS:
    <SOURCE>         Corpus source file/folder. If folder, splits corpus files in provided
                     folder
    <DESTINATION>    Corpus source file/folder. If folder, splits corpus files in provided
                     folder

OPTIONS:
        --clean                only return documents with no tags. include and exclude will be
                               ignored
    -e, --exclude <tags>...    space separated tags to exclude.
    -h, --help                 Print help information
    -i, --include <tags>...    space separated tags to include.

extract-text

extract-text "converts" a 2201-like corpus into a 2019-like corpus, by removing all metadata and only storing sentences. Keep in mind that while the format will be similar to 2109-like corpora, the filtering is a bit different and lines from other languages won't be stripped.

Extract text from documents. The output will be a OSCAR v1 (2019)-compatible corpus.

USAGE:
    oscar-tools v2 extract-text [OPTIONS] <SOURCE> <DESTINATION>

ARGS:
    <SOURCE>         Corpus source file.
    <DESTINATION>    Corpus destination file (OSCAR v1 (2019)-like)

OPTIONS:
        --del_src    If set, deletes source files as they are being extracted.
    -h, --help       Print help information