Generating a new OSCAR Version on Jean Zay
Compile the latest version of Unogliant
This is currently preferred to just getting it from cargo install ungoliant
.
git clone https://github.com/oscar-project/ungoliant
- Open an interactive session on a
compil
node:srun --partition=compil -A <GROUP ID>@cpu --pty bash
- Run
module load llvm boost cargo
(boost
andllvm
are necessary for compiling KenLM and FastText) - Run
cd ungoliant
- Run
cargo b --release --features kenlm
Download the data from CommonCrawl
We advise the use of the prepost
partition for downloading the data form Common Crawl. However, please bear in mind that jobs are limited to 20hours in the prepost
partition, meaning that you'll likely run out of time before completing the download of a whole Common Crawl dump.
- Download the
wet.paths.gz
file for the latest release (likely here gzip -d wet.paths.gz
Create a dl_corpus.slurm
file with the following text inside:
#! /bin/bash
#SBATCH --partition=prepost
#SBATCH --job-name=get_cc # create a short name for your job
#SBATCH --mail-type=BEGIN,END,FAIL # Mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH --mail-user=<YOUR MAIL> # Where to send mail
#SBATCH --nodes="1" #Combien de nœuds
#SBATCH --ntasks-per-node="1" # Une tâche par GPU
#SBATCH --cpus-per-task="64" # nombre de coeurs à réserver par tâche
#SBATCH --time="20:00:00" # temps d'exécution maximum demande (HH:MM:SS)
#SBATCH -A <GROUP ID>@cpu
export CARGO_HOME=<CARGO HOME PATH (in SCRATCH if you can>
export PATHS_FILE=<PATH TO wet.PATHS>
export DST=<DESTINATION>
./target/release/ungoliant download $PATHS_FILE $DST
When the time has run out, you have to ensure that the last downloaded shards weren't corrupted (because of a potential kill while downloading).
Then, after potentially removing faulty shards, run the following slurm job.
The only difference with the previous one is the use of the -o n
parameter on ungoliant download
, which will ignore the first n
lines of the wet.paths
.
You can/should also use another DESTINATION
folder, and then do the merge by hand.
#! /bin/bash
#SBATCH --partition=prepost
#SBATCH --job-name=get_cc # create a short name for your job
#SBATCH --mail-type=BEGIN,END,FAIL # Mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH --mail-user=<YOUR MAIL> # Where to send mail
#SBATCH --nodes="1" #Combien de nœuds
#SBATCH --ntasks-per-node="1" # Une tâche par GPU
#SBATCH --cpus-per-task="64" # nombre de coeurs à réserver par tâche
#SBATCH --time="20:00:00" # temps d'exécution maximum demande (HH:MM:SS)
#SBATCH -A <GROUP ID>@cpu
export CARGO_HOME=<CARGO HOME PATH (in SCRATCH if you can>
export PATHS_FILE=<PATH TO wet.PATHS>
export DST=<DESTINATION>
./target/release/ungoliant download -o <NB_DOWNLOADED> $PATHS_FILE $DST
You can then check that no shards are missing:
import os
shards_dir = "./shards"
paths_file = "wet.paths"
cc_rooturl = "https://data.commoncrawl.org/"
missing_shards = list()
for i in range(88000):
if not os.path.isfile(f"{shards_dir}/{i}.txt.gz"):
missing_shards.append(i)
print(f"missing {len(missing_shards)} shards")
with open(paths_file) as f:
shard_paths = f.readlines()
for missing_shard_number in missing_shards:
print(
f"wget -nc {cc_rooturl}{shard_paths[missing_shard_number].strip()} -O {missing_shard_number}.txt.gz"
)
This will give you the wget
commands to get the missing shards, with a -nc
param to avoid overwriting already existing files.
Generate OSCAR
When you have your shards ready, create a new SLURM file with:
We use a QoS of t4 because since we can only use one node and corpus generation time is likely >20h, we need the 100 mark.
Other strategies could be tested (for example, splitting CC data into 4 buckets and launch 4 ungoliant
jobs.
Then, merging back the datasets should be done.
Note that in that case, rebuild files will be less efficient (since we'll have 4 of them)
#! /bin/bash
#SBATCH --partition=cpu_p1
#SBATCH --job-name=gen_oscar # create a short name for your job
#SBATCH --mail-type=BEGIN,END,FAIL # Mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH --mail-user=<YOUR MAIL> # Where to send mail
#SBATCH --nodes="1" #Combien de nœuds
#SBATCH --ntasks-per-node="1" # Une tâche par GPU
#SBATCH --cpus-per-task="40" # nombre de coeurs à réserver par tâche
#SBATCH --time="100:00:00" # temps d'exécution maximum demande (HH:MM:SS)
#SBATCH --qos=qos_cpu-t4
#SBATCH -A <GROUP ID>@cpu
export CARGO_HOME=<CARGO HOME PATH>
export CC_FOLDER=<SHARDS PATH>
export KENLM_FOLDER=<PATH TO KENLMS MODELS IF APPLICABLE>
export CORPUS=<DESTINATION FOLDER>
export BLOCKLIST=<BLOCKLIST FOLDER (must contain subfolders with category names..)>
export LID_PATH=<PATH TO FASTTEXT LangID>
export UNGOLIANT_PATH=<PATH TO UNGOLIANT BINARY>
$UNGOLIANT_PATH pipeline $CC_FOLDER $CORPUS --blocklist-path $BLOCKLIST --kenlms-path $KENLM_FOLDER --lid-path $LID_PATH
As of Jan. 2023, using ungoliant 1.3.0 ([c14acc8](https://github.com/oscar-project/ungoliant/tree/c14acc8c6a87913d138a022cf4819024d66b3e06))
, with a 88,000-shard dump of CommonCrawl (November/December 2022, ~9.5TB compressed), this process took around 20 hours and yielded a corpus weighing arount 12TB (uncompressed).
Move OSCAR
Files in $SCRATCH
are deleted after 30 days if no R/W is operated on them. You should move out files to $STORE
if you plan on keeping them.
Unfortunately, due to the file size, you'll need to launch another job to do the copying of the files.
Warning
rsync -n
enables a dry-run, enabling you to see which files would be moved, and where. Remove the -n
parameter when you want to perform the actual copy.
#! /bin/bash
#SBATCH --partition=prepost
#SBATCH --job-name=copy_oscar # create a short name for your job
#SBATCH --mail-type=BEGIN,END,FAIL # Mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH --mail-user=julien.abadji@inria.fr # Where to send mail
#SBATCH --nodes="1" #Combien de nœuds
#SBATCH --ntasks-per-node="1" # Une tâche par GPU
#SBATCH --cpus-per-task="4" # nombre de coeurs à réserver par tâche
#SBATCH --time="20:00:00" # temps d'exécution maximum demande (HH:MM:SS)
#SBATCH -A <GROUP ID>@cpu
export SRC=<CORPUS SOURCE>
export DST=<CORPUS DESTINATION>
rsync -anvP $SRC $DST
On the same example as before, copying took around 9 hours.
Preparing for release
Splitting
We use oscar-tools
to split the corpus.
Note
At the time of writing, oscar-tools
is not available via crates.io/cargo install
, so you have to compile it from source. Luckily, it's easy.
Compiling oscar-tools
- Get the source:
git clone https://github.com/oscar-project/oscar-tools
- Go inside a
compil
node:srun --partition=compil -A <GROUP ID>@cpu --pty bash
cd oscar-tools
CARGO_HOME=<Somewhere not in your ~, like $SCRATCH/.cargo> cargo b --features zstd --release
.- Wait ~some hours~
- That's it! Your binary sits at
target/release/oscar-tools
.
#! /bin/bash
#SBATCH --partition=prepost
#SBATCH --job-name=split_oscar # create a short name for your job
#SBATCH --mail-type=BEGIN,END,FAIL # Mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH --mail-user=<Your email address> # Where to send mail
#SBATCH --nodes="1" #Combien de nœuds
#SBATCH --ntasks-per-node="1" # Une tâche par GPU
#SBATCH --cpus-per-task="10" # nombre de coeurs à réserver par tâche
#SBATCH --time="20:00:00" # temps d'exécution maximum demande (HH:MM:SS)
#SBATCH -A <group id>@cpu
export OSCAR_TOOLS_BIN=<path to oscar-tools binary>
export CORPUS=<path to corpus>
export DST=<where the split corpus will be put>
$OSCAR_TOOLS_BIN v2 split $CORPUS $DST -s 2000
This step took around 3 hours (assuming both CORPUS
and DST
are on $SCRATCH
).
Compressing
#! /bin/bash
#SBATCH --partition=prepost
#SBATCH --job-name=compress_oscar # create a short name for your job
#SBATCH --mail-type=BEGIN,END,FAIL # Mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH --mail-user=<email address> # Where to send mail
#SBATCH --nodes="1" #Combien de nœuds
#SBATCH --ntasks-per-node="1" # Une tâche par GPU
#SBATCH --cpus-per-task="48" # nombre de coeurs à réserver par tâche
#SBATCH --time="20:00:00" # temps d'exécution maximum demande (HH:MM:SS)
#SBATCH -A <group id>@cpu
export OSCAR_TOOLS_BIN=<link to oscar-tools binary>
export CORPUS=<path to split focus>
export DST=<where the compressed ocrpus will be saved>
$OSCAR_TOOLS_BIN v2 compress $CORPUS $DST
This step took around 2 hours, going from 12TB to 3.3TB
Checksuming
The last step is to create checksum
files for each language, so that people can check that their downloads have been successful.
Also, it acts as a split list for download-oscar.
#! /bin/bash
#SBATCH --partition=prepost
#SBATCH --job-name=compress_oscar # create a short name for your job
#SBATCH --mail-type=BEGIN,END,FAIL # Mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH --mail-user=<email address> # Where to send mail
#SBATCH --nodes="1" #Combien de nœuds
#SBATCH --ntasks-per-node="1" # Une tâche par GPU
#SBATCH --cpus-per-task="48" # nombre de coeurs à réserver par tâche
#SBATCH --time="20:00:00" # temps d'exécution maximum demande (HH:MM:SS)
#SBATCH -A <group id>@cpu
export OSCAR_TOOLS_BIN=<link to oscar-tools binary>
export CORPUS=<path to split focus>
$OSCAR_TOOLS_BIN v2 checksum $CORPUS
The process took around 2 hours.