Getting access to OSCAR
To acces the dataset please go to HuggingFace.
Using datasets
The following implies that you already have installed the Python datasets library
- Create an account on HuggingFace.
- Create a user access token.
- Open the OSCAR Team page.
- Open your corpus of choice. Instructions should be in the corpus page.
After all of this, you should be able to easily use OSCAR data with the datasets
library :
# example with OSCAR 2201
from datasets import load_dataset
dataset = load_dataset("oscar-corpus/OSCAR-2201",
use_auth_token=True, # required
language="ar",
streaming=True, # optional
split="train") # optional
for d in dataset:
print(d) # prints documents
Using Git LFS
You can also get the raw data from HuggingFace using Git LFS.
The following steps assume you have git and git-lfs installed, and are on a UNIX system. The procedure should roughly be the same on Windows, but hasn’t been attempted.
This will download the Basque corpus from OSCAR 2109.