Getting access to OSCAR
There are two ways of accessing OSCAR: through Huma-Num, or through HuggingFace. Depending on your status, you might not have the choice.
Research/Academic | Individual | |
---|---|---|
Huma-Num | ||
Hugging-Face |
You can apply for an access request by sending us an email!
Warning
Carefully respect the following instructions, as incorrect submissions might significantly delay your access.
Danger
Do not create an account by yourselves, as it could delay you access by weeks! We will create an account for you.
Send us an email at contact at oscar-project.org, with OSCAR Access Request as the title, and the following (completed) as the body:
Warning
Please send your email using your institutional/academic address when possible. Otherwise, your access might be delayed/refused.
- First name:
- Last name:
- Affiliation:
- Contact details:
- Corpus version:
- Languages:
+ a short description of your usecase.
Note
Access requests can take some days to be answered, sometimes more.
We post updates on our Discord server on exceptional delays, and you can always contact us there to inquire about yours.
After some time, you should get an email back from us with access instructions!
Using datasets
The following implies that you already have installed the Python datasets library
- Create an account on HuggingFace.
- Create a user access token.
- Open the OSCAR Team page.
- Open your corpus of choice. Instructions should be in the corpus page.
After all of this, you should be able to easily use OSCAR data with the datasets
library :
# example with OSCAR 2201
from datasets import load_dataset
dataset = load_dataset("oscar-corpus/OSCAR-2201",
use_auth_token=True, # required
language="ar",
streaming=True, # optional
split="train") # optional
for d in dataset:
print(d) # prints documents
Using Git LFS
You can also get the raw data from HuggingFace using Git LFS.
The following steps assume you have git and git-lfs installed, and are on a UNIX system. The procedure should roughly be the same on Windows, but hasn’t been attempted.
This will download the Basque corpus from OSCAR 2109.