Name		Name	Last commit message	Last commit date
parent directory ..
tokenizer		tokenizer
README.md		README.md
arguments.py		arguments.py
contributor_notebook.ipynb		contributor_notebook.ipynb
dataset_streaming.py		dataset_streaming.py
huggingface_auth.py		huggingface_auth.py
metrics_utils.py		metrics_utils.py
requirements.txt		requirements.txt
run_first_peer.py		run_first_peer.py
run_trainer.py		run_trainer.py
tokenization_albert_bengali_fast.py		tokenization_albert_bengali_fast.py
train_ncc.py		train_ncc.py
train_ner.py		train_ner.py

README.md

sahajBERT: A Collaboratively Trained Bengali Language Model

This section contains the code and instructions that were used when training sahajBERT in a collaborative setup.

Preparation

Install hivemind (see main README)
Dependencies: pip install -r requirements.txt
Preprocess the data by following the instructions in the tokenizer_training folder
Upload the archive with preprocessed data to somewhere your peers can reach

Running an experiment

Run the first DHT peer to welcome trainers and record training statistics (e.g. loss, performance):

In this example, we use wandb.ai to plot training metrics; If you're unfamiliar with Weights & Biases, here's a tutorial.
Run python run_first_peer.py --listen_on '[::]:*' --experiment_prefix NAME_YOUR_EXPERIMENT --wandb_project WANDB_PROJECT_HERE
NAME_YOUR_EXPERIMENT must be a unique name of this training run, e.g. Bengali-albert. It cannot contain . due to naming conventions.
WANDB_PROJECT_HERE is a name of wandb project used to track training metrics. Multiple experiments can have the same project name.
This peer will run a DHT node on a certain IP/port (Running DHT root at ...). You will need this address for next steps

To join the collaboration as a participant, run

HIVEMIND_THREADS=128 python run_trainer.py \
--output_dir ./outputs_trainer --overwrite_output_dir  --logging_dir ./logs_trainer \
--logging_first_step --logging_steps 100   --initial_peers COORDINATOR_IP:COORDINATOR_PORT \
--experiment_prefix NAME_YOUR_EXPERIMENT --seed 42 --averaging_timeout 120  --bandwidth 1000

Instead of COORDINATOR_IP:COORDINATOR_PORT, you can specify any existing trainers in the same format.

Collaborator notebook and authentication

We also provide an example of the Jupyter Notebook that was shared with the participants as an easy way to join the experiment. Note that it is meant as a demonstration and not as a ready-to use solution; as a result, it might require adaptation to the specifics of the experiment and the community that you intend to share it with.

One possible solution for managing experiment access that was used for sahajBERT training is the authentication mechanism. In huggingface_auth.py, you can see the client side of the authentication API that can be used as an authorizer in Hivemind. For the server side, please refer to the collaborative-training-auth repo by Hugging Face.

Downstream evaluation

We use scripts to evaluate on downstream tasks, based on the original finetuning scripts from the transformers repository.

Datasets:

NER: "wikiann", "bn"
NCC: "indic_glue", "sna.bn"

Models:

"neuropark/sahajBERT"
"xlm-roberta-large"
"ai4bharat/indic-bert"
"neuralspace-reverie/indic-transformers-bn-roberta"

Required arguments:

--model_name_or_path
--output_dir

Examples

NER

python train_ner.py \
  --model_name_or_path xlm-roberta-large \
  --output_dir sahajbert/ner \
  --learning_rate 1e-5 \
  --max_seq_length 128 \
  --num_train_epochs 20 \
  --per_device_train_batch_size 128 \
  --per_device_eval_batch_size 128 \
  --early_stopping_patience 3 \
  --early_stopping_threshold 0.01

NCC

python train_ncc.py \
  --model_name_or_path xlm-roberta-large \
  --output_dir sahajbert/ncc \
  --learning_rate 1e-5 \
  --max_seq_length 128 \
  --num_train_epochs 20 \
  --per_device_train_batch_size 128 \
  --per_device_eval_batch_size 128 \
  --early_stopping_patience 3 \
  --early_stopping_threshold 0.01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sahajbert

sahajbert

README.md

sahajBERT: A Collaboratively Trained Bengali Language Model

Preparation

Running an experiment

Collaborator notebook and authentication

Downstream evaluation

Examples

Files

sahajbert

Directory actions

More options

Directory actions

More options

Latest commit

History

sahajbert

Folders and files

parent directory

README.md

sahajBERT: A Collaboratively Trained Bengali Language Model

Preparation

Running an experiment

Collaborator notebook and authentication

Downstream evaluation

Examples