-
Notifications
You must be signed in to change notification settings - Fork 1
X2Static Training
Cahid Arda edited this page Mar 14, 2024
·
2 revisions
Last time we used the HPC available to our university, we had to run our scripts inside docker images.
This meant that our script had to work end-to-end. It would load the data, clone the repositories, train the model and run the evaluation scripts from start to finish.
Following is the script we used in the end:
#!/bin/bash
#SBATCH --container-image ghcr.io\#bouncmpe/cuda-python3
#SBATCH --gpus=2
#SBATCH --cpus-per-gpu=16
#SBATCH --mem=100G
#SBATCH --time=90:00:00
source /opt/python3/venv/base/bin/activate
echo "bounweb corpus"
pip install wget
python3 -c "import wget; wget.download('https://tulap.cmpe.boun.edu.tr/server/api/core/bitstreams/150f2e37-1dd3-4229-a37e-111f8a365edf/content')"
python3 -c "import zipfile; zipfile.ZipFile('bounwebcorpus.txt.zip', 'r').extractall()"
rm bounwebcorpus.txt.zip
echo "huwaei"
pip install gdown
python3 -m gdown 1PTytZ7yGIl9QvxRxCsfWLlHycU4z1_Vp
echo "vocab"
python -m gdown 1Sjfh9c7gMa6lvsjprcnbtVyKYEJ8r2fH
echo "comb"
python3 -c "open('combined.txt', 'w', encoding='utf-8').writelines([line for file in ['bounwebcorpus.txt', 'turkish-texts-tokenized.txt'] for line in open(file, 'r', encoding='utf-8')])"
# mv bounwebcorpus.txt combined.txt
export REPO=Word-Embeddings-Repository-for-Turkish
git clone https://github.com/epfml/X2Static.git
pip install tqdm torch transformers nltk gensim "tensorflow-gpu==2.8.0" scikit-learn torchmetrics matplotlib
# python X2Static/src/make_vocab_dataset.py --dataset_location combined.txt --location_save_vocab_dataset processed_data/
python X2Static/src/make_vocab_dataset.py --dataset_location combined.txt --min_count 10 --max_vocab_size 750000 --location_save_vocab_dataset processed_data/
ls -l processed_data
# python X2Static/src/learn_from_bert_ver2_paragraph.py --pretrained_bert_model "dbmdz/bert-base-turkish-128k-cased" --location_dataset processed_data/ --model_folder x2static_model --gpu_id 0 --num_epochs 1 --lr 0.001 --algo SparseAdam --t 5e-6 --word_emb_size 768 --num_negatives 10
python X2Static/src/learn_from_bert_ver2_paragraph.py --gpu_id 0 --num_epochs 1 --lr 0.001 --algo SparseAdam --t 5e-6 --word_emb_size 768 --location_dataset processed_data/ --model_folder model/ --num_negatives 10 --pretrained_bert_model dbmdz/bert-base-turkish-128k-cased
ls -l
echo "model"
ls -l /model
mv /model/vectors_final.txt /bert-decontextualized-static.wv
pip install protobuf==3.20.*
# Loop through each script in the array
git clone https://github.com/Turkish-Word-Embeddings/Word-Embeddings-Repository-for-Turkish.git
echo "NER"
cd /$REPO/NLP/NER
sed -i "s@FOLDER = \"C:/Users/karab/Desktop/Models\"@FOLDER = '/'@g" ner.py
sed -i 's/"no_header": False/"no_header": True/g' ner.py
python3 ner.py -w dc_bert
echo "PoS"
cd /$REPO/NLP/PoS
sed -i "s@FOLDER = \"C:/Users/karab/Desktop/Models\"@FOLDER = '/'@g" pos.py
sed -i 's/"no_header": False/"no_header": True/g' pos.py
python3 pos.py -w dc_bert
echo "SENTIMENT"
cd "/$REPO/NLP/Sentiment Analysis/"
sed -i 's@"C:/Users/karab/Desktop/Models"@"/"@g' sentiment.py
sed -i 's/"no_header": False/"no_header": True/g' sentiment.py
python3 sentiment.py -w dc_bert -d 1 -s 7
python3 sentiment.py -w dc_bert -d 1 -s 24
python3 sentiment.py -w dc_bert -d 1 -s 30
python3 sentiment.py -w dc_bert -d 2 -s 7
python3 sentiment.py -w dc_bert -d 2 -s 24
python3 sentiment.py -w dc_bert -d 2 -s 30
python3 sentiment.py -w dc_bert -d 3 -hs 196 -s 7
python3 sentiment.py -w dc_bert -d 3 -hs 196 -s 24
python3 sentiment.py -w dc_bert -d 3 -hs 196 -s 30