Attention is all you need.
- einops starter
- attentions
- multi-head causal attention
- multi-head cross attention
- multi-head grouped query attention (torch + einops)
- positional embeddings
- Low-Rank Adaptation (LoRA)
- implementing LoRA based on this wonderful tutorial by Sebastian Raschka
- finetuning LoRA adapted
deberta-v3-base
on IMDb dataset
-
LlaMA
- for process, check building_llama_complete.ipynb
- model implementation
- inference (used SmolLM2-135M-Instruct which is based on LlaMA architecture but super small) code kaggle
- super cool resource: LLMs From Scratch by Sebastian Raschka
-
simple Vision Transformer
- for process, check building_ViT.ipynb
- model implementation
- used
mean
pooling instead of[class]
token
-
GPT2
- for process, check buildingGPT2.ipynb
- model implementation
- built in such a way that it supports loading pretrained openAI/huggingface weights gpt2-load-via-hf.ipynb
- for my own custom trained causal LM, checkout shakespeareGPT which is although a bit more like GPT-1.
-
OpenAI CLIP
- implemented
ViT-B/32
variant - for process, check building_clip.ipynb
- inference req: install clip for tokenization and preprocessing:
pip install git+https://github.com/openai/CLIP.git
- model implementation
- zero-shot inference code
- built in such a way that it supports loading pretrained openAI weights and IT WORKS!!!
- My lighter implementation of this using existing image and language models trained on Flickr8k dataset is available here: liteCLIP
- implemented
-
Encoder Decoder Transformer
- for process, check building_encoder-decoder.ipynb
- model implementation
- src_mask for encoder is optional but is nice to have since it is used to mask out the pad tokens so attention is not considered for those tokens.
- used learned embeddings for position instead of sin/cos as per the OG.
- I trained a model for multilingual machine translation.
- Translates english to hindi and telugu.
- change: single encoder & decoder embedding layer since I used a single tokenizer.
- for the code and results check: shreydan/multilingual-translation
-
BERT - MLM
- for process of masked language modeling, check masked-language-modeling.ipynb
- model implementation
- simplification: for pre-training no use of [CLS] & [SEP] tokens since I only built the model for masked language modeling and not for next sentence prediction.
- I trained an entire model on the wikipedia dataset, more info in shreydan/masked-language-modeling repo.
- once, pretrained the MLM head can be replaced with any other downstream task head.
-
ViT MAE
- Paper: Masked autoencoders are scalable vision learners
- model implementation
- for process, check: building-vitmae.ipynb
- Quite reliant on the original code released by authors.
- Only simplification: No [CLS] token so used mean pooling
- The model can be trained 2 ways:
- For pretraining: the decoder can be thrown away and the encoder can be used for downstream tasks
- For visualization: can be used to reconstruct masked images.
- I trained a smaller model for reconstruction visualization: ViTMAE on Animals Dataset
-
UNETR
- 3D segmentation model for medical domain
- Transformer based architecture, more info
- process: building_unetr
einops
torch
torchvision
numpy
matplotlib
pandas
God is our refuge and strength, a very present help in trouble.
Psalm 46:1