diff --git a/.asset/applications/detection.png b/.asset/applications/detection.png
new file mode 100644
index 0000000..bc7fa0a
Binary files /dev/null and b/.asset/applications/detection.png differ
diff --git a/.asset/applications/spotting.png b/.asset/applications/spotting.png
new file mode 100644
index 0000000..b53c956
Binary files /dev/null and b/.asset/applications/spotting.png differ
diff --git a/README.md b/README.md
index f9e8739..1c04b08 100644
--- a/README.md
+++ b/README.md
@@ -4,7 +4,10 @@
 <a href="https://arxiv.org/abs/2401.17904"><img src="https://img.shields.io/badge/arXiv-Paper-<color>"></a>
 </p>
 
-This is the official repository of the paper [Hi-SAM: Marrying Segment Anything Model for Hierarchical Text Segmentation](https://arxiv.org/abs/2401.17904).
+> [Hi-SAM: Marrying Segment Anything Model for Hierarchical Text Segmentation](https://arxiv.org/abs/2401.17904).
+> [arXiv preprint]
+
+This is the official repository for Hi-SAM, a unified hierarchical text segmentation model. Refer to our paper for more details.
 
 ## :sparkles: Highlight
 
@@ -14,21 +17,201 @@ This is the official repository of the paper [Hi-SAM: Marrying Segment Anything
 - **High-Quality Text Stroke Segmentation & Stroke Labeling Assistant.** High-quality text stroke segmentation by introducing mask feature of 1024×1024 resolution with minimal modification in SAM's original mask decoder. 
 - **Automatic and Interactive.** Hi-SAM supports both automatic mask generation and interactive promptable mode. Given a single-point prompt, Hi-SAM provides word, text-line, and paragraph masks.
 
+
 ## :bulb: Overview of Hi-SAM
 ![Hi-SAM](.asset/Hi-SAM.png)
 
+
+## :fire: News
+- **[`2024/02/0x`]**: Inference and evaluation codes are released. Checkpoints are available. Some applications are provided.
+
+
+## :hammer_and_wrench: Install
+
+**Recommended**: `Linux` `Python 3.8` `Pytorch 1.10` `CUDA 11.1`
+
+```
+conda create --name hi_sam python=3.8 -y
+conda activate hi_sam
+pip install torch==1.10.0+cu111 torchvision==0.11.0+cu111 -f https://download.pytorch.org/whl/torch_stable.html
+git clone https://github.com/ymy-k/Hi-SAM.git
+cd Hi-SAM
+pip install -r requirements.txt
+```
+
+## :pushpin: Checkpoints
+
+You can download the following model weights and put them in `pretrained_checkpoint/`.
+
+- **SAM-TSS (only for text stroke segmentation)**
+
+|Model|Used Dataset|Weights|fgIOU|F-score|
+|:------:|:------:|:------:|:------:|:------:|
+|SAM-TSS-B|Total-Text|[OneDrive](https://1drv.ms/u/s!AimBgYV7JjTlgcoycYfJS3jn8Zi5aQ?e=qsGFu4)|80.93|86.25|
+|SAM-TSS-L|Total-Text|[OneDrive](https://1drv.ms/u/s!AimBgYV7JjTlgco3-gIk1_LtCjaUZg?e=RgYfYu)|84.59|88.69|
+|SAM-TSS-H|Total-Text|[OneDrive](https://1drv.ms/u/s!AimBgYV7JjTlgco0WBDWOp4Va6mS0w?e=QPvtGL)|84.86|89.68|
+
+|Model|Used Dataset|Weights|fgIOU|F-score|
+|:------:|:------:|:------:|:------:|:------:|
+|SAM-TSS-B|TextSeg|[OneDrive](https://1drv.ms/u/s!AimBgYV7JjTlgcow1U3IUTVOZoQPgQ?e=XGmher)|87.15|92.81|
+|SAM-TSS-L|TextSeg|[OneDrive](https://1drv.ms/u/s!AimBgYV7JjTlgco2ogP59MnXtR6bSw?e=5iq3Rt)|88.77|93.79|
+|SAM-TSS-H|TextSeg|[OneDrive](https://1drv.ms/u/s!AimBgYV7JjTlgco1z9sdUi1vXCsKgA?e=U3WPJy)|88.96|93.87|
+
+|Model|Used Dataset|Weights|fgIOU|F-score|
+|:------:|:------:|:------:|:------:|:------:|
+|SAM-TSS-B|HierText|[OneDrive](https://1drv.ms/u/s!AimBgYV7JjTlgcouDoj3oDJtpDKxtQ?e=teOm4L)|73.39|81.34|
+|SAM-TSS-L|HierText|[OneDrive](https://1drv.ms/u/s!AimBgYV7JjTlgco4WC_2I3pMGOCcKg?e=y4XcNi)|78.37|84.99|
+|SAM-TSS-H|HierText|[OneDrive](https://1drv.ms/u/s!AimBgYV7JjTlgcozVtoOKCQL_HCmsQ?e=fI6e1o)|79.27|85.63|
+
+- **Hi-SAM** 
+
+|Model|Used Dataset|Weights|Stroke F-score|Word F-score|Text-Line F-score|Paragraph F-score|
+|:------:|:------:|:------:|:------:|:------:|:------:|:------:|
+|Hi-SAM-B|HierText|[OneDrive](https://1drv.ms/u/s!AimBgYV7JjTlgcosk3ZK1dImhxaW9g?e=xTsegH)|79.78|78.34|82.15|71.15|
+|Hi-SAM-L|HierText|[OneDrive](https://1drv.ms/u/s!AimBgYV7JjTlgcovMjJKfH6baFBTGw?e=T3IrUf)|82.90|81.83|84.85|74.49|
+|Hi-SAM-H|HierText|[OneDrive](https://1drv.ms/u/s!AimBgYV7JjTlgcoxoNjp1IG7xitzrg?e=0z4QhJ)|83.36|82.86|85.30|75.97|
+
+The results of Hi-SAM on the test set are reported here.
+
+:star: **`Note`:** For faster downloading and saving storage, **above checkpoints do not contain the parameters in SAM's ViT image encoder**. Please follow [segment-anything](https://github.com/facebookresearch/segment-anything) to achieve `sam_vit_b_01ec64.pth`, `sam_vit_l_0b3195.pth`, `sam_vit_h_4b8939.pth` and put them in `pretrained_checkpoint/` for loading the frozen parameters in ViT image encoder.
+
+## :arrow_forward: Usage
+
+### **1. Visualization Demo**
+
+**1.1 Text stroke segmentation (for SAM-TSS & Hi-SAM):**
+
+```
+python demo_hisam.py --checkpoint pretrained_checkpoint/sam_tss_l_hiertext.pth --model-type vit_l --input demo/2e0cb33320757201.jpg --output demo/
+```
+
+- `--checkpoint`: the model path.
+- `--model-type`: the backbone type. Use `vit_b` for ViT-Base backbone,  `vit_l` for ViT-Large,  `vit_h` for ViT-Huge. 
+- `--input`: the input image path.
+- `--output`: the output image path or folder.
+
+To achieve better quality on small texts using sliding window, run the following script:
+
+```
+python demo_hisam.py --checkpoint pretrained_checkpoint/sam_tss_l_hiertext.pth --model-type vit_l --input demo/2e0cb33320757201.jpg --output demo/2e0cb33320757201_sliding.png --patch_mode
+```
+
+- `--patch_mode`: enabling sliding window inference. The default patch size is 512×512, the stride is 384 (for HierText). You can adjust the setting for better result on your data.
+
+**1.2 Word, Text-line, and Paragraph Segmentation (for Hi-SAM)**
+
+Run the following script for promptable segmentation on demo/img293.jpg:
+
+```
+python demo_hisam.py --checkpoint pretrained_checkpoint/hi_sam_l.pth --model-type vit_l --input demo/img293.jpg --output demo/ --hier_det
+```
+
+- `--hier_det`: enabling hierarchical segmentation. Hi-SAM predicts a word mask, a text-line mask, and a paragraph mask for each single-point prompt. See demo_hisam.py for the point position and details.
+
+### **2. Evaluation**
+
+Please follow [data_preparation.md](datasets/data_preparation.md) to prepare the datasets at first.
+
+**2.1 Text Stroke Segmentation (for SAM-TSS & Hi-SAM)**
+
+If you only want to evaluate the text stroke segmentation part performance, run the following script:
+
+```
+python -m torch.distributed.launch --nproc_per_node=8 train.py --checkpoint <saved_model_path> --model-type <select_vit_type> --val_datasets hiertext_test --eval
+```
+
+- `--nproc_per_node`: you can use multiple GPUs for faster evaluation.
+- `--val_datasets`: the selected dataset for evaluation. Use `totaltext_test` for evaluation on the test set of Total-Text,  `textseg_test` for the test set of TextSeg,  `hiertext_test` for the test set of HierText.
+-  `--eval`: use evaluation mode.
+
+If you want to evaluate the performance on HierText with sliding window inference, run the following scripts:
+
+```
+mkdir img_eval
+python demo_hisam.py --checkpoint <saved_model_path> --model-type <select_vit_type> --input datasets/HierText/test/ --output img_eval/ --patch_mode
+python eval_img.py
+```
+
+*Using sliding window takes a relatively long time. For faster inference, you can divide the test images into multiple folders and conduct inference for each folder with an individual GPU.*
+
+**2.2 Hierarchical Text Segmentation (for Hi-SAM)**
+
+For stroke level performance, please follow **section 2.1**. For word, text-line, and paragraph level performance on HierText, please follow the subsequent steps.
+
+**Step 1:** run the following scripts to get the required jsonl file:
+
+```
+python demo_amg.py --checkpoint <saved_model_path> --model-type <select_vit_type> --input datasets/HierText/test/ --total_points 1500 --batch_points 100 --eval
+cd hiertext_eval
+python collect_results.py --saved_name res_1500pts.jsonl
+```
+
+*For faster inference, you can divide the test or validation images into multiple folders and conduct inference for each folder with an individual GPU*.
+
+- `--input`: the test or validation image folder of HierText.
+- `--total_points`: the foreground points number per image. 1500 is the default setting.
+- `--batch_points`: the points number processed by H-Decoder per batch. It can be changed according to your GPU memory condition.
+- `--eval`: use evaluation mode. For each image, the prediction results will be saved as a jsonl file in folder `hiertext_eval/res_per_img/`. 
+- `--saved_name`: the saved jsonl file for submission on website or off-line evaluation. The jsonl files of all images will be merged into one jsonl file.
+
+**Step 2:** if you conduct inference on the test set of HierText, please submit the final jsonl file to [the official website](https://rrc.cvc.uab.es/?ch=18&com=mymethods&task=1) to achieve the evaluation metrics. If you conduct inference on the validation set: (1) follow [HierText repo](https://github.com/google-research-datasets/hiertext) to download and achieve the validation ground-truth `validation.jsonl`. Put it in  `hiertext_eval/gt/`. (2) Run the following script borrowed from [HierText repo](https://github.com/google-research-datasets/hiertext) to get the evaluation metrics:
+
+```
+python eval.py --gt=gt/validation.jsonl --result=res_1500pts.jsonl --output=score.txt --mask_stride=1 --eval_lines --eval_paragraphs
+cd ..
+```
+
+The evaluation process will take about 20 minutes. The evaluation metrics will be saved in the txt file determined by `--output`.
+
+
+## :eye: Applications
+
+### **1. Promptable Multi-granularity Text Erasing and Inpainting**
+
+Combining Hi-SAM with [Stable-Diffusion-inpainting](https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/inpaint) for interactive text erasing and inpainting (click a single-point for word, text-line, or paragraph erasing and inpainting). You can see [this project](https://github.com/yeungchenwa/OCR-SAM) to implement the combination of Hi-SAM and Stable-Diffusion.
+
+### **2. Text Detection**
+
+Only word level or only text-line level text detection. Directly segment contact text instance region instead of the shrunk text kernel region.  
+
+![spotting](.asset/applications/detection.png)
+
+Two demo models are provided here: [word_detection_totaltext.pth](https://1drv.ms/u/s!AimBgYV7JjTlgco6PgIiYeItOjffnA?e=qb6G0s) (trained on Total-Text, only for word detection). [line_detection_ctw1500.pth](https://1drv.ms/u/s!AimBgYV7JjTlgco5llba2msYi3eWXg?e=zKLX4n), (trained on CTW1500, only for text-line detection). Put them in `pretrained_checkpoint/`. Then, for example, run the following script for word detection:
+
+```
+python demo_text_detection.py --checkpoint pretrained_checkpoint/word_detection_totaltext.pth --model-type vit_h --input demo/img643.jpg --output demo/ --dataset totaltext
+```
+
+For text-line detection:
+
+```
+python demo_text_detection.py --checkpoint pretrained_checkpoint/line_detection_ctw1500.pth --model-type vit_h --input demo/1165.jpg --output demo/ --dataset ctw1500
+```
+
+### **3. Promptable Scene Text Spotting**
+
+Combination with a single-point scene text spotter, [SPTSv2](https://github.com/bytedance/SPTSv2). SPTSv2 can recognize scene texts but only predicts a single-point position for one instance. Providing the point position as prompt to Hi-SAM, the intact text mask can be achieved. Some demo figures are provided bellow, the green stars indicate the point prompts. The masks are generated by the word detection model in section **2. Text Detection**.
+
+![spotting](.asset/applications/spotting.png)
+
 ## :label: TODO 
 
-- [ ] Release demo.
-- [ ] Release training and inference codes.
+- [x] Release inference and evaluation codes.
+- [x] Release model weights.
+- [ ] Release training codes
+- [ ] Release online demo.
+
 
 ## 💗 Acknowledgement
 
 - [segment-anything](https://github.com/facebookresearch/segment-anything)
 - [HierText](https://github.com/google-research-datasets/hiertext), [Total-Text](https://github.com/cs-chan/Total-Text-Dataset), [TextSeg](https://github.com/SHI-Labs/Rethinking-Text-Segmentation)
 
+
 ## :black_nib: Citation
 
+If you find Hi-SAM helpful in your research, please consider giving this repository a :star: and citing:
+
 ```bibtex
 @article{ye2024hi-sam,
   title={Hi-SAM: Marrying Segment Anything Model for Hierarchical Text Segmentation},
diff --git a/datasets/HierText/example.gif b/datasets/HierText/example.gif
new file mode 100644
index 0000000..229910a
Binary files /dev/null and b/datasets/HierText/example.gif differ
diff --git a/datasets/TextSeg/process_textseg.py b/datasets/TextSeg/process_textseg.py
new file mode 100644
index 0000000..ad60030
--- /dev/null
+++ b/datasets/TextSeg/process_textseg.py
@@ -0,0 +1,49 @@
+import copy
+import math
+import time
+
+import skimage
+import torch
+import io
+from pycocotools.coco import COCO
+from pycocotools import mask as maskutils
+from glob import glob
+from tqdm import tqdm
+import numpy as np
+import torch
+import sys, os
+import matplotlib.pyplot as plt
+import cv2
+import random
+import json
+from PIL import Image, TarIO
+import contextlib
+import torch.nn.functional as F
+import pyclipper
+from shapely.geometry import Polygon
+import shutil
+import skimage
+
+
+os.makedirs('train_images', exist_ok=True)
+os.makedirs('train_gt', exist_ok=True)
+os.makedirs('val_images', exist_ok=True)
+os.makedirs('val_gt', exist_ok=True)
+os.makedirs('test_images', exist_ok=True)
+os.makedirs('test_gt', exist_ok=True)
+
+with open('split.json') as f_json:
+    split_info = json.load(f_json)
+train_img_list = split_info['train']
+val_img_list = split_info['val']
+test_img_list = split_info['test']
+
+for im_id in train_img_list:
+    shutil.copy(os.path.join('image', im_id+'.jpg'), os.path.join('train_images', im_id+'.jpg'))
+    shutil.copy(os.path.join('semantic_label', im_id + '_maskfg.png'), os.path.join('train_gt', im_id + '.png'))
+for im_id in val_img_list:
+    shutil.copy(os.path.join('image', im_id+'.jpg'), os.path.join('val_images', im_id+'.jpg'))
+    shutil.copy(os.path.join('semantic_label', im_id + '_maskfg.png'), os.path.join('val_gt', im_id + '.png'))
+for im_id in test_img_list:
+    shutil.copy(os.path.join('image', im_id+'.jpg'), os.path.join('test_images', im_id+'.jpg'))
+    shutil.copy(os.path.join('semantic_label', im_id + '_maskfg.png'), os.path.join('test_gt', im_id + '.png'))
diff --git a/datasets/data_preparation.md b/datasets/data_preparation.md
new file mode 100644
index 0000000..dc8721a
--- /dev/null
+++ b/datasets/data_preparation.md
@@ -0,0 +1,42 @@
+## Data Preparation
+
+### Step1. Download
+
+- **HierText**. Follow [the official repo of HierText](https://github.com/google-research-datasets/hiertext) to download  the dataset images. I label and provide the text stroke segmentation ground-truths (png format, binary, 0 for background, 255 for text foreground), which can be downloaded with the following OneDrive links: [train_gt (131MB)](https://1drv.ms/u/s!AimBgYV7JjTlgcorK9fmoBp7QImvww?e=zRiNKL), [validation_gt (26MB)](https://1drv.ms/u/s!AimBgYV7JjTlgcooQOfgKDidqWvrAw?e=7NCHiC), [test_gt (25MB)](https://1drv.ms/u/s!AimBgYV7JjTlgcopZbsovlW6JVjomA?e=qw8Dht).
+
+![example](HierText/example.gif)
+
+- **Total-Text**. Follow [the official repo of Total-Text](https://github.com/cs-chan/Total-Text-Dataset) to download the dataset. For text stroke segmentation, please download [the character level mask ground-truths](https://github.com/cs-chan/Total-Text-Dataset/tree/master/Groundtruth/Pixel/Character%20Level%20Mask).
+- **TextSeg**. Follow [the official repo of TextSeg](https://github.com/SHI-Labs/Rethinking-Text-Segmentation) to apply for the dataset.
+
+### Step2. Process & Organization
+
+(1) For Total-Text, rename` groundtruth_pixel/Train/img61.JPG` to ` groundtruth_pixel/Train/img61.jpg` .
+
+(2) For TextSeg, see ` TextSeg/process_textseg.py` and use it to split the original data.
+
+(3) Organize the datasets as the following structure:
+
+```
+|- HierText
+|  |- train
+|  |- train_gt
+|  |- validation
+|  |- validation_gt
+|  |- test
+|  └  test_gt
+|- TotalText
+|  |- groundtruth_pixel
+|     |- Test
+|     └  Train
+|  └  Images
+|     |- Test
+|     └  Train
+|- TextSeg
+|  |- train_images
+|  |- train_gt
+|  |- val_images
+|  |- val_gt
+|  |- test_images
+|  └  test_gt
+```
\ No newline at end of file
diff --git a/demo/1165.jpg b/demo/1165.jpg
new file mode 100644
index 0000000..e331904
Binary files /dev/null and b/demo/1165.jpg differ
diff --git a/demo/2e0cb33320757201.jpg b/demo/2e0cb33320757201.jpg
new file mode 100644
index 0000000..2c99e54
Binary files /dev/null and b/demo/2e0cb33320757201.jpg differ
diff --git a/demo/img293.jpg b/demo/img293.jpg
new file mode 100644
index 0000000..db50da4
Binary files /dev/null and b/demo/img293.jpg differ
diff --git a/demo/img643.jpg b/demo/img643.jpg
new file mode 100644
index 0000000..fac5272
Binary files /dev/null and b/demo/img643.jpg differ
diff --git a/demo_amg.py b/demo_amg.py
new file mode 100644
index 0000000..8c8269d
--- /dev/null
+++ b/demo_amg.py
@@ -0,0 +1,215 @@
+import json
+import sys
+
+import numpy as np
+import torch
+import matplotlib.pyplot as plt
+import cv2
+import skimage
+import os
+import argparse
+from hi_sam.modeling.build import model_registry
+from hi_sam.modeling.auto_mask_generator import AutoMaskGenerator
+import glob
+from tqdm import tqdm
+from PIL import Image
+import random
+from utils import utilities
+from shapely.geometry import Polygon
+import pyclipper
+import datetime
+import warnings
+warnings.filterwarnings("ignore")
+
+
+def get_args_parser():
+    parser = argparse.ArgumentParser('Hi-SAM', add_help=False)
+
+    parser.add_argument("--input", type=str, required=True, nargs="+",
+                        help="Path to the input image")
+    parser.add_argument("--output", type=str, default='./demo',
+                        help="A file or directory to save output visualizations.")
+    parser.add_argument("--existing_fgmask_input", type=str, default='./datasets/HierText/val_fgmask/',
+                        help="A file or directory of foreground masks.")
+    parser.add_argument("--model-type", type=str, default="vit_l",
+                        help="The type of model to load, in ['vit_h', 'vit_l', 'vit_b']")
+    parser.add_argument("--checkpoint", type=str, required=True,
+                        help="The path to the SAM checkpoint to use for mask generation.")
+    parser.add_argument("--device", type=str, default="cuda",
+                        help="The device to run generation on.")
+    parser.add_argument("--hier_det", default=True)
+    parser.add_argument("--use_fgmask", action='store_true')
+    parser.add_argument("--vis", action='store_true')
+    parser.add_argument("--eval", action='store_true')
+    parser.add_argument("--eval_out_file", type=str, default='./hiertext_eval/res_per_img',
+                        help="A file or directory to save results per image.")
+    parser.add_argument('--total_points', default=1500, type=int, help='The number of foreground points')
+    parser.add_argument('--batch_points', default=100, type=int, help='The number of points per batch')
+
+    parser.add_argument('--seed', default=42, type=int)
+    parser.add_argument('--input_size', default=[1024, 1024], type=list)
+
+    # self-prompting
+    parser.add_argument('--attn_layers', default=1, type=int,
+                        help='The number of image to token cross attention layers in model_aligner')
+    parser.add_argument('--prompt_len', default=12, type=int, help='The number of prompt token')
+    parser.add_argument('--layout_thresh', type=float, default=0.5)
+    return parser.parse_args()
+
+
+def show_points(coords, ax, marker_size=40):
+    ax.scatter(coords[:, 0], coords[:, 1], color='green', marker='*', s=marker_size, edgecolor='white', linewidth=0.25)
+
+
+def show_mask(mask, ax, random_color=False, color=None):
+    if random_color:
+        color = np.concatenate([np.random.random(3), np.array([0.5])], axis=0)
+    else:
+        color = color if color is not None else np.array([30/255, 144/255, 255/255, 0.5])
+    h, w = mask.shape[-2:]
+    mask_image = mask.reshape(h, w, 1) * color.reshape(1, 1, -1)
+    ax.imshow(mask_image)
+
+
+def show_hi_masks(masks, filename, image):
+    plt.figure(figsize=(15, 15), dpi=200)
+    plt.imshow(image)
+
+    for i, hi_mask in enumerate(masks):
+        hi_mask = hi_mask[0]
+        show_mask(hi_mask, plt.gca(), random_color=True)
+    plt.axis('off')
+    plt.savefig(filename, bbox_inches='tight')
+    plt.close()
+
+
+def save_binary_mask(mask: np.array, filename):
+    if len(mask.shape) == 3:
+        assert mask.shape[0] == 1
+        mask = mask[0].astype(np.uint8)*255
+    elif len(mask.shape) == 2:
+        mask = mask.astype(np.uint8)*255
+    else:
+        raise NotImplementedError
+    mask = Image.fromarray(mask)
+    mask.save(filename)
+
+
+def unclip(p, unclip_ratio=2.0):
+    poly = Polygon(p)
+    distance = poly.area * unclip_ratio / poly.length
+    offset = pyclipper.PyclipperOffset()
+    offset.AddPath(p, pyclipper.JT_ROUND, pyclipper.ET_CLOSEDPOLYGON)
+    expanded = np.array(offset.Execute(distance))
+    return expanded
+
+
+if __name__ == '__main__':
+    args = get_args_parser()
+    seed = args.seed
+    torch.manual_seed(seed)
+    np.random.seed(seed)
+    random.seed(seed)
+    torch.cuda.manual_seed(seed)
+    torch.cuda.manual_seed_all(seed)
+    hisam = model_registry[args.model_type](args)
+    hisam.eval()
+    hisam.to(args.device)
+    print("Loaded model")
+    amg = AutoMaskGenerator(hisam)
+    none_num = 0
+
+    if args.eval:
+        os.makedirs(args.eval_out_file, exist_ok=True)
+    if os.path.isdir(args.input[0]):
+        args.input = [os.path.join(args.input[0], fname) for fname in os.listdir(args.input[0])]
+    elif len(args.input) == 1:
+        args.input = glob.glob(os.path.expanduser(args.input[0]))
+        assert args.input, "The input path(s) was not found"
+    for path in tqdm(args.input, disable=not args.output):
+        img_id = os.path.basename(path).split('.')[0]
+        if os.path.isdir(args.output):
+            assert os.path.isdir(args.output), args.output
+            img_name = img_id + '.png'
+            out_filename = os.path.join(args.output, img_name)
+        else:
+            assert len(args.input) == 1
+            out_filename = args.output
+
+        image = cv2.imread(path)
+        image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)  # h, w, 3
+        img_h, img_w = image.shape[:2]
+
+        if args.use_fgmask:
+            fgmask_path = os.path.join(args.existing_fgmask_input, img_id+'.png')
+            fgmask = skimage.io.imread(fgmask_path)
+            amg.set_fgmask(fgmask)
+
+        amg.set_image(image)
+
+        masks, scores, affinity = amg.predict(
+            from_low_res=False,
+            fg_points_num=args.total_points,
+            batch_points_num=args.batch_points,
+            score_thresh=0.5,
+            nms_thresh=0.5,
+        )  # only return word masks here for evaluation
+
+        if args.eval:
+            if masks is None:
+                lines = [{'words': [{'text': '', 'vertices': [[0,0],[1,0],[1,1],[0,1]]}], 'text': ''}]
+                paragraphs = [{'lines': lines}]
+                result = {
+                    'image_id': img_id,
+                    "paragraphs": paragraphs
+                }
+                none_num += 1
+            else:
+                masks = (masks[:, 0, :, :]).astype(np.uint8)  # word masks, (n, h, w)
+                lines = []
+                line_indices = []
+                for index, mask in enumerate(masks):
+                    line = {'words': [], 'text': ''}
+                    contours, _ = cv2.findContours(mask, cv2.RETR_LIST, cv2.CHAIN_APPROX_SIMPLE)
+                    for cont in contours:
+                        epsilon = 0.002 * cv2.arcLength(cont, True)
+                        approx = cv2.approxPolyDP(cont, epsilon, True)
+                        points = approx.reshape((-1, 2))
+                        if points.shape[0] < 4:
+                            continue
+                        pts = unclip(points)
+                        if len(pts) != 1:
+                            continue
+                        pts = pts[0].astype(np.int32)
+                        if Polygon(pts).area < 32:
+                            continue
+                        pts[:, 0] = np.clip(pts[:, 0], 0, img_w)
+                        pts[:, 1] = np.clip(pts[:, 1], 0, img_h)
+                        cnt_list = pts.tolist()
+                        line['words'].append({'text': '', 'vertices': cnt_list})
+                    if line['words']:
+                        lines.append(line)
+                        line_indices.append(index)
+
+                line_grouping = utilities.DisjointSet(len(line_indices))
+                affinity = affinity[line_indices][:, line_indices]
+                for i1, i2 in zip(*np.where(affinity > args.layout_thresh)):
+                    line_grouping.union(i1, i2)
+                line_groups = line_grouping.to_group()
+                paragraphs = []
+                for line_group in line_groups:
+                    paragraph = {'lines': []}
+                    for id_ in line_group:
+                        paragraph['lines'].append(lines[id_])
+                    if paragraph:
+                        paragraphs.append(paragraph)
+                result = {
+                    'image_id': img_id,
+                    "paragraphs": paragraphs
+                }
+            with open(os.path.join(args.eval_out_file, img_id+'.jsonl'), 'w', encoding='utf-8') as fw:
+                json.dump(result, fw)
+            fw.close()
+
+    if args.eval:
+        print(f'{none_num} images without predictions.')
diff --git a/demo_hisam.py b/demo_hisam.py
new file mode 100644
index 0000000..45c8562
--- /dev/null
+++ b/demo_hisam.py
@@ -0,0 +1,245 @@
+import sys
+
+import numpy as np
+import torch
+import matplotlib.pyplot as plt
+import cv2
+import os
+import argparse
+from hi_sam.modeling.build import model_registry
+from hi_sam.modeling.predictor import SamPredictor
+import glob
+from tqdm import tqdm
+from PIL import Image
+from shapely.geometry import Polygon
+import pyclipper
+import warnings
+warnings.filterwarnings("ignore")
+
+
+def get_args_parser():
+    parser = argparse.ArgumentParser('Hi-SAM', add_help=False)
+
+    parser.add_argument("--input", type=str, required=True, nargs="+",
+                        help="Path to the input image")
+    parser.add_argument("--output", type=str, default='./demo',
+                        help="A file or directory to save output visualizations.")
+    parser.add_argument("--model-type", type=str, default="vit_l",
+                        help="The type of model to load, in ['vit_h', 'vit_l', 'vit_b']")
+    parser.add_argument("--checkpoint", type=str, required=True,
+                        help="The path to the SAM checkpoint to use for mask generation.")
+    parser.add_argument("--device", type=str, default="cuda",
+                        help="The device to run generation on.")
+    parser.add_argument("--hier_det", action='store_true',
+                        help="If False, only text stroke segmentation.")
+
+    parser.add_argument('--input_size', default=[1024,1024], type=list)
+    parser.add_argument('--patch_mode', action='store_true')
+
+    # self-prompting
+    parser.add_argument('--attn_layers', default=1, type=int,
+                        help='The number of image to token cross attention layers in model_aligner')
+    parser.add_argument('--prompt_len', default=12, type=int, help='The number of prompt token')
+
+    return parser.parse_args()
+
+
+def patchify(image: np.array, patch_size: int=256):
+    h, w = image.shape[:2]
+    patch_list = []
+    h_num, w_num = h//patch_size, w//patch_size
+    h_remain, w_remain = h%patch_size, w%patch_size
+    row, col = h_num + int(h_remain>0), w_num + int(w_remain>0)
+    h_slices = [[r * patch_size, (r + 1) * patch_size] for r in range(h_num)]
+    if h_remain:
+        h_slices = h_slices + [[h - h_remain, h]]
+    h_slices = np.tile(h_slices, (1, col)).reshape(-1, 2).tolist()
+    w_slices = [[i * patch_size, (i + 1) * patch_size] for i in range(w_num)]
+    if w_remain:
+        w_slices = w_slices + [[w-w_remain, w]]
+    w_slices = w_slices * row
+    assert len(w_slices) == len(h_slices)
+    for idx in range(0, len(w_slices)):
+        # from left to right, then from top to bottom
+        patch_list.append(image[h_slices[idx][0]:h_slices[idx][1], w_slices[idx][0]:w_slices[idx][1], :])
+    return patch_list, row, col
+
+
+def unpatchify(patches, row, col):
+    # return np.array
+    whole = [np.concatenate(patches[r*col : (r+1)*col], axis=1) for r in range(row)]
+    whole = np.concatenate(whole, axis=0)
+    return whole
+
+
+def patchify_sliding(image: np.array, patch_size: int=512, stride: int=256):
+    h, w = image.shape[:2]
+    patch_list = []
+    h_slice_list = []
+    w_slice_list = []
+    for j in range(0, h, stride):
+        start_h, end_h = j, j+patch_size
+        if end_h > h:
+            start_h = max(h - patch_size, 0)
+            end_h = h
+        for i in range(0, w, stride):
+            start_w, end_w = i, i+patch_size
+            if end_w > w:
+                start_w = max(w - patch_size, 0)
+                end_w = w
+            h_slice = slice(start_h, end_h)
+            h_slice_list.append(h_slice)
+            w_slice = slice(start_w, end_w)
+            w_slice_list.append(w_slice)
+            patch_list.append(image[h_slice, w_slice])
+
+    return patch_list, h_slice_list, w_slice_list
+
+
+def unpatchify_sliding(patch_list, h_slice_list, w_slice_list, ori_size):
+    assert len(ori_size) == 2  # (h, w)
+    whole_logits = np.zeros(ori_size)
+    assert len(patch_list) == len(h_slice_list)
+    assert len(h_slice_list) == len(w_slice_list)
+    for idx in range(len(patch_list)):
+        h_slice = h_slice_list[idx]
+        w_slice = w_slice_list[idx]
+        whole_logits[h_slice, w_slice] += patch_list[idx]
+
+    return whole_logits
+
+
+def show_points(coords, ax, marker_size=200):
+    ax.scatter(coords[0], coords[1], color='green', marker='*', s=marker_size, edgecolor='white', linewidth=0.25)
+
+
+def show_mask(mask, ax, random_color=False, color=None):
+    if random_color:
+        color = np.concatenate([np.random.random(3), np.array([0.6])], axis=0)
+    else:
+        color = color if color is not None else np.array([30/255, 144/255, 255/255, 0.5])
+    h, w = mask.shape[-2:]
+    mask_image = mask.reshape(h, w, 1) * color.reshape(1, 1, -1)
+    ax.imshow(mask_image)
+
+
+def show_res(masks, scores, filename, image):
+    for i, (mask, score) in enumerate(zip(masks, scores)):
+        plt.figure(figsize=(10, 10))
+        plt.imshow(image)
+        show_mask(mask, plt.gca())
+
+        print(f"Score: {score:.3f}")
+        plt.axis('off')
+        plt.savefig(filename, bbox_inches='tight', pad_inches=-0.1)
+        plt.close()
+
+
+def show_hi_masks(masks, word_masks, input_points, filename, image, scores):
+    plt.figure(figsize=(15, 15))
+    plt.imshow(image)
+    for i, (line_para_masks, word_mask, hi_score, point) in enumerate(zip(masks, word_masks, scores, input_points)):
+        line_mask = line_para_masks[0]
+        para_mask = line_para_masks[1]
+        show_mask(para_mask, plt.gca(), color=np.array([255 / 255, 144 / 255, 30 / 255, 0.5]))
+        show_mask(line_mask, plt.gca())
+        word_mask = word_mask[0].astype(np.uint8)
+        contours, _ = cv2.findContours(word_mask, cv2.RETR_LIST, cv2.CHAIN_APPROX_SIMPLE)
+        select_word = None
+        for cont in contours:
+            epsilon = 0.002 * cv2.arcLength(cont, True)
+            approx = cv2.approxPolyDP(cont, epsilon, True)
+            points = approx.reshape((-1, 2))
+            if points.shape[0] < 4:
+                continue
+            pts = unclip(points)
+            if len(pts) != 1:
+                continue
+            pts = pts[0].astype(np.int32)
+            if cv2.pointPolygonTest(pts, (int(point[0]), int(point[1])), False) >= 0:
+                select_word = pts
+                break
+        if select_word is not None:
+            word_mask = cv2.fillPoly(np.zeros(word_mask.shape), [select_word], 1)
+            show_mask(word_mask, plt.gca(), color=np.array([30 / 255, 255 / 255, 144 / 255, 0.5]))
+        show_points(point, plt.gca())
+        print(f'point {i}: line {hi_score[1]}, para {hi_score[2]}')
+
+    plt.axis('off')
+    plt.savefig(filename, bbox_inches='tight', pad_inches=0)
+    plt.close()
+
+
+def save_binary_mask(mask: np.array, filename):
+    if len(mask.shape) == 3:
+        assert mask.shape[0] == 1
+        mask = mask[0].astype(np.uint8)*255
+    elif len(mask.shape) == 2:
+        mask = mask.astype(np.uint8)*255
+    else:
+        raise NotImplementedError
+    mask = Image.fromarray(mask)
+    mask.save(filename)
+
+def unclip(p, unclip_ratio=2.0):
+    poly = Polygon(p)
+    distance = poly.area * unclip_ratio / poly.length
+    offset = pyclipper.PyclipperOffset()
+    offset.AddPath(p, pyclipper.JT_ROUND, pyclipper.ET_CLOSEDPOLYGON)
+    expanded = np.array(offset.Execute(distance))
+    return expanded
+
+
+if __name__ == '__main__':
+    args = get_args_parser()
+    hisam = model_registry[args.model_type](args)
+    hisam.eval()
+    hisam.to(args.device)
+    predictor = SamPredictor(hisam)
+
+    if os.path.isdir(args.input[0]):
+        args.input = [os.path.join(args.input[0], fname) for fname in os.listdir(args.input[0])]
+    elif len(args.input) == 1:
+        args.input = glob.glob(os.path.expanduser(args.input[0]))
+        assert args.input, "The input path(s) was not found"
+    for path in tqdm(args.input, disable=not args.output):
+        if os.path.isdir(args.output):
+            assert os.path.isdir(args.output), args.output
+            img_name = os.path.basename(path).split('.')[0] + '.png'
+            out_filename = os.path.join(args.output, img_name)
+        else:
+            assert len(args.input) == 1
+            out_filename = args.output
+
+        image = cv2.imread(path)
+        image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
+
+        if args.patch_mode:
+            ori_size = image.shape[:2]
+            patch_list, h_slice_list, w_slice_list = patchify_sliding(image, 512, 384)  # sliding window config
+            mask_512 = []
+            for patch in tqdm(patch_list):
+               predictor.set_image(patch)
+               m, hr_m, score, hr_score = predictor.predict(multimask_output=False, return_logits=True)
+               assert hr_m.shape[0] == 1  # high-res mask
+               mask_512.append(hr_m[0])
+            mask_512 = unpatchify_sliding(mask_512, h_slice_list, w_slice_list, ori_size)
+            assert mask_512.shape[-2:] == ori_size
+            mask = mask_512
+            mask = mask > predictor.model.mask_threshold
+            save_binary_mask(mask, out_filename)
+        else:
+            predictor.set_image(image)
+            if args.hier_det:
+                input_point = np.array([[125, 275]])  # for demo/img293.jpg
+                input_label = np.ones(input_point.shape[0])
+                mask, hr_mask, score, hr_score, hi_mask, hi_iou, word_mask = predictor.predict(
+                    multimask_output=False,
+                    hier_det=True,
+                    point_coords=input_point,
+                    point_labels=input_label,
+                )
+                show_hi_masks(hi_mask, word_mask, input_point, out_filename, image, hi_iou)
+            else:
+                mask, hr_mask, score, hr_score = predictor.predict(multimask_output=False)
+                save_binary_mask(hr_mask, out_filename)
\ No newline at end of file
diff --git a/demo_text_detection.py b/demo_text_detection.py
new file mode 100644
index 0000000..6a761b0
--- /dev/null
+++ b/demo_text_detection.py
@@ -0,0 +1,206 @@
+import json
+import sys
+
+import numpy as np
+import torch
+import matplotlib.pyplot as plt
+import cv2
+import skimage
+import os
+import argparse
+from hi_sam.modeling.build import model_registry
+from hi_sam.modeling.auto_mask_generator import AutoMaskGenerator
+import glob
+from tqdm import tqdm
+from PIL import Image
+import random
+from utils import utilities
+from shapely.geometry import Polygon
+import pyclipper
+import datetime
+import warnings
+warnings.filterwarnings("ignore")
+
+
+def get_args_parser():
+    parser = argparse.ArgumentParser('Hi-SAM', add_help=False)
+
+    parser.add_argument("--input", type=str, required=True, nargs="+",
+                        help="Path to the input image")
+    parser.add_argument("--output", type=str, default='./demo',
+                        help="A file or directory to save output visualizations.")
+    parser.add_argument("--model-type", type=str, default="vit_l",
+                        help="The type of model to load, in ['vit_h', 'vit_l', 'vit_b']")
+    parser.add_argument("--checkpoint", type=str, required=True,
+                        help="The path to the SAM checkpoint to use for mask generation.")
+    parser.add_argument("--device", type=str, default="cuda",
+                        help="The device to run generation on.")
+    parser.add_argument("--hier_det", default=True)
+    parser.add_argument("--dataset", type=str, required=True, default='totaltext',
+                        help="'totaltext' or 'ctw1500', or 'ic15'.")
+    parser.add_argument("--vis", action='store_true')
+    parser.add_argument("--zero_shot", action='store_true')
+
+    parser.add_argument('--seed', default=42, type=int)
+    parser.add_argument('--input_size', default=[1024, 1024], type=list)
+
+    # self-prompting
+    parser.add_argument('--attn_layers', default=1, type=int,
+                        help='The number of image to token cross attention layers in model_aligner')
+    parser.add_argument('--prompt_len', default=12, type=int, help='The number of prompt token')
+    parser.add_argument('--layout_thresh', type=float, default=0.5)
+    return parser.parse_args()
+
+
+def unclip(p, unclip_ratio=2.0):
+    poly = Polygon(p)
+    distance = poly.area * unclip_ratio / poly.length
+    offset = pyclipper.PyclipperOffset()
+    offset.AddPath(p, pyclipper.JT_ROUND, pyclipper.ET_CLOSEDPOLYGON)
+    expanded = np.array(offset.Execute(distance))
+    return expanded
+
+
+def polygon2rbox(polygon, image_height, image_width):
+    rect = cv2.minAreaRect(polygon)
+    corners = cv2.boxPoints(rect)
+    corners = np.array(corners, dtype="int")
+    pts = get_tight_rect(corners, 0, 0, image_height, image_width, 1)
+    pts = np.array(pts).reshape(-1, 2)
+    return pts
+
+
+def get_tight_rect(points, start_x, start_y, image_height, image_width, scale):
+    points = list(points)
+    ps = sorted(points, key=lambda x: x[0])
+
+    if ps[1][1] > ps[0][1]:
+        px1 = ps[0][0] * scale + start_x
+        py1 = ps[0][1] * scale + start_y
+        px4 = ps[1][0] * scale + start_x
+        py4 = ps[1][1] * scale + start_y
+    else:
+        px1 = ps[1][0] * scale + start_x
+        py1 = ps[1][1] * scale + start_y
+        px4 = ps[0][0] * scale + start_x
+        py4 = ps[0][1] * scale + start_y
+    if ps[3][1] > ps[2][1]:
+        px2 = ps[2][0] * scale + start_x
+        py2 = ps[2][1] * scale + start_y
+        px3 = ps[3][0] * scale + start_x
+        py3 = ps[3][1] * scale + start_y
+    else:
+        px2 = ps[3][0] * scale + start_x
+        py2 = ps[3][1] * scale + start_y
+        px3 = ps[2][0] * scale + start_x
+        py3 = ps[2][1] * scale + start_y
+
+    px1 = min(max(px1, 1), image_width - 1)
+    px2 = min(max(px2, 1), image_width - 1)
+    px3 = min(max(px3, 1), image_width - 1)
+    px4 = min(max(px4, 1), image_width - 1)
+    py1 = min(max(py1, 1), image_height - 1)
+    py2 = min(max(py2, 1), image_height - 1)
+    py3 = min(max(py3, 1), image_height - 1)
+    py4 = min(max(py4, 1), image_height - 1)
+    return [px1, py1, px2, py2, px3, py3, px4, py4]
+
+
+def show_mask(mask, ax, random_color=False, color=None):
+    if random_color:
+        color = np.concatenate([np.random.random(3), np.array([0.6])], axis=0)
+    else:
+        color = color if color is not None else np.array([30/255, 144/255, 255/255, 0.5])
+    h, w = mask.shape[-2:]
+    mask_image = mask.reshape(h, w, 1) * color.reshape(1, 1, -1)
+    ax.imshow(mask_image)
+
+
+def show_masks(masks, filename, image):
+    plt.figure(figsize=(15, 15))
+    plt.imshow(image)
+    for i, mask in enumerate(masks):
+        mask = mask[0].astype(np.uint8)
+        # contours, _ = cv2.findContours(mask, cv2.RETR_LIST, cv2.CHAIN_APPROX_SIMPLE)
+        # for cont in contours:
+        #     epsilon = 0.002 * cv2.arcLength(cont, True)
+        #     approx = cv2.approxPolyDP(cont, epsilon, True)
+        #     pts = approx.reshape((-1, 2))
+        #     if pts.shape[0] < 4:
+        #         continue
+        #     pts = pts.astype(np.int32)
+        #     mask = cv2.fillPoly(np.zeros(mask.shape), [pts], 1)
+        show_mask(mask, plt.gca(), random_color=True)
+    plt.axis('off')
+    plt.savefig(filename, bbox_inches='tight', pad_inches=0)
+    plt.close()
+
+
+if __name__ == '__main__':
+    args = get_args_parser()
+    seed = args.seed
+    torch.manual_seed(seed)
+    np.random.seed(seed)
+    random.seed(seed)
+    torch.cuda.manual_seed(seed)
+    torch.cuda.manual_seed_all(seed)
+    hisam = model_registry[args.model_type](args)
+    hisam.eval()
+    hisam.to(args.device)
+    print("Loaded model")
+    amg = AutoMaskGenerator(hisam)
+
+    if args.dataset == 'totaltext':
+        if args.zero_shot:
+            fg_points_num = 50  # assemble text kernel
+            score_thresh = 0.3
+            unclip_ratio = 1.5
+        else:
+            fg_points_num = 500
+            score_thresh = 0.95
+    elif args.dataset == 'ctw1500':
+        if args.zero_shot:
+            fg_points_num = 100
+            score_thresh = 0.6
+        else:
+            fg_points_num = 300
+            score_thresh = 0.7
+    else:
+        raise ValueError
+
+    if os.path.isdir(args.input[0]):
+        args.input = [os.path.join(args.input[0], fname) for fname in os.listdir(args.input[0])]
+    elif len(args.input) == 1:
+        args.input = glob.glob(os.path.expanduser(args.input[0]))
+        assert args.input, "The input path(s) was not found"
+    for path in tqdm(args.input):
+        img_id = os.path.basename(path).split('.')[0]
+
+        if os.path.isdir(args.output):
+            assert os.path.isdir(args.output), args.output
+            img_name = os.path.basename(path).split('.')[0] + '.png'
+            out_filename = os.path.join(args.output, img_name)
+        else:
+            assert len(args.input) == 1
+            out_filename = args.output
+
+        image = cv2.imread(path)
+        image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)  # h, w, 3
+        img_h, img_w = image.shape[:2]
+
+        amg.set_image(image)
+        masks, scores = amg.predict_text_detection(
+            from_low_res=False,
+            fg_points_num=fg_points_num,
+            batch_points_num=min(fg_points_num, 100),
+            score_thresh=score_thresh,
+            nms_thresh=score_thresh,
+            zero_shot=args.zero_shot,
+            dataset=args.dataset
+        )
+
+        if masks is not None:
+            print('Inference done. Start plotting masks.')
+            show_masks(masks, out_filename, image)
+        else:
+            print('no prediction')
diff --git a/eval_img.py b/eval_img.py
new file mode 100644
index 0000000..a38c399
--- /dev/null
+++ b/eval_img.py
@@ -0,0 +1,121 @@
+import os.path
+import sys
+
+import numpy as np
+import torch
+from skimage import io
+from glob import glob
+from tqdm import tqdm
+from collections import OrderedDict
+import copy
+
+
+def get_IandU(pred, gt, m=None):
+    if pred.dtype not in [bool, np.uint8, np.int32, np.int64, int]:
+        raise ValueError
+    if gt.dtype not in [bool, np.uint8, np.int32, np.int64, int]:
+        raise ValueError
+    pred = pred.copy().astype(int)
+    gt = (gt.copy() > 127).astype(int)
+    max_index = 2
+
+    if m is None:
+        m = np.ones(pred.shape, dtype=np.uint8)
+    mgt = m * (gt >= 0) & (gt < max_index)
+    gt[mgt == 0] = max_index
+    # here set it with a new "ignore" index
+
+    mx = mgt * ((pred >= 0) & (pred < max_index))
+    # all gt ignores will be ignored, but if x predict
+    # ignore on non-ignore pixel, that will count as an error.
+    pred[mx == 0] = max_index
+
+    bdd_index = max_index + 1  # 3
+    # include the ignore
+
+    cmp = np.bincount((pred + gt * bdd_index).flatten())
+    cm = np.zeros((bdd_index * bdd_index)).astype(int)
+    cm[0:len(cmp)] = cmp
+    cm = cm.reshape(bdd_index, bdd_index)
+    pdn = cm.sum(axis=0)
+    gtn = cm.sum(axis=1)
+    tp = np.diag(cm)
+    intersection = tp[:max_index].tolist()  # remove ignore
+    union = pdn + gtn - tp
+    union = union[:max_index].tolist()
+    return {"intersection": intersection, "union": union}
+
+
+def label_count(x, m=None):
+    if x.dtype not in [bool, np.uint8, np.int32, np.int64, int]:
+        raise ValueError
+    x = x.copy().astype(int)
+    max_index = 2
+
+    if m is None:
+        m = np.ones(x.shape, dtype=int)
+    else:
+        m = (m > 0).astype(int)
+
+    mx = m * (x >= 0) & (x < max_index)
+    # here set it with a new "ignore" index
+    x[mx == 0] = max_index
+
+    counti = np.bincount(x.flatten())
+    counti = counti[0:max_index]
+    counti = list(counti)
+    if len(counti) < max_index:
+        counti += [0] * (max_index - len(counti))
+    return counti
+
+
+result_folder = "img_eval"
+gt_folder = "datasets/HierText/test_gt"
+gts = glob(gt_folder+'/*')
+semantic_classname = {0: 'background', 1: 'text'}
+predictions = []
+for gt_name in tqdm(gts):
+    img_id = os.path.basename(gt_name).split('.')[0]
+    gt_data = io.imread(gt_name)
+    gt_data = (gt_data > 127).astype(np.uint8) * 255
+    if len(gt_data.shape) > 2:
+        gt_data = gt_data[:, :, 0]
+
+    res_data = io.imread(os.path.join(result_folder, img_id+'.png'))
+    res_data = res_data > 127
+
+    img_result_dict = get_IandU(res_data, gt_data)
+    gn = label_count((gt_data>127).astype(int))
+    pn = label_count(res_data)
+    img_result_dict.update(pred_num=pn, gt_num=gn)
+    predictions.append(img_result_dict)
+
+results = OrderedDict()
+final_i = [x["intersection"] for x in predictions]
+final_i = np.array(final_i)
+final_u = [x["union"] for x in predictions]
+final_u = np.array(final_u)
+final_pn = [x["pred_num"] for x in predictions]
+final_pn = np.array(final_pn)
+final_gn = [x["gt_num"] for x in predictions]
+final_gn = np.array(final_gn)
+ii = final_i.sum(axis=0)
+uu = final_u.sum(axis=0)
+iou = ii.astype(float) / (uu.astype(float))
+miou = np.nanmean(iou)
+pn_save = copy.deepcopy(final_pn)
+gn_save = copy.deepcopy(final_gn)
+pn_save[final_pn == 0] = 1
+gn_save[final_gn == 0] = 1
+prec_imwise = final_i.astype(float) / (pn_save.astype(float))
+recl_imwise = final_i.astype(float) / (gn_save.astype(float))
+prec_imwise[final_pn == 0] = 0
+recl_imwise[final_gn == 0] = 0
+prec_imwise = prec_imwise.mean(axis=0)
+recl_imwise = recl_imwise.mean(axis=0)
+fscore_imwise = 2 * prec_imwise * recl_imwise / (prec_imwise + recl_imwise)
+results['---mIOU'] = float(miou)
+for idx in range(1, len(iou)):  # ignore background
+    results[str(idx).zfill(3)+'-' + semantic_classname[idx]+'-IOU'] = float(iou[idx])
+    results[str(idx).zfill(3) + '-' + semantic_classname[idx] + '-Fscore'] = float(fscore_imwise[idx])
+print(results)
\ No newline at end of file
diff --git a/hi_sam/__init__.py b/hi_sam/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/hi_sam/data/__init__.py b/hi_sam/data/__init__.py
new file mode 100644
index 0000000..5277f46
--- /dev/null
+++ b/hi_sam/data/__init__.py
@@ -0,0 +1,5 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+
+# This source code is licensed under the license found in the
+# LICENSE file in the root directory of this source tree.
diff --git a/hi_sam/data/dataloader.py b/hi_sam/data/dataloader.py
new file mode 100644
index 0000000..74072fa
--- /dev/null
+++ b/hi_sam/data/dataloader.py
@@ -0,0 +1,247 @@
+# Copyright by HQ-SAM team
+# All rights reserved.
+
+## data loader
+from __future__ import print_function, division
+
+import copy
+import numbers
+import sys
+
+import numpy as np
+import random
+from copy import deepcopy
+
+import torchvision.transforms
+from skimage import io
+import os
+from glob import glob
+from typing import Tuple, List, Optional
+from collections.abc import Sequence
+import json
+from pycocotools import mask as mask_utils
+import time
+from tqdm import tqdm
+import cv2
+from PIL import Image
+
+
+import torch
+from torch.utils.data import Dataset, DataLoader, ConcatDataset
+from torchvision import transforms, utils
+from torchvision.transforms.functional import normalize, InterpolationMode
+from torchvision.transforms.functional import resize, to_pil_image  # type: ignore
+from torchvision.transforms.functional import adjust_brightness, adjust_contrast, adjust_saturation, adjust_hue, rotate, gaussian_blur
+import torch.nn.functional as F
+from torch.utils.data.distributed import DistributedSampler
+
+
+#### --------------------- dataloader online ---------------------####
+
+def get_im_gt_name_dict(datasets, flag='valid'):
+    print("------------------------------", flag, "--------------------------------")
+    name_im_gt_list = []
+
+    for i in range(len(datasets)):
+        print("--->>>",flag," dataset ",i+1,"/",len(datasets)," ",datasets[i]["name"],"<<<---")
+        tmp_im_list = glob(datasets[i]["im_dir"]+os.sep+'*'+datasets[i]["im_ext"])
+        print('-im-',datasets[i]["name"],datasets[i]["im_dir"],': ',len(tmp_im_list))
+
+        if(datasets[i]["gt_dir"]==""):
+            print('-gt-', datasets[i]["name"], datasets[i]["gt_dir"], ': ', 'No Ground Truth Found')
+            tmp_gt_list = []
+        else:
+            tmp_gt_list = [
+                datasets[i]["gt_dir"]+os.sep+x.split(os.sep)[-1].split(datasets[i]["im_ext"])[0]+datasets[i]["gt_ext"]
+                for x in tmp_im_list
+            ]
+            print('-gt-',datasets[i]["name"],datasets[i]["gt_dir"],': ',len(tmp_gt_list))
+
+
+        name_im_gt_list.append({"dataset_name":datasets[i]["name"],
+                                "im_path":tmp_im_list,
+                                "gt_path":tmp_gt_list,
+                                "im_ext":datasets[i]["im_ext"],
+                                "gt_ext":datasets[i]["gt_ext"],
+                                "json_path": datasets[i].get('json_dir', None)
+                                })
+
+    return name_im_gt_list
+
+def custom_collate_fn(batch_samples):
+    imidx, image, label, shape = [], [], [], []
+    para_masks, line_masks, word_masks, line2para_idx = [], [], [], []
+    for sample in batch_samples:
+        imidx.append(sample['imidx'])
+        image.append(sample['image'].unsqueeze(0))
+        label.append(sample['label'].unsqueeze(0))
+        shape.append(sample['shape'].unsqueeze(0))
+        line_masks.append(sample['line_masks'])
+        if sample.get('paragraph_masks', None) is not None:
+            para_masks.append(sample['paragraph_masks'])
+            word_masks.append(sample['word_masks'])
+            line2para_idx.append(sample['line2paragraph_index'])
+    return {
+        'imidx': torch.as_tensor(imidx), 'image': torch.cat(image), 'label': torch.cat(label), 'shape': torch.cat(shape),
+        'paragraph_masks': para_masks, 'line_masks': line_masks, 'word_masks': word_masks, 'line2paragraph_index': line2para_idx
+    }
+
+def create_dataloaders(
+        name_im_gt_list, my_transforms=[], batch_size=1, training=False, hier_det=False, collate_fn=None
+):
+    gos_dataloaders = []
+    gos_datasets = []
+
+    if(len(name_im_gt_list)==0):
+        return gos_dataloaders, gos_datasets
+
+    num_workers_ = 1
+    if batch_size > 1:
+        num_workers_ = 2
+    if batch_size >= 4:
+        num_workers_ = 4
+    if batch_size >= 8:
+        num_workers_ = 8
+
+    if training:
+        for i in range(len(name_im_gt_list)):   
+            gos_dataset = OnlineDataset(
+                [name_im_gt_list[i]],
+                transform = transforms.Compose(my_transforms),
+                hier_det=hier_det
+            )
+            gos_datasets.append(gos_dataset)
+
+        gos_dataset = ConcatDataset(gos_datasets)
+        sampler = DistributedSampler(gos_dataset)
+        batch_sampler_train = torch.utils.data.BatchSampler(sampler, batch_size, drop_last=True)
+        dataloader = DataLoader(
+            gos_dataset,
+            batch_sampler=batch_sampler_train,
+            num_workers=num_workers_,
+            pin_memory=True,
+            prefetch_factor=6,
+            collate_fn=collate_fn if hier_det else None
+        )
+        gos_dataloaders = dataloader
+        gos_datasets = gos_dataset
+    else:
+        for i in range(len(name_im_gt_list)):   
+            gos_dataset = OnlineDataset(
+                [name_im_gt_list[i]],
+                transform=transforms.Compose(my_transforms),
+                eval_ori_resolution=True
+            )
+            sampler = DistributedSampler(gos_dataset, shuffle=False)
+            dataloader = DataLoader(gos_dataset, batch_size, sampler=sampler, drop_last=False, num_workers=num_workers_)
+            gos_dataloaders.append(dataloader)
+            gos_datasets.append(gos_dataset)
+
+    return gos_dataloaders, gos_datasets
+
+
+class ResizeLongestSide_ToTensor(object):
+    def __init__(self, target_length=1024):
+        self.target_length = target_length
+
+    @staticmethod
+    def get_preprocess_shape(oldh: int, oldw: int, long_side_length: int):
+        scale = long_side_length * 1.0 / max(oldh, oldw)
+        newh, neww = oldh * scale, oldw * scale
+        neww = int(neww + 0.5)
+        newh = int(newh + 0.5)
+        return [newh, neww]
+
+    def __call__(self, sample):
+        # image: np.array, [h, w, c]
+        image, shape = sample['image'], sample['shape']
+        image = torch.as_tensor(image, dtype=torch.float32)
+        image = image.permute(2, 0, 1)
+        target_size = self.get_preprocess_shape(image.shape[1], image.shape[2], self.target_length)
+        image = torch.squeeze(F.interpolate(torch.unsqueeze(image, 0), target_size, mode='bilinear'), dim=0)
+        return {'image':image, 'shape':torch.tensor(target_size)}
+
+
+class OnlineDataset(Dataset):
+    def __init__(self, name_im_gt_list, transform=None, eval_ori_resolution=False, hier_det=False):
+        # if hier_det is True, the json file for word, line, paragraph
+        # detection should be loaded.
+        self.transform = transform
+        self.dataset = {}
+        im_name_list = []  # image name
+        im_path_list = []  # im path
+        gt_path_list = []  # gt path
+
+        assert len(name_im_gt_list) == 1
+        name_im_gt_list = name_im_gt_list[0]
+        im_name_list.extend([x.split(os.sep)[-1].split(name_im_gt_list["im_ext"])[0] for x in name_im_gt_list["im_path"]])
+        im_path_list.extend(name_im_gt_list["im_path"])
+        gt_path_list.extend(name_im_gt_list["gt_path"])
+
+        self.dataset["im_name"] = im_name_list
+        self.dataset["im_path"] = im_path_list
+        self.dataset["ori_im_path"] = deepcopy(im_path_list)
+        self.dataset["gt_path"] = gt_path_list
+        self.dataset["ori_gt_path"] = deepcopy(gt_path_list)
+        self.dataset_name = name_im_gt_list["dataset_name"]
+        self.hier_det = hier_det
+        if hier_det:
+            json_path = name_im_gt_list.get('json_path', None)
+            assert json_path is not None, "Please check settings."
+            load_start = time.time()
+            with open(json_path, 'r') as fw:
+                annotations = json.load(fw)['annotations']
+            self.dataset["annotations"] = {anns['image_id']: anns for anns in annotations}
+            print(f"processed {len(self.dataset['annotations'])} samples from {json_path} in {time.time()-load_start:.2f} seconds.")
+            del annotations
+
+        self.eval_ori_resolution = eval_ori_resolution
+
+    def __len__(self):
+        return len(self.dataset["im_path"])
+
+    def __getitem__(self, idx):
+        im_path = self.dataset["im_path"][idx]
+        gt_path = self.dataset["gt_path"][idx]
+        im_name = self.dataset["im_name"][idx]
+        im = io.imread(im_path)
+        gt_ori = io.imread(gt_path)
+        if 'TextSeg' in self.dataset_name:
+            gt = (gt_ori == 100).astype(np.uint8) * 255
+        elif 'COCO_TS' in self.dataset_name:
+            gt = (gt_ori > 0).astype(np.uint8) * 255
+        else:
+            gt = (gt_ori > 127).astype(np.uint8) * 255  # for TotalText, HierText: 0 or 255
+        if len(gt.shape) > 2:
+            gt = gt[:, :, 0]
+        if len(im.shape) < 3:
+            im = im[:, :, np.newaxis]
+        if im.shape[2] == 1:
+            im = np.repeat(im, 3, axis=2)
+        sample = {
+            "imidx": torch.from_numpy(np.array(idx)),
+            "image": im,
+            "label": gt.astype(np.float32),
+            "shape": torch.tensor(im.shape[:2]),
+        }
+
+        if self.hier_det:
+            raise NotImplementedError
+        
+        if self.transform:
+            sample = self.transform(sample)
+
+        if self.eval_ori_resolution:
+            sample["ori_label"] = torch.unsqueeze(torch.from_numpy(gt), 0)
+            if 'TextSeg' in self.dataset_name:
+                ignore_mask = (gt_ori == 255).astype(np.uint8) * 255
+                sample["ignore_mask"] = torch.unsqueeze(torch.from_numpy(ignore_mask), 0)
+            sample['ori_im_path'] = self.dataset["im_path"][idx]
+            sample['ori_gt_path'] = self.dataset["gt_path"][idx]
+
+        return sample
+
+
+eval_transforms = [
+    ResizeLongestSide_ToTensor(),
+]
diff --git a/hi_sam/data/transforms.py b/hi_sam/data/transforms.py
new file mode 100644
index 0000000..3ad3466
--- /dev/null
+++ b/hi_sam/data/transforms.py
@@ -0,0 +1,102 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+
+# This source code is licensed under the license found in the
+# LICENSE file in the root directory of this source tree.
+
+import numpy as np
+import torch
+from torch.nn import functional as F
+from torchvision.transforms.functional import resize, to_pil_image  # type: ignore
+
+from copy import deepcopy
+from typing import Tuple
+
+
+class ResizeLongestSide:
+    """
+    Resizes images to longest side 'target_length', as well as provides
+    methods for resizing coordinates and boxes. Provides methods for
+    transforming both numpy array and batched torch tensors.
+    """
+
+    def __init__(self, target_length: int) -> None:
+        self.target_length = target_length
+
+    def apply_image(self, image: np.ndarray) -> np.ndarray:
+        """
+        Expects a numpy array with shape HxWxC in uint8 format.
+        """
+        target_size = self.get_preprocess_shape(image.shape[0], image.shape[1], self.target_length)
+        return np.array(resize(to_pil_image(image), target_size))
+
+    def apply_coords(self, coords: np.ndarray, original_size: Tuple[int, ...]) -> np.ndarray:
+        """
+        Expects a numpy array of length 2 in the final dimension. Requires the
+        original image size in (H, W) format.
+        """
+        old_h, old_w = original_size
+        new_h, new_w = self.get_preprocess_shape(
+            original_size[0], original_size[1], self.target_length
+        )
+        coords = deepcopy(coords).astype(float)
+        coords[..., 0] = coords[..., 0] * (new_w / old_w)
+        coords[..., 1] = coords[..., 1] * (new_h / old_h)
+        return coords
+
+    def apply_boxes(self, boxes: np.ndarray, original_size: Tuple[int, ...]) -> np.ndarray:
+        """
+        Expects a numpy array shape Bx4. Requires the original image size
+        in (H, W) format.
+        """
+        boxes = self.apply_coords(boxes.reshape(-1, 2, 2), original_size)
+        return boxes.reshape(-1, 4)
+
+    def apply_image_torch(self, image: torch.Tensor) -> torch.Tensor:
+        """
+        Expects batched images with shape BxCxHxW and float format. This
+        transformation may not exactly match apply_image. apply_image is
+        the transformation expected by the model.
+        """
+        # Expects an image in BCHW format. May not exactly match apply_image.
+        target_size = self.get_preprocess_shape(image.shape[0], image.shape[1], self.target_length)
+        return F.interpolate(
+            image, target_size, mode="bilinear", align_corners=False, antialias=True
+        )
+
+    def apply_coords_torch(
+        self, coords: torch.Tensor, original_size: Tuple[int, ...]
+    ) -> torch.Tensor:
+        """
+        Expects a torch tensor with length 2 in the last dimension. Requires the
+        original image size in (H, W) format.
+        """
+        old_h, old_w = original_size
+        new_h, new_w = self.get_preprocess_shape(
+            original_size[0], original_size[1], self.target_length
+        )
+        coords = deepcopy(coords).to(torch.float)
+        coords[..., 0] = coords[..., 0] * (new_w / old_w)
+        coords[..., 1] = coords[..., 1] * (new_h / old_h)
+        return coords
+
+    def apply_boxes_torch(
+        self, boxes: torch.Tensor, original_size: Tuple[int, ...]
+    ) -> torch.Tensor:
+        """
+        Expects a torch tensor with shape Bx4. Requires the original image
+        size in (H, W) format.
+        """
+        boxes = self.apply_coords_torch(boxes.reshape(-1, 2, 2), original_size)
+        return boxes.reshape(-1, 4)
+
+    @staticmethod
+    def get_preprocess_shape(oldh: int, oldw: int, long_side_length: int) -> Tuple[int, int]:
+        """
+        Compute the output size given input size and target long side length.
+        """
+        scale = long_side_length * 1.0 / max(oldh, oldw)
+        newh, neww = oldh * scale, oldw * scale
+        neww = int(neww + 0.5)
+        newh = int(newh + 0.5)
+        return (newh, neww)
diff --git a/hi_sam/evaluation/__init__.py b/hi_sam/evaluation/__init__.py
new file mode 100644
index 0000000..e6fcb53
--- /dev/null
+++ b/hi_sam/evaluation/__init__.py
@@ -0,0 +1 @@
+from .evaluator import Evaluator
\ No newline at end of file
diff --git a/hi_sam/evaluation/evaluator.py b/hi_sam/evaluation/evaluator.py
new file mode 100644
index 0000000..057aea0
--- /dev/null
+++ b/hi_sam/evaluation/evaluator.py
@@ -0,0 +1,205 @@
+import copy
+import io
+import itertools
+import json
+import numpy as np
+import os
+import torch
+from collections import OrderedDict
+from utils import misc
+
+
+class Evaluator():
+    def __init__(self, dataset_name, args, distributed, output_dir=None):
+        self._distributed = distributed
+        self._output_dir = output_dir
+        self._cpu_device = torch.device("cpu")
+
+        if "TotalText" in dataset_name:
+            self.dataset_name = "totaltext"
+            self.class_num = 2
+            self.semantic_classname = {
+                0: 'background',
+                1: 'text',
+            }
+        elif "HierText" in dataset_name:
+            self.dataset_name = "hiertext"
+            self.class_num = 2
+            self.semantic_classname = {
+                0: 'background',
+                1: 'text',
+            }
+        elif "TextSeg" in dataset_name:
+            self.dataset_name = "textseg"
+            self.class_num = 2
+            self.semantic_classname = {
+                0: 'background',
+                1: 'text',
+            }
+        else:
+            raise NotImplementedError
+
+    def reset(self):
+        self._predictions = []
+        self._hr_predictions = []
+
+    def process(self, predictions, hr_predictions, gts, ignore_mask=None):
+        assert predictions.shape[0] == 1, "Only support one batch per GPU now."
+        assert predictions.shape == gts.shape
+        assert hr_predictions.shape == gts.shape
+        if ignore_mask is not None:
+            assert ignore_mask.shape == gts.shape
+            predictions[ignore_mask==255] = False
+            hr_predictions[ignore_mask==255] = False
+        for pred, hr_pred, gt in zip(predictions, hr_predictions, gts):
+            pred = pred.squeeze(0).to(self._cpu_device).detach().numpy()  # h, w
+            hr_pred = hr_pred.squeeze(0).to(self._cpu_device).detach().numpy()  # h, w
+            gt = gt.squeeze(0).to(self._cpu_device).detach().numpy()
+
+            img_result_dict = self.get_IandU(pred, gt)
+            pn = self.label_count(pred)
+            gn = self.label_count((gt>127).astype(int))
+            img_result_dict.update(pred_num=pn, gt_num=gn)
+            self._predictions.append(img_result_dict)
+
+            img_hr_result_dict = self.get_IandU(hr_pred, gt)
+            hr_pn = self.label_count(hr_pred)
+            img_hr_result_dict.update(pred_num=hr_pn, gt_num=gn)
+            self._hr_predictions.append(img_hr_result_dict)
+
+    def get_IandU(self, pred, gt, m=None):
+        if pred.dtype not in [bool, np.uint8, np.int32, np.int64, int]:
+            raise ValueError
+        if gt.dtype not in [bool, np.uint8, np.int32, np.int64, int]:
+            raise ValueError
+        pred = pred.copy().astype(int)
+        gt = (gt.copy() > 127).astype(int)
+        max_index = self.class_num  # 2 for Total-Text
+
+        if m is None:
+            m = np.ones(pred.shape, dtype=np.uint8)
+        mgt = m * (gt >= 0) & (gt < max_index)
+        gt[mgt == 0] = max_index
+        # here set it with a new "ignore" index
+
+        mx = mgt * ((pred >= 0) & (pred < max_index))
+        # all gt ignores will be ignored, but if x predict
+        # ignore on non-ignore pixel, that will count as an error.
+        pred[mx == 0] = max_index
+
+        bdd_index = max_index + 1  # 3
+        # include the ignore
+
+        cmp = np.bincount((pred + gt * bdd_index).flatten())
+        cm = np.zeros((bdd_index * bdd_index)).astype(int)
+        cm[0:len(cmp)] = cmp
+        cm = cm.reshape(bdd_index, bdd_index)
+        pdn = cm.sum(axis=0)
+        gtn = cm.sum(axis=1)
+        tp = np.diag(cm)
+        intersection = tp[:max_index].tolist() # remove ignore
+        union = pdn + gtn - tp
+        union = union[:max_index].tolist()
+        return {"intersection": intersection, "union": union}
+
+    def label_count(self, x, m=None):
+        if x.dtype not in [bool, np.uint8, np.int32, np.int64, int]:
+            raise ValueError
+        x = x.copy().astype(int)
+        max_index = self.class_num
+
+        if m is None:
+            m = np.ones(x.shape, dtype=int)
+        else:
+            m = (m > 0).astype(int)
+
+        mx = m * (x >= 0) & (x < max_index)
+        # here set it with a new "ignore" index
+        x[mx == 0] = max_index
+
+        counti = np.bincount(x.flatten())
+        counti = counti[0:max_index]
+        counti = list(counti)
+        if len(counti) < max_index:
+            counti += [0] * (max_index - len(counti))
+        return counti
+
+    def evaluate(self):
+        if self._distributed:
+            misc.synchronize()
+            predictions = misc.gather(self._predictions, dst=0)
+            predictions = list(itertools.chain(*predictions))
+            hr_predictions = misc.gather(self._hr_predictions, dst=0)
+            hr_predictions = list(itertools.chain(*hr_predictions))
+
+            if not misc.is_main_process():
+                return {}
+        else:
+            predictions = self._predictions
+            hr_predictions = self._hr_predictions
+
+        if len(predictions) == 0:
+            print("Did not receice valid predictions")
+            return {}
+
+        self._results = OrderedDict()
+
+        final_i = [x["intersection"] for x in predictions]
+        final_i = np.array(final_i)  # (300,2)
+        final_i_hr = [x["intersection"] for x in hr_predictions]
+        final_i_hr = np.array(final_i_hr)
+
+        final_u = [x["union"] for x in predictions]
+        final_u = np.array(final_u)
+        final_u_hr = [x["union"] for x in hr_predictions]
+        final_u_hr = np.array(final_u_hr)
+
+        final_pn = [x["pred_num"] for x in predictions]
+        final_pn = np.array(final_pn)
+        final_pn_hr = [x["pred_num"] for x in hr_predictions]
+        final_pn_hr = np.array(final_pn_hr)
+
+        final_gn = [x["gt_num"] for x in predictions]
+        final_gn = np.array(final_gn)
+
+        ii = final_i.sum(axis=0)
+        uu = final_u.sum(axis=0)
+        iou = ii.astype(float) / (uu.astype(float))
+        miou = np.nanmean(iou)
+        ii_hr = final_i_hr.sum(axis=0)
+        uu_hr = final_u_hr.sum(axis=0)
+        iou_hr = ii_hr.astype(float) / (uu_hr.astype(float))
+        miou_hr = np.nanmean(iou_hr)
+
+        pn_save = copy.deepcopy(final_pn)
+        gn_save = copy.deepcopy(final_gn)
+        '''image-wise fscore'''
+        pn_save[final_pn == 0] = 1
+        gn_save[final_gn == 0] = 1
+        prec_imwise = final_i.astype(float) / (pn_save.astype(float))
+        recl_imwise = final_i.astype(float) / (gn_save.astype(float))
+        prec_imwise[final_pn == 0] = 0
+        recl_imwise[final_gn == 0] = 0
+        prec_imwise = prec_imwise.mean(axis=0)
+        recl_imwise = recl_imwise.mean(axis=0)
+        fscore_imwise = 2 * prec_imwise * recl_imwise / (prec_imwise + recl_imwise)
+
+        pn_save_hr = copy.deepcopy(final_pn_hr)
+        pn_save_hr[final_pn_hr == 0] = 1
+        prec_imwise_hr = final_i_hr.astype(float) / (pn_save_hr.astype(float))
+        recl_imwise_hr = final_i_hr.astype(float) / (gn_save.astype(float))
+        prec_imwise_hr[final_pn_hr == 0] = 0
+        recl_imwise_hr[final_gn == 0] = 0
+        prec_imwise_hr = prec_imwise_hr.mean(axis=0)
+        recl_imwise_hr = recl_imwise_hr.mean(axis=0)
+        fscore_imwise_hr = 2 * prec_imwise_hr * recl_imwise_hr / (prec_imwise_hr + recl_imwise_hr)
+
+        self._results['---mIOU'] = float(miou)
+        self._results['---mIOU_hr'] = float(miou_hr)
+        for idx in range(1, len(iou)):  # ignore background
+            self._results[str(idx).zfill(3)+'-'+self.semantic_classname[idx]+'-IOU'] = float(iou[idx])
+            self._results[str(idx).zfill(3) + '-' + self.semantic_classname[idx] + '-Fscore'] = float(fscore_imwise[idx])
+            self._results[str(idx).zfill(3) + '-' + self.semantic_classname[idx] + '-IOU_hr'] = float(iou_hr[idx])
+            self._results[str(idx).zfill(3) + '-' + self.semantic_classname[idx] + '-Fscore_hr'] = float(fscore_imwise_hr[idx])
+
+        return copy.deepcopy(self._results)
diff --git a/hi_sam/modeling/__init__.py b/hi_sam/modeling/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/hi_sam/modeling/auto_mask_generator.py b/hi_sam/modeling/auto_mask_generator.py
new file mode 100644
index 0000000..1535d1a
--- /dev/null
+++ b/hi_sam/modeling/auto_mask_generator.py
@@ -0,0 +1,399 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+import sys
+
+# This source code is licensed under the license found in the
+# LICENSE file in the root directory of this source tree.
+
+import numpy as np
+import torch
+import torch.nn.functional as F
+
+from .hi_sam import HiSam
+
+from typing import Optional, Tuple
+from hi_sam.data.transforms import ResizeLongestSide
+
+
+class AutoMaskGenerator:
+    def __init__(
+        self,
+        sam_model: HiSam,
+    ) -> None:
+        super().__init__()
+        self.model = sam_model
+        self.transform = ResizeLongestSide(sam_model.image_encoder.img_size)
+        self.reset_image()
+        self.reset_fgmask()
+
+    def set_image(
+        self,
+        image: np.ndarray,
+        image_format: str = "RGB",
+    ) -> None:
+        """
+        Calculates the image embeddings for the provided image, allowing
+        masks to be predicted with the 'predict' method.
+
+        Arguments:
+          image (np.ndarray): The image for calculating masks. Expects an
+            image in HWC uint8 format, with pixel values in [0, 255].
+          image_format (str): The color format of the image, in ['RGB', 'BGR'].
+        """
+        assert image_format in [
+            "RGB",
+            "BGR",
+        ], f"image_format must be in ['RGB', 'BGR'], is {image_format}."
+        # import pdb;pdb.set_trace()
+        if image_format != self.model.image_format:
+            image = image[..., ::-1]
+
+        # Transform the image to the form expected by the model
+        # import pdb;pdb.set_trace()
+        input_image = self.transform.apply_image(image)
+        input_image_torch = torch.as_tensor(input_image, device=self.device)
+        input_image_torch = input_image_torch.permute(2, 0, 1).contiguous()[None, :, :, :]
+
+        self.set_torch_image(input_image_torch, image.shape[:2])
+
+    @torch.no_grad()
+    def set_torch_image(
+        self,
+        transformed_image: torch.Tensor,
+        original_image_size: Tuple[int, ...],
+    ) -> None:
+        """
+        Calculates the image embeddings for the provided image, allowing
+        masks to be predicted with the 'predict' method. Expects the input
+        image to be already transformed to the format expected by the model.
+
+        Arguments:
+          transformed_image (torch.Tensor): The input image, with shape
+            1x3xHxW, which has been transformed with ResizeLongestSide.
+          original_image_size (tuple(int, int)): The size of the image
+            before transformation, in (H, W) format.
+        """
+        assert (
+            len(transformed_image.shape) == 4
+            and transformed_image.shape[1] == 3
+            and max(*transformed_image.shape[2:]) == self.model.image_encoder.img_size
+        ), f"set_torch_image input must be BCHW with long side {self.model.image_encoder.img_size}."
+        self.reset_image()
+
+        self.original_size = original_image_size
+        self.input_size = tuple(transformed_image.shape[-2:])
+        input_image = self.model.preprocess(transformed_image)
+        self.features = self.model.image_encoder(input_image)
+        self.is_image_set = True
+
+    def set_fgmask(
+            self,
+            fgmask: np.ndarray,
+    ) -> None:
+        assert (len(fgmask.shape) == 2 or len(fgmask.shape) == 3)
+        if len(fgmask.shape) > 2:
+            fgmask = fgmask[:, :, 0]
+        self.reset_fgmask()
+        fgmask = self.transform.apply_image(fgmask)
+        fgmask = fgmask > (fgmask.max() / 2)
+        self.fgmask = torch.as_tensor(fgmask, device=self.device)
+        self.is_fgmask_set = True
+
+    @torch.no_grad()
+    def forward_foreground_points(
+            self,
+            from_low_res: bool = False,
+            fg_points_num: int = 600
+    ):
+        if not self.is_image_set:
+            raise RuntimeError("An image must be set with .set_image(...) before mask prediction.")
+        if self.is_fgmask_set:
+            fg_mask = self.fgmask
+        else:
+            sparse_emb = self.model.modal_aligner(self.features)
+            low_res_mask, high_res_mask, iou_pred, iou_pred_hr = self.model.mask_decoder(
+                image_embeddings=self.features,
+                image_pe=self.model.prompt_encoder.get_dense_pe(),
+                sparse_prompt_embeddings=sparse_emb,
+                multimask_output=False,
+            )
+            assert low_res_mask.shape[0] == 1, "only support one image per batch now"
+
+            if from_low_res:
+                fg_mask = (low_res_mask > self.model.mask_threshold).squeeze(1)[0]  # (256, 256)
+            else:
+                fg_mask = (high_res_mask > self.model.mask_threshold).squeeze(1)[0]  # (1024, 1024)
+            del low_res_mask
+            del high_res_mask
+
+        # sample fg_points randomly
+        y_idx, x_idx = torch.where(fg_mask > 0)
+
+        p_n = x_idx.size(0)
+        perm = torch.randperm(p_n)
+        idx = perm[:min(p_n, fg_points_num)]
+
+        y_idx, x_idx = y_idx[idx][:, None], x_idx[idx][:, None]
+        fg_points = torch.cat((x_idx, y_idx), dim=1)[:, None, :]  # (k, 1, 2)
+        if from_low_res and (not self.is_fgmask_set):
+            fg_points = fg_points * 4  # from 256 to 1024
+        return fg_points
+
+    @torch.no_grad()
+    def forward_hi_decoder(
+            self,
+            point_coords,
+            point_labels
+    ):
+        point_embeddings, _ = self.model.prompt_encoder(
+            points=(point_coords, point_labels), boxes=None, masks=None
+        )
+        hi_masks_logits, hi_iou_preds, word_masks_logits = self.model.hi_decoder(
+            image_embeddings=self.features,
+            image_pe=self.model.prompt_encoder.get_dense_pe(),
+            sparse_prompt_embeddings=point_embeddings,
+            multimask_output=True
+        )
+        return hi_masks_logits, hi_iou_preds, word_masks_logits
+
+    def predict(
+        self,
+        from_low_res: bool = False,
+        fg_points_num: int = 1500,
+        batch_points_num: int = 100,
+        score_thresh: float = 0.5,
+        nms_thresh: float = 0.5,
+        return_logits: bool = False,
+        oracle_point_prompts: np.ndarray = None,
+    ):
+        if not self.is_image_set:
+            raise RuntimeError("An image must be set with .set_image(...) before mask prediction.")
+        assert batch_points_num <= fg_points_num
+
+        if oracle_point_prompts is not None:
+            # resize point prompts
+            oracle_point_prompts = self.transform.apply_coords(oracle_point_prompts, self.original_size)
+            fg_points = torch.tensor(oracle_point_prompts, dtype=torch.int64, device=self.features.device)
+            fg_points = fg_points[:, None, :]
+        else:
+            fg_points = self.forward_foreground_points(from_low_res, fg_points_num)
+        fg_points_num = fg_points.shape[0]
+        if fg_points_num == 0:
+            return None, None, None
+
+        masks, scores, word_masks = [], [], []
+        for start_idx in range(0, fg_points_num, batch_points_num):
+            end_idx = min(start_idx + batch_points_num, fg_points_num)
+            hi_masks_logits, hi_iou_preds, word_masks_logits = self.forward_hi_decoder(
+                fg_points[start_idx:end_idx, :, :],
+                torch.ones((end_idx-start_idx, 1), device=fg_points.device)
+            )
+            hi_masks_logits = hi_masks_logits[:, 1:, :, :]
+            masks.append(hi_masks_logits)
+            scores.append(hi_iou_preds)
+            word_masks.append(word_masks_logits)
+        del hi_masks_logits
+        del word_masks_logits
+        masks = torch.cat(masks, dim=0)  # (fg_points_num, x, 256, 256)
+        scores = torch.cat(scores, dim=0)  # (fg_points_num, x)
+        word_masks = torch.cat(word_masks, dim=0)
+
+        # filter low quality lines
+        keep = scores[:, 1] > score_thresh
+        if keep.sum() == 0:
+            return None, None, None
+        masks = masks[keep]
+        scores = scores[keep]
+        word_masks = word_masks[keep]
+
+        # conduct mask nms, use 256x256 mask for nms to save memory and time
+        updated_scores = matrix_nms(
+            seg_masks=(masks[:, -2, :, :] > self.model.mask_threshold),
+            scores=scores[:, 1]
+        )
+        keep = updated_scores > nms_thresh
+        if keep.sum() == 0:
+            return None, None, None
+        masks = masks[keep]  # line and paragraph masks
+        scores = scores[keep]
+        word_masks = word_masks[keep]
+        affinity = get_para_iou(para_masks=(masks[:, -1, :, :] > self.model.mask_threshold))
+        
+        del masks  # only return word masks here for evaluation on HierText
+        word_masks = self.model.postprocess_masks(word_masks, self.input_size, self.original_size)
+        word_masks = word_masks > self.model.mask_threshold
+        masks_np = word_masks.cpu().numpy()
+        scores_np = scores.cpu().numpy()
+        affinity_np = affinity.cpu().numpy()
+        return masks_np, scores_np, affinity_np
+
+    def predict_text_detection(
+        self,
+        from_low_res: bool = False,
+        fg_points_num: int = 600,
+        batch_points_num: int = 100,
+        score_thresh: float = 0.5,
+        nms_thresh: float = 0.5,
+        return_logits: bool = False,
+        oracle_point_prompts: np.ndarray = None,
+        zero_shot: bool = True,
+        dataset: str = 'totaltext'
+    ):
+        if not self.is_image_set:
+            raise RuntimeError("An image must be set with .set_image(...) before mask prediction.")
+        assert batch_points_num <= fg_points_num
+
+        if oracle_point_prompts is not None:
+            # resize point prompts
+            oracle_point_prompts = self.transform.apply_coords(oracle_point_prompts, self.original_size)
+            fg_points = torch.tensor(oracle_point_prompts, dtype=torch.int64, device=self.features.device)
+            fg_points = fg_points[:, None, :]
+        else:
+            fg_points = self.forward_foreground_points(from_low_res, fg_points_num)
+        fg_points_num = fg_points.shape[0]
+        if fg_points_num == 0:
+            return None, None
+
+        masks, scores, word_masks = [], [], []
+        for start_idx in range(0, fg_points_num, batch_points_num):
+            end_idx = min(start_idx + batch_points_num, fg_points_num)
+            hi_masks_logits, hi_iou_preds, word_masks_logits = self.forward_hi_decoder(
+                fg_points[start_idx:end_idx, :, :],
+                torch.ones((end_idx-start_idx, 1), device=fg_points.device)
+            )
+            hi_masks_logits = hi_masks_logits[:, 1:, :, :]
+            masks.append(hi_masks_logits)
+            scores.append(hi_iou_preds)
+            word_masks.append(word_masks_logits)
+        del hi_masks_logits
+        del word_masks_logits
+        masks = torch.cat(masks, dim=0)  # (fg_points_num, x, 256, 256)
+        scores = torch.cat(scores, dim=0)  # (fg_points_num, x)
+        word_masks = torch.cat(word_masks, dim=0)
+
+        # filter low quality lines
+        if dataset != 'ctw1500' and not zero_shot:
+            scores[:, 1] = 1.0  # since no iou score prediction
+        keep = scores[:, 1] > score_thresh
+        if keep.sum() == 0:
+            return None, None
+        masks = masks[keep]
+        scores = scores[keep]
+        word_masks = word_masks[keep]
+
+        if dataset == 'totaltext' and zero_shot:
+            word_masks = word_masks > self.model.mask_threshold
+            word_masks = (word_masks.sum(dim=0)[None, ...] > 0).type(torch.float32)
+            word_masks = self.model.postprocess_masks(word_masks, self.input_size, self.original_size) > 0
+            masks_np = word_masks.cpu().numpy()
+            return masks_np, None
+        else:
+            if dataset != 'ctw1500' and not zero_shot:
+                updated_scores = matrix_nms(
+                    seg_masks=(word_masks[:, 0, :, :] > self.model.mask_threshold),
+                    scores=scores[:, 1]
+                )
+            else:
+                updated_scores = matrix_nms(
+                    seg_masks=(masks[:, -2, :, :] > self.model.mask_threshold),
+                    scores=scores[:, 1]
+                )
+            keep = updated_scores > nms_thresh
+            if keep.sum() == 0:
+                return None, None
+            if dataset == 'ctw1500':
+                masks = masks[keep][:, 0:1, :, :]
+            else:
+                masks = word_masks[keep][:, 0:1, :, :]
+            scores = scores[keep][:, 1]
+            masks = self.model.postprocess_masks(masks, self.input_size, self.original_size)
+            masks = masks > self.model.mask_threshold
+            masks_np = masks.cpu().numpy()
+            scores_np = scores.cpu().numpy()
+            return masks_np, scores_np
+
+    def get_image_embedding(self) -> torch.Tensor:
+        """
+        Returns the image embeddings for the currently set image, with
+        shape 1xCxHxW, where C is the embedding dimension and (H,W) are
+        the embedding spatial dimension of SAM (typically C=256, H=W=64).
+        """
+        if not self.is_image_set:
+            raise RuntimeError(
+                "An image must be set with .set_image(...) to generate an embedding."
+            )
+        assert self.features is not None, "Features must exist if an image has been set."
+        return self.features
+
+    @property
+    def device(self) -> torch.device:
+        return self.model.device
+
+    def reset_image(self) -> None:
+        """Resets the currently set image."""
+        self.is_image_set = False
+        self.features = None
+        self.orig_h = None
+        self.orig_w = None
+        self.input_h = None
+        self.input_w = None
+
+    def reset_fgmask(self) -> None:
+        self.is_fgmask_set = False
+        self.fgmask = None
+
+
+def matrix_nms(seg_masks, scores, kernel='gaussian', sigma=2.0, sum_masks=None):
+    """Matrix NMS from SOLOv2
+
+    Args:
+        seg_masks (Tensor): shape (n, h, w)
+        scores (Tensor): shape (n)
+        kernel (str): 'linear' or 'gaussian'
+        sigma (float): std in gaussian method
+        sum_masks (Tensor): the sum of seg_masks
+    """
+    n_samples = len(seg_masks)
+    if sum_masks is None:
+        sum_masks = seg_masks.sum((1, 2)).float()
+    seg_masks = seg_masks.reshape(n_samples, -1).float()
+    # inter
+    inter_matrix = torch.mm(seg_masks, seg_masks.transpose(1, 0))
+    del seg_masks
+    # union
+    sum_masks = sum_masks.expand(n_samples, n_samples)
+    # iou
+    iou_matrix = (inter_matrix / (sum_masks + sum_masks.transpose(1, 0) - inter_matrix)).triu(diagonal=1)
+    # IOU compensation
+    compensate_iou, _ = iou_matrix.max(0)
+    compensate_iou = compensate_iou.expand(n_samples, n_samples).transpose(1, 0)
+    # IOU decay
+    decay_iou = iou_matrix  # no label matrix because there is only one foreground class
+
+    if kernel == 'gaussian':
+        decay_matrix = torch.exp(-1 * sigma * (decay_iou ** 2))
+        compensate_matrix = torch.exp(-1 * sigma * (compensate_iou ** 2))
+        decay_coef, _ = (decay_matrix / compensate_matrix).min(0)
+    elif kernel == 'linear':
+        decay_matrix = (1 - decay_iou) / (1 - compensate_iou)
+        decay_coef, _ = decay_matrix.min(0)
+    else:
+        raise NotImplementedError
+    updated_score = scores * decay_coef
+    return updated_score
+
+
+def get_para_iou(para_masks):
+    """
+        Args:
+            para_masks (Tensor): shape (n, h, w)
+        """
+    n_samples = len(para_masks)
+    sum_masks = para_masks.sum((1, 2)).float()
+    para_masks = para_masks.reshape(n_samples, -1).float()
+    inter_matrix = torch.mm(para_masks, para_masks.transpose(1, 0))
+    # del para_masks
+    sum_masks = sum_masks.expand(n_samples, n_samples)
+    iou_matrix = (inter_matrix / (sum_masks + sum_masks.transpose(1, 0) - inter_matrix))
+
+    return iou_matrix
diff --git a/hi_sam/modeling/build.py b/hi_sam/modeling/build.py
new file mode 100644
index 0000000..6cfff92
--- /dev/null
+++ b/hi_sam/modeling/build.py
@@ -0,0 +1,168 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+import pdb
+
+# This source code is licensed under the license found in the
+# LICENSE file in the root directory of this source tree.
+
+import torch
+
+from functools import partial
+import copy
+import os
+
+from .image_encoder import ImageEncoderViT
+from .mask_decoder import MaskDecoder, HiDecoder
+from .prompt_encoder import PromptEncoder
+from .hi_sam import HiSam
+from .transformer import TwoWayTransformer
+from .modal_aligner import ModalAligner
+
+
+def build_sam_vit_h(args):
+    return _build_sam(
+        encoder_embed_dim=1280,
+        encoder_depth=32,
+        encoder_num_heads=16,
+        encoder_global_attn_indexes=[7, 15, 23, 31],
+        args=args,
+    )
+
+
+def build_sam_vit_l(args):
+    return _build_sam(
+        encoder_embed_dim=1024,
+        encoder_depth=24,
+        encoder_num_heads=16,
+        encoder_global_attn_indexes=[5, 11, 17, 23],
+        args=args,
+    )
+
+
+def build_sam_vit_b(args):
+    return _build_sam(
+        encoder_embed_dim=768,
+        encoder_depth=12,
+        encoder_num_heads=12,
+        encoder_global_attn_indexes=[2, 5, 8, 11],
+        args=args,
+    )
+
+
+model_registry = {
+    "vit_h": build_sam_vit_h,
+    "vit_l": build_sam_vit_l,
+    "vit_b": build_sam_vit_b,
+}
+
+
+def _build_sam(
+    encoder_embed_dim,
+    encoder_depth,
+    encoder_num_heads,
+    encoder_global_attn_indexes,
+    args,
+):
+    prompt_embed_dim = 256
+    image_size = 1024
+    vit_patch_size = 16
+    image_embedding_size = image_size // vit_patch_size  # 64
+
+    checkpoint = args.checkpoint
+
+    model = HiSam(
+        image_encoder=ImageEncoderViT(
+            depth=encoder_depth,
+            embed_dim=encoder_embed_dim,
+            img_size=image_size,
+            mlp_ratio=4,
+            norm_layer=partial(torch.nn.LayerNorm, eps=1e-6),
+            num_heads=encoder_num_heads,
+            patch_size=vit_patch_size,
+            qkv_bias=True,
+            use_rel_pos=True,
+            global_attn_indexes=encoder_global_attn_indexes,
+            window_size=14,
+            out_chans=prompt_embed_dim,
+        ),
+        modal_aligner=ModalAligner(
+            prompt_embed_dim,
+            attn_layers=args.attn_layers,
+            prompt_len=args.prompt_len
+        ),
+        prompt_encoder=PromptEncoder(
+            embed_dim=prompt_embed_dim,
+            image_embedding_size=(image_embedding_size, image_embedding_size),
+            input_image_size=(image_size, image_size),
+            mask_in_chans=16,
+        ),
+        mask_decoder=MaskDecoder(
+            num_multimask_outputs=3,
+            transformer=TwoWayTransformer(
+                depth=2,
+                embedding_dim=prompt_embed_dim,
+                mlp_dim=2048,
+                num_heads=8,
+            ),
+            transformer_dim=prompt_embed_dim,
+            iou_head_depth=3,
+            iou_head_hidden_dim=256,
+        ),
+        pixel_mean=[123.675, 116.28, 103.53],
+        pixel_std=[58.395, 57.12, 57.375],
+    )
+
+    if args.hier_det:
+        model.hier_det = True
+        model.hi_decoder = HiDecoder(
+            num_multimask_outputs=3,
+            transformer=TwoWayTransformer(
+                depth=2,
+                embedding_dim=prompt_embed_dim,
+                mlp_dim=2048,
+                num_heads=8,
+            ),
+            transformer_dim=prompt_embed_dim,
+        )
+
+    if checkpoint is not None:
+        with open(checkpoint, "rb") as f:
+            state_dict = torch.load(f, map_location="cpu")
+        dict_keys = state_dict.keys()
+        if 'optimizer' in dict_keys or 'lr_scheduler' in dict_keys or 'epoch' in dict_keys:
+            state_dict = state_dict['model']
+        dict_keys = state_dict.keys()
+
+        # check whether to use the paras. of SAM's mask decoder to initialize H-Decoder
+        if args.hier_det:
+            contain_hi_decoder = False
+            for key in dict_keys:
+                if 'hi_decoder' in key:
+                    contain_hi_decoder = True
+                    break
+            if not contain_hi_decoder:
+                ckpt_dir = os.path.dirname(checkpoint)
+                with open(os.path.join('pretrained_checkpoint', args.model_type+'_maskdecoder.pth'), "rb") as f:
+                    mask_decoder_dict = torch.load(f)
+                for key, value in mask_decoder_dict.items():
+                    new_key = key.replace('mask_decoder', 'hi_decoder')
+                    state_dict[new_key] = value
+
+        # load SAM's ViT backbone paras.
+        if args.model_type == 'vit_b':
+            sam_path = os.path.join('pretrained_checkpoint', 'sam_vit_b_01ec64.pth')
+        elif args.model_type == 'vit_l':
+            sam_path = os.path.join('pretrained_checkpoint', 'sam_vit_l_0b3195.pth')
+        elif args.model_type == 'vit_h':
+            sam_path = os.path.join('pretrained_checkpoint', 'sam_vit_h_4b8939.pth')
+        with open(sam_path, "rb") as f:
+            sam_dict = torch.load(f)
+        for key, value in sam_dict.items():
+            if key not in dict_keys:
+                state_dict[key] = value
+        del sam_dict
+
+        info = model.load_state_dict(state_dict, strict=False)
+        print(info)
+
+    return model
diff --git a/hi_sam/modeling/common.py b/hi_sam/modeling/common.py
new file mode 100644
index 0000000..945d4c8
--- /dev/null
+++ b/hi_sam/modeling/common.py
@@ -0,0 +1,65 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+
+# This source code is licensed under the license found in the
+# LICENSE file in the root directory of this source tree.
+
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+
+from typing import Type
+
+
+class MLPBlock(nn.Module):
+    def __init__(
+        self,
+        embedding_dim: int,
+        mlp_dim: int,
+        act: Type[nn.Module] = nn.GELU,
+    ) -> None:
+        super().__init__()
+        self.lin1 = nn.Linear(embedding_dim, mlp_dim)
+        self.lin2 = nn.Linear(mlp_dim, embedding_dim)
+        self.act = act()
+
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        return self.lin2(self.act(self.lin1(x)))
+
+
+# From https://github.com/facebookresearch/detectron2/blob/main/detectron2/layers/batch_norm.py # noqa
+# Itself from https://github.com/facebookresearch/ConvNeXt/blob/d1fa8f6fef0a165b27399986cc2bdacc92777e40/models/convnext.py#L119  # noqa
+class LayerNorm2d(nn.Module):
+    def __init__(self, num_channels: int, eps: float = 1e-6) -> None:
+        super().__init__()
+        self.weight = nn.Parameter(torch.ones(num_channels))
+        self.bias = nn.Parameter(torch.zeros(num_channels))
+        self.eps = eps
+
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        u = x.mean(1, keepdim=True)
+        s = (x - u).pow(2).mean(1, keepdim=True)
+        x = (x - u) / torch.sqrt(s + self.eps)
+        x = self.weight[:, None, None] * x + self.bias[:, None, None]
+        return x
+
+
+class Adapter(nn.Module):
+    def __init__(self, D_features, mlp_ratio=0.25, act_layer=nn.GELU, skip_connect=True):
+        super().__init__()
+        self.skip_connect = skip_connect
+        D_hidden_features = int(D_features * mlp_ratio)
+        self.act = act_layer()
+        self.D_fc1 = nn.Linear(D_features, D_hidden_features)
+        self.D_fc2 = nn.Linear(D_hidden_features, D_features)
+
+    def forward(self, x):
+        # x is (BT, HW+1, D)
+        xs = self.D_fc1(x)
+        xs = self.act(xs)
+        xs = self.D_fc2(xs)
+        if self.skip_connect:
+            x = x + xs
+        else:
+            x = xs
+        return x
diff --git a/hi_sam/modeling/hi_sam.py b/hi_sam/modeling/hi_sam.py
new file mode 100644
index 0000000..f4bf4d2
--- /dev/null
+++ b/hi_sam/modeling/hi_sam.py
@@ -0,0 +1,172 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+
+# This source code is licensed under the license found in the
+# LICENSE file in the root directory of this source tree.
+
+import torch
+from torch import nn
+from torch.nn import functional as F
+
+from typing import Any, Dict, List, Tuple
+
+from .image_encoder import ImageEncoderViT
+from .mask_decoder import MaskDecoder, HiDecoder
+from .prompt_encoder import PromptEncoder
+from .modal_aligner import ModalAligner
+
+
+class HiSam(nn.Module):
+    mask_threshold: float = 0.0
+    image_format: str = "RGB"
+
+    def __init__(
+        self,
+        image_encoder: ImageEncoderViT,
+        modal_aligner: ModalAligner,
+        prompt_encoder: PromptEncoder,
+        mask_decoder: MaskDecoder,
+        pixel_mean: List[float] = [123.675, 116.28, 103.53],
+        pixel_std: List[float] = [58.395, 57.12, 57.375],
+    ) -> None:
+        super().__init__()
+        self.image_encoder = image_encoder
+        for n, p in self.image_encoder.named_parameters():
+            if "Adapter" not in n:
+                p.requires_grad = False
+        print("Freeze image encoder.")
+
+        self.prompt_encoder = prompt_encoder
+        for p in self.prompt_encoder.parameters():
+            p.requires_grad = False
+
+        self.modal_aligner = modal_aligner
+        self.mask_decoder = mask_decoder
+        self.register_buffer("pixel_mean", torch.Tensor(pixel_mean).view(-1, 1, 1), False)
+        self.register_buffer("pixel_std", torch.Tensor(pixel_std).view(-1, 1, 1), False)
+
+        self.hier_det = False
+        self.hi_decoder = None
+
+    @property
+    def device(self) -> Any:
+        return self.pixel_mean.device
+
+    def forward(
+        self,
+        batched_input: List[Dict[str, Any]],
+        multimask_output: bool,
+    ):
+        input_images = torch.stack([self.preprocess(x["image"]) for x in batched_input], dim=0)
+
+        image_embeddings = self.image_encoder(input_images)
+        sparse_emb = self.modal_aligner(image_embeddings)
+
+        up_masks_logits = []
+        up_masks = []
+        iou_preds = []
+        hr_masks_logits = []
+        hr_masks = []
+        iou_preds_hr = []
+        if self.hier_det:
+            hi_masks_logits = []
+            hi_iou_preds = []
+            word_masks_logits = []
+
+        for image_record, curr_embedding, sparse_embeddings in zip(batched_input, image_embeddings, sparse_emb):
+            low_res_masks, high_res_masks, iou_pred, iou_pred_hr = self.mask_decoder(
+                image_embeddings=curr_embedding.unsqueeze(0),
+                image_pe=self.prompt_encoder.get_dense_pe(),
+                sparse_prompt_embeddings=sparse_embeddings.unsqueeze(0),
+                multimask_output=multimask_output
+            )
+            iou_preds.append(iou_pred)
+            iou_preds_hr.append(iou_pred_hr)
+            upscaled_masks = self.postprocess_masks(
+                low_res_masks,
+                input_size=image_record["image"].shape[-2:],
+                original_size=image_record["original_size"],
+            )
+            high_res_masks = self.postprocess_masks(
+                high_res_masks,
+                input_size=image_record["image"].shape[-2:],
+                original_size=image_record["original_size"],
+            )
+            up_masks_logits.append(upscaled_masks)
+            up_masks.append(upscaled_masks > self.mask_threshold)
+            hr_masks_logits.append(high_res_masks)
+            hr_masks.append(high_res_masks > self.mask_threshold)
+
+            if self.hier_det:
+                points = (image_record["point_coords"], image_record["point_labels"])
+                point_embeddings, _ = self.prompt_encoder(
+                    points=points, boxes=None, masks=None
+                )
+                hi_masks, hi_iou_pred, word_masks = self.hi_decoder(
+                    image_embeddings=curr_embedding.unsqueeze(0),
+                    image_pe=self.prompt_encoder.get_dense_pe(),
+                    sparse_prompt_embeddings=point_embeddings,
+                    multimask_output=True
+                )
+                hi_masks_logits.append(hi_masks)
+                hi_iou_preds.append(hi_iou_pred)
+                word_masks_logits.append(word_masks)
+
+        up_masks_logits = torch.cat(up_masks_logits, dim=0)
+        up_masks = torch.cat(up_masks, dim=0)
+        iou_preds = torch.cat(iou_preds, dim=0)
+        hr_masks_logits = torch.cat(hr_masks_logits, dim=0)
+        hr_masks = torch.cat(hr_masks, dim=0)
+        iou_preds_hr = torch.cat(iou_preds_hr, dim=0)
+
+        if self.hier_det:
+            hi_masks_logits = torch.cat(hi_masks_logits, dim=0)
+            hi_iou_preds = torch.cat(hi_iou_preds, dim=0)
+            word_masks_logits = torch.cat(word_masks_logits, dim=0)
+            return (up_masks_logits, up_masks, iou_preds, hr_masks_logits, hr_masks, iou_preds_hr,
+                    hi_masks_logits, hi_iou_preds, word_masks_logits)
+        else:
+            return up_masks_logits, up_masks, iou_preds, hr_masks_logits, hr_masks, iou_preds_hr
+
+    def postprocess_masks(
+        self,
+        masks: torch.Tensor,
+        input_size: Tuple[int, ...],
+        original_size: Tuple[int, ...],
+    ) -> torch.Tensor:
+        """
+        Remove padding and upscale masks to the original image size.
+
+        Arguments:
+          masks (torch.Tensor): Batched masks from the mask_decoder,
+            in BxCxHxW format.
+          input_size (tuple(int, int)): The size of the image input to the
+            model, in (H, W) format. Used to remove padding.
+          original_size (tuple(int, int)): The original size of the image
+            before resizing for input to the model, in (H, W) format.
+
+        Returns:
+          (torch.Tensor): Batched masks in BxCxHxW format, where (H, W)
+            is given by original_size.
+        """
+        masks = F.interpolate(
+            masks,
+            (self.image_encoder.img_size, self.image_encoder.img_size),
+            mode="bilinear",
+            align_corners=False,
+        )
+        masks = masks[..., : input_size[0], : input_size[1]]
+        masks = F.interpolate(masks, original_size, mode="bilinear", align_corners=False)
+        return masks
+
+    def preprocess(self, x: torch.Tensor) -> torch.Tensor:
+        """Normalize pixel values and pad to a square input."""
+        # Normalize colors
+        x = (x - self.pixel_mean) / self.pixel_std
+
+        # Pad
+        h, w = x.shape[-2:]
+        padh = self.image_encoder.img_size - h
+        padw = self.image_encoder.img_size - w
+        x = F.pad(x, (0, padw, 0, padh))
+        return x
\ No newline at end of file
diff --git a/hi_sam/modeling/image_encoder.py b/hi_sam/modeling/image_encoder.py
new file mode 100644
index 0000000..7aeed3c
--- /dev/null
+++ b/hi_sam/modeling/image_encoder.py
@@ -0,0 +1,402 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+
+# This source code is licensed under the license found in the
+# LICENSE file in the root directory of this source tree.
+
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from typing import Optional, Tuple, Type
+from .common import LayerNorm2d, MLPBlock, Adapter
+import torch.utils.checkpoint as cp
+
+
+# This class and its supporting functions below lightly adapted from the ViTDet backbone available at: https://github.com/facebookresearch/detectron2/blob/main/detectron2/modeling/backbone/vit.py # noqa
+class ImageEncoderViT(nn.Module):
+    def __init__(
+        self,
+        img_size: int = 1024,
+        patch_size: int = 16,
+        in_chans: int = 3,
+        embed_dim: int = 768,
+        depth: int = 12,
+        num_heads: int = 12,
+        mlp_ratio: float = 4.0,
+        out_chans: int = 256,
+        qkv_bias: bool = True,
+        norm_layer: Type[nn.Module] = nn.LayerNorm,
+        act_layer: Type[nn.Module] = nn.GELU,
+        use_abs_pos: bool = True,
+        use_rel_pos: bool = False,
+        rel_pos_zero_init: bool = True,
+        window_size: int = 0,
+        global_attn_indexes: Tuple[int, ...] = (),
+    ) -> None:
+        """
+        Args:
+            img_size (int): Input image size.
+            patch_size (int): Patch size.
+            in_chans (int): Number of input image channels.
+            embed_dim (int): Patch embedding dimension.
+            depth (int): Depth of ViT.
+            num_heads (int): Number of attention heads in each ViT block.
+            mlp_ratio (float): Ratio of mlp hidden dim to embedding dim.
+            qkv_bias (bool): If True, add a learnable bias to query, key, value.
+            norm_layer (nn.Module): Normalization layer.
+            act_layer (nn.Module): Activation layer.
+            use_abs_pos (bool): If True, use absolute positional embeddings.
+            use_rel_pos (bool): If True, add relative positional embeddings to the attention map.
+            rel_pos_zero_init (bool): If True, zero initialize relative positional parameters.
+            window_size (int): Window size for window attention blocks.
+            global_attn_indexes (list): Indexes for blocks using global attention.
+        """
+        super().__init__()
+        self.img_size = img_size
+
+        self.patch_embed = PatchEmbed(
+            kernel_size=(patch_size, patch_size),
+            stride=(patch_size, patch_size),
+            in_chans=in_chans,
+            embed_dim=embed_dim,
+        )
+
+        self.pos_embed: Optional[nn.Parameter] = None
+        if use_abs_pos:
+            # Initialize absolute positional embedding with pretrain image size.
+            self.pos_embed = nn.Parameter(
+                torch.zeros(1, img_size // patch_size, img_size // patch_size, embed_dim)
+            )
+
+        self.blocks = nn.ModuleList()
+        
+        for i in range(depth):
+            block = Block(
+                dim=embed_dim,
+                num_heads=num_heads,
+                mlp_ratio=mlp_ratio,
+                qkv_bias=qkv_bias,
+                norm_layer=norm_layer,
+                act_layer=act_layer,
+                use_rel_pos=use_rel_pos,
+                rel_pos_zero_init=rel_pos_zero_init,
+                window_size=window_size if i not in global_attn_indexes else 0,
+                input_size=(img_size // patch_size, img_size // patch_size),
+            )
+            self.blocks.append(block)
+
+        self.neck = nn.Sequential(
+            nn.Conv2d(
+                embed_dim,
+                out_chans,
+                kernel_size=1,
+                bias=False,
+            ),
+            LayerNorm2d(out_chans),
+            nn.Conv2d(
+                out_chans,
+                out_chans,
+                kernel_size=3,
+                padding=1,
+                bias=False,
+            ),
+            LayerNorm2d(out_chans),
+        )
+
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        # with torch.no_grad():
+        x = self.patch_embed(x)
+        if self.pos_embed is not None:
+            x = x + self.pos_embed
+        for blk in self.blocks:
+            x = blk(x)
+        x = self.neck(x.permute(0, 3, 1, 2))
+        
+        return x
+
+
+class Block(nn.Module):
+    """Transformer blocks with support of window attention and residual propagation blocks"""
+
+    def __init__(
+        self,
+        dim: int,
+        num_heads: int,
+        mlp_ratio: float = 4.0,
+        qkv_bias: bool = True,
+        norm_layer: Type[nn.Module] = nn.LayerNorm,
+        act_layer: Type[nn.Module] = nn.GELU,
+        use_rel_pos: bool = False,
+        rel_pos_zero_init: bool = True,
+        window_size: int = 0,
+        input_size: Optional[Tuple[int, int]] = None,
+        scale: float = 0.5,  # new
+    ) -> None:
+        """
+        Args:
+            dim (int): Number of input channels.
+            num_heads (int): Number of attention heads in each ViT block.
+            mlp_ratio (float): Ratio of mlp hidden dim to embedding dim.
+            qkv_bias (bool): If True, add a learnable bias to query, key, value.
+            norm_layer (nn.Module): Normalization layer.
+            act_layer (nn.Module): Activation layer.
+            use_rel_pos (bool): If True, add relative positional embeddings to the attention map.
+            rel_pos_zero_init (bool): If True, zero initialize relative positional parameters.
+            window_size (int): Window size for window attention blocks. If it equals 0, then
+                use global attention.
+            input_size (int or None): Input resolution for calculating the relative positional
+                parameter size.
+        """
+        super().__init__()
+        self.norm1 = norm_layer(dim)
+        self.attn = Attention(
+            dim,
+            num_heads=num_heads,
+            qkv_bias=qkv_bias,
+            use_rel_pos=use_rel_pos,
+            rel_pos_zero_init=rel_pos_zero_init,
+            input_size=input_size if window_size == 0 else (window_size, window_size),
+        )
+
+        self.Space_Adapter = Adapter(dim)
+        self.scale = scale
+        self.MLP_Adapter = Adapter(dim, skip_connect=False)
+
+        self.norm2 = norm_layer(dim)
+        self.mlp = MLPBlock(embedding_dim=dim, mlp_dim=int(dim * mlp_ratio), act=act_layer)
+
+        self.window_size = window_size
+
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        shortcut = x
+        x = self.norm1(x)
+        # Window partition
+        if self.window_size > 0:
+            H, W = x.shape[1], x.shape[2]
+            x, pad_hw = window_partition(x, self.window_size)
+
+        x = cp.checkpoint(self.attn, x)
+        x = self.Space_Adapter(x)
+
+        # Reverse window partition
+        if self.window_size > 0:
+            x = window_unpartition(x, self.window_size, pad_hw, (H, W))
+        x = shortcut + x
+
+        shortcut = x
+        x = self.norm2(x)
+        x = shortcut + cp.checkpoint(self.mlp, x) + self.scale * self.MLP_Adapter(x)
+
+        return x
+
+class Attention(nn.Module):
+    """Multi-head Attention block with relative position embeddings."""
+
+    def __init__(
+        self,
+        dim: int,
+        num_heads: int = 8,
+        qkv_bias: bool = True,
+        use_rel_pos: bool = False,
+        rel_pos_zero_init: bool = True,
+        input_size: Optional[Tuple[int, int]] = None,
+    ) -> None:
+        """
+        Args:
+            dim (int): Number of input channels.
+            num_heads (int): Number of attention heads.
+            qkv_bias (bool:  If True, add a learnable bias to query, key, value.
+            rel_pos (bool): If True, add relative positional embeddings to the attention map.
+            rel_pos_zero_init (bool): If True, zero initialize relative positional parameters.
+            input_size (int or None): Input resolution for calculating the relative positional
+                parameter size.
+        """
+        super().__init__()
+        self.num_heads = num_heads
+        head_dim = dim // num_heads
+        self.scale = head_dim**-0.5
+
+        self.qkv = nn.Linear(dim, dim * 3, bias=qkv_bias)
+        self.proj = nn.Linear(dim, dim)
+
+        self.use_rel_pos = use_rel_pos
+        if self.use_rel_pos:
+            assert (
+                input_size is not None
+            ), "Input size must be provided if using relative positional encoding."
+            # initialize relative positional embeddings
+            self.rel_pos_h = nn.Parameter(torch.zeros(2 * input_size[0] - 1, head_dim))
+            self.rel_pos_w = nn.Parameter(torch.zeros(2 * input_size[1] - 1, head_dim))
+
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        B, H, W, _ = x.shape
+        # qkv with shape (3, B, nHead, H * W, C)
+        qkv = self.qkv(x).reshape(B, H * W, 3, self.num_heads, -1).permute(2, 0, 3, 1, 4)
+        # q, k, v with shape (B * nHead, H * W, C)
+        q, k, v = qkv.reshape(3, B * self.num_heads, H * W, -1).unbind(0)
+
+        attn = (q * self.scale) @ k.transpose(-2, -1)
+
+        if self.use_rel_pos:
+            attn = add_decomposed_rel_pos(attn, q, self.rel_pos_h, self.rel_pos_w, (H, W), (H, W))
+
+        attn = attn.softmax(dim=-1)
+        x = (attn @ v).view(B, self.num_heads, H, W, -1).permute(0, 2, 3, 1, 4).reshape(B, H, W, -1)
+        x = self.proj(x)
+        return x
+        
+
+
+def window_partition(x: torch.Tensor, window_size: int) -> Tuple[torch.Tensor, Tuple[int, int]]:
+    """
+    Partition into non-overlapping windows with padding if needed.
+    Args:
+        x (tensor): input tokens with [B, H, W, C].
+        window_size (int): window size.
+
+    Returns:
+        windows: windows after partition with [B * num_windows, window_size, window_size, C].
+        (Hp, Wp): padded height and width before partition
+    """
+    B, H, W, C = x.shape
+
+    pad_h = (window_size - H % window_size) % window_size
+    pad_w = (window_size - W % window_size) % window_size
+    if pad_h > 0 or pad_w > 0:
+        x = F.pad(x, (0, 0, 0, pad_w, 0, pad_h))
+    Hp, Wp = H + pad_h, W + pad_w
+
+    x = x.view(B, Hp // window_size, window_size, Wp // window_size, window_size, C)
+    windows = x.permute(0, 1, 3, 2, 4, 5).contiguous().view(-1, window_size, window_size, C)
+    return windows, (Hp, Wp)
+
+
+def window_unpartition(
+    windows: torch.Tensor, window_size: int, pad_hw: Tuple[int, int], hw: Tuple[int, int]
+) -> torch.Tensor:
+    """
+    Window unpartition into original sequences and removing padding.
+    Args:
+        x (tensor): input tokens with [B * num_windows, window_size, window_size, C].
+        window_size (int): window size.
+        pad_hw (Tuple): padded height and width (Hp, Wp).
+        hw (Tuple): original height and width (H, W) before padding.
+
+    Returns:
+        x: unpartitioned sequences with [B, H, W, C].
+    """
+    Hp, Wp = pad_hw
+    H, W = hw
+    B = windows.shape[0] // (Hp * Wp // window_size // window_size)
+    x = windows.view(B, Hp // window_size, Wp // window_size, window_size, window_size, -1)
+    x = x.permute(0, 1, 3, 2, 4, 5).contiguous().view(B, Hp, Wp, -1)
+
+    if Hp > H or Wp > W:
+        x = x[:, :H, :W, :].contiguous()
+    return x
+
+
+def get_rel_pos(q_size: int, k_size: int, rel_pos: torch.Tensor) -> torch.Tensor:
+    """
+    Get relative positional embeddings according to the relative positions of
+        query and key sizes.
+    Args:
+        q_size (int): size of query q.
+        k_size (int): size of key k.
+        rel_pos (Tensor): relative position embeddings (L, C).
+
+    Returns:
+        Extracted positional embeddings according to relative positions.
+    """
+    max_rel_dist = int(2 * max(q_size, k_size) - 1)
+    # Interpolate rel pos if needed.
+    if rel_pos.shape[0] != max_rel_dist:
+        # Interpolate rel pos.
+        rel_pos_resized = F.interpolate(
+            rel_pos.reshape(1, rel_pos.shape[0], -1).permute(0, 2, 1),
+            size=max_rel_dist,
+            mode="linear",
+        )
+        rel_pos_resized = rel_pos_resized.reshape(-1, max_rel_dist).permute(1, 0)
+    else:
+        rel_pos_resized = rel_pos
+
+    # Scale the coords with short length if shapes for q and k are different.
+    q_coords = torch.arange(q_size)[:, None] * max(k_size / q_size, 1.0)
+    k_coords = torch.arange(k_size)[None, :] * max(q_size / k_size, 1.0)
+    relative_coords = (q_coords - k_coords) + (k_size - 1) * max(q_size / k_size, 1.0)
+
+    return rel_pos_resized[relative_coords.long()]
+
+
+def add_decomposed_rel_pos(
+    attn: torch.Tensor,
+    q: torch.Tensor,
+    rel_pos_h: torch.Tensor,
+    rel_pos_w: torch.Tensor,
+    q_size: Tuple[int, int],
+    k_size: Tuple[int, int],
+) -> torch.Tensor:
+    """
+    Calculate decomposed Relative Positional Embeddings from :paper:`mvitv2`.
+    https://github.com/facebookresearch/mvit/blob/19786631e330df9f3622e5402b4a419a263a2c80/mvit/models/attention.py   # noqa B950
+    Args:
+        attn (Tensor): attention map.
+        q (Tensor): query q in the attention layer with shape (B, q_h * q_w, C).
+        rel_pos_h (Tensor): relative position embeddings (Lh, C) for height axis.
+        rel_pos_w (Tensor): relative position embeddings (Lw, C) for width axis.
+        q_size (Tuple): spatial sequence size of query q with (q_h, q_w).
+        k_size (Tuple): spatial sequence size of key k with (k_h, k_w).
+
+    Returns:
+        attn (Tensor): attention map with added relative positional embeddings.
+    """
+    q_h, q_w = q_size
+    k_h, k_w = k_size
+    Rh = get_rel_pos(q_h, k_h, rel_pos_h)
+    Rw = get_rel_pos(q_w, k_w, rel_pos_w)
+
+    B, _, dim = q.shape
+    r_q = q.reshape(B, q_h, q_w, dim)
+    rel_h = torch.einsum("bhwc,hkc->bhwk", r_q, Rh)
+    rel_w = torch.einsum("bhwc,wkc->bhwk", r_q, Rw)
+
+    attn = (
+        attn.view(B, q_h, q_w, k_h, k_w) + rel_h[:, :, :, :, None] + rel_w[:, :, :, None, :]
+    ).view(B, q_h * q_w, k_h * k_w)
+
+    return attn
+
+
+class PatchEmbed(nn.Module):
+    """
+    Image to Patch Embedding.
+    """
+
+    def __init__(
+        self,
+        kernel_size: Tuple[int, int] = (16, 16),
+        stride: Tuple[int, int] = (16, 16),
+        padding: Tuple[int, int] = (0, 0),
+        in_chans: int = 3,
+        embed_dim: int = 768,
+    ) -> None:
+        """
+        Args:
+            kernel_size (Tuple): kernel size of the projection layer.
+            stride (Tuple): stride of the projection layer.
+            padding (Tuple): padding size of the projection layer.
+            in_chans (int): Number of input image channels.
+            embed_dim (int):  embed_dim (int): Patch embedding dimension.
+        """
+        super().__init__()
+
+        self.proj = nn.Conv2d(
+            in_chans, embed_dim, kernel_size=kernel_size, stride=stride, padding=padding
+        )
+
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        x = self.proj(x)
+        # B C H W -> B H W C
+        x = x.permute(0, 2, 3, 1)
+        return x
diff --git a/hi_sam/modeling/loss.py b/hi_sam/modeling/loss.py
new file mode 100644
index 0000000..d77c212
--- /dev/null
+++ b/hi_sam/modeling/loss.py
@@ -0,0 +1,280 @@
+import sys
+
+import torch
+from torch.nn import functional as F
+from typing import List, Optional
+import utils.misc as misc
+
+
+def mean_square_loss(
+        inputs: torch.Tensor,
+        targets: torch.Tensor
+):
+    loss = F.mse_loss(inputs, targets)
+    return loss
+
+
+mean_square_loss_jit = torch.jit.script(
+    mean_square_loss
+)  # type: torch.jit.ScriptModule
+
+
+def loss_iou_mse(pred_iou, src_masks, target_masks):
+    """
+    pred_iou.shape: (bs, 1)
+
+    (2) get mse loss with pred_iou & gt_iou
+    """
+    with torch.no_grad():
+        src_masks = src_masks.type(torch.int).flatten(1)  # (bs,1024*1024)
+        target_masks = (target_masks > 0.5).type(torch.int).flatten(1)
+        area_src = src_masks.sum(dim=1)
+        area_target = target_masks.sum(dim=1)
+        intersection = (src_masks * target_masks).sum(dim=1)
+        gt_iou = intersection / (area_src + area_target - intersection + 1e-6)
+        gt_iou = gt_iou.reshape(-1, 1)
+    loss_mse = mean_square_loss_jit(pred_iou, gt_iou)
+    del src_masks
+    del target_masks
+    return loss_mse
+
+
+def loss_hi_iou_mse(pred_iou, src_masks, mask_thresh, target_masks):
+    # src_masks: logits
+    with torch.no_grad():
+        target_masks = F.interpolate(target_masks, src_masks.shape[-2:], mode="bilinear", align_corners=False)
+        target_masks = (target_masks > 0.5).type(torch.int).flatten(1)
+        src_masks = (src_masks > mask_thresh).type(torch.int).flatten(1)
+        area_src = src_masks.sum(dim=1)
+        area_target = target_masks.sum(dim=1)
+        intersection = (src_masks * target_masks).sum(dim=1)
+        gt_iou = intersection / (area_src + area_target - intersection + 1e-6)
+        gt_iou = gt_iou.reshape(-1, 1)
+    loss_mse = mean_square_loss_jit(pred_iou, gt_iou)
+    del src_masks
+    del target_masks
+    return loss_mse
+
+
+def point_sample(input, point_coords, **kwargs):
+    """
+    A wrapper around :function:`torch.nn.functional.grid_sample` to support 3D point_coords tensors.
+    Unlike :function:`torch.nn.functional.grid_sample` it assumes `point_coords` to lie inside
+    [0, 1] x [0, 1] square.
+    Args:
+        input (Tensor): A tensor of shape (N, C, H, W) that contains features map on a H x W grid.
+        point_coords (Tensor): A tensor of shape (N, P, 2) or (N, Hgrid, Wgrid, 2) that contains
+        [0, 1] x [0, 1] normalized point coordinates.
+    Returns:
+        output (Tensor): A tensor of shape (N, C, P) or (N, C, Hgrid, Wgrid) that contains
+            features for points in `point_coords`. The features are obtained via bilinear
+            interplation from `input` the same way as :function:`torch.nn.functional.grid_sample`.
+    """
+    add_dim = False
+    if point_coords.dim() == 3:
+        add_dim = True
+        point_coords = point_coords.unsqueeze(2)
+    output = F.grid_sample(input, 2.0 * point_coords - 1.0, **kwargs)
+    if add_dim:
+        output = output.squeeze(3)
+    return output
+
+
+def cat(tensors: List[torch.Tensor], dim: int = 0):
+    """
+    Efficient version of torch.cat that avoids a copy if there is only a single element in a list
+    """
+    assert isinstance(tensors, (list, tuple))
+    if len(tensors) == 1:
+        return tensors[0]
+    return torch.cat(tensors, dim)
+
+
+def get_uncertain_point_coords_with_randomness(
+    coarse_logits, uncertainty_func, num_points, oversample_ratio, importance_sample_ratio
+):
+    """
+    Sample points in [0, 1] x [0, 1] coordinate space based on their uncertainty. The unceratinties
+        are calculated for each point using 'uncertainty_func' function that takes point's logit
+        prediction as input.
+    See PointRend paper for details.
+    Args:
+        coarse_logits (Tensor): A tensor of shape (N, C, Hmask, Wmask) or (N, 1, Hmask, Wmask) for
+            class-specific or class-agnostic prediction.
+        uncertainty_func: A function that takes a Tensor of shape (N, C, P) or (N, 1, P) that
+            contains logit predictions for P points and returns their uncertainties as a Tensor of
+            shape (N, 1, P).
+        num_points (int): The number of points P to sample.
+        oversample_ratio (int): Oversampling parameter.
+        importance_sample_ratio (float): Ratio of points that are sampled via importnace sampling.
+    Returns:
+        point_coords (Tensor): A tensor of shape (N, P, 2) that contains the coordinates of P
+            sampled points.
+    """
+    assert oversample_ratio >= 1
+    assert importance_sample_ratio <= 1 and importance_sample_ratio >= 0
+    num_boxes = coarse_logits.shape[0]
+    num_sampled = int(num_points * oversample_ratio)
+    point_coords = torch.rand(num_boxes, num_sampled, 2, device=coarse_logits.device)
+    point_logits = point_sample(coarse_logits, point_coords, align_corners=False)
+    # It is crucial to calculate uncertainty based on the sampled prediction value for the points.
+    # Calculating uncertainties of the coarse predictions first and sampling them for points leads
+    # to incorrect results.
+    # To illustrate this: assume uncertainty_func(logits)=-abs(logits), a sampled point between
+    # two coarse predictions with -1 and 1 logits has 0 logits, and therefore 0 uncertainty value.
+    # However, if we calculate uncertainties for the coarse predictions first,
+    # both will have -1 uncertainty, and the sampled point will get -1 uncertainty.
+    point_uncertainties = uncertainty_func(point_logits)
+    num_uncertain_points = int(importance_sample_ratio * num_points)
+    num_random_points = num_points - num_uncertain_points
+    idx = torch.topk(point_uncertainties[:, 0, :], k=num_uncertain_points, dim=1)[1]
+    shift = num_sampled * torch.arange(num_boxes, dtype=torch.long, device=coarse_logits.device)
+    idx += shift[:, None]
+    point_coords = point_coords.view(-1, 2)[idx.view(-1), :].view(
+        num_boxes, num_uncertain_points, 2
+    )
+    if num_random_points > 0:
+        point_coords = cat(
+            [
+                point_coords,
+                torch.rand(num_boxes, num_random_points, 2, device=coarse_logits.device),
+            ],
+            dim=1,
+        )
+    return point_coords
+
+
+def dice_loss(
+        inputs: torch.Tensor,
+        targets: torch.Tensor,
+        num_masks: float,
+    ):
+    """
+    Compute the DICE loss, similar to generalized IOU for masks
+    Args:
+        inputs: A float tensor of arbitrary shape.
+                The predictions for each example.
+        targets: A float tensor with the same shape as inputs. Stores the binary
+                 classification label for each element in inputs
+                (0 for the negative class and 1 for the positive class).
+    """
+    inputs = inputs.sigmoid()
+    inputs = inputs.flatten(1)
+    numerator = 2 * (inputs * targets).sum(-1)
+    denominator = inputs.sum(-1) + targets.sum(-1)
+    loss = 1 - (numerator + 1) / (denominator + 1)
+    return loss.sum() / num_masks
+
+
+dice_loss_jit = torch.jit.script(
+    dice_loss
+)  # type: torch.jit.ScriptModule
+
+
+def sigmoid_ce_loss(
+        inputs: torch.Tensor,
+        targets: torch.Tensor,
+        num_masks: float,
+    ):
+    """
+    Args:
+        inputs: A float tensor of arbitrary shape.
+                The predictions for each example.
+        targets: A float tensor with the same shape as inputs. Stores the binary
+                 classification label for each element in inputs
+                (0 for the negative class and 1 for the positive class).
+    Returns:
+        Loss tensor
+    """
+    loss = F.binary_cross_entropy_with_logits(inputs, targets, reduction="none")
+
+    return loss.mean(1).sum() / num_masks
+
+
+sigmoid_ce_loss_jit = torch.jit.script(
+    sigmoid_ce_loss
+)  # type: torch.jit.ScriptModule
+
+
+def sigmoid_focal_loss(
+        inputs: torch.Tensor,
+        targets: torch.Tensor,
+        num_masks: float,
+        alpha: float = 0.25,
+        gamma: float = 2,
+        mask_threshold: float = 0.0
+    ):
+    prob = inputs.sigmoid()
+    # prob = (inputs > mask_threshold).type(torch.float32)
+    ce_loss = F.binary_cross_entropy_with_logits(inputs, targets, reduction="none")
+    p_t = prob * targets + (1 - prob) * (1 - targets)
+    loss = ce_loss * ((1 - p_t) ** gamma)
+
+    if alpha >= 0:
+        alpha_t = alpha * targets + (1 - alpha) * (1 - targets)
+        loss = alpha_t * loss
+
+    return loss.mean(1).sum() / num_masks
+
+
+sigmoid_focal_loss_jit = torch.jit.script(
+    sigmoid_focal_loss
+)  # type: torch.jit.ScriptModule
+
+
+def calculate_uncertainty(logits):
+    """
+    We estimate uncerainty as L1 distance between 0.0 and the logit prediction in 'logits' for the
+        foreground class in `classes`.
+    Args:
+        logits (Tensor): A tensor of shape (R, 1, ...) for class-specific or
+            class-agnostic, where R is the total number of predicted masks in all images and C is
+            the number of foreground classes. The values are logits.
+    Returns:
+        scores (Tensor): A tensor of shape (R, 1, ...) that contains uncertainty scores with
+            the most uncertain locations having the highest uncertainty score.
+    """
+    assert logits.shape[1] == 1
+    gt_class_logits = logits.clone()
+    return -(torch.abs(gt_class_logits))
+
+
+def loss_masks(src_masks, target_masks, num_masks):
+    src_masks = src_masks.flatten(1)
+    target_masks = target_masks.flatten(1)
+    loss_focal = sigmoid_focal_loss_jit(src_masks, target_masks, num_masks)
+    loss_dice = dice_loss_jit(src_masks, target_masks, num_masks)
+    del src_masks
+    del target_masks
+    return loss_focal, loss_dice
+
+
+def loss_hi_masks(src_masks, target_masks, num_masks, oversample_ratio=3.0, num_points=128 * 128):
+    with torch.no_grad():
+        # sample point_coords
+        point_coords = get_uncertain_point_coords_with_randomness(
+            src_masks,
+            lambda logits: calculate_uncertainty(logits),
+            num_points,
+            oversample_ratio,
+            0.75,
+        )
+        # get gt labels
+        point_labels = point_sample(
+            target_masks,
+            point_coords,
+            align_corners=False,
+        ).squeeze(1)
+
+    point_logits = point_sample(
+        src_masks,
+        point_coords,
+        align_corners=False,
+    ).squeeze(1)
+
+    loss_ce = sigmoid_ce_loss_jit(point_logits, point_labels, num_masks)
+    loss_dice = dice_loss_jit(point_logits, point_labels, num_masks)
+    del src_masks
+    del target_masks
+
+    return loss_ce, loss_dice
diff --git a/hi_sam/modeling/mask_decoder.py b/hi_sam/modeling/mask_decoder.py
new file mode 100644
index 0000000..9192244
--- /dev/null
+++ b/hi_sam/modeling/mask_decoder.py
@@ -0,0 +1,345 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+import sys
+
+# This source code is licensed under the license found in the
+# LICENSE file in the root directory of this source tree.
+
+import torch
+from torch import nn
+from torch.nn import functional as F
+import torch.utils.checkpoint as cp
+
+from typing import List, Tuple, Type
+
+from .common import LayerNorm2d
+
+
+class MaskDecoder(nn.Module):
+    def __init__(
+        self,
+        *,
+        transformer_dim: int,
+        transformer: nn.Module,
+        num_multimask_outputs: int = 3,
+        activation: Type[nn.Module] = nn.GELU,
+        iou_head_depth: int = 3,
+        iou_head_hidden_dim: int = 256,
+    ) -> None:
+        """
+        Predicts masks given an image and prompt embeddings, using a
+        tranformer architecture.
+
+        Arguments:
+          transformer_dim (int): the channel dimension of the transformer
+          transformer (nn.Module): the transformer used to predict masks
+          num_multimask_outputs (int): the number of masks to predict
+            when disambiguating masks
+          activation (nn.Module): the type of activation to use when
+            upscaling masks
+          iou_head_depth (int): the depth of the MLP used to predict
+            mask quality
+          iou_head_hidden_dim (int): the hidden dimension of the MLP
+            used to predict mask quality
+        """
+        super().__init__()
+        self.transformer_dim = transformer_dim
+        self.transformer = transformer
+
+        self.num_multimask_outputs = num_multimask_outputs
+
+        self.iou_token = nn.Embedding(1, transformer_dim)
+        self.num_mask_tokens = num_multimask_outputs + 1
+        self.mask_tokens = nn.Embedding(self.num_mask_tokens, transformer_dim)
+
+        self.output_upscaling = nn.Sequential(
+            nn.ConvTranspose2d(transformer_dim, transformer_dim // 4, kernel_size=2, stride=2),
+            LayerNorm2d(transformer_dim // 4),
+            activation(),
+            nn.ConvTranspose2d(transformer_dim // 4, transformer_dim // 8, kernel_size=2, stride=2),
+            activation(),
+        )
+        self.output_hypernetworks_mlps = nn.ModuleList(
+            [
+                MLP(transformer_dim, transformer_dim, transformer_dim // 8, 3)
+                for i in range(self.num_mask_tokens)
+            ]
+        )
+        self.iou_prediction_head = MLP(
+            transformer_dim, iou_head_hidden_dim, self.num_mask_tokens, iou_head_depth
+        )
+
+        # add for high res mask
+        self.output_upscaling_hr = nn.Sequential(
+            nn.ConvTranspose2d(transformer_dim // 8, transformer_dim // 16, kernel_size=2, stride=2),
+            LayerNorm2d(transformer_dim // 16),
+            activation(),
+            nn.ConvTranspose2d(transformer_dim // 16, transformer_dim // 16, kernel_size=2, stride=2),
+            LayerNorm2d(transformer_dim // 16),
+            activation(),
+            nn.Conv2d(transformer_dim // 16, transformer_dim // 16, kernel_size=3, padding=1),
+            LayerNorm2d(transformer_dim // 16),
+            activation(),
+            nn.Conv2d(transformer_dim // 16, transformer_dim // 16, kernel_size=3, padding=1),
+            LayerNorm2d(transformer_dim // 16),
+            activation(),
+            nn.Conv2d(transformer_dim // 16, transformer_dim // 16, kernel_size=3, padding=1),
+            LayerNorm2d(transformer_dim // 16),
+            activation(),
+            nn.Conv2d(transformer_dim // 16, transformer_dim // 16, kernel_size=3, padding=1),
+            activation()
+        )
+        self.output_hypernetworks_mlps_hr = MLP(transformer_dim, transformer_dim, transformer_dim // 16, 3)
+        self.iou_prediction_head_hr = MLP(
+            transformer_dim, iou_head_hidden_dim, 1, iou_head_depth
+        )
+
+    def forward(
+        self,
+        image_embeddings: torch.Tensor,
+        image_pe: torch.Tensor,
+        sparse_prompt_embeddings: torch.Tensor,
+        dense_prompt_embeddings: torch.Tensor = None,
+        multimask_output: bool = False,
+    ): # -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]:
+        """
+        Predict masks given image and prompt embeddings.
+        """
+        masks, hr_masks, iou_pred, iou_pred_hr = self.predict_masks(
+            image_embeddings=image_embeddings,
+            image_pe=image_pe,
+            sparse_prompt_embeddings=sparse_prompt_embeddings,
+            dense_prompt_embeddings=dense_prompt_embeddings,
+        )
+        if multimask_output:
+            mask_slice = slice(1, None)
+        else:
+            mask_slice = slice(0, 1)
+        masks = masks[:, mask_slice, :, :]
+        iou_pred = iou_pred[:, mask_slice]
+        hr_masks = hr_masks[:, mask_slice, :, :]
+        return masks, hr_masks, iou_pred, iou_pred_hr
+
+    def predict_masks(
+        self,
+        image_embeddings: torch.Tensor,
+        image_pe: torch.Tensor,
+        sparse_prompt_embeddings: torch.Tensor,
+        dense_prompt_embeddings: torch.Tensor = None,
+    ):  # -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]:
+        """Predicts masks. See 'forward' for more details."""
+        # Concatenate output tokens
+        output_tokens = torch.cat(
+            [self.iou_token.weight, self.mask_tokens.weight],
+            dim=0
+        )
+        output_tokens = output_tokens.unsqueeze(0).expand(sparse_prompt_embeddings.size(0), -1, -1)
+        tokens = torch.cat((output_tokens, sparse_prompt_embeddings), dim=1)
+
+        # Expand per-image data in batch direction to be per-mask
+        src = torch.repeat_interleave(image_embeddings, tokens.shape[0], dim=0)   # (1, 256, 64, 64)
+        if dense_prompt_embeddings is not None:
+            src = src + dense_prompt_embeddings
+        pos_src = torch.repeat_interleave(image_pe, tokens.shape[0], dim=0)
+        b, c, h, w = src.shape
+
+        # Run the transformer
+        hs, src = self.transformer(src, pos_src, tokens)
+        iou_token_out = hs[:, 0, :]  # (1, 256)
+        mask_tokens_out = hs[:, 1: (1 + self.num_mask_tokens), :]  # (1, 4, 256)
+
+        # Upscale mask embeddings and predict masks using the mask tokens
+        src = src.transpose(1, 2).view(b, c, h, w)
+        upscaled_embedding = self.output_upscaling(src)
+        hyper_in_list: List[torch.Tensor] = []
+        for i in range(self.num_mask_tokens):
+            hyper_in_list.append(self.output_hypernetworks_mlps[i](mask_tokens_out[:, i, :]))
+        hyper_in = torch.stack(hyper_in_list, dim=1)
+        b, c, h, w = upscaled_embedding.shape
+        masks = (hyper_in @ upscaled_embedding.view(b, c, h * w)).view(b, -1, h, w)  # (1, 4, 256, 256)
+
+        # add for high res mask
+        upscaled_embedding = self.output_upscaling_hr(upscaled_embedding)
+        hyper_in_hr = self.output_hypernetworks_mlps_hr(mask_tokens_out[:, 0, :])
+        b, c, h, w = upscaled_embedding.shape
+        hr_masks = (hyper_in_hr @ upscaled_embedding.view(b, c, h * w)).view(b, -1, h, w)  # (1,1,1024,1024)
+
+        # Generate mask quality predictions
+        iou_pred = self.iou_prediction_head(iou_token_out)  # (1, 4)
+        iou_pred_hr = self.iou_prediction_head_hr(iou_token_out)  # (1, 1)
+
+        return masks, hr_masks, iou_pred, iou_pred_hr
+
+
+class HiDecoder(nn.Module):
+    def __init__(
+        self,
+        *,
+        transformer_dim: int,
+        transformer: nn.Module,
+        num_multimask_outputs: int = 3,
+        activation: Type[nn.Module] = nn.GELU,
+        iou_head_depth: int = 3,
+        iou_head_hidden_dim: int = 256,
+    ) -> None:
+        """
+        Predicts masks given an image and prompt embeddings, using a
+        tranformer architecture.
+
+        Arguments:
+          transformer_dim (int): the channel dimension of the transformer
+          transformer (nn.Module): the transformer used to predict masks
+          num_multimask_outputs (int): the number of masks to predict
+            when disambiguating masks
+          activation (nn.Module): the type of activation to use when
+            upscaling masks
+          iou_head_depth (int): the depth of the MLP used to predict
+            mask quality
+          iou_head_hidden_dim (int): the hidden dimension of the MLP
+            used to predict mask quality
+        """
+        super().__init__()
+        self.transformer_dim = transformer_dim
+        self.transformer = transformer
+
+        self.num_multimask_outputs = num_multimask_outputs
+
+        self.iou_token = nn.Embedding(1, transformer_dim)
+        self.num_mask_tokens = num_multimask_outputs + 1
+        self.mask_tokens = nn.Embedding(self.num_mask_tokens, transformer_dim)
+
+        self.output_upscaling = nn.Sequential(
+            nn.ConvTranspose2d(transformer_dim, transformer_dim // 4, kernel_size=2, stride=2),
+            LayerNorm2d(transformer_dim // 4),
+            activation(),
+            nn.ConvTranspose2d(transformer_dim // 4, transformer_dim // 8, kernel_size=2, stride=2),
+            activation(),
+        )
+        self.output_hypernetworks_mlps = nn.ModuleList(
+            [
+                MLP(transformer_dim, transformer_dim, transformer_dim // 8, 3)
+                for i in range(self.num_mask_tokens)
+            ]
+        )
+        self.iou_prediction_head = MLP(
+            transformer_dim, iou_head_hidden_dim, self.num_mask_tokens, iou_head_depth
+        )
+
+        self.word_mask_dc = nn.Sequential(
+            nn.Conv2d(transformer_dim // 8, transformer_dim // 16, kernel_size=1),
+            LayerNorm2d(transformer_dim // 16),
+            activation(),
+        )
+        self.word_mask_refine = nn.Sequential(
+            nn.Conv2d(transformer_dim // 16, transformer_dim // 16, kernel_size=3, padding=1),
+            LayerNorm2d(transformer_dim // 16),
+            activation(),
+            nn.Conv2d(transformer_dim // 16, transformer_dim // 16, kernel_size=3, padding=1),
+            LayerNorm2d(transformer_dim // 16),
+            activation(),
+            nn.Conv2d(transformer_dim // 16, transformer_dim // 16, kernel_size=3, padding=1),
+            LayerNorm2d(transformer_dim // 16),
+            activation(),
+            nn.Conv2d(transformer_dim // 16, transformer_dim // 16, kernel_size=3, padding=1),
+            activation(),
+        )
+        self.output_word_mlp = MLP(transformer_dim, transformer_dim, transformer_dim // 16, 3)
+
+    def forward(
+        self,
+        image_embeddings: torch.Tensor,
+        image_pe: torch.Tensor,
+        sparse_prompt_embeddings: torch.Tensor,
+        dense_prompt_embeddings: torch.Tensor = None,
+        multimask_output: bool = False,
+    ): # -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]:
+        """
+        Predict masks given image and prompt embeddings.
+        """
+        masks, iou_pred, word_masks = self.predict_masks(
+            image_embeddings=image_embeddings,
+            image_pe=image_pe,
+            sparse_prompt_embeddings=sparse_prompt_embeddings,
+            dense_prompt_embeddings=dense_prompt_embeddings,
+        )
+        if multimask_output:
+            mask_slice = slice(1, None)
+        else:
+            mask_slice = slice(0, 1)
+        masks = masks[:, mask_slice, :, :]
+        iou_pred = iou_pred[:, mask_slice]
+        return masks, iou_pred, word_masks
+
+    def predict_masks(
+        self,
+        image_embeddings: torch.Tensor,
+        image_pe: torch.Tensor,
+        sparse_prompt_embeddings: torch.Tensor,
+        dense_prompt_embeddings: torch.Tensor = None,
+    ):  # -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]:
+        """Predicts masks. See 'forward' for more details."""
+        output_tokens = torch.cat(
+            [self.iou_token.weight, self.mask_tokens.weight],
+            dim=0
+        )
+        output_tokens = output_tokens.unsqueeze(0).expand(sparse_prompt_embeddings.size(0), -1, -1)
+        tokens = torch.cat((output_tokens, sparse_prompt_embeddings), dim=1)
+
+        # Expand per-image data in batch direction to be per-mask
+        src = torch.repeat_interleave(image_embeddings, tokens.shape[0], dim=0)   # (1, 256, 64, 64)
+        if dense_prompt_embeddings is not None:
+            src = src + dense_prompt_embeddings
+        pos_src = torch.repeat_interleave(image_pe, tokens.shape[0], dim=0)
+        b, c, h, w = src.shape
+
+        # Run the transformer
+        hs, src = self.transformer(src, pos_src, tokens)
+        iou_token_out = hs[:, 0, :]  # (1, 256)
+        mask_tokens_out = hs[:, 1: (1 + self.num_mask_tokens), :]
+
+        # Upscale mask embeddings and predict masks using the mask tokens
+        src = src.transpose(1, 2).view(b, c, h, w)
+        upscaled_embedding = self.output_upscaling(src)
+        hyper_in_list: List[torch.Tensor] = []
+        for i in range(self.num_mask_tokens):
+            hyper_in_list.append(self.output_hypernetworks_mlps[i](mask_tokens_out[:, i, :]))
+        hyper_in = torch.stack(hyper_in_list, dim=1)
+        b, c, h, w = upscaled_embedding.shape
+        masks = (hyper_in @ upscaled_embedding.view(b, c, h * w)).view(b, -1, h, w)
+        iou_pred = self.iou_prediction_head(iou_token_out)
+
+        upscaled_embedding = self.word_mask_dc(upscaled_embedding)
+        upscaled_embedding = F.interpolate(upscaled_embedding, (384, 384), mode="bilinear", align_corners=False)
+        upscaled_embedding = self.word_mask_refine(upscaled_embedding)
+        hyper_in_word = self.output_word_mlp(mask_tokens_out[:, 1:2, :])
+        b, c, h, w = upscaled_embedding.shape
+        word_masks = (hyper_in_word @ upscaled_embedding.view(b, c, h * w)).view(b, -1, h, w)
+
+        return masks, iou_pred, word_masks
+
+
+# Lightly adapted from
+# https://github.com/facebookresearch/MaskFormer/blob/main/mask_former/modeling/transformer/transformer_predictor.py # noqa
+class MLP(nn.Module):
+    def __init__(
+        self,
+        input_dim: int,
+        hidden_dim: int,
+        output_dim: int,
+        num_layers: int,
+        sigmoid_output: bool = False,
+    ) -> None:
+        super().__init__()
+        self.num_layers = num_layers
+        h = [hidden_dim] * (num_layers - 1)
+        self.layers = nn.ModuleList(
+            nn.Linear(n, k) for n, k in zip([input_dim] + h, h + [output_dim])
+        )
+        self.sigmoid_output = sigmoid_output
+
+    def forward(self, x):
+        for i, layer in enumerate(self.layers):
+            x = F.relu(layer(x)) if i < self.num_layers - 1 else layer(x)
+        if self.sigmoid_output:
+            x = F.sigmoid(x)
+        return x
diff --git a/hi_sam/modeling/modal_aligner.py b/hi_sam/modeling/modal_aligner.py
new file mode 100644
index 0000000..6d84d05
--- /dev/null
+++ b/hi_sam/modeling/modal_aligner.py
@@ -0,0 +1,115 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+
+# This source code is licensed under the license found in the
+# LICENSE file in the root directory of this source tree.
+
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from typing import Type, Tuple
+from einops import rearrange
+from .common import LayerNorm2d
+import math
+
+
+class CrossModalMLP(nn.Module):
+    def __init__(
+        self,
+        input_dim: int,
+        hidden_dim: int,
+        output_dim: int,
+        num_layers: int,
+        act: Type[nn.Module] = nn.GELU,
+    ) -> None:
+        super().__init__()
+        self.num_layers = num_layers
+        h = [hidden_dim] *(num_layers - 1)
+        self.layers = nn.ModuleList(
+            nn.Linear(n, k) for n, k in zip([input_dim] + h, h + [output_dim])
+        )
+        self.act = act()
+
+    def forward(self, x):
+        for i, layer in enumerate(self.layers):
+            x = self.act(layer(x)) if i < self.num_layers - 1 else layer(x)
+        return x
+
+
+class AttentionBlock(nn.Module):
+    def __init__(self, output_dim, nhead, dropout: float= 0.1):
+        super().__init__()
+        self.self_attn = nn.MultiheadAttention(output_dim, nhead, dropout=dropout, batch_first=True)
+        self.dropout1 = nn.Dropout(dropout)
+        self.norm1 = nn.LayerNorm(output_dim)
+
+        self.cross_attn = nn.MultiheadAttention(output_dim, nhead, dropout=dropout, batch_first=True)
+        self.dropout2 = nn.Dropout(dropout)
+        self.norm2 = nn.LayerNorm(output_dim)
+
+        self.ffn = nn.Sequential(
+            nn.Linear(output_dim, output_dim * 4),
+            nn.GELU(),
+            nn.Linear(output_dim * 4, output_dim)
+        )
+        self.dropout3 = nn.Dropout(dropout)
+
+    def with_pos_embed(self, tensor, pos):
+        return tensor if pos is None else tensor + pos
+
+    def forward(self, sparse_embeddings, image_embeddings):
+        sparse_embeddings = sparse_embeddings + self.dropout1(self.self_attn(
+            query=sparse_embeddings,
+            key=sparse_embeddings,
+            value=sparse_embeddings)[0])
+        sparse_embeddings = self.norm1(sparse_embeddings)
+        sparse_embeddings = sparse_embeddings + self.dropout2(self.cross_attn(
+            query=sparse_embeddings,
+            key=rearrange(image_embeddings, "b c h w -> b (h w) c"),
+            value=rearrange(image_embeddings, "b c h w -> b (h w) c"))[0])
+        sparse_embeddings = sparse_embeddings + self.dropout3(self.ffn(self.norm2(sparse_embeddings)))
+
+        return sparse_embeddings
+
+
+class ModalAligner(nn.Module):
+    def __init__(
+            self,
+            transformer_dim: int,
+            act: Type[nn.Module] = nn.GELU,
+            nhead: int = 8,
+            dropout: float = 0.1,
+            attn_layers: int = 1,
+            prompt_len: int = 12,
+    ) -> None:
+        super().__init__()
+
+        self.prompt_len = prompt_len
+        self.conv = nn.Sequential(
+            nn.Conv2d(transformer_dim, self.prompt_len, kernel_size=3, padding=1, bias=False),
+            act(),
+            nn.Conv2d(self.prompt_len, self.prompt_len, kernel_size=3, padding=1, bias=False),
+            act(),
+            nn.Conv2d(self.prompt_len, self.prompt_len, kernel_size=3, padding=1, bias=False),
+            act(),
+            nn.Conv2d(self.prompt_len, self.prompt_len, kernel_size=3, padding=1, bias=False),
+        )
+        for m in self.modules():
+            if isinstance(m, nn.Conv2d):
+                nn.init.kaiming_normal_(m.weight)
+        self.transformer_layers = nn.ModuleList([
+            AttentionBlock(transformer_dim, nhead, dropout) for _ in range(attn_layers)
+        ])
+
+    def forward(self, image_embeddings: torch.Tensor) -> torch.Tensor:
+        bs, c, h, w = image_embeddings.shape
+        spatial_attention = self.conv(image_embeddings)  # bs, len, 64, 64
+        spatial_attention = spatial_attention.reshape(bs, self.prompt_len, -1)
+        spatial_attention = F.sigmoid(spatial_attention)[..., None]  # bs, len, h*w, 1
+        feat = image_embeddings.reshape(bs, c, -1).permute(0, 2, 1)[:, None, ...]  # bs, 1, h*w, c
+        sparse_embeddings = (feat * spatial_attention).mean(dim=2)  # bs, len, c
+
+        for layer in self.transformer_layers:
+            sparse_embeddings = layer(sparse_embeddings, image_embeddings)
+            
+        return sparse_embeddings
diff --git a/hi_sam/modeling/predictor.py b/hi_sam/modeling/predictor.py
new file mode 100644
index 0000000..67da971
--- /dev/null
+++ b/hi_sam/modeling/predictor.py
@@ -0,0 +1,260 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+
+# This source code is licensed under the license found in the
+# LICENSE file in the root directory of this source tree.
+
+import numpy as np
+import torch
+
+from .hi_sam import HiSam
+
+from typing import Optional, Tuple
+from hi_sam.data.transforms import ResizeLongestSide
+
+
+class SamPredictor:
+    def __init__(
+        self,
+        sam_model: HiSam,
+    ) -> None:
+        """
+        Uses SAM to calculate the image embedding for an image, and then
+        allow repeated, efficient mask prediction given prompts.
+
+        Arguments:
+          sam_model (Sam): The model to use for mask prediction.
+        """
+        super().__init__()
+        self.model = sam_model
+        self.transform = ResizeLongestSide(sam_model.image_encoder.img_size)
+        self.reset_image()
+
+    def set_image(
+        self,
+        image: np.ndarray,
+        image_format: str = "RGB",
+    ) -> None:
+        """
+        Calculates the image embeddings for the provided image, allowing
+        masks to be predicted with the 'predict' method.
+
+        Arguments:
+          image (np.ndarray): The image for calculating masks. Expects an
+            image in HWC uint8 format, with pixel values in [0, 255].
+          image_format (str): The color format of the image, in ['RGB', 'BGR'].
+        """
+        assert image_format in [
+            "RGB",
+            "BGR",
+        ], f"image_format must be in ['RGB', 'BGR'], is {image_format}."
+        # import pdb;pdb.set_trace()
+        if image_format != self.model.image_format:
+            image = image[..., ::-1]
+
+        # Transform the image to the form expected by the model
+        # import pdb;pdb.set_trace()
+        input_image = self.transform.apply_image(image)
+        input_image_torch = torch.as_tensor(input_image, device=self.device)
+        input_image_torch = input_image_torch.permute(2, 0, 1).contiguous()[None, :, :, :]
+
+        self.set_torch_image(input_image_torch, image.shape[:2])
+
+    @torch.no_grad()
+    def set_torch_image(
+        self,
+        transformed_image: torch.Tensor,
+        original_image_size: Tuple[int, ...],
+    ) -> None:
+        """
+        Calculates the image embeddings for the provided image, allowing
+        masks to be predicted with the 'predict' method. Expects the input
+        image to be already transformed to the format expected by the model.
+
+        Arguments:
+          transformed_image (torch.Tensor): The input image, with shape
+            1x3xHxW, which has been transformed with ResizeLongestSide.
+          original_image_size (tuple(int, int)): The size of the image
+            before transformation, in (H, W) format.
+        """
+        assert (
+            len(transformed_image.shape) == 4
+            and transformed_image.shape[1] == 3
+            and max(*transformed_image.shape[2:]) == self.model.image_encoder.img_size
+        ), f"set_torch_image input must be BCHW with long side {self.model.image_encoder.img_size}."
+        self.reset_image()
+
+        self.original_size = original_image_size
+        self.input_size = tuple(transformed_image.shape[-2:])
+        input_image = self.model.preprocess(transformed_image)
+        self.features = self.model.image_encoder(input_image)
+        self.is_image_set = True
+
+    def predict(
+        self,
+        multimask_output: bool = False,
+        return_logits: bool = False,
+        hier_det: bool = False,
+        point_coords: Optional[np.ndarray] = None,
+        point_labels: Optional[np.ndarray] = None,
+    ):
+        """
+        Predict masks for the given input prompts, using the currently set image.
+
+        Arguments:
+          point_coords (np.ndarray or None): A Nx2 array of point prompts to the
+            model. Each point is in (X,Y) in pixels.
+          point_labels (np.ndarray or None): A length N array of labels for the
+            point prompts. 1 indicates a foreground point and 0 indicates a
+            background point.
+          box (np.ndarray or None): A length 4 array given a box prompt to the
+            model, in XYXY format.
+          mask_input (np.ndarray): A low resolution mask input to the model, typically
+            coming from a previous prediction iteration. Has form 1xHxW, where
+            for SAM, H=W=256.
+          multimask_output (bool): If true, the model will return three masks.
+            For ambiguous input prompts (such as a single click), this will often
+            produce better masks than a single prediction. If only a single
+            mask is needed, the model's predicted quality score can be used
+            to select the best mask. For non-ambiguous prompts, such as multiple
+            input prompts, multimask_output=False can give better results.
+          return_logits (bool): If true, returns un-thresholded masks logits
+            instead of a binary mask.
+        """
+        if not self.is_image_set:
+            raise RuntimeError("An image must be set with .set_image(...) before mask prediction.")
+
+        if hier_det:
+            assert point_coords is not None and point_labels is not None, \
+                "point_coords and point_labels must be supplied"
+            point_coords = self.transform.apply_coords(point_coords, self.original_size)
+            coords_torch = torch.as_tensor(point_coords, dtype=torch.float, device=self.device)
+            labels_torch = torch.as_tensor(point_labels, dtype=torch.int, device=self.device)
+            coords_torch, labels_torch = coords_torch[:, None, :], labels_torch[:, None]
+            masks, hr_masks, iou_predictions, iou_predictions_hr, hi_masks, hi_iou, word_masks = self.predict_torch(
+                multimask_output,
+                return_logits=return_logits,
+                hier_det=True,
+                point_coords=coords_torch,
+                point_labels=labels_torch
+            )
+            masks_np = masks[0].detach().cpu().numpy()
+            hr_masks_np = hr_masks[0].detach().cpu().numpy()
+            iou_predictions_np = iou_predictions[0].detach().cpu().numpy()
+            iou_predictions_hr_np = iou_predictions_hr[0].detach().cpu().numpy()
+            hi_masks_np = hi_masks.detach().cpu().numpy()
+            hi_iou_np = hi_iou.detach().cpu().numpy()
+            word_masks_np = word_masks.detach().cpu().numpy()
+            return (masks_np, hr_masks_np, iou_predictions_np, iou_predictions_hr_np,
+                    hi_masks_np, hi_iou_np, word_masks_np)
+        else:
+            masks, hr_masks, iou_predictions, iou_predictions_hr = self.predict_torch(
+                multimask_output,
+                return_logits=return_logits,
+            )
+            masks_np = masks[0].detach().cpu().numpy()
+            hr_masks_np = hr_masks[0].detach().cpu().numpy()
+            iou_predictions_np = iou_predictions[0].detach().cpu().numpy()
+            iou_predictions_hr_np = iou_predictions_hr[0].detach().cpu().numpy()
+            return masks_np, hr_masks_np, iou_predictions_np, iou_predictions_hr_np
+
+    @torch.no_grad()
+    def predict_torch(
+            self,
+            multimask_output: bool = False,
+            return_logits: bool = False,
+            hier_det: bool = False,
+            point_coords: Optional[np.ndarray] = None,
+            point_labels: Optional[np.ndarray] = None,
+    ):
+        """
+        Predict masks for the given input prompts, using the currently set image.
+        Input prompts are batched torch tensors and are expected to already be
+        transformed to the input frame using ResizeLongestSide.
+
+        Arguments:
+          point_coords (torch.Tensor or None): A BxNx2 array of point prompts to the
+            model. Each point is in (X,Y) in pixels.
+          point_labels (torch.Tensor or None): A BxN array of labels for the
+            point prompts. 1 indicates a foreground point and 0 indicates a
+            background point.
+          boxes (np.ndarray or None): A Bx4 array given a box prompt to the
+            model, in XYXY format.
+          mask_input (np.ndarray): A low resolution mask input to the model, typically
+            coming from a previous prediction iteration. Has form Bx1xHxW, where
+            for SAM, H=W=256. Masks returned by a previous iteration of the
+            predict method do not need further transformation.
+          multimask_output (bool): If true, the model will return three masks.
+            For ambiguous input prompts (such as a single click), this will often
+            produce better masks than a single prediction. If only a single
+            mask is needed, the model's predicted quality score can be used
+            to select the best mask. For non-ambiguous prompts, such as multiple
+            input prompts, multimask_output=False can give better results.
+          return_logits (bool): If true, returns un-thresholded masks logits
+            instead of a binary mask.
+        """
+        if not self.is_image_set:
+            raise RuntimeError("An image must be set with .set_image(...) before mask prediction.")
+
+        sparse_emb = self.model.modal_aligner(self.features)
+
+        low_res_masks, high_res_masks, iou_pred, iou_pred_hr = self.model.mask_decoder(
+            image_embeddings=self.features,
+            image_pe=self.model.prompt_encoder.get_dense_pe(),
+            sparse_prompt_embeddings=sparse_emb,
+            multimask_output=multimask_output,
+        )
+
+        # Upscale the masks to the original image resolution
+        masks = self.model.postprocess_masks(low_res_masks, self.input_size, self.original_size)
+        hr_masks = self.model.postprocess_masks(high_res_masks, self.input_size, self.original_size)
+
+        if not hier_det:
+            if not return_logits:
+                masks = masks > self.model.mask_threshold
+                hr_masks = hr_masks > self.model.mask_threshold
+            return masks, hr_masks, iou_pred, iou_pred_hr
+            
+        else:
+            points = (point_coords, point_labels)
+            point_embeddings, _ = self.model.prompt_encoder(
+                points=points, boxes=None, masks=None
+            )
+            hi_masks, iou_pred_hi, word_masks = self.model.hi_decoder(
+                image_embeddings=self.features,
+                image_pe=self.model.prompt_encoder.get_dense_pe(),
+                sparse_prompt_embeddings=point_embeddings,
+                multimask_output=True
+            )
+            hi_masks = self.model.postprocess_masks(hi_masks[:, 1:, :, :], self.input_size, self.original_size)
+            word_masks = self.model.postprocess_masks(word_masks, self.input_size, self.original_size)
+            if not return_logits:
+                hi_masks = hi_masks > self.model.mask_threshold
+                word_masks = word_masks > self.model.mask_threshold
+            return masks, hr_masks, iou_pred, iou_pred_hr, hi_masks, iou_pred_hi, word_masks
+
+    def get_image_embedding(self) -> torch.Tensor:
+        """
+        Returns the image embeddings for the currently set image, with
+        shape 1xCxHxW, where C is the embedding dimension and (H,W) are
+        the embedding spatial dimension of SAM (typically C=256, H=W=64).
+        """
+        if not self.is_image_set:
+            raise RuntimeError(
+                "An image must be set with .set_image(...) to generate an embedding."
+            )
+        assert self.features is not None, "Features must exist if an image has been set."
+        return self.features
+
+    @property
+    def device(self) -> torch.device:
+        return self.model.device
+
+    def reset_image(self) -> None:
+        """Resets the currently set image."""
+        self.is_image_set = False
+        self.features = None
+        self.orig_h = None
+        self.orig_w = None
+        self.input_h = None
+        self.input_w = None
diff --git a/hi_sam/modeling/prompt_encoder.py b/hi_sam/modeling/prompt_encoder.py
new file mode 100644
index 0000000..fa96422
--- /dev/null
+++ b/hi_sam/modeling/prompt_encoder.py
@@ -0,0 +1,215 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+
+# This source code is licensed under the license found in the
+# LICENSE file in the root directory of this source tree.
+
+import numpy as np
+import torch
+from torch import nn
+
+from typing import Any, Optional, Tuple, Type
+
+from .common import LayerNorm2d
+
+
+class PromptEncoder(nn.Module):
+    def __init__(
+        self,
+        embed_dim: int,
+        image_embedding_size: Tuple[int, int],
+        input_image_size: Tuple[int, int],
+        mask_in_chans: int,
+        activation: Type[nn.Module] = nn.GELU,
+    ) -> None:
+        """
+        Encodes prompts for input to SAM's mask decoder.
+
+        Arguments:
+          embed_dim (int): The prompts' embedding dimension
+          image_embedding_size (tuple(int, int)): The spatial size of the
+            image embedding, as (H, W).
+          input_image_size (int): The padded size of the image as input
+            to the image encoder, as (H, W).
+          mask_in_chans (int): The number of hidden channels used for
+            encoding input masks.
+          activation (nn.Module): The activation to use when encoding
+            input masks.
+        """
+        super().__init__()
+        self.embed_dim = embed_dim
+        self.input_image_size = input_image_size
+        self.image_embedding_size = image_embedding_size
+        self.pe_layer = PositionEmbeddingRandom(embed_dim // 2)
+
+        self.num_point_embeddings: int = 4  # pos/neg point + 2 box corners
+        point_embeddings = [nn.Embedding(1, embed_dim) for i in range(self.num_point_embeddings)]
+        self.point_embeddings = nn.ModuleList(point_embeddings)
+        self.not_a_point_embed = nn.Embedding(1, embed_dim)
+
+        self.mask_input_size = (4 * image_embedding_size[0], 4 * image_embedding_size[1])
+        self.mask_downscaling = nn.Sequential(
+            nn.Conv2d(1, mask_in_chans // 4, kernel_size=2, stride=2),
+            LayerNorm2d(mask_in_chans // 4),
+            activation(),
+            nn.Conv2d(mask_in_chans // 4, mask_in_chans, kernel_size=2, stride=2),
+            LayerNorm2d(mask_in_chans),
+            activation(),
+            nn.Conv2d(mask_in_chans, embed_dim, kernel_size=1),
+        )
+        self.no_mask_embed = nn.Embedding(1, embed_dim)
+
+    def get_dense_pe(self) -> torch.Tensor:
+        """
+        Returns the positional encoding used to encode point prompts,
+        applied to a dense set of points the shape of the image encoding.
+
+        Returns:
+          torch.Tensor: Positional encoding with shape
+            1x(embed_dim)x(embedding_h)x(embedding_w)
+        """
+        return self.pe_layer(self.image_embedding_size).unsqueeze(0)
+
+    def _embed_points(
+        self,
+        points: torch.Tensor,
+        labels: torch.Tensor,
+        pad: bool,
+    ) -> torch.Tensor:
+        """Embeds point prompts."""
+        points = points + 0.5  # Shift to center of pixel
+        if pad:
+            padding_point = torch.zeros((points.shape[0], 1, 2), device=points.device)
+            padding_label = -torch.ones((labels.shape[0], 1), device=labels.device)
+            points = torch.cat([points, padding_point], dim=1)
+            labels = torch.cat([labels, padding_label], dim=1)
+        point_embedding = self.pe_layer.forward_with_coords(points, self.input_image_size)
+        point_embedding[labels == -1] = 0.0
+        point_embedding[labels == -1] += self.not_a_point_embed.weight
+        point_embedding[labels == 0] += self.point_embeddings[0].weight
+        point_embedding[labels == 1] += self.point_embeddings[1].weight
+        return point_embedding
+
+    def _embed_boxes(self, boxes: torch.Tensor) -> torch.Tensor:
+        """Embeds box prompts."""
+        boxes = boxes + 0.5  # Shift to center of pixel
+        coords = boxes.reshape(-1, 2, 2)
+        corner_embedding = self.pe_layer.forward_with_coords(coords, self.input_image_size)
+        corner_embedding[:, 0, :] += self.point_embeddings[2].weight
+        corner_embedding[:, 1, :] += self.point_embeddings[3].weight
+        return corner_embedding
+
+    def _embed_masks(self, masks: torch.Tensor) -> torch.Tensor:
+        """Embeds mask inputs."""
+        mask_embedding = self.mask_downscaling(masks)
+        return mask_embedding
+
+    def _get_batch_size(
+        self,
+        points: Optional[Tuple[torch.Tensor, torch.Tensor]],
+        boxes: Optional[torch.Tensor],
+        masks: Optional[torch.Tensor],
+    ) -> int:
+        """
+        Gets the batch size of the output given the batch size of the input prompts.
+        """
+        if points is not None:
+            return points[0].shape[0]
+        elif boxes is not None:
+            return boxes.shape[0]
+        elif masks is not None:
+            return masks.shape[0]
+        else:
+            return 1
+
+    def _get_device(self) -> torch.device:
+        return self.point_embeddings[0].weight.device
+
+    def forward(
+        self,
+        points: Optional[Tuple[torch.Tensor, torch.Tensor]],
+        boxes: Optional[torch.Tensor],
+        masks: Optional[torch.Tensor],
+    ) -> Tuple[torch.Tensor, torch.Tensor]:
+        """
+        Embeds different types of prompts, returning both sparse and dense
+        embeddings.
+
+        Arguments:
+          points (tuple(torch.Tensor, torch.Tensor) or none): point coordinates
+            and labels to embed.
+          boxes (torch.Tensor or none): boxes to embed
+          masks (torch.Tensor or none): masks to embed
+
+        Returns:
+          torch.Tensor: sparse embeddings for the points and boxes, with shape
+            BxNx(embed_dim), where N is determined by the number of input points
+            and boxes.
+          torch.Tensor: dense embeddings for the masks, in the shape
+            Bx(embed_dim)x(embed_H)x(embed_W)
+        """
+        with torch.no_grad():
+            bs = self._get_batch_size(points, boxes, masks)
+            sparse_embeddings = torch.empty((bs, 0, self.embed_dim), device=self._get_device())
+            if points is not None:
+                coords, labels = points
+                point_embeddings = self._embed_points(coords, labels, pad=(boxes is None))
+                sparse_embeddings = torch.cat([sparse_embeddings, point_embeddings], dim=1)
+            if boxes is not None:
+                box_embeddings = self._embed_boxes(boxes)
+                sparse_embeddings = torch.cat([sparse_embeddings, box_embeddings], dim=1)
+
+            if masks is not None:
+                dense_embeddings = self._embed_masks(masks)
+            else:
+                dense_embeddings = self.no_mask_embed.weight.reshape(1, -1, 1, 1).expand(
+                    bs, -1, self.image_embedding_size[0], self.image_embedding_size[1]
+                )
+
+        return sparse_embeddings, dense_embeddings
+
+
+class PositionEmbeddingRandom(nn.Module):
+    """
+    Positional encoding using random spatial frequencies.
+    """
+
+    def __init__(self, num_pos_feats: int = 64, scale: Optional[float] = None) -> None:
+        super().__init__()
+        if scale is None or scale <= 0.0:
+            scale = 1.0
+        self.register_buffer(
+            "positional_encoding_gaussian_matrix",
+            scale * torch.randn((2, num_pos_feats)),
+        )
+
+    def _pe_encoding(self, coords: torch.Tensor) -> torch.Tensor:
+        """Positionally encode points that are normalized to [0,1]."""
+        # assuming coords are in [0, 1]^2 square and have d_1 x ... x d_n x 2 shape
+        coords = 2 * coords - 1
+        coords = coords @ self.positional_encoding_gaussian_matrix
+        coords = 2 * np.pi * coords
+        # outputs d_1 x ... x d_n x C shape
+        return torch.cat([torch.sin(coords), torch.cos(coords)], dim=-1)
+
+    def forward(self, size: Tuple[int, int]) -> torch.Tensor:
+        """Generate positional encoding for a grid of the specified size."""
+        h, w = size
+        device: Any = self.positional_encoding_gaussian_matrix.device
+        grid = torch.ones((h, w), device=device, dtype=torch.float32)
+        y_embed = grid.cumsum(dim=0) - 0.5
+        x_embed = grid.cumsum(dim=1) - 0.5
+        y_embed = y_embed / h
+        x_embed = x_embed / w
+
+        pe = self._pe_encoding(torch.stack([x_embed, y_embed], dim=-1))
+        return pe.permute(2, 0, 1)  # C x H x W
+
+    def forward_with_coords(
+        self, coords_input: torch.Tensor, image_size: Tuple[int, int]
+    ) -> torch.Tensor:
+        """Positionally encode points that are not normalized to [0,1]."""
+        coords = coords_input.clone()
+        coords[:, :, 0] = coords[:, :, 0] / image_size[1]
+        coords[:, :, 1] = coords[:, :, 1] / image_size[0]
+        return self._pe_encoding(coords.to(torch.float))  # B x N x C
diff --git a/hi_sam/modeling/transformer.py b/hi_sam/modeling/transformer.py
new file mode 100644
index 0000000..f1a2812
--- /dev/null
+++ b/hi_sam/modeling/transformer.py
@@ -0,0 +1,240 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+
+# This source code is licensed under the license found in the
+# LICENSE file in the root directory of this source tree.
+
+import torch
+from torch import Tensor, nn
+
+import math
+from typing import Tuple, Type
+
+from .common import MLPBlock
+
+
+class TwoWayTransformer(nn.Module):
+    def __init__(
+        self,
+        depth: int,
+        embedding_dim: int,
+        num_heads: int,
+        mlp_dim: int,
+        activation: Type[nn.Module] = nn.ReLU,
+        attention_downsample_rate: int = 2,
+    ) -> None:
+        """
+        A transformer decoder that attends to an input image using
+        queries whose positional embedding is supplied.
+
+        Args:
+          depth (int): number of layers in the transformer
+          embedding_dim (int): the channel dimension for the input embeddings
+          num_heads (int): the number of heads for multihead attention. Must
+            divide embedding_dim
+          mlp_dim (int): the channel dimension internal to the MLP block
+          activation (nn.Module): the activation to use in the MLP block
+        """
+        super().__init__()
+        self.depth = depth
+        self.embedding_dim = embedding_dim
+        self.num_heads = num_heads
+        self.mlp_dim = mlp_dim
+        self.layers = nn.ModuleList()
+
+        for i in range(depth):
+            self.layers.append(
+                TwoWayAttentionBlock(
+                    embedding_dim=embedding_dim,
+                    num_heads=num_heads,
+                    mlp_dim=mlp_dim,
+                    activation=activation,
+                    attention_downsample_rate=attention_downsample_rate,
+                    skip_first_layer_pe=(i == 0),
+                )
+            )
+
+        self.final_attn_token_to_image = Attention(
+            embedding_dim, num_heads, downsample_rate=attention_downsample_rate
+        )
+        self.norm_final_attn = nn.LayerNorm(embedding_dim)
+
+    def forward(
+        self,
+        image_embedding: Tensor,
+        image_pe: Tensor,
+        point_embedding: Tensor,
+    ) -> Tuple[Tensor, Tensor]:
+        """
+        Args:
+          image_embedding (torch.Tensor): image to attend to. Should be shape
+            B x embedding_dim x h x w for any h and w.
+          image_pe (torch.Tensor): the positional encoding to add to the image. Must
+            have the same shape as image_embedding.
+          point_embedding (torch.Tensor): the embedding to add to the query points.
+            Must have shape B x N_points x embedding_dim for any N_points.
+
+        Returns:
+          torch.Tensor: the processed point_embedding
+          torch.Tensor: the processed image_embedding
+        """
+        # BxCxHxW -> BxHWxC == B x N_image_tokens x C
+        bs, c, h, w = image_embedding.shape
+        image_embedding = image_embedding.flatten(2).permute(0, 2, 1)
+        image_pe = image_pe.flatten(2).permute(0, 2, 1)
+
+        # Prepare queries
+        queries = point_embedding
+        keys = image_embedding
+
+        # Apply transformer blocks and final layernorm
+        for layer in self.layers:
+            queries, keys = layer(
+                queries=queries,
+                keys=keys,
+                query_pe=point_embedding,
+                key_pe=image_pe,
+            )
+
+        # Apply the final attenion layer from the points to the image
+        q = queries + point_embedding
+        k = keys + image_pe
+        attn_out = self.final_attn_token_to_image(q=q, k=k, v=keys)
+        queries = queries + attn_out
+        queries = self.norm_final_attn(queries)
+
+        return queries, keys
+
+
+class TwoWayAttentionBlock(nn.Module):
+    def __init__(
+        self,
+        embedding_dim: int,
+        num_heads: int,
+        mlp_dim: int = 2048,
+        activation: Type[nn.Module] = nn.ReLU,
+        attention_downsample_rate: int = 2,
+        skip_first_layer_pe: bool = False,
+    ) -> None:
+        """
+        A transformer block with four layers: (1) self-attention of sparse
+        inputs, (2) cross attention of sparse inputs to dense inputs, (3) mlp
+        block on sparse inputs, and (4) cross attention of dense inputs to sparse
+        inputs.
+
+        Arguments:
+          embedding_dim (int): the channel dimension of the embeddings
+          num_heads (int): the number of heads in the attention layers
+          mlp_dim (int): the hidden dimension of the mlp block
+          activation (nn.Module): the activation of the mlp block
+          skip_first_layer_pe (bool): skip the PE on the first layer
+        """
+        super().__init__()
+        self.self_attn = Attention(embedding_dim, num_heads)
+        self.norm1 = nn.LayerNorm(embedding_dim)
+
+        self.cross_attn_token_to_image = Attention(
+            embedding_dim, num_heads, downsample_rate=attention_downsample_rate
+        )
+        self.norm2 = nn.LayerNorm(embedding_dim)
+
+        self.mlp = MLPBlock(embedding_dim, mlp_dim, activation)
+        self.norm3 = nn.LayerNorm(embedding_dim)
+
+        self.norm4 = nn.LayerNorm(embedding_dim)
+        self.cross_attn_image_to_token = Attention(
+            embedding_dim, num_heads, downsample_rate=attention_downsample_rate
+        )
+
+        self.skip_first_layer_pe = skip_first_layer_pe
+
+    def forward(
+        self, queries: Tensor, keys: Tensor, query_pe: Tensor, key_pe: Tensor
+    ) -> Tuple[Tensor, Tensor]:
+        # Self attention block
+        if self.skip_first_layer_pe:
+            queries = self.self_attn(q=queries, k=queries, v=queries)
+        else:
+            q = queries + query_pe
+            attn_out = self.self_attn(q=q, k=q, v=queries)
+            queries = queries + attn_out
+        queries = self.norm1(queries)
+
+        # Cross attention block, tokens attending to image embedding
+        q = queries + query_pe
+        k = keys + key_pe
+        attn_out = self.cross_attn_token_to_image(q=q, k=k, v=keys)
+        queries = queries + attn_out
+        queries = self.norm2(queries)
+
+        # MLP block
+        mlp_out = self.mlp(queries)
+        queries = queries + mlp_out
+        queries = self.norm3(queries)
+
+        # Cross attention block, image embedding attending to tokens
+        q = queries + query_pe
+        k = keys + key_pe
+        attn_out = self.cross_attn_image_to_token(q=k, k=q, v=queries)
+        keys = keys + attn_out
+        keys = self.norm4(keys)
+
+        return queries, keys
+
+
+class Attention(nn.Module):
+    """
+    An attention layer that allows for downscaling the size of the embedding
+    after projection to queries, keys, and values.
+    """
+
+    def __init__(
+        self,
+        embedding_dim: int,
+        num_heads: int,
+        downsample_rate: int = 1,
+    ) -> None:
+        super().__init__()
+        self.embedding_dim = embedding_dim
+        self.internal_dim = embedding_dim // downsample_rate
+        self.num_heads = num_heads
+        assert self.internal_dim % num_heads == 0, "num_heads must divide embedding_dim."
+
+        self.q_proj = nn.Linear(embedding_dim, self.internal_dim)
+        self.k_proj = nn.Linear(embedding_dim, self.internal_dim)
+        self.v_proj = nn.Linear(embedding_dim, self.internal_dim)
+        self.out_proj = nn.Linear(self.internal_dim, embedding_dim)
+
+    def _separate_heads(self, x: Tensor, num_heads: int) -> Tensor:
+        b, n, c = x.shape
+        x = x.reshape(b, n, num_heads, c // num_heads)
+        return x.transpose(1, 2)  # B x N_heads x N_tokens x C_per_head
+
+    def _recombine_heads(self, x: Tensor) -> Tensor:
+        b, n_heads, n_tokens, c_per_head = x.shape
+        x = x.transpose(1, 2)
+        return x.reshape(b, n_tokens, n_heads * c_per_head)  # B x N_tokens x C
+
+    def forward(self, q: Tensor, k: Tensor, v: Tensor) -> Tensor:
+        # Input projections
+        q = self.q_proj(q)
+        k = self.k_proj(k)
+        v = self.v_proj(v)
+
+        # Separate into heads
+        q = self._separate_heads(q, self.num_heads)
+        k = self._separate_heads(k, self.num_heads)
+        v = self._separate_heads(v, self.num_heads)
+
+        # Attention
+        _, _, _, c_per_head = q.shape
+        attn = q @ k.permute(0, 1, 3, 2)  # B x N_heads x N_tokens x N_tokens
+        attn = attn / math.sqrt(c_per_head)
+        attn = torch.softmax(attn, dim=-1)
+
+        # Get output
+        out = attn @ v
+        out = self._recombine_heads(out)
+        out = self.out_proj(out)
+
+        return out
diff --git a/hiertext_eval/collect_results.py b/hiertext_eval/collect_results.py
new file mode 100644
index 0000000..4aa81f6
--- /dev/null
+++ b/hiertext_eval/collect_results.py
@@ -0,0 +1,25 @@
+from glob import glob
+from tqdm import tqdm
+import sys, os
+import json
+import argparse
+
+
+def get_args_parser():
+    parser = argparse.ArgumentParser('Hi-SAM', add_help=False)
+    parser.add_argument("--saved_name", type=str, required=True, default='res_1500pts.jsonl',
+                        help="Saved name.")
+    return parser.parse_args()
+
+
+args = get_args_parser()
+jsonl_list = glob('./res_per_img/*.jsonl')
+final_results = {"annotations": []}
+for jsonl_name in tqdm(jsonl_list):
+    with open(jsonl_name, 'r') as fr:
+        res = json.load(fr)
+    fr.close()
+    final_results['annotations'].append(res)
+
+with open(args.saved_name, 'w') as fw:
+    json.dump(final_results, fw, ensure_ascii=False)
diff --git a/hiertext_eval/eval.py b/hiertext_eval/eval.py
new file mode 100644
index 0000000..c12b2c6
--- /dev/null
+++ b/hiertext_eval/eval.py
@@ -0,0 +1,355 @@
+"""PQ metrics for HierText."""
+
+import json
+import time
+from typing import Sequence
+
+from absl import app
+from absl import flags
+import apache_beam as beam
+from apache_beam.options.pipeline_options import PipelineOptions
+import cv2
+import numpy as np
+
+from evaluator import evaluator
+
+
+_GT = flags.DEFINE_string('gt', None, 'Groundtruth JSON file.')
+_RESULT = flags.DEFINE_string('result', None, 'Prediction JSON file.')
+_OUTPUT = flags.DEFINE_string(
+    'output', None, 'The output text file containing evaluation results.')
+
+_EVAL_LINES = flags.DEFINE_bool(
+    'eval_lines', False, 'Whether to perform line-level evaluation.')
+_EVAL_PARAGRAPHS = flags.DEFINE_bool(
+    'eval_paragraphs', False, 'Whether to perform paragraph-level evaluation.')
+_E2E = flags.DEFINE_bool(
+    'e2e', False, 'Whether to perform end-to-end evaluation.')
+_MASK_STRIDE = flags.DEFINE_integer(
+    'mask_stride', 1,
+    ('Downsample the masks for faster but less accuracte results. '
+     'Note that when reporting results for a paper, users should use '
+     'mask_stirde=1 i.e. no downsampling!'))
+
+_NUM_WORKERS = flags.DEFINE_integer(
+    'num_workers', 8, 'The number of workers. Set to 0 to use all the cores.')
+
+
+def load_annotations(gt_path: str, result_path: str):
+  """Loading results and ground truths, and then pairing them."""
+  outputs = []
+  index = {}
+
+  print('Loading ground-truth annotations.')
+  gt_annos = json.load(open(gt_path, encoding='utf-8'))['annotations']
+  for anno in gt_annos:
+    outputs.append(anno)
+    index[anno['image_id']] = anno
+
+  print('Loading predictions.')
+  results = json.load(open(result_path, encoding='utf-8'))['annotations']
+  for anno in results:
+    gt_dict = index[anno['image_id']]
+    gt_dict['output_paragraphs'] = anno['paragraphs']
+
+  print('Finished loading.')
+  return outputs
+
+
+def draw_mask(vertices: np.ndarray, w: int, h: int, s: int = 1):
+  mask = np.zeros((h, w), dtype=np.float32)
+  return cv2.fillPoly(mask, [vertices], [1.])[::s, ::s] > 0
+
+
+def parse_annotation_dict(anno, eval_lines, eval_paragraphs, mask_stride):
+  """Parse the data for the evaluator."""
+  t_start = time.time()
+  image_id = anno['image_id']
+
+  w = anno['image_width']
+  h = anno['image_height']
+
+  gt_word_polygons = []
+  gt_word_weights = []
+  gt_word_texts = []
+  if eval_lines:
+    gt_line_masks = []
+    gt_line_boxes = []
+    gt_line_weights = []
+    gt_line_texts = []
+  if eval_paragraphs:
+    gt_paragraph_masks = []
+    gt_paragraph_boxes = []
+    gt_paragraph_weights = []
+
+  for paragraph in anno['paragraphs']:
+    gt_paragraph_mask = []
+    gt_paragraph_box = []
+    for line in paragraph['lines']:
+      gt_line_mask = []
+      gt_line_box = []
+      for word in line['words']:
+        vertices = np.array(word['vertices'])
+        gt_word_polygons.append(vertices)
+        gt_word_weights.append(1.0 if word['legible'] else 0.0)
+        gt_word_texts.append(word['text'])
+        if eval_lines or eval_paragraphs:
+          gt_word_mask = draw_mask(vertices, w, h, mask_stride)
+          if eval_lines:
+            gt_line_mask.append(gt_word_mask)
+            gt_line_box.append(vertices)
+          if eval_paragraphs:
+            gt_paragraph_mask.append(gt_word_mask)
+            gt_paragraph_box.append(vertices)
+      if eval_lines:
+        if not gt_line_mask:
+          gt_line_mask = [
+              draw_mask(np.array(line['vertices']), w, h, mask_stride)]
+          gt_line_box.append(np.array([[0, 0], [w, 0], [w, h], [0, h]]))
+        gt_line_masks.append(
+            np.any(np.stack(gt_line_mask, axis=0), axis=0).astype(np.float32))
+        gt_line_box = np.concatenate(gt_line_box, axis=0)
+        gt_line_boxes.append(
+            [np.min(gt_line_box[:, 1]),
+             np.min(gt_line_box[:, 0]),
+             np.max(gt_line_box[:, 1]),
+             np.max(gt_line_box[:, 0])])
+        gt_line_weights.append(1.0 if line['legible'] else 0.0)
+        gt_line_texts.append(line['text'])
+    if eval_paragraphs:
+      if not gt_paragraph_mask or not paragraph['legible']:
+        gt_paragraph_mask = [draw_mask(
+            np.array(paragraph['vertices']), w, h, mask_stride)]
+        gt_paragraph_box.append(np.array([[0, 0], [w, 0], [w, h], [0, h]]))
+      gt_paragraph_masks.append(
+          np.any(np.stack(gt_paragraph_mask, axis=0), axis=0).astype(np.float32))
+      gt_paragraph_box = np.concatenate(gt_paragraph_box, axis=0)
+      gt_paragraph_boxes.append([
+          np.min(gt_paragraph_box[:, 1]),
+          np.min(gt_paragraph_box[:, 0]),
+          np.max(gt_paragraph_box[:, 1]),
+          np.max(gt_paragraph_box[:, 0])
+      ])
+      gt_paragraph_weights.append(1.0 if paragraph['legible'] else 0.0)
+
+  word_polygons = []
+  word_texts = []
+  if eval_lines:
+    line_masks = []
+    line_boxes = []
+    line_texts = []
+  if eval_paragraphs:
+    paragraph_masks = []
+    paragraph_boxes = []
+
+  for paragraph in anno['output_paragraphs']:
+    paragraph_mask = []
+    paragraph_box = []
+    for line in paragraph['lines']:
+      line_mask = []
+      line_box = []
+      for word in line['words']:
+        vertices = np.array(word['vertices'])
+        word_polygons.append(vertices)
+        word_texts.append(word['text'])
+        if eval_lines or eval_paragraphs:
+          word_mask = draw_mask(vertices, w, h, mask_stride)
+          if eval_lines:
+            line_mask.append(word_mask)
+            line_box.append(vertices)
+          if eval_paragraphs:
+            paragraph_mask.append(word_mask)
+            paragraph_box.append(vertices)
+      if eval_lines:
+        if not line_mask:
+          raise ValueError('Line does not contain words: %s' % line)
+        line_masks.append(
+            np.any(np.stack(line_mask, axis=0), axis=0).astype(np.float32))
+        line_box = np.concatenate(line_box, axis=0)
+        line_boxes.append(
+            [np.min(line_box[:, 1]),
+             np.min(line_box[:, 0]),
+             np.max(line_box[:, 1]),
+             np.max(line_box[:, 0])])
+        line_texts.append(line['text'])
+    if eval_paragraphs:
+      if not paragraph_mask:
+        raise ValueError('Paragraph does not contain lines: %s' %
+                         paragraph)
+      paragraph_masks.append(
+          np.any(np.stack(paragraph_mask, axis=0), axis=0).astype(np.float32))
+      paragraph_box = np.concatenate(paragraph_box, axis=0)
+      paragraph_boxes.append([
+          np.min(paragraph_box[:, 1]),
+          np.min(paragraph_box[:, 0]),
+          np.max(paragraph_box[:, 1]),
+          np.max(paragraph_box[:, 0])
+      ])
+
+  num_gt_words = len(gt_word_polygons)
+  num_pred_words = len(word_polygons)
+
+  word_dict = {
+      'gt_weights': (np.array(gt_word_weights) if num_gt_words else np.zeros(
+          (0,), np.float32)),
+      'gt_boxes':
+          gt_word_polygons,
+      'gt_texts': (np.array(gt_word_texts) if num_gt_words else np.zeros(
+          (0,), str)),
+      'detection_boxes':
+          word_polygons,
+      'pred_texts': (np.array(word_texts) if num_pred_words else np.zeros(
+          (0,), str)),
+  }
+
+  line_dict = {}
+  if eval_lines:
+    num_gt_lines = len(gt_line_masks)
+    num_pred_lines = len(line_masks)
+    line_dict = {
+        'gt_weights': (np.array(gt_line_weights) if num_gt_lines else np.zeros(
+            (0,), np.float32)),
+        'gt_masks': (np.stack(gt_line_masks, 0) if num_gt_lines else np.zeros(
+            (0, (h + 1) // 2, (w + 1) // 2), np.float32)),
+        'gt_boxes': (np.array(gt_line_boxes, np.float32)
+                     if num_gt_lines else np.zeros((0, 4), np.float32)),
+        'gt_texts': (np.array(gt_line_texts) if num_gt_lines else np.zeros(
+            (0,), str)),
+        'detection_boxes':
+            (np.array(line_boxes, np.float32)
+             if num_pred_lines else np.zeros((0, 4), np.float32)),
+        'detection_masks':
+            (np.stack(line_masks, 0) if num_pred_lines else np.zeros(
+                (0, (h + 1) // 2, (w + 1) // 2), np.float32)),
+        'pred_texts': (np.array(line_texts) if num_pred_lines else np.zeros(
+            (0,), str)),
+    }
+
+  paragraph_dict = {}
+  if eval_paragraphs:
+    num_gt_paragraphs = len(gt_paragraph_masks)
+    num_pred_paragraphs = len(paragraph_masks)
+    paragraph_dict = {
+        'gt_weights':
+            (np.array(gt_paragraph_weights) if num_gt_paragraphs else np.zeros(
+                (0,), np.float32)),
+        'gt_masks':
+            (np.stack(gt_paragraph_masks, 0) if num_gt_paragraphs else np.zeros(
+                (0, (h + 1) // 2, (w + 1) // 2), np.float32)),
+        'gt_boxes':
+            (np.array(gt_paragraph_boxes, np.float32)
+             if num_gt_paragraphs else np.zeros((0, 4), np.float32)),
+        'detection_boxes':
+            (np.array(paragraph_boxes, np.float32)
+             if num_pred_paragraphs else np.zeros((0, 4), np.float32)),
+        'detection_masks':
+            (np.stack(paragraph_masks, 0) if num_pred_paragraphs else np.zeros(
+                (0, (h + 1) // 2, (w + 1) // 2), np.float32)),
+    }
+  t_end = time.time()
+  print(f'Parsing {image_id} takes {t_end - t_start} secs')
+  return image_id, word_dict, line_dict, paragraph_dict
+
+
+def evaluate_one_image(image_id, word_dict, line_dict, paragraph_dict,
+                       word_evaluator, line_evaluator, paragraph_evaluator):
+  t_start = time.time()
+  word_stats = word_evaluator.evaluate_one_image(word_dict)
+  line_stats = {}
+  if line_evaluator is not None:
+    line_stats = line_evaluator.evaluate_one_image(line_dict)
+  paragraph_stats = {}
+  if paragraph_evaluator is not None:
+    paragraph_stats = paragraph_evaluator.evaluate_one_image(paragraph_dict)
+  t_end = time.time()
+  print(f'Evaluating {image_id} takes {t_end - t_start} secs')
+  return [word_stats, line_stats, paragraph_stats]
+
+
+def compute_eval_metrics(word_sum, line_sum, paragraph_sum,
+                         word_evaluator, line_evaluator, paragraph_evaluator):
+  word_metrics = word_evaluator.evaluate(word_sum)
+  if line_evaluator is not None:
+    line_metrics = line_evaluator.evaluate(line_sum)
+  else:
+    line_metrics = {}
+  if paragraph_evaluator is not None:
+    paragraph_metrics = paragraph_evaluator.evaluate(paragraph_sum)
+  else:
+    paragraph_metrics = {}
+  return (['word', word_metrics], ['line', line_metrics],
+          ['paragraph', paragraph_metrics])
+
+
+def dict_add(input_tuples):
+  """Aggregating dictionary accumulators by summing up respective fields."""
+  new_dicts = [{}, {}, {}]
+  for dicts in input_tuples:
+    for i, ent_dict in enumerate(dicts):
+      for k, v in ent_dict.items():
+        new_dicts[i][k] = new_dicts[i].get(k, 0) + v
+  return new_dicts
+
+
+def metric_format(metric_groups):
+  """Formatting the metrics in dict for printing."""
+  outputs = []
+  for ent, metrics in metric_groups:
+    if metrics:
+      output = '========= ' + ent + ' =========\n'
+      kv_pairs = list(metrics.items())
+      kv_pairs.sort()
+      output += '\n'.join(f'{k}: {v}' for k, v in kv_pairs)
+      outputs.append(output)
+  return '\n\n'.join(outputs)
+
+
+def main(argv: Sequence[str]) -> None:
+  eval_start_time = time.time()
+
+  if len(argv) > 1:
+    raise app.UsageError('Too many command-line arguments.')
+
+  word_evaluator = evaluator.HierTextEvaluator(
+      text_box_type=evaluator.TextBoxRep.POLY,
+      evaluate_text=_E2E.value)
+  if _EVAL_LINES.value:
+    line_evaluator = evaluator.HierTextEvaluator(
+        evaluate_text=_E2E.value)
+  else:
+    line_evaluator = None
+  if _EVAL_PARAGRAPHS.value:
+    paragraph_evaluator = evaluator.HierTextEvaluator()
+  else:
+    paragraph_evaluator = None
+
+  running_mode = 'in_memory' if _NUM_WORKERS.value == 1 else 'multi_threading'
+  options = PipelineOptions([
+      '--runner=DirectRunner',
+      f'--direct_num_workers={_NUM_WORKERS.value}',
+      f'--direct_running_mode={running_mode}',
+  ])
+  with beam.Pipeline(options=options) as pipeline:
+    _ = (
+        pipeline
+        | 'Read' >> beam.Create(load_annotations(_GT.value, _RESULT.value))
+        | 'Parse' >> beam.Map(lambda x: parse_annotation_dict(
+            x, _EVAL_LINES.value, _EVAL_PARAGRAPHS.value, _MASK_STRIDE.value))
+        | 'Eval' >> beam.Map(lambda x: evaluate_one_image(
+            *x, word_evaluator, line_evaluator, paragraph_evaluator))
+        | 'Sum' >> beam.CombineGlobally(dict_add)
+        | 'Compute-metrics' >> beam.Map(lambda x: compute_eval_metrics(
+            *x, word_evaluator, line_evaluator, paragraph_evaluator))
+        | 'Format-metrics' >> beam.Map(lambda x: metric_format(x))
+        | 'Write' >> beam.io.WriteToText(_OUTPUT.value))
+
+  eval_end_time = time.time()
+  print(f'Evaluation took {eval_end_time - eval_start_time} secs.')
+
+if __name__ == '__main__':
+  flags.mark_flag_as_required('gt')
+  flags.mark_flag_as_required('result')
+  flags.mark_flag_as_required('output')
+  app.run(main)
+  
+  # python3 eval.py --gt=validation.jsonl --result=sample.jsonl --output=score.txt --mask_stride=1 --eval_lines --eval_paragraphs
diff --git a/hiertext_eval/evaluator/__init__.py b/hiertext_eval/evaluator/__init__.py
new file mode 100644
index 0000000..1561fb0
--- /dev/null
+++ b/hiertext_eval/evaluator/__init__.py
@@ -0,0 +1 @@
+from . import evaluator
diff --git a/hiertext_eval/evaluator/evaluator.py b/hiertext_eval/evaluator/evaluator.py
new file mode 100644
index 0000000..5d0c4e1
--- /dev/null
+++ b/hiertext_eval/evaluator/evaluator.py
@@ -0,0 +1,260 @@
+"""HierText's PQ-like evaluator class."""
+
+import enum
+from typing import Dict, List, Union
+
+import numpy as np
+
+from . import np_box_mask_list
+from . import np_box_mask_list_ops
+from . import polygon_list
+from . import polygon_ops
+
+
+class TextBoxRep(enum.Enum):
+  MASK = 0
+  POLY = 1
+
+EPSILON = 1e-5
+TextEntityListType = Union[
+    polygon_list.PolygonList,
+    np_box_mask_list.BoxMaskList]
+NumpyDict = Dict[str, np.ndarray]
+
+
+class HierTextEvaluator(object):
+  """Evaluator class.
+
+  Attributes:
+    iou_threshold: IoU threshold to use for matching groundtruth boxes/masks
+      to detection boxes/masks.
+    text_box_type: The type of the detection representation.
+    evaluate_text: Whether to evaluate predicted text.
+    filter_invalid_thr: Predictions will be removed if the pixel-level
+      precision with any groundtruths that are marked as `dont't care` is larger
+      than this valule.
+  """
+
+  def __init__(self,
+               iou_threshold: float = 0.5,
+               text_box_type: TextBoxRep = TextBoxRep.MASK,
+               evaluate_text: bool = False,
+               filter_invalid_thr: float = 0.5):
+    """Constructor."""
+    self._iou_threshold = iou_threshold
+    self._text_box_type = text_box_type
+    self._evaluate_text = evaluate_text
+    self._filter_invalid_thr = filter_invalid_thr
+
+  def _init_new_accumulators(self):
+    """Initialize the counter dict."""
+    # For detection:
+    accumulators = {
+        'tp_det_cnt': 0.,
+        'num_prediction': 0.,
+        'num_groundtruth': 0.,
+        'tp_det_tightness_sum': 0.,
+    }
+    # For E2E:
+    if self._evaluate_text:
+      accumulators_for_e2e = {
+          'tp_e2e_cnt': 0.,
+          'tp_e2e_tightness_sum': 0.,
+      }
+      accumulators.update(accumulators_for_e2e)
+    return accumulators
+
+  def _get_text(self,
+                prediction_box_list: TextEntityListType,
+                gt_box_list: TextEntityListType):
+    """This function decodes the text content from the text box containers.
+
+    Args:
+      prediction_box_list: The predicted text entities.
+      gt_box_list: The GT text entities.
+
+    Returns:
+      pred_text: A list of str for the predicted text.
+      gt_text: A list of str for the ground-truth text.
+    """
+    pred_text = []
+    gt_text = []
+
+    if self._evaluate_text:
+      pred_text = [str(text) for text in prediction_box_list.get_field(
+          'pred_texts')]
+      gt_text = [str(text) for text in gt_box_list.get_field(
+          'gt_texts')]
+    return pred_text, gt_text
+
+  def _count_tp(self,
+                num_gt: int,
+                iou: np.ndarray,
+                max_iou_ids_for_gts: np.ndarray,
+                max_iou_ids_for_preds: np.ndarray,
+                pred_text: List[str],
+                gt_text: List[str],
+                accumulators: Dict[str, Union[float, int]]):
+    """Matching groundtruths and predictions."""
+    for i in range(num_gt):
+      # get the id of prediction that has highest IoU with i-th GT
+      max_prediction_id = max_iou_ids_for_gts[i]
+
+      box_score = iou[i, max_prediction_id]
+
+      if (box_score >= self._iou_threshold and
+          # mutually best match
+          i == max_iou_ids_for_preds[max_prediction_id]):
+        accumulators['tp_det_cnt'] += 1
+        accumulators['tp_det_tightness_sum'] += box_score
+
+        # For E2E:
+        if self._evaluate_text:
+          if pred_text[max_prediction_id] == gt_text[i]:
+            accumulators['tp_e2e_cnt'] += 1
+            accumulators['tp_e2e_tightness_sum'] += box_score
+
+  def _filter_dont_care_region(self,
+                               eval_dict: NumpyDict,
+                               prediction_box_list: TextEntityListType,
+                               gt_box_list: TextEntityListType):
+    """Remove pred if intersection(pred, dont' care gt) / area(pred) >= filter_invalid_thr."""
+    valid_gt_indices = np.where(eval_dict['gt_weights'] >= 0.5)[0]
+    invalid_gt_indices = np.where(eval_dict['gt_weights'] < 0.5)[0]
+
+    if invalid_gt_indices.size and prediction_box_list.num_boxes():
+      invalid_gt_box_list = self._gather(gt_box_list, invalid_gt_indices)
+      precision, _ = self._precision_recall(
+          prediction_box_list, invalid_gt_box_list)
+      valid_prediction_indices = np.where(
+          np.max(precision, axis=1) < self._filter_invalid_thr)[0]
+      prediction_box_list = self._gather(prediction_box_list,
+                                         valid_prediction_indices)
+
+    gt_box_list = self._gather(gt_box_list, valid_gt_indices)
+
+    return prediction_box_list, gt_box_list
+
+  def evaluate_one_image(self, eval_dict: NumpyDict):
+    """Evaluation code."""
+    accumulators = self._init_new_accumulators()
+
+    # step 1: convert to box list objects
+    gt_box_list, prediction_box_list = self._get_box_lists(eval_dict)
+
+    # step 2: filter out don't care regions
+    prediction_box_list, gt_box_list = self._filter_dont_care_region(
+        eval_dict, prediction_box_list, gt_box_list)
+
+    # step 3: evaluate
+    num_gt = gt_box_list.num_boxes()
+    num_prediction = prediction_box_list.num_boxes()
+
+    if num_gt and num_prediction:
+      iou = self._iou(gt_box_list, prediction_box_list)
+      max_iou_ids_for_gts = np.argmax(iou, axis=1)  # for GT
+      max_iou_ids_for_preds = np.argmax(iou, axis=0)  # for pred
+      pred_text, gt_text = self._get_text(prediction_box_list, gt_box_list)
+
+      self._count_tp(num_gt, iou,
+                     max_iou_ids_for_gts, max_iou_ids_for_preds,
+                     pred_text, gt_text,
+                     accumulators)
+
+    accumulators['num_prediction'] += num_prediction
+    accumulators['num_groundtruth'] += num_gt
+
+    return accumulators
+
+  def _get_box_lists(self, eval_dict: NumpyDict):
+    """Get ground truth and prediction box list."""
+    if self._text_box_type == TextBoxRep.MASK:
+      gt_box_list = np_box_mask_list.BoxMaskList(
+          eval_dict['gt_boxes'], eval_dict['gt_masks'].astype(np.uint8))
+      prediction_box_list = np_box_mask_list.BoxMaskList(
+          eval_dict['detection_boxes'],
+          eval_dict['detection_masks'].astype(np.uint8))
+    elif self._text_box_type == TextBoxRep.POLY:
+      gt_box_list = polygon_list.PolygonList(eval_dict['gt_boxes'])
+      prediction_box_list = polygon_list.PolygonList(
+          eval_dict['detection_boxes'])
+    else:
+      raise NotImplementedError()
+
+    for key in ['gt_weights', 'gt_texts']:
+      if key in eval_dict:
+        gt_box_list.add_field(key, eval_dict[key])
+    if 'pred_texts' in eval_dict:
+      prediction_box_list.add_field('pred_texts', eval_dict['pred_texts'])
+
+    return gt_box_list, prediction_box_list
+
+  def _gather(self, box_list, indices):
+    """Gather regions by indices."""
+    if self._text_box_type == TextBoxRep.MASK:
+      return np_box_mask_list_ops.gather(box_list, indices)
+    elif self._text_box_type == TextBoxRep.POLY:
+      return polygon_ops.gather(box_list, indices)
+    else:
+      raise NotImplementedError()
+
+  def _iou(self, boxlist1, boxlist2):
+    """Computes intersection-over-union."""
+    if self._text_box_type == TextBoxRep.MASK:
+      return np_box_mask_list_ops.iou(boxlist1, boxlist2)
+    elif self._text_box_type == TextBoxRep.POLY:
+      return polygon_ops.iou(boxlist1, boxlist2)
+    else:
+      raise NotImplementedError()
+
+  def _precision_recall(self, boxlist1, boxlist2):
+    """Computes pixel precision and recall matrices."""
+
+    if self._text_box_type == TextBoxRep.MASK:
+      # intersection: ndarray[N x M]
+      intersection = np_box_mask_list_ops.intersection(boxlist1, boxlist2)
+      # box_1_area: ndarray[N x 1], box_2_area: ndarray[1 x M]
+      box_1_area = np_box_mask_list_ops.area(boxlist1).reshape(-1, 1)
+      box_2_area = np_box_mask_list_ops.area(boxlist2).reshape(1, -1)
+    elif self._text_box_type == TextBoxRep.POLY:
+      # intersection: ndarray[N x M]
+      # box_1_area: ndarray[N x 1], box_2_area: ndarray[1 x M]
+      intersection, box_1_area, box_2_area = polygon_ops.intersection_and_area(
+          boxlist1, boxlist2)
+    else:
+      raise NotImplementedError()
+    precision = intersection / (box_1_area + EPSILON)
+    recall = intersection / (box_2_area + EPSILON)
+    return precision, recall
+
+  def evaluate(self, accumulators):
+    """Compute evaluation metrics."""
+    metrics = {}
+    total_gt = accumulators['num_groundtruth']
+    total_prediction = accumulators['num_prediction']
+    total_tp_det = accumulators['tp_det_cnt']
+    r_det = total_tp_det / total_gt if total_gt else 1.0
+    p_det = total_tp_det / total_prediction if total_prediction else 1.0
+    metrics['Det-Recall'] = r_det
+    metrics['Det-Precision'] = p_det
+    metrics['Det-Fscore'] = (2 * r_det * p_det / (r_det + p_det) if
+                             (r_det + p_det) else 0.0)
+    metrics['Det-Tightness'] = (
+        (accumulators['tp_det_tightness_sum'] /
+         total_tp_det) if total_tp_det else 1.0)
+    metrics['Det-PQ'] = metrics['Det-Tightness'] * metrics['Det-Fscore']
+
+    if self._evaluate_text:
+      total_tp_e2e = accumulators['tp_e2e_cnt']
+      r_e2e = total_tp_e2e / total_gt if total_gt else 1.0
+      p_e2e = total_tp_e2e / total_prediction if total_prediction else 1.0
+      metrics['E2E-Recall'] = r_e2e
+      metrics['E2E-Precision'] = p_e2e
+      metrics['E2E-Fscore'] = (2 * r_e2e * p_e2e / (r_e2e + p_e2e) if
+                               (r_e2e + p_e2e) else 0.0)
+      metrics['E2E-Tightness'] = (
+          accumulators['tp_e2e_tightness_sum'] /
+          total_tp_e2e if total_tp_e2e else 1.0)
+      metrics['E2E-PQ'] = metrics['E2E-Tightness'] * metrics['E2E-Fscore']
+
+    return metrics
diff --git a/hiertext_eval/evaluator/np_box_list.py b/hiertext_eval/evaluator/np_box_list.py
new file mode 100644
index 0000000..07aec96
--- /dev/null
+++ b/hiertext_eval/evaluator/np_box_list.py
@@ -0,0 +1,140 @@
+# Lint as: python2, python3
+# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+
+# copied from: https://github.com/tensorflow/models/tree/master/research/object_detection/utils
+
+"""Numpy BoxList classes and functions."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+import numpy as np
+from six.moves import range
+
+
+class BoxList(object):
+  """Box collection.
+
+  BoxList represents a list of bounding boxes as numpy array, where each
+  bounding box is represented as a row of 4 numbers,
+  [y_min, x_min, y_max, x_max].  It is assumed that all bounding boxes within a
+  given list correspond to a single image.
+
+  Optionally, users can add additional related fields (such as
+  objectness/classification scores).
+  """
+
+  def __init__(self, data):
+    """Constructs box collection.
+
+    Args:
+      data: a numpy array of shape [N, 4] representing box coordinates
+
+    Raises:
+      ValueError: if bbox data is not a numpy array
+      ValueError: if invalid dimensions for bbox data
+    """
+    if not isinstance(data, np.ndarray):
+      raise ValueError('data must be a numpy array.')
+    if len(data.shape) != 2 or data.shape[1] != 4:
+      raise ValueError('Invalid dimensions for box data.')
+    if data.dtype != np.float32 and data.dtype != np.float64:
+      raise ValueError('Invalid data type for box data: float is required.')
+    if not self._is_valid_boxes(data):
+      raise ValueError('Invalid box data. data must be a numpy array of '
+                       'N*[y_min, x_min, y_max, x_max]')
+    self.data = {'boxes': data}
+
+  def num_boxes(self):
+    """Return number of boxes held in collections."""
+    return self.data['boxes'].shape[0]
+
+  def get_extra_fields(self):
+    """Return all non-box fields."""
+    return [k for k in self.data.keys() if k != 'boxes']
+
+  def has_field(self, field):
+    return field in self.data
+
+  def add_field(self, field, field_data):
+    """Add data to a specified field.
+
+    Args:
+      field: a string parameter used to speficy a related field to be accessed.
+      field_data: a numpy array of [N, ...] representing the data associated
+          with the field.
+    Raises:
+      ValueError: if the field is already exist or the dimension of the field
+          data does not matches the number of boxes.
+    """
+    if self.has_field(field):
+      raise ValueError('Field ' + field + 'already exists')
+    if len(field_data.shape) < 1 or field_data.shape[0] != self.num_boxes():
+      raise ValueError('Invalid dimensions for field data')
+    self.data[field] = field_data
+
+  def get(self):
+    """Convenience function for accesssing box coordinates.
+
+    Returns:
+      a numpy array of shape [N, 4] representing box corners
+    """
+    return self.get_field('boxes')
+
+  def get_field(self, field):
+    """Accesses data associated with the specified field in the box collection.
+
+    Args:
+      field: a string parameter used to speficy a related field to be accessed.
+
+    Returns:
+      a numpy 1-d array representing data of an associated field
+
+    Raises:
+      ValueError: if invalid field
+    """
+    if not self.has_field(field):
+      raise ValueError('field {} does not exist'.format(field))
+    return self.data[field]
+
+  def get_coordinates(self):
+    """Get corner coordinates of boxes.
+
+    Returns:
+     a list of 4 1-d numpy arrays [y_min, x_min, y_max, x_max]
+    """
+    box_coordinates = self.get()
+    y_min = box_coordinates[:, 0]
+    x_min = box_coordinates[:, 1]
+    y_max = box_coordinates[:, 2]
+    x_max = box_coordinates[:, 3]
+    return [y_min, x_min, y_max, x_max]
+
+  def _is_valid_boxes(self, data):
+    """Check whether data fullfills the format of N*[ymin, xmin, ymax, xmin].
+
+    Args:
+      data: a numpy array of shape [N, 4] representing box coordinates
+
+    Returns:
+      a boolean indicating whether all ymax of boxes are equal or greater than
+          ymin, and all xmax of boxes are equal or greater than xmin.
+    """
+    if data.shape[0] > 0:
+      for i in range(data.shape[0]):
+        if data[i, 0] > data[i, 2] or data[i, 1] > data[i, 3]:
+          return False
+    return True
diff --git a/hiertext_eval/evaluator/np_box_list_ops.py b/hiertext_eval/evaluator/np_box_list_ops.py
new file mode 100644
index 0000000..d34e227
--- /dev/null
+++ b/hiertext_eval/evaluator/np_box_list_ops.py
@@ -0,0 +1,566 @@
+# Lint as: python2, python3
+# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+
+# copied from: https://github.com/tensorflow/models/tree/master/research/object_detection/utils
+
+"""Bounding Box List operations for Numpy BoxLists.
+
+Example box operations that are supported:
+  * Areas: compute bounding box areas
+  * IOU: pairwise intersection-over-union scores
+"""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import numpy as np
+from six.moves import range
+
+from . import np_box_list
+from . import np_box_ops
+
+
+class SortOrder(object):
+  """Enum class for sort order.
+
+  Attributes:
+    ascend: ascend order.
+    descend: descend order.
+  """
+  ASCEND = 1
+  DESCEND = 2
+
+
+def area(boxlist):
+  """Computes area of boxes.
+
+  Args:
+    boxlist: BoxList holding N boxes
+
+  Returns:
+    a numpy array with shape [N*1] representing box areas
+  """
+  y_min, x_min, y_max, x_max = boxlist.get_coordinates()
+  return (y_max - y_min) * (x_max - x_min)
+
+
+def intersection(boxlist1, boxlist2):
+  """Compute pairwise intersection areas between boxes.
+
+  Args:
+    boxlist1: BoxList holding N boxes
+    boxlist2: BoxList holding M boxes
+
+  Returns:
+    a numpy array with shape [N*M] representing pairwise intersection area
+  """
+  return np_box_ops.intersection(boxlist1.get(), boxlist2.get())
+
+
+def iou(boxlist1, boxlist2):
+  """Computes pairwise intersection-over-union between box collections.
+
+  Args:
+    boxlist1: BoxList holding N boxes
+    boxlist2: BoxList holding M boxes
+
+  Returns:
+    a numpy array with shape [N, M] representing pairwise iou scores.
+  """
+  return np_box_ops.iou(boxlist1.get(), boxlist2.get())
+
+
+def ioa(boxlist1, boxlist2):
+  """Computes pairwise intersection-over-area between box collections.
+
+  Intersection-over-area (ioa) between two boxes box1 and box2 is defined as
+  their intersection area over box2's area. Note that ioa is not symmetric,
+  that is, IOA(box1, box2) != IOA(box2, box1).
+
+  Args:
+    boxlist1: BoxList holding N boxes
+    boxlist2: BoxList holding M boxes
+
+  Returns:
+    a numpy array with shape [N, M] representing pairwise ioa scores.
+  """
+  return np_box_ops.ioa(boxlist1.get(), boxlist2.get())
+
+
+def gather(boxlist, indices, fields=None):
+  """Gather boxes from BoxList according to indices and return new BoxList.
+
+  By default, gather returns boxes corresponding to the input index list, as
+  well as all additional fields stored in the boxlist (indexing into the
+  first dimension).  However one can optionally only gather from a
+  subset of fields.
+
+  Args:
+    boxlist: BoxList holding N boxes
+    indices: a 1-d numpy array of type int_
+    fields: (optional) list of fields to also gather from.  If None (default),
+        all fields are gathered from.  Pass an empty fields list to only gather
+        the box coordinates.
+
+  Returns:
+    subboxlist: a BoxList corresponding to the subset of the input BoxList
+        specified by indices
+
+  Raises:
+    ValueError: if specified field is not contained in boxlist or if the
+        indices are not of type int_
+  """
+  if indices.size:
+    if np.amax(indices) >= boxlist.num_boxes() or np.amin(indices) < 0:
+      raise ValueError('indices are out of valid range.')
+  subboxlist = np_box_list.BoxList(boxlist.get()[indices, :])
+  if fields is None:
+    fields = boxlist.get_extra_fields()
+  for field in fields:
+    extra_field_data = boxlist.get_field(field)
+    subboxlist.add_field(field, extra_field_data[indices, ...])
+  return subboxlist
+
+
+def sort_by_field(boxlist, field, order=SortOrder.DESCEND):
+  """Sort boxes and associated fields according to a scalar field.
+
+  A common use case is reordering the boxes according to descending scores.
+
+  Args:
+    boxlist: BoxList holding N boxes.
+    field: A BoxList field for sorting and reordering the BoxList.
+    order: (Optional) 'descend' or 'ascend'. Default is descend.
+
+  Returns:
+    sorted_boxlist: A sorted BoxList with the field in the specified order.
+
+  Raises:
+    ValueError: if specified field does not exist or is not of single dimension.
+    ValueError: if the order is not either descend or ascend.
+  """
+  if not boxlist.has_field(field):
+    raise ValueError('Field ' + field + ' does not exist')
+  if len(boxlist.get_field(field).shape) != 1:
+    raise ValueError('Field ' + field + 'should be single dimension.')
+  if order != SortOrder.DESCEND and order != SortOrder.ASCEND:
+    raise ValueError('Invalid sort order')
+
+  field_to_sort = boxlist.get_field(field)
+  sorted_indices = np.argsort(field_to_sort)
+  if order == SortOrder.DESCEND:
+    sorted_indices = sorted_indices[::-1]
+  return gather(boxlist, sorted_indices)
+
+
+def non_max_suppression(boxlist,
+                        max_output_size=10000,
+                        iou_threshold=1.0,
+                        score_threshold=-10.0):
+  """Non maximum suppression.
+
+  This op greedily selects a subset of detection bounding boxes, pruning
+  away boxes that have high IOU (intersection over union) overlap (> thresh)
+  with already selected boxes. In each iteration, the detected bounding box with
+  highest score in the available pool is selected.
+
+  Args:
+    boxlist: BoxList holding N boxes.  Must contain a 'scores' field
+      representing detection scores. All scores belong to the same class.
+    max_output_size: maximum number of retained boxes
+    iou_threshold: intersection over union threshold.
+    score_threshold: minimum score threshold. Remove the boxes with scores
+                     less than this value. Default value is set to -10. A very
+                     low threshold to pass pretty much all the boxes, unless
+                     the user sets a different score threshold.
+
+  Returns:
+    a BoxList holding M boxes where M <= max_output_size
+  Raises:
+    ValueError: if 'scores' field does not exist
+    ValueError: if threshold is not in [0, 1]
+    ValueError: if max_output_size < 0
+  """
+  if not boxlist.has_field('scores'):
+    raise ValueError('Field scores does not exist')
+  if iou_threshold < 0. or iou_threshold > 1.0:
+    raise ValueError('IOU threshold must be in [0, 1]')
+  if max_output_size < 0:
+    raise ValueError('max_output_size must be bigger than 0.')
+
+  boxlist = filter_scores_greater_than(boxlist, score_threshold)
+  if boxlist.num_boxes() == 0:
+    return boxlist
+
+  boxlist = sort_by_field(boxlist, 'scores')
+
+  # Prevent further computation if NMS is disabled.
+  if iou_threshold == 1.0:
+    if boxlist.num_boxes() > max_output_size:
+      selected_indices = np.arange(max_output_size)
+      return gather(boxlist, selected_indices)
+    else:
+      return boxlist
+
+  boxes = boxlist.get()
+  num_boxes = boxlist.num_boxes()
+  # is_index_valid is True only for all remaining valid boxes,
+  is_index_valid = np.full(num_boxes, 1, dtype=bool)
+  selected_indices = []
+  num_output = 0
+  for i in range(num_boxes):
+    if num_output < max_output_size:
+      if is_index_valid[i]:
+        num_output += 1
+        selected_indices.append(i)
+        is_index_valid[i] = False
+        valid_indices = np.where(is_index_valid)[0]
+        if valid_indices.size == 0:
+          break
+
+        intersect_over_union = np_box_ops.iou(
+            np.expand_dims(boxes[i, :], axis=0), boxes[valid_indices, :])
+        intersect_over_union = np.squeeze(intersect_over_union, axis=0)
+        is_index_valid[valid_indices] = np.logical_and(
+            is_index_valid[valid_indices],
+            intersect_over_union <= iou_threshold)
+  return gather(boxlist, np.array(selected_indices))
+
+
+def multi_class_non_max_suppression(boxlist, score_thresh, iou_thresh,
+                                    max_output_size):
+  """Multi-class version of non maximum suppression.
+
+  This op greedily selects a subset of detection bounding boxes, pruning
+  away boxes that have high IOU (intersection over union) overlap (> thresh)
+  with already selected boxes.  It operates independently for each class for
+  which scores are provided (via the scores field of the input box_list),
+  pruning boxes with score less than a provided threshold prior to
+  applying NMS.
+
+  Args:
+    boxlist: BoxList holding N boxes.  Must contain a 'scores' field
+      representing detection scores.  This scores field is a tensor that can
+      be 1 dimensional (in the case of a single class) or 2-dimensional, which
+      which case we assume that it takes the shape [num_boxes, num_classes].
+      We further assume that this rank is known statically and that
+      scores.shape[1] is also known (i.e., the number of classes is fixed
+      and known at graph construction time).
+    score_thresh: scalar threshold for score (low scoring boxes are removed).
+    iou_thresh: scalar threshold for IOU (boxes that that high IOU overlap
+      with previously selected boxes are removed).
+    max_output_size: maximum number of retained boxes per class.
+
+  Returns:
+    a BoxList holding M boxes with a rank-1 scores field representing
+      corresponding scores for each box with scores sorted in decreasing order
+      and a rank-1 classes field representing a class label for each box.
+  Raises:
+    ValueError: if iou_thresh is not in [0, 1] or if input boxlist does not have
+      a valid scores field.
+  """
+  if not 0 <= iou_thresh <= 1.0:
+    raise ValueError('thresh must be between 0 and 1')
+  if not isinstance(boxlist, np_box_list.BoxList):
+    raise ValueError('boxlist must be a BoxList')
+  if not boxlist.has_field('scores'):
+    raise ValueError('input boxlist must have \'scores\' field')
+  scores = boxlist.get_field('scores')
+  if len(scores.shape) == 1:
+    scores = np.reshape(scores, [-1, 1])
+  elif len(scores.shape) == 2:
+    if scores.shape[1] is None:
+      raise ValueError('scores field must have statically defined second '
+                       'dimension')
+  else:
+    raise ValueError('scores field must be of rank 1 or 2')
+  num_boxes = boxlist.num_boxes()
+  num_scores = scores.shape[0]
+  num_classes = scores.shape[1]
+
+  if num_boxes != num_scores:
+    raise ValueError('Incorrect scores field length: actual vs expected.')
+
+  selected_boxes_list = []
+  for class_idx in range(num_classes):
+    boxlist_and_class_scores = np_box_list.BoxList(boxlist.get())
+    class_scores = np.reshape(scores[0:num_scores, class_idx], [-1])
+    boxlist_and_class_scores.add_field('scores', class_scores)
+    boxlist_filt = filter_scores_greater_than(boxlist_and_class_scores,
+                                              score_thresh)
+    nms_result = non_max_suppression(boxlist_filt,
+                                     max_output_size=max_output_size,
+                                     iou_threshold=iou_thresh,
+                                     score_threshold=score_thresh)
+    nms_result.add_field(
+        'classes', np.zeros_like(nms_result.get_field('scores')) + class_idx)
+    selected_boxes_list.append(nms_result)
+  selected_boxes = concatenate(selected_boxes_list)
+  sorted_boxes = sort_by_field(selected_boxes, 'scores')
+  return sorted_boxes
+
+
+def scale(boxlist, y_scale, x_scale):
+  """Scale box coordinates in x and y dimensions.
+
+  Args:
+    boxlist: BoxList holding N boxes
+    y_scale: float
+    x_scale: float
+
+  Returns:
+    boxlist: BoxList holding N boxes
+  """
+  y_min, x_min, y_max, x_max = np.array_split(boxlist.get(), 4, axis=1)
+  y_min = y_scale * y_min
+  y_max = y_scale * y_max
+  x_min = x_scale * x_min
+  x_max = x_scale * x_max
+  scaled_boxlist = np_box_list.BoxList(np.hstack([y_min, x_min, y_max, x_max]))
+
+  fields = boxlist.get_extra_fields()
+  for field in fields:
+    extra_field_data = boxlist.get_field(field)
+    scaled_boxlist.add_field(field, extra_field_data)
+
+  return scaled_boxlist
+
+
+def clip_to_window(boxlist, window, filter_nonoverlapping=True):
+  """Clip bounding boxes to a window.
+
+  This op clips input bounding boxes (represented by bounding box
+  corners) to a window, optionally filtering out boxes that do not
+  overlap at all with the window.
+
+  Args:
+    boxlist: BoxList holding M_in boxes
+    window: a numpy array of shape [4] representing the
+            [y_min, x_min, y_max, x_max] window to which the op
+            should clip boxes.
+    filter_nonoverlapping: whether to filter out boxes that do not overlap at
+      all with the window.
+
+  Returns:
+    a BoxList holding M_out boxes where M_out <= M_in
+  """
+  y_min, x_min, y_max, x_max = np.array_split(boxlist.get(), 4, axis=1)
+  win_y_min = window[0]
+  win_x_min = window[1]
+  win_y_max = window[2]
+  win_x_max = window[3]
+  y_min_clipped = np.fmax(np.fmin(y_min, win_y_max), win_y_min)
+  y_max_clipped = np.fmax(np.fmin(y_max, win_y_max), win_y_min)
+  x_min_clipped = np.fmax(np.fmin(x_min, win_x_max), win_x_min)
+  x_max_clipped = np.fmax(np.fmin(x_max, win_x_max), win_x_min)
+  clipped = np_box_list.BoxList(
+      np.hstack([y_min_clipped, x_min_clipped, y_max_clipped, x_max_clipped]))
+  clipped = _copy_extra_fields(clipped, boxlist)
+  if filter_nonoverlapping:
+    areas = area(clipped)
+    nonzero_area_indices = np.reshape(
+        np.nonzero(np.greater(areas, 0.0)), [-1]).astype(np.int32)
+    clipped = gather(clipped, nonzero_area_indices)
+  return clipped
+
+
+def prune_non_overlapping_boxes(boxlist1, boxlist2, minoverlap=0.0):
+  """Prunes the boxes in boxlist1 that overlap less than thresh with boxlist2.
+
+  For each box in boxlist1, we want its IOA to be more than minoverlap with
+  at least one of the boxes in boxlist2. If it does not, we remove it.
+
+  Args:
+    boxlist1: BoxList holding N boxes.
+    boxlist2: BoxList holding M boxes.
+    minoverlap: Minimum required overlap between boxes, to count them as
+                overlapping.
+
+  Returns:
+    A pruned boxlist with size [N', 4].
+  """
+  intersection_over_area = ioa(boxlist2, boxlist1)  # [M, N] tensor
+  intersection_over_area = np.amax(intersection_over_area, axis=0)  # [N] tensor
+  keep_bool = np.greater_equal(intersection_over_area, np.array(minoverlap))
+  keep_inds = np.nonzero(keep_bool)[0]
+  new_boxlist1 = gather(boxlist1, keep_inds)
+  return new_boxlist1
+
+
+def prune_outside_window(boxlist, window):
+  """Prunes bounding boxes that fall outside a given window.
+
+  This function prunes bounding boxes that even partially fall outside the given
+  window. See also ClipToWindow which only prunes bounding boxes that fall
+  completely outside the window, and clips any bounding boxes that partially
+  overflow.
+
+  Args:
+    boxlist: a BoxList holding M_in boxes.
+    window: a numpy array of size 4, representing [ymin, xmin, ymax, xmax]
+            of the window.
+
+  Returns:
+    pruned_corners: a tensor with shape [M_out, 4] where M_out <= M_in.
+    valid_indices: a tensor with shape [M_out] indexing the valid bounding boxes
+     in the input tensor.
+  """
+
+  y_min, x_min, y_max, x_max = np.array_split(boxlist.get(), 4, axis=1)
+  win_y_min = window[0]
+  win_x_min = window[1]
+  win_y_max = window[2]
+  win_x_max = window[3]
+  coordinate_violations = np.hstack([np.less(y_min, win_y_min),
+                                     np.less(x_min, win_x_min),
+                                     np.greater(y_max, win_y_max),
+                                     np.greater(x_max, win_x_max)])
+  valid_indices = np.reshape(
+      np.where(np.logical_not(np.max(coordinate_violations, axis=1))), [-1])
+  return gather(boxlist, valid_indices), valid_indices
+
+
+def concatenate(boxlists, fields=None):
+  """Concatenate list of BoxLists.
+
+  This op concatenates a list of input BoxLists into a larger BoxList.  It also
+  handles concatenation of BoxList fields as long as the field tensor shapes
+  are equal except for the first dimension.
+
+  Args:
+    boxlists: list of BoxList objects
+    fields: optional list of fields to also concatenate.  By default, all
+      fields from the first BoxList in the list are included in the
+      concatenation.
+
+  Returns:
+    a BoxList with number of boxes equal to
+      sum([boxlist.num_boxes() for boxlist in BoxList])
+  Raises:
+    ValueError: if boxlists is invalid (i.e., is not a list, is empty, or
+      contains non BoxList objects), or if requested fields are not contained in
+      all boxlists
+  """
+  if not isinstance(boxlists, list):
+    raise ValueError('boxlists should be a list')
+  if not boxlists:
+    raise ValueError('boxlists should have nonzero length')
+  for boxlist in boxlists:
+    if not isinstance(boxlist, np_box_list.BoxList):
+      raise ValueError('all elements of boxlists should be BoxList objects')
+  concatenated = np_box_list.BoxList(
+      np.vstack([boxlist.get() for boxlist in boxlists]))
+  if fields is None:
+    fields = boxlists[0].get_extra_fields()
+  for field in fields:
+    first_field_shape = boxlists[0].get_field(field).shape
+    first_field_shape = first_field_shape[1:]
+    for boxlist in boxlists:
+      if not boxlist.has_field(field):
+        raise ValueError('boxlist must contain all requested fields')
+      field_shape = boxlist.get_field(field).shape
+      field_shape = field_shape[1:]
+      if field_shape != first_field_shape:
+        raise ValueError('field %s must have same shape for all boxlists '
+                         'except for the 0th dimension.' % field)
+    concatenated_field = np.concatenate(
+        [boxlist.get_field(field) for boxlist in boxlists], axis=0)
+    concatenated.add_field(field, concatenated_field)
+  return concatenated
+
+
+def filter_scores_greater_than(boxlist, thresh):
+  """Filter to keep only boxes with score exceeding a given threshold.
+
+  This op keeps the collection of boxes whose corresponding scores are
+  greater than the input threshold.
+
+  Args:
+    boxlist: BoxList holding N boxes.  Must contain a 'scores' field
+      representing detection scores.
+    thresh: scalar threshold
+
+  Returns:
+    a BoxList holding M boxes where M <= N
+
+  Raises:
+    ValueError: if boxlist not a BoxList object or if it does not
+      have a scores field
+  """
+  if not isinstance(boxlist, np_box_list.BoxList):
+    raise ValueError('boxlist must be a BoxList')
+  if not boxlist.has_field('scores'):
+    raise ValueError('input boxlist must have \'scores\' field')
+  scores = boxlist.get_field('scores')
+  if len(scores.shape) > 2:
+    raise ValueError('Scores should have rank 1 or 2')
+  if len(scores.shape) == 2 and scores.shape[1] != 1:
+    raise ValueError('Scores should have rank 1 or have shape '
+                     'consistent with [None, 1]')
+  high_score_indices = np.reshape(np.where(np.greater(scores, thresh)),
+                                  [-1]).astype(np.int32)
+  return gather(boxlist, high_score_indices)
+
+
+def change_coordinate_frame(boxlist, window):
+  """Change coordinate frame of the boxlist to be relative to window's frame.
+
+  Given a window of the form [ymin, xmin, ymax, xmax],
+  changes bounding box coordinates from boxlist to be relative to this window
+  (e.g., the min corner maps to (0,0) and the max corner maps to (1,1)).
+
+  An example use case is data augmentation: where we are given groundtruth
+  boxes (boxlist) and would like to randomly crop the image to some
+  window (window). In this case we need to change the coordinate frame of
+  each groundtruth box to be relative to this new window.
+
+  Args:
+    boxlist: A BoxList object holding N boxes.
+    window: a size 4 1-D numpy array.
+
+  Returns:
+    Returns a BoxList object with N boxes.
+  """
+  win_height = window[2] - window[0]
+  win_width = window[3] - window[1]
+  boxlist_new = scale(
+      np_box_list.BoxList(boxlist.get() -
+                          [window[0], window[1], window[0], window[1]]),
+      1.0 / win_height, 1.0 / win_width)
+  _copy_extra_fields(boxlist_new, boxlist)
+
+  return boxlist_new
+
+
+def _copy_extra_fields(boxlist_to_copy_to, boxlist_to_copy_from):
+  """Copies the extra fields of boxlist_to_copy_from to boxlist_to_copy_to.
+
+  Args:
+    boxlist_to_copy_to: BoxList to which extra fields are copied.
+    boxlist_to_copy_from: BoxList from which fields are copied.
+
+  Returns:
+    boxlist_to_copy_to with extra fields.
+  """
+  for field in boxlist_to_copy_from.get_extra_fields():
+    boxlist_to_copy_to.add_field(field, boxlist_to_copy_from.get_field(field))
+  return boxlist_to_copy_to
+
+
+def _update_valid_indices_by_removing_high_iou_boxes(
+    selected_indices, is_index_valid, intersect_over_union, threshold):
+  max_iou = np.max(intersect_over_union[:, selected_indices], axis=1)
+  return np.logical_and(is_index_valid, max_iou <= threshold)
diff --git a/hiertext_eval/evaluator/np_box_mask_list.py b/hiertext_eval/evaluator/np_box_mask_list.py
new file mode 100644
index 0000000..3cc705f
--- /dev/null
+++ b/hiertext_eval/evaluator/np_box_mask_list.py
@@ -0,0 +1,69 @@
+# Lint as: python2, python3
+# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+
+# copied from: https://github.com/tensorflow/models/tree/master/research/object_detection/utils
+
+"""Numpy BoxMaskList classes and functions."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import numpy as np
+from . import np_box_list
+
+
+class BoxMaskList(np_box_list.BoxList):
+  """Convenience wrapper for BoxList with masks.
+
+  BoxMaskList extends the np_box_list.BoxList to contain masks as well.
+  In particular, its constructor receives both boxes and masks. Note that the
+  masks correspond to the full image.
+  """
+
+  def __init__(self, box_data, mask_data):
+    """Constructs box collection.
+
+    Args:
+      box_data: a numpy array of shape [N, 4] representing box coordinates
+      mask_data: a numpy array of shape [N, height, width] representing masks
+        with values are in {0,1}. The masks correspond to the full
+        image. The height and the width will be equal to image height and width.
+
+    Raises:
+      ValueError: if bbox data is not a numpy array
+      ValueError: if invalid dimensions for bbox data
+      ValueError: if mask data is not a numpy array
+      ValueError: if invalid dimension for mask data
+    """
+    super(BoxMaskList, self).__init__(box_data)
+    if not isinstance(mask_data, np.ndarray):
+      raise ValueError('Mask data must be a numpy array.')
+    if len(mask_data.shape) != 3:
+      raise ValueError('Invalid dimensions for mask data.')
+    if mask_data.dtype != np.uint8:
+      raise ValueError('Invalid data type for mask data: uint8 is required.')
+    if mask_data.shape[0] != box_data.shape[0]:
+      raise ValueError('There should be the same number of boxes and masks.')
+    self.data['masks'] = mask_data
+
+  def get_masks(self):
+    """Convenience function for accessing masks.
+
+    Returns:
+      a numpy array of shape [N, height, width] representing masks
+    """
+    return self.get_field('masks')
diff --git a/hiertext_eval/evaluator/np_box_mask_list_ops.py b/hiertext_eval/evaluator/np_box_mask_list_ops.py
new file mode 100644
index 0000000..c2e623e
--- /dev/null
+++ b/hiertext_eval/evaluator/np_box_mask_list_ops.py
@@ -0,0 +1,407 @@
+# Lint as: python2, python3
+# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+
+# copied from: https://github.com/tensorflow/models/tree/master/research/object_detection/utils
+
+"""Operations for np_box_mask_list.BoxMaskList.
+
+Example box operations that are supported:
+  * Areas: compute bounding box areas
+  * IOU: pairwise intersection-over-union scores
+"""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import numpy as np
+from six.moves import range
+
+from . import np_box_list_ops
+from . import np_box_mask_list
+from . import np_mask_ops
+
+
+def box_list_to_box_mask_list(boxlist):
+  """Converts a BoxList containing 'masks' into a BoxMaskList.
+
+  Args:
+    boxlist: An np_box_list.BoxList object.
+
+  Returns:
+    An np_box_mask_list.BoxMaskList object.
+
+  Raises:
+    ValueError: If boxlist does not contain `masks` as a field.
+  """
+  if not boxlist.has_field('masks'):
+    raise ValueError('boxlist does not contain mask field.')
+  box_mask_list = np_box_mask_list.BoxMaskList(
+      box_data=boxlist.get(),
+      mask_data=boxlist.get_field('masks'))
+  extra_fields = boxlist.get_extra_fields()
+  for key in extra_fields:
+    if key != 'masks':
+      box_mask_list.data[key] = boxlist.get_field(key)
+  return box_mask_list
+
+
+def area(box_mask_list):
+  """Computes area of masks.
+
+  Args:
+    box_mask_list: np_box_mask_list.BoxMaskList holding N boxes and masks
+
+  Returns:
+    a numpy array with shape [N*1] representing mask areas
+  """
+  return np_mask_ops.area(box_mask_list.get_masks())
+
+
+def intersection(box_mask_list1, box_mask_list2):
+  """Compute pairwise intersection areas between masks.
+
+  Args:
+    box_mask_list1: BoxMaskList holding N boxes and masks
+    box_mask_list2: BoxMaskList holding M boxes and masks
+
+  Returns:
+    a numpy array with shape [N*M] representing pairwise intersection area
+  """
+  return np_mask_ops.intersection(box_mask_list1,
+                                  box_mask_list2)
+
+
+def iou(box_mask_list1, box_mask_list2):
+  """Computes pairwise intersection-over-union between box and mask collections.
+
+  Args:
+    box_mask_list1: BoxMaskList holding N boxes and masks
+    box_mask_list2: BoxMaskList holding M boxes and masks
+
+  Returns:
+    a numpy array with shape [N, M] representing pairwise iou scores.
+  """
+  return np_mask_ops.iou(box_mask_list1,
+                         box_mask_list2)
+
+
+def ioa(box_mask_list1, box_mask_list2):
+  """Computes pairwise intersection-over-area between box and mask collections.
+
+  Intersection-over-area (ioa) between two masks mask1 and mask2 is defined as
+  their intersection area over mask2's area. Note that ioa is not symmetric,
+  that is, IOA(mask1, mask2) != IOA(mask2, mask1).
+
+  Args:
+    box_mask_list1: np_box_mask_list.BoxMaskList holding N boxes and masks
+    box_mask_list2: np_box_mask_list.BoxMaskList holding M boxes and masks
+
+  Returns:
+    a numpy array with shape [N, M] representing pairwise ioa scores.
+  """
+  return np_mask_ops.ioa(box_mask_list1.get_masks(), box_mask_list2.get_masks())
+
+
+def gather(box_mask_list, indices, fields=None):
+  """Gather boxes from np_box_mask_list.BoxMaskList according to indices.
+
+  By default, gather returns boxes corresponding to the input index list, as
+  well as all additional fields stored in the box_mask_list (indexing into the
+  first dimension).  However one can optionally only gather from a
+  subset of fields.
+
+  Args:
+    box_mask_list: np_box_mask_list.BoxMaskList holding N boxes
+    indices: a 1-d numpy array of type int_
+    fields: (optional) list of fields to also gather from.  If None (default),
+        all fields are gathered from.  Pass an empty fields list to only gather
+        the box coordinates.
+
+  Returns:
+    subbox_mask_list: a np_box_mask_list.BoxMaskList corresponding to the subset
+        of the input box_mask_list specified by indices
+
+  Raises:
+    ValueError: if specified field is not contained in box_mask_list or if the
+        indices are not of type int_
+  """
+  if fields is not None:
+    if 'masks' not in fields:
+      fields.append('masks')
+  return box_list_to_box_mask_list(
+      np_box_list_ops.gather(
+          boxlist=box_mask_list, indices=indices, fields=fields))
+
+
+def sort_by_field(box_mask_list, field,
+                  order=np_box_list_ops.SortOrder.DESCEND):
+  """Sort boxes and associated fields according to a scalar field.
+
+  A common use case is reordering the boxes according to descending scores.
+
+  Args:
+    box_mask_list: BoxMaskList holding N boxes.
+    field: A BoxMaskList field for sorting and reordering the BoxMaskList.
+    order: (Optional) 'descend' or 'ascend'. Default is descend.
+
+  Returns:
+    sorted_box_mask_list: A sorted BoxMaskList with the field in the specified
+      order.
+  """
+  return box_list_to_box_mask_list(
+      np_box_list_ops.sort_by_field(
+          boxlist=box_mask_list, field=field, order=order))
+
+
+def non_max_suppression(box_mask_list,
+                        max_output_size=10000,
+                        iou_threshold=1.0,
+                        score_threshold=-10.0):
+  """Non maximum suppression.
+
+  This op greedily selects a subset of detection bounding boxes, pruning
+  away boxes that have high IOU (intersection over union) overlap (> thresh)
+  with already selected boxes. In each iteration, the detected bounding box with
+  highest score in the available pool is selected.
+
+  Args:
+    box_mask_list: np_box_mask_list.BoxMaskList holding N boxes.  Must contain
+      a 'scores' field representing detection scores. All scores belong to the
+      same class.
+    max_output_size: maximum number of retained boxes
+    iou_threshold: intersection over union threshold.
+    score_threshold: minimum score threshold. Remove the boxes with scores
+                     less than this value. Default value is set to -10. A very
+                     low threshold to pass pretty much all the boxes, unless
+                     the user sets a different score threshold.
+
+  Returns:
+    an np_box_mask_list.BoxMaskList holding M boxes where M <= max_output_size
+
+  Raises:
+    ValueError: if 'scores' field does not exist
+    ValueError: if threshold is not in [0, 1]
+    ValueError: if max_output_size < 0
+  """
+  if not box_mask_list.has_field('scores'):
+    raise ValueError('Field scores does not exist')
+  if iou_threshold < 0. or iou_threshold > 1.0:
+    raise ValueError('IOU threshold must be in [0, 1]')
+  if max_output_size < 0:
+    raise ValueError('max_output_size must be bigger than 0.')
+
+  box_mask_list = filter_scores_greater_than(box_mask_list, score_threshold)
+  if box_mask_list.num_boxes() == 0:
+    return box_mask_list
+
+  box_mask_list = sort_by_field(box_mask_list, 'scores')
+
+  # Prevent further computation if NMS is disabled.
+  if iou_threshold == 1.0:
+    if box_mask_list.num_boxes() > max_output_size:
+      selected_indices = np.arange(max_output_size)
+      return gather(box_mask_list, selected_indices)
+    else:
+      return box_mask_list
+
+  masks = box_mask_list.get_masks()
+  num_masks = box_mask_list.num_boxes()
+
+  # is_index_valid is True only for all remaining valid boxes,
+  is_index_valid = np.full(num_masks, 1, dtype=bool)
+  selected_indices = []
+  num_output = 0
+  for i in range(num_masks):
+    if num_output < max_output_size:
+      if is_index_valid[i]:
+        num_output += 1
+        selected_indices.append(i)
+        is_index_valid[i] = False
+        valid_indices = np.where(is_index_valid)[0]
+        if valid_indices.size == 0:
+          break
+
+        intersect_over_union = np_mask_ops.iou(
+            np.expand_dims(masks[i], axis=0), masks[valid_indices])
+        intersect_over_union = np.squeeze(intersect_over_union, axis=0)
+        is_index_valid[valid_indices] = np.logical_and(
+            is_index_valid[valid_indices],
+            intersect_over_union <= iou_threshold)
+  return gather(box_mask_list, np.array(selected_indices))
+
+
+def multi_class_non_max_suppression(box_mask_list, score_thresh, iou_thresh,
+                                    max_output_size):
+  """Multi-class version of non maximum suppression.
+
+  This op greedily selects a subset of detection bounding boxes, pruning
+  away boxes that have high IOU (intersection over union) overlap (> thresh)
+  with already selected boxes.  It operates independently for each class for
+  which scores are provided (via the scores field of the input box_list),
+  pruning boxes with score less than a provided threshold prior to
+  applying NMS.
+
+  Args:
+    box_mask_list: np_box_mask_list.BoxMaskList holding N boxes.  Must contain a
+      'scores' field representing detection scores.  This scores field is a
+      tensor that can be 1 dimensional (in the case of a single class) or
+      2-dimensional, in which case we assume that it takes the
+      shape [num_boxes, num_classes]. We further assume that this rank is known
+      statically and that scores.shape[1] is also known (i.e., the number of
+      classes is fixed and known at graph construction time).
+    score_thresh: scalar threshold for score (low scoring boxes are removed).
+    iou_thresh: scalar threshold for IOU (boxes that that high IOU overlap
+      with previously selected boxes are removed).
+    max_output_size: maximum number of retained boxes per class.
+
+  Returns:
+    a box_mask_list holding M boxes with a rank-1 scores field representing
+      corresponding scores for each box with scores sorted in decreasing order
+      and a rank-1 classes field representing a class label for each box.
+  Raises:
+    ValueError: if iou_thresh is not in [0, 1] or if input box_mask_list does
+      not have a valid scores field.
+  """
+  if not 0 <= iou_thresh <= 1.0:
+    raise ValueError('thresh must be between 0 and 1')
+  if not isinstance(box_mask_list, np_box_mask_list.BoxMaskList):
+    raise ValueError('box_mask_list must be a box_mask_list')
+  if not box_mask_list.has_field('scores'):
+    raise ValueError('input box_mask_list must have \'scores\' field')
+  scores = box_mask_list.get_field('scores')
+  if len(scores.shape) == 1:
+    scores = np.reshape(scores, [-1, 1])
+  elif len(scores.shape) == 2:
+    if scores.shape[1] is None:
+      raise ValueError('scores field must have statically defined second '
+                       'dimension')
+  else:
+    raise ValueError('scores field must be of rank 1 or 2')
+
+  num_boxes = box_mask_list.num_boxes()
+  num_scores = scores.shape[0]
+  num_classes = scores.shape[1]
+
+  if num_boxes != num_scores:
+    raise ValueError('Incorrect scores field length: actual vs expected.')
+
+  selected_boxes_list = []
+  for class_idx in range(num_classes):
+    box_mask_list_and_class_scores = np_box_mask_list.BoxMaskList(
+        box_data=box_mask_list.get(),
+        mask_data=box_mask_list.get_masks())
+    class_scores = np.reshape(scores[0:num_scores, class_idx], [-1])
+    box_mask_list_and_class_scores.add_field('scores', class_scores)
+    box_mask_list_filt = filter_scores_greater_than(
+        box_mask_list_and_class_scores, score_thresh)
+    nms_result = non_max_suppression(
+        box_mask_list_filt,
+        max_output_size=max_output_size,
+        iou_threshold=iou_thresh,
+        score_threshold=score_thresh)
+    nms_result.add_field(
+        'classes',
+        np.zeros_like(nms_result.get_field('scores')) + class_idx)
+    selected_boxes_list.append(nms_result)
+  selected_boxes = np_box_list_ops.concatenate(selected_boxes_list)
+  sorted_boxes = np_box_list_ops.sort_by_field(selected_boxes, 'scores')
+  return box_list_to_box_mask_list(boxlist=sorted_boxes)
+
+
+def prune_non_overlapping_masks(box_mask_list1, box_mask_list2, minoverlap=0.0):
+  """Prunes the boxes in list1 that overlap less than thresh with list2.
+
+  For each mask in box_mask_list1, we want its IOA to be more than minoverlap
+  with at least one of the masks in box_mask_list2. If it does not, we remove
+  it. If the masks are not full size image, we do the pruning based on boxes.
+
+  Args:
+    box_mask_list1: np_box_mask_list.BoxMaskList holding N boxes and masks.
+    box_mask_list2: np_box_mask_list.BoxMaskList holding M boxes and masks.
+    minoverlap: Minimum required overlap between boxes, to count them as
+                overlapping.
+
+  Returns:
+    A pruned box_mask_list with size [N', 4].
+  """
+  intersection_over_area = ioa(box_mask_list2, box_mask_list1)  # [M, N] tensor
+  intersection_over_area = np.amax(intersection_over_area, axis=0)  # [N] tensor
+  keep_bool = np.greater_equal(intersection_over_area, np.array(minoverlap))
+  keep_inds = np.nonzero(keep_bool)[0]
+  new_box_mask_list1 = gather(box_mask_list1, keep_inds)
+  return new_box_mask_list1
+
+
+def concatenate(box_mask_lists, fields=None):
+  """Concatenate list of box_mask_lists.
+
+  This op concatenates a list of input box_mask_lists into a larger
+  box_mask_list.  It also
+  handles concatenation of box_mask_list fields as long as the field tensor
+  shapes are equal except for the first dimension.
+
+  Args:
+    box_mask_lists: list of np_box_mask_list.BoxMaskList objects
+    fields: optional list of fields to also concatenate.  By default, all
+      fields from the first BoxMaskList in the list are included in the
+      concatenation.
+
+  Returns:
+    a box_mask_list with number of boxes equal to
+      sum([box_mask_list.num_boxes() for box_mask_list in box_mask_list])
+  Raises:
+    ValueError: if box_mask_lists is invalid (i.e., is not a list, is empty, or
+      contains non box_mask_list objects), or if requested fields are not
+      contained in all box_mask_lists
+  """
+  if fields is not None:
+    if 'masks' not in fields:
+      fields.append('masks')
+  return box_list_to_box_mask_list(
+      np_box_list_ops.concatenate(boxlists=box_mask_lists, fields=fields))
+
+
+def filter_scores_greater_than(box_mask_list, thresh):
+  """Filter to keep only boxes and masks with score exceeding a given threshold.
+
+  This op keeps the collection of boxes and masks whose corresponding scores are
+  greater than the input threshold.
+
+  Args:
+    box_mask_list: BoxMaskList holding N boxes and masks.  Must contain a
+      'scores' field representing detection scores.
+    thresh: scalar threshold
+
+  Returns:
+    a BoxMaskList holding M boxes and masks where M <= N
+
+  Raises:
+    ValueError: if box_mask_list not a np_box_mask_list.BoxMaskList object or
+      if it does not have a scores field
+  """
+  if not isinstance(box_mask_list, np_box_mask_list.BoxMaskList):
+    raise ValueError('box_mask_list must be a BoxMaskList')
+  if not box_mask_list.has_field('scores'):
+    raise ValueError('input box_mask_list must have \'scores\' field')
+  scores = box_mask_list.get_field('scores')
+  if len(scores.shape) > 2:
+    raise ValueError('Scores should have rank 1 or 2')
+  if len(scores.shape) == 2 and scores.shape[1] != 1:
+    raise ValueError('Scores should have rank 1 or have shape '
+                     'consistent with [None, 1]')
+  high_score_indices = np.reshape(np.where(np.greater(scores, thresh)),
+                                  [-1]).astype(np.int32)
+  return gather(box_mask_list, high_score_indices)
diff --git a/hiertext_eval/evaluator/np_box_ops.py b/hiertext_eval/evaluator/np_box_ops.py
new file mode 100644
index 0000000..d2ba703
--- /dev/null
+++ b/hiertext_eval/evaluator/np_box_ops.py
@@ -0,0 +1,105 @@
+# Lint as: python2, python3
+# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+
+# copied from: https://github.com/tensorflow/models/tree/master/research/object_detection/utils
+
+"""Operations for [N, 4] numpy arrays representing bounding boxes.
+
+Example box operations that are supported:
+  * Areas: compute bounding box areas
+  * IOU: pairwise intersection-over-union scores
+"""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import numpy as np
+
+
+def area(boxes):
+  """Computes area of boxes.
+
+  Args:
+    boxes: Numpy array with shape [N, 4] holding N boxes
+
+  Returns:
+    a numpy array with shape [N*1] representing box areas
+  """
+  return (boxes[:, 2] - boxes[:, 0]) * (boxes[:, 3] - boxes[:, 1])
+
+
+def intersection(boxes1, boxes2):
+  """Compute pairwise intersection areas between boxes.
+
+  Args:
+    boxes1: a numpy array with shape [N, 4] holding N boxes
+    boxes2: a numpy array with shape [M, 4] holding M boxes
+
+  Returns:
+    a numpy array with shape [N*M] representing pairwise intersection area
+  """
+  [y_min1, x_min1, y_max1, x_max1] = np.split(boxes1, 4, axis=1)
+  [y_min2, x_min2, y_max2, x_max2] = np.split(boxes2, 4, axis=1)
+
+  all_pairs_min_ymax = np.minimum(y_max1, np.transpose(y_max2))
+  all_pairs_max_ymin = np.maximum(y_min1, np.transpose(y_min2))
+  intersect_heights = np.maximum(
+      np.zeros(all_pairs_max_ymin.shape),
+      all_pairs_min_ymax - all_pairs_max_ymin)
+  all_pairs_min_xmax = np.minimum(x_max1, np.transpose(x_max2))
+  all_pairs_max_xmin = np.maximum(x_min1, np.transpose(x_min2))
+  intersect_widths = np.maximum(
+      np.zeros(all_pairs_max_xmin.shape),
+      all_pairs_min_xmax - all_pairs_max_xmin)
+  return intersect_heights * intersect_widths
+
+
+def iou(boxes1, boxes2):
+  """Computes pairwise intersection-over-union between box collections.
+
+  Args:
+    boxes1: a numpy array with shape [N, 4] holding N boxes.
+    boxes2: a numpy array with shape [M, 4] holding M boxes.
+
+  Returns:
+    a numpy array with shape [N, M] representing pairwise iou scores.
+  """
+  intersect = intersection(boxes1, boxes2)
+  area1 = area(boxes1)
+  area2 = area(boxes2)
+  union = np.expand_dims(area1, axis=1) + np.expand_dims(
+      area2, axis=0) - intersect
+  return intersect / union
+
+
+def ioa(boxes1, boxes2):
+  """Computes pairwise intersection-over-area between box collections.
+
+  Intersection-over-area (ioa) between two boxes box1 and box2 is defined as
+  their intersection area over box2's area. Note that ioa is not symmetric,
+  that is, IOA(box1, box2) != IOA(box2, box1).
+
+  Args:
+    boxes1: a numpy array with shape [N, 4] holding N boxes.
+    boxes2: a numpy array with shape [M, 4] holding M boxes.
+
+  Returns:
+    a numpy array with shape [N, M] representing pairwise ioa scores.
+  """
+  intersect = intersection(boxes1, boxes2)
+  areas = np.expand_dims(area(boxes2), axis=0)
+  return intersect / areas
diff --git a/hiertext_eval/evaluator/np_mask_ops.py b/hiertext_eval/evaluator/np_mask_ops.py
new file mode 100644
index 0000000..115dae6
--- /dev/null
+++ b/hiertext_eval/evaluator/np_mask_ops.py
@@ -0,0 +1,133 @@
+# Lint as: python2, python3
+# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+
+# copied and modified from: https://github.com/tensorflow/models/tree/master/research/object_detection/utils
+
+"""Operations for [N, height, width] numpy arrays representing masks.
+
+Example mask operations that are supported:
+  * Areas: compute mask areas
+  * IOU: pairwise intersection-over-union scores
+"""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import numpy as np
+
+from . import np_box_list_ops
+
+
+EPSILON = 1e-7
+
+
+def area(masks):
+  """Computes area of masks.
+
+  Args:
+    masks: Numpy array with shape [N, height, width] holding N masks. Masks
+      values are of type np.uint8 and values are in {0,1}.
+
+  Returns:
+    a numpy array with shape [N*1] representing mask areas.
+
+  Raises:
+    ValueError: If masks.dtype is not np.uint8
+  """
+  if masks.dtype != np.uint8:
+    raise ValueError('Masks type should be np.uint8')
+  return np.sum(masks, axis=(1, 2), dtype=np.float32)
+
+
+def intersection(mask_list_1, mask_list_2):
+  """Compute pairwise intersection areas between masks.
+
+  Args:
+    mask_list_1: BoxMaskList holding N boxes and masks
+    mask_list_2: BoxMaskList holding M boxes and masks
+
+  Returns:
+    a numpy array with shape [N*M] representing pairwise intersection area.
+
+  Raises:
+    ValueError: If masks1 and masks2 are not of type np.uint8.
+  """
+  masks1 = mask_list_1.get_masks()
+  masks2 = mask_list_2.get_masks()
+  box_intersection = np_box_list_ops.intersection(mask_list_1, mask_list_2)
+  if masks1.dtype != np.uint8 or masks2.dtype != np.uint8:
+    raise ValueError('masks1 and masks2 should be of type np.uint8')
+  n = masks1.shape[0]
+  m = masks2.shape[0]
+  answer = np.zeros([n, m], dtype=np.float32)
+  for i in np.arange(n):
+    for j in np.arange(m):
+      if box_intersection[i, j] > 0:
+        answer[i, j] = np.sum(
+            np.minimum(masks1[i], masks2[j]), dtype=np.float32)
+  return answer
+
+
+def iou(mask_list_1, mask_list_2):
+  """Computes pairwise intersection-over-union between mask collections.
+
+  Args:
+    mask_list_1: BoxMaskList holding N boxes and masks
+    mask_list_2: BoxMaskList holding M boxes and masks
+
+  Returns:
+    a numpy array with shape [N, M] representing pairwise iou scores.
+
+  Raises:
+    ValueError: If masks1 and masks2 are not of type np.uint8.
+  """
+  masks1 = mask_list_1.get_masks()
+  masks2 = mask_list_2.get_masks()
+  if masks1.dtype != np.uint8 or masks2.dtype != np.uint8:
+    raise ValueError('masks1 and masks2 should be of type np.uint8')
+  intersect = intersection(mask_list_1, mask_list_2)
+  area1 = area(masks1)
+  area2 = area(masks2)
+  union = np.expand_dims(area1, axis=1) + np.expand_dims(
+      area2, axis=0) - intersect
+  return intersect / np.maximum(union, EPSILON)
+
+
+def ioa(masks1, masks2):
+  """Computes pairwise intersection-over-area between box collections.
+
+  Intersection-over-area (ioa) between two masks, mask1 and mask2 is defined as
+  their intersection area over mask2's area. Note that ioa is not symmetric,
+  that is, IOA(mask1, mask2) != IOA(mask2, mask1).
+
+  Args:
+    masks1: a numpy array with shape [N, height, width] holding N masks. Masks
+      values are of type np.uint8 and values are in {0,1}.
+    masks2: a numpy array with shape [M, height, width] holding N masks. Masks
+      values are of type np.uint8 and values are in {0,1}.
+
+  Returns:
+    a numpy array with shape [N, M] representing pairwise ioa scores.
+
+  Raises:
+    ValueError: If masks1 and masks2 are not of type np.uint8.
+  """
+  if masks1.dtype != np.uint8 or masks2.dtype != np.uint8:
+    raise ValueError('masks1 and masks2 should be of type np.uint8')
+  intersect = intersection(masks1, masks2)
+  areas = np.expand_dims(area(masks2), axis=0)
+  return intersect / (areas + EPSILON)
diff --git a/hiertext_eval/evaluator/polygon_list.py b/hiertext_eval/evaluator/polygon_list.py
new file mode 100644
index 0000000..cd8cccf
--- /dev/null
+++ b/hiertext_eval/evaluator/polygon_list.py
@@ -0,0 +1,86 @@
+"""Polygon list classes and functions."""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import numpy as np
+from shapely.geometry.polygon import Polygon
+from shapely.validation import make_valid
+
+from . import np_box_list
+
+
+class PolygonList(np_box_list.BoxList):
+  """Polygon collection.
+
+  PolygonList represents a list of bounding polygons. A polygon with n vertices
+  is represented as a [n, 2] ndarray.
+
+  Optionally, users can add additional related fields (such as
+  objectness/classification scores).
+  """
+
+  def __init__(self, data, set_aa_box=True):
+    """Constructs polygon collection.
+
+    Args:
+      data: PolygonList accepts three kinds of representations: (1) a numpy
+        array of shape [L, N, 2] for L polygons with the same number of vertices
+        (2) a list of ndarray where the i-th ndarray has shape [n_i, 2]. It's
+        also stored in the 'boxes' field as a list of Polygon objects. (3) a
+        list of Polygon objects
+      set_aa_box: Whether to infer axis-aligned bounding boxes from data.
+
+    Raises:
+      ValueError: if data format is incorrect.
+    """
+    if isinstance(data, list):
+      for i in range(len(data)):
+        if not isinstance(data[i], np.ndarray):
+          continue
+        shape = data[i].shape
+        if len(shape) != 2:
+          raise ValueError(f"Shape ({shape}) is not supported for ndarray.")
+        n = shape[0]
+        if n < 3:
+          raise ValueError(f"Invalid number of polygon vertices (N={n}).")
+        if shape[1] != 2:
+          raise ValueError(f"Invalid polygon formats ({shape}).")
+    elif isinstance(data, np.ndarray):
+      shape = data.shape
+      if len(shape) != 3:
+        raise ValueError(f"Shape ({shape}) is not supported for ndarray.")
+      _, n, _ = shape
+      if n < 3:
+        raise ValueError(f"Invalid number of polygon vertices (N={n}).")
+      if shape[-1] != 2:
+        raise ValueError(f"Invalid polygon formats ({shape}).")
+    else:
+      raise ValueError(f"Type ({type(data)}) is not supported")
+
+    self.data = {}
+    polys = [(Polygon(poly) if isinstance(poly, np.ndarray) else poly) for poly in data]
+    self.data["boxes"] = [poly if poly.is_valid else make_valid(poly) for poly in polys]
+
+    # axis_aligned bboxes:
+    if set_aa_box:
+      self.data["aa_boxes"] = np.array(
+          [self.poly_to_axis_aligned_bboxes(poly) for poly in polys])
+
+  def num_boxes(self):
+    """Return number of boxes held in collections."""
+    return len(self.data["boxes"])
+
+  def get_coordinates(self):
+    """This method is not supported for polygon."""
+    raise NotImplementedError()
+
+  def poly_to_axis_aligned_bboxes(self, poly):
+    """As titled."""
+    x, y = poly.exterior.coords.xy
+    xmin = np.min(x)
+    xmax = np.max(x)
+    ymin = np.min(y)
+    ymax = np.max(y)
+    return [ymin, xmin, ymax, xmax]
diff --git a/hiertext_eval/evaluator/polygon_ops.py b/hiertext_eval/evaluator/polygon_ops.py
new file mode 100644
index 0000000..794eeed
--- /dev/null
+++ b/hiertext_eval/evaluator/polygon_ops.py
@@ -0,0 +1,140 @@
+"""Operations for polygon lists."""
+
+from absl import logging
+import numpy as np
+
+from . import np_box_list
+from . import np_box_list_ops
+from . import polygon_list
+
+
+EPSILON = 1e-5
+
+
+def area(polygons: polygon_list.PolygonList):
+  """Computes area of masks.
+
+  Args:
+    polygons: A polygon_list.PolygonList object.
+
+  Returns:
+    a numpy array with shape [N*1] representing polygons areas.
+  """
+  n = polygons.num_boxes()
+  polygon_data = polygons.get()
+  return np.array([polygon_data[i].area for i in range(n)])
+
+
+def intersection_and_area(polygons1, polygons2):
+  """Computes area of masks.
+
+  Args:
+    polygons1: A polygon_list.PolygonList object.
+    polygons2: A polygon_list.PolygonList object.
+
+  Returns:
+    a numpy array with shape [N, M] representing intersections areas.
+    a numpy array with shape [N, 1] representing the area of polygons1.
+    a numpy array with shape [1, M] representing the area of polygons2.
+  """
+  n = polygons1.num_boxes()
+  m = polygons2.num_boxes()
+  polygon_data1 = polygons1.get()
+  polygon_data2 = polygons2.get()
+  intersection_area = np.zeros((n, m), dtype=np.float)
+  area1 = area(polygons1)
+  area2 = area(polygons2)
+
+  # Use axis-aligned bboxes for fast filtering.
+  aa_box_1 = np_box_list.BoxList(polygons1.get_field('aa_boxes'))
+  aa_box_2 = np_box_list.BoxList(polygons2.get_field('aa_boxes'))
+  intersection_aa_box = np_box_list_ops.intersection(aa_box_1, aa_box_2)
+  for i in range(n):
+    for j in range(m):
+      if intersection_aa_box[i, j] > 0:
+        if area1[i] < EPSILON or area2[j] < EPSILON:
+          if area1[i] < EPSILON:
+            logging.error(  # pylint: disable=logging-format-interpolation
+                f'{i}-th text box in array 1 is empty: {polygon_data1[i]}')
+          if area2[j] < EPSILON:
+            logging.error(  # pylint: disable=logging-format-interpolation
+                f'{j}-th text box in array 2 is empty: {polygon_data2[j]}')
+          intersection_area[i, j] = 0.
+        else:
+          intersection_area[i, j] = polygon_data1[i].intersection(
+              polygon_data2[j]).area
+      else:
+        intersection_area[i, j] = 0.
+  return intersection_area, area1.reshape(-1, 1), area2.reshape(1, -1)
+
+
+def iou(polygons1, polygons2):
+  """Computes area of masks.
+
+  Args:
+    polygons1: A polygon_list.PolygonList object.
+    polygons2: A polygon_list.PolygonList object.
+
+  Returns:
+    a numpy array with shape [N, M] representing IoU.
+  """
+  intersection_area, area1, area2 = intersection_and_area(
+      polygons1, polygons2)
+  union_area = area1 + area2 - intersection_area
+  return intersection_area / (union_area + EPSILON)
+
+
+def ioa(polygons1, polygons2):
+  """Computes area of masks.
+
+  Args:
+    polygons1: A polygon_list.PolygonList object.
+    polygons2: A polygon_list.PolygonList object.
+
+  Returns:
+    a numpy array with shape [N, M] representing intersection over the area of
+    polygons2.
+  """
+  intersection_area, _, area2 = intersection_and_area(polygons1, polygons2)
+  return intersection_area / (area2 + EPSILON)
+
+
+def gather(polygons, indices, fields=None):
+  """Gather boxes from polygons according to indices and return new polygons.
+
+  By default, gather returns PolygonList corresponding to the input index list,
+  as well as all additional fields stored in the PolygonList (indexing into the
+  first dimension).  However one can optionally only gather from a
+  subset of fields.
+
+  Args:
+    polygons: PolygonList holding N boxes
+    indices: a 1-d numpy array of type int_
+    fields: (optional) list of fields to also gather from.  If None (default),
+        all fields are gathered from.  Pass an empty fields list to only gather
+        the box coordinates.
+
+  Returns:
+    subboxlist: a PolygonList corresponding to the subset of the input
+      PolygonList specified by indices
+
+  Raises:
+    ValueError: if specified field is not contained in boxlist or if the
+        indices are not of type int_
+  """
+  if indices.size:
+    if (np.amax(indices) >= polygons.num_boxes() or
+        np.amin(indices) < 0):
+      raise ValueError('indices are out of valid range.')
+  polygon_data = polygons.get()
+  sublist_polygon_data = polygon_list.PolygonList(
+      [polygon_data[index] for index in indices],
+      set_aa_box=False,
+      )
+  if fields is None:
+    fields = polygons.get_extra_fields()
+  for field in fields:
+    extra_field_data = polygons.get_field(field)
+    sublist_polygon_data.add_field(field, extra_field_data[indices, ...])
+  return sublist_polygon_data
+
diff --git a/requirements.txt b/requirements.txt
new file mode 100644
index 0000000..a7a40c3
--- /dev/null
+++ b/requirements.txt
@@ -0,0 +1,15 @@
+numpy==1.23.5
+opencv-python==4.8.0.76
+pycocotools
+matplotlib
+absl-py
+apache_beam
+einops==0.6.1
+Polygon3==3.0.9.1
+pyclipper==1.3.0.post5
+python-Levenshtein
+scikit-image==0.21.0
+scipy==1.10.1
+shapely==2.0.2
+timm
+tqdm
\ No newline at end of file
diff --git a/train.py b/train.py
new file mode 100644
index 0000000..b057190
--- /dev/null
+++ b/train.py
@@ -0,0 +1,341 @@
+import os
+import argparse
+import sys
+
+import numpy as np
+import torch
+import torch.optim as optim
+import torch.nn as nn
+import torch.nn.functional as F
+from torch.autograd import Variable
+import matplotlib.pyplot as plt
+import cv2
+import random
+from typing import Dict, List, Tuple
+import time
+import datetime
+
+from hi_sam.modeling.build import model_registry
+from hi_sam.modeling.loss import loss_masks, loss_hi_masks, loss_iou_mse, loss_hi_iou_mse
+from hi_sam.data.dataloader import get_im_gt_name_dict, create_dataloaders, eval_transforms, custom_collate_fn
+from hi_sam.evaluation import Evaluator
+import utils.misc as misc
+import warnings
+warnings.filterwarnings("ignore")
+
+
+def get_args_parser():
+    parser = argparse.ArgumentParser('Hi-SAM', add_help=False)
+
+    parser.add_argument("--output", type=str, default="work_dirs/", 
+                        help="Path to the directory where masks and checkpoints will be output")
+    parser.add_argument("--model-type", type=str, default="vit_l", 
+                        help="The type of model to load, in ['vit_h', 'vit_l', 'vit_b']")
+    parser.add_argument("--checkpoint", type=str, required=True,
+                        help="The path to the SAM checkpoint to use for mask generation.")
+    parser.add_argument("--device", type=str, default="cuda", 
+                        help="The device to run generation on.")
+    parser.add_argument("--train_datasets", type=str, nargs='+', default=['totaltext_train'])
+    parser.add_argument("--val_datasets", type=str, nargs='+', default=['totaltext_test'])
+    parser.add_argument("--hier_det", action='store_true',
+                        help="If False, only text stroke segmentation.")
+
+    parser.add_argument('--seed', default=42, type=int)
+    parser.add_argument('--lr', default=1e-4, type=float)
+    parser.add_argument('--lr_charac_mask_decoder_name', default=["mask_decoder"], type=str, nargs='+')
+    parser.add_argument('--lr_charac_mask_decoder', default=1e-4, type=float)
+    parser.add_argument('--start_epoch', default=0, type=int)
+    parser.add_argument('--lr_drop_epoch', default=40, type=int)
+    parser.add_argument('--max_epoch_num', default=50, type=int)
+    parser.add_argument('--input_size', default=[1024, 1024], type=list)
+    parser.add_argument('--batch_size_train', default=1, type=int)
+    parser.add_argument('--batch_size_valid', default=1, type=int)
+    parser.add_argument('--valid_period', default=1, type=int)
+    parser.add_argument('--model_save_fre', default=90, type=int)
+
+    parser.add_argument('--world_size', default=1, type=int,
+                        help='number of distributed processes')
+    parser.add_argument('--dist_url', default='env://', help='url used to set up distributed training')
+    parser.add_argument('--rank', default=0, type=int,
+                        help='number of distributed processes')
+    parser.add_argument('--local_rank', type=int, help='local rank for dist')
+    parser.add_argument('--find_unused_params', action='store_true')
+
+    parser.add_argument('--eval', action='store_true')
+
+    # self-prompting
+    parser.add_argument('--attn_layers', default=1, type=int,
+                        help='The number of image to token cross attention layers in model_aligner')
+    parser.add_argument('--prompt_len', default=12, type=int, help='The number of prompt token')
+
+    return parser.parse_args()
+
+
+def main(train_datasets, valid_datasets, args):
+
+    misc.init_distributed_mode(args)
+    print('world size: {}'.format(args.world_size))
+    print('rank: {}'.format(args.rank))
+    print('local_rank: {}'.format(args.local_rank))
+    print("args: " + str(args) + '\n')
+
+    seed = args.seed + misc.get_rank()
+    torch.manual_seed(seed)
+    np.random.seed(seed)
+    random.seed(seed)
+    torch.cuda.manual_seed(seed)
+    torch.cuda.manual_seed_all(seed)
+    torch.backends.cudnn.benchmark = False
+    torch.backends.cudnn.deterministic = True
+
+    ### --- Step 1: Train or Valid dataset ---
+    if not args.eval:
+        raise NotImplementedError
+
+    print("--- create valid dataloader ---")
+    valid_datasets_names = [val_ds["name"] for val_ds in valid_datasets]
+    valid_im_gt_list = get_im_gt_name_dict(valid_datasets, flag="valid")
+    valid_dataloaders, valid_datasets = create_dataloaders(
+        valid_im_gt_list,
+        my_transforms=eval_transforms,
+        batch_size=args.batch_size_valid,
+        training=False
+    )
+    print(len(valid_dataloaders), " valid dataloaders created")
+    
+    ### --- Step 2: DistributedDataParallel---
+    model = model_registry[args.model_type](args=args)
+    if torch.cuda.is_available():
+        model.cuda()
+    model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.gpu], find_unused_parameters=args.find_unused_params)
+    model_without_ddp = model.module
+ 
+    ### --- Step 3: Train or Evaluate ---
+    if not args.eval:
+        raise NotImplementedError
+    else:
+        print("restore model from:", args.checkpoint)
+        evaluate(args, model, valid_dataloaders, valid_datasets_names)
+
+
+def inference_on_dataset(model, data_loader, data_name, evaluator, args):
+    print("Start inference on {}, {} batches".format(data_name, len(data_loader)))
+    num_devices = misc.get_world_size()
+    total = len(data_loader)
+    evaluator.reset()
+    num_warmup = min(5, total - 1)
+    start_time = time.perf_counter()
+    total_data_time = 0
+    total_compute_time = 0
+    total_eval_time = 0
+
+    start_data_time = time.perf_counter()
+    for idx_val, data_val in enumerate(data_loader):
+        inputs_val, labels_ori = data_val['image'], data_val['ori_label']
+        ignore_mask = data_val.get('ignore_mask', None)
+        if torch.cuda.is_available():
+            labels_ori = labels_ori.cuda()
+        batched_input = []
+        for b_i in range(len(inputs_val)):
+            dict_input = dict()
+            dict_input['image'] = inputs_val[b_i].to(model.device).contiguous()
+            dict_input['original_size'] = labels_ori[b_i].shape[-2:]
+            batched_input.append(dict_input)
+
+        total_data_time += time.perf_counter() - start_data_time
+        if idx_val == num_warmup:
+            start_time = time.perf_counter()
+            total_data_time = 0
+            total_compute_time = 0
+            total_eval_time = 0
+
+        start_compute_time = time.perf_counter()
+        with torch.no_grad():
+            up_masks_logits, up_masks, iou_output, hr_masks_logits, hr_masks, hr_iou_output = model(
+                batched_input, multimask_output=False
+            )
+        if torch.cuda.is_available():
+            torch.cuda.synchronize()
+        total_compute_time += time.perf_counter() - start_compute_time
+
+        start_eval_time = time.perf_counter()
+        evaluator.process(up_masks, hr_masks, labels_ori, ignore_mask)
+        total_eval_time += time.perf_counter() - start_eval_time
+
+        iters_after_start = idx_val + 1 - num_warmup * int(idx_val >= num_warmup)
+        data_seconds_per_iter = total_data_time / iters_after_start
+        compute_seconds_per_iter = total_compute_time / iters_after_start
+        eval_seconds_per_iter = total_eval_time / iters_after_start
+        total_seconds_per_iter = (time.perf_counter() - start_time) / iters_after_start
+        if (idx_val+1) % 20 == 0:
+            eta = datetime.timedelta(seconds=int(total_seconds_per_iter * (total - idx_val - 1)))
+            print(
+                f"Inference done [{idx_val + 1}]/[{total}]. ",
+                f"Dataloading: {data_seconds_per_iter:.4f} s/iter. ",
+                f"Inference: {compute_seconds_per_iter:.4f} s/iter. ",
+                f"Eval: {eval_seconds_per_iter:.4f} s/iter. ",
+                f"Total: {total_seconds_per_iter:.4f} s/iter. ",
+                f"ETA={eta}"
+            )
+        start_data_time = time.perf_counter()
+
+    total_time = time.perf_counter() - start_time
+    total_time_str = str(datetime.timedelta(seconds=total_time))
+    print(
+        "Total inference time: {} ({:.6f} s / iter per device, on {} devices)".format(
+            total_time_str, total_time / (total - num_warmup), num_devices
+        )
+    )
+    total_compute_time_str = str(datetime.timedelta(seconds=int(total_compute_time)))
+    print(
+        "Total inference pure compute time: {} ({:.6f} s / iter per device, on {} devices)".format(
+            total_compute_time_str, total_compute_time / (total - num_warmup), num_devices
+        )
+    )
+
+    results = evaluator.evaluate()
+    if results is None:
+        results = {}
+
+    return results
+
+
+def evaluate(args, model, valid_dataloaders, valid_datasets_names):
+    model.eval()
+    test_stats = {}
+
+    for k in range(len(valid_dataloaders)):
+        metric_logger = misc.MetricLogger(delimiter="  ")
+        valid_dataloader = valid_dataloaders[k]
+        valid_dataset_name = valid_datasets_names[k]
+        evaluator = Evaluator(valid_dataset_name, args, True)
+        print('============================')
+        results_k = inference_on_dataset(model, valid_dataloader, valid_dataset_name, evaluator, args)
+        print("Evaluation results for {}:".format(valid_dataset_name))
+        for task, res in results_k.items():
+            if '_hr' not in task:
+                print(f"copypaste: {task}={res}, {task}_hr={results_k[task+'_hr']}")
+        print('============================')
+        test_stats.update({valid_dataset_name: results_k})
+
+    return test_stats
+
+
+if __name__ == "__main__":
+
+    # train
+    totaltext_train = {
+        "name": "TotalText-train",
+        "im_dir": "./datasets/TotalText/Images/Train",
+        "gt_dir": "./datasets/TotalText/groundtruth_pixel/Train",
+        "im_ext": ".jpg",
+        "gt_ext": ".jpg",
+    }
+    hiertext_train = {
+        "name": "HierText-train",
+        "im_dir": "./datasets/HierText/train",
+        "gt_dir": "./datasets/HierText/train_gt",
+        "im_ext": ".jpg",
+        "gt_ext": ".png",
+        "json_dir": "./datasets/HierText/train_shrink_vert.json"
+    }
+    textseg_train = {
+        "name": "TextSeg-train",
+        "im_dir": "./datasets/TextSeg/train_images",
+        "gt_dir": "./datasets/TextSeg/train_gt",
+        "im_ext": ".jpg",
+        "gt_ext": ".png"
+    }
+    cocots_train = {
+        "name": "COCO_TS-train",
+        "im_dir": "./datasets/COCO_TS/train_images",
+        "gt_dir": "./datasets/COCO_TS/COCO_TS_labels",
+        "im_ext": ".jpg",
+        "gt_ext": ".png"
+    }
+    cocots_train_hier = {
+        "name": "COCO_TS-train",
+        "im_dir": "./datasets/COCO_TS/train_images",
+        "gt_dir": "./datasets/COCO_TS/hier-model_labels",
+        "im_ext": ".jpg",
+        "gt_ext": ".png"
+    }
+    cocots_train_tt = {
+        "name": "COCO_TS-train",
+        "im_dir": "./datasets/COCO_TS/train_images",
+        "gt_dir": "./datasets/COCO_TS/tt-model_labels",
+        "im_ext": ".jpg",
+        "gt_ext": ".png"
+    }
+    cocots_train_textseg = {
+        "name": "COCO_TS-train",
+        "im_dir": "./datasets/COCO_TS/train_images",
+        "gt_dir": "./datasets/COCO_TS/textseg-model_labels",
+        "im_ext": ".jpg",
+        "gt_ext": ".png"
+    }
+
+    train_dataset_map = {
+        'totaltext_train': totaltext_train,
+        'hiertext_train': hiertext_train,
+        'textseg_train': textseg_train,
+        'cocots_train': cocots_train,
+        'cocots_train_hier': cocots_train_hier,
+        'cocots_train_tt': cocots_train_tt,
+        'cocots_train_textseg': cocots_train_textseg,
+    }
+
+    # validation and test
+    totaltext_test = {
+        "name": "TotalText-test",
+        "im_dir": "./datasets/TotalText/Images/Test",
+        "gt_dir": "./datasets/TotalText/groundtruth_pixel/Test",
+        "im_ext": ".jpg",
+        "gt_ext": ".jpg"
+    }
+    hiertext_val = {
+        "name": "HierText-val",
+        "im_dir": "./datasets/HierText/validation",
+        "gt_dir": "./datasets/HierText/validation_gt",
+        "im_ext": ".jpg",
+        "gt_ext": ".png"
+    }
+    hiertext_test = {
+        "name": "HierText-test",
+        "im_dir": "./datasets/HierText/test",
+        "gt_dir": "./datasets/HierText/test_gt",
+        "im_ext": ".jpg",
+        "gt_ext": ".png"
+    }
+    textseg_val = {
+        "name": "TextSeg-val",
+        "im_dir": "./datasets/TextSeg/val_images",
+        "gt_dir": "./datasets/TextSeg/val_gt",
+        "im_ext": ".jpg",
+        "gt_ext": ".png"
+    }
+    textseg_test = {
+        "name": "TextSeg-test",
+        "im_dir": "./datasets/TextSeg/test_images",
+        "gt_dir": "./datasets/TextSeg/test_gt",
+        "im_ext": ".jpg",
+        "gt_ext": ".png"
+    }
+    val_dataset_map = {
+        'totaltext_test': totaltext_test,
+        'hiertext_val': hiertext_val,
+        'hiertext_test': hiertext_test,
+        'textseg_val': textseg_val,
+        'textseg_test': textseg_test,
+    }
+
+    train_datasets = []
+    val_datasets = []
+    args = get_args_parser()
+
+    for ds_name in args.train_datasets:
+        train_datasets.append(train_dataset_map[ds_name])
+    for ds_name in args.val_datasets:
+        val_datasets.append(val_dataset_map[ds_name])
+
+    main(train_datasets, val_datasets, args)
diff --git a/utils/__init__.py b/utils/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/utils/misc.py b/utils/misc.py
new file mode 100644
index 0000000..05756ac
--- /dev/null
+++ b/utils/misc.py
@@ -0,0 +1,438 @@
+# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved
+"""
+Misc functions, including distributed helpers.
+
+Mostly copy-paste from torchvision references.
+"""
+import os
+import random 
+import subprocess
+import sys
+import time
+from collections import OrderedDict, defaultdict, deque
+import datetime
+import pickle
+from typing import Optional, List
+import functools
+
+import json, time
+import numpy as np
+import torch
+import torch.distributed as dist
+from torch import Tensor
+
+import colorsys
+import torch.nn.functional as F
+
+import cv2
+from PIL import Image
+import matplotlib.pyplot as plt
+
+# needed due to empty tensor bug in pytorch and torchvision 0.5
+import torchvision
+
+
+class SmoothedValue(object):
+    """Track a series of values and provide access to smoothed values over a
+    window or the global series average.
+    """
+
+    def __init__(self, window_size=20, fmt=None):
+        if fmt is None:
+            fmt = "{median:.4f} ({global_avg:.4f})"
+        self.deque = deque(maxlen=window_size)
+        self.total = 0.0
+        self.count = 0
+        self.fmt = fmt
+
+    def update(self, value, n=1):
+        self.deque.append(value)
+        self.count += n
+        self.total += value * n
+
+    def synchronize_between_processes(self):
+        """
+        Warning: does not synchronize the deque!
+        """
+        if not is_dist_avail_and_initialized():
+            return
+        t = torch.tensor([self.count, self.total], dtype=torch.float64, device='cuda')
+        dist.barrier()
+        dist.all_reduce(t)
+        t = t.tolist()
+        self.count = int(t[0])
+        self.total = t[1]
+
+    @property
+    def median(self):
+        d = torch.tensor(list(self.deque))
+        if d.shape[0] == 0:
+            return 0
+        return d.median().item()
+
+    @property
+    def avg(self):
+        d = torch.tensor(list(self.deque), dtype=torch.float32)
+        return d.mean().item()
+
+    @property
+    def global_avg(self):
+        return self.total / self.count
+
+    @property
+    def max(self):
+        return max(self.deque)
+
+    @property
+    def value(self):
+        return self.deque[-1]
+
+    def __str__(self):
+        return self.fmt.format(
+            median=self.median,
+            avg=self.avg,
+            global_avg=self.global_avg,
+            max=self.max,
+            value=self.value)
+
+
+def all_gather(data):
+    """
+    Run all_gather on arbitrary picklable data (not necessarily tensors)
+    Args:
+        data: any picklable object
+    Returns:
+        list[data]: list of data gathered from each rank
+    """
+    world_size = get_world_size()
+    if world_size == 1:
+        return [data]
+
+    # serialized to a Tensor
+    buffer = pickle.dumps(data)
+    storage = torch.ByteStorage.from_buffer(buffer)
+    tensor = torch.ByteTensor(storage).to("cuda")
+
+    # obtain Tensor size of each rank
+    local_size = torch.tensor([tensor.numel()], device="cuda")
+    size_list = [torch.tensor([0], device="cuda") for _ in range(world_size)]
+    dist.all_gather(size_list, local_size)
+    size_list = [int(size.item()) for size in size_list]
+    max_size = max(size_list)
+
+    # receiving Tensor from all ranks
+    # we pad the tensor because torch all_gather does not support
+    # gathering tensors of different shapes
+    tensor_list = []
+    for _ in size_list:
+        tensor_list.append(torch.empty((max_size,), dtype=torch.uint8, device="cuda"))
+    if local_size != max_size:
+        padding = torch.empty(size=(max_size - local_size,), dtype=torch.uint8, device="cuda")
+        tensor = torch.cat((tensor, padding), dim=0)
+    dist.all_gather(tensor_list, tensor)
+
+    data_list = []
+    for size, tensor in zip(size_list, tensor_list):
+        buffer = tensor.cpu().numpy().tobytes()[:size]
+        data_list.append(pickle.loads(buffer))
+
+    return data_list
+
+
+def reduce_dict(input_dict, average=True):
+    """
+    Args:
+        input_dict (dict): all the values will be reduced
+        average (bool): whether to do average or sum
+    Reduce the values in the dictionary from all processes so that all processes
+    have the averaged results. Returns a dict with the same fields as
+    input_dict, after reduction.
+    """
+    world_size = get_world_size()
+    if world_size < 2:
+        return input_dict
+    with torch.no_grad():
+        names = []
+        values = []
+        # sort the keys so that they are consistent across processes
+        for k in sorted(input_dict.keys()):
+            names.append(k)
+            values.append(input_dict[k])
+        values = torch.stack(values, dim=0)
+        dist.all_reduce(values)
+        if average:
+            values /= world_size
+        reduced_dict = {k: v for k, v in zip(names, values)}
+    return reduced_dict
+
+
+class MetricLogger(object):
+    def __init__(self, delimiter="\t"):
+        self.meters = defaultdict(SmoothedValue)
+        self.delimiter = delimiter
+
+    def update(self, **kwargs):
+        for k, v in kwargs.items():
+            if isinstance(v, torch.Tensor):
+                v = v.item()
+            assert isinstance(v, (float, int))
+            self.meters[k].update(v)
+
+    def __getattr__(self, attr):
+        if attr in self.meters:
+            return self.meters[attr]
+        if attr in self.__dict__:
+            return self.__dict__[attr]
+        raise AttributeError("'{}' object has no attribute '{}'".format(
+            type(self).__name__, attr))
+
+    def __str__(self):
+        loss_str = []
+        for name, meter in self.meters.items():
+            # print(name, str(meter))
+            # import ipdb;ipdb.set_trace()
+            if meter.count > 0:
+                loss_str.append(
+                    "{}: {}".format(name, str(meter))
+                )
+        return self.delimiter.join(loss_str)
+
+    def synchronize_between_processes(self):
+        for meter in self.meters.values():
+            meter.synchronize_between_processes()
+
+    def add_meter(self, name, meter):
+        self.meters[name] = meter
+
+    def log_every(self, iterable, print_freq, header=None, logger=None):
+        if logger is None:
+            print_func = print
+        else:
+            print_func = logger.info
+
+        i = 0
+        if not header:
+            header = ''
+        start_time = time.time()
+        end = time.time()
+        iter_time = SmoothedValue(fmt='{avg:.4f}')
+        data_time = SmoothedValue(fmt='{avg:.4f}')
+        space_fmt = ':' + str(len(str(len(iterable)))) + 'd'
+        if torch.cuda.is_available():
+            log_msg = self.delimiter.join([
+                header,
+                '[{0' + space_fmt + '}/{1}]',
+                'eta: {eta}',
+                '{meters}',
+                'time: {time}',
+                'data: {data}',
+                'max mem: {memory:.0f}'
+            ])
+        else:
+            log_msg = self.delimiter.join([
+                header,
+                '[{0' + space_fmt + '}/{1}]',
+                'eta: {eta}',
+                '{meters}',
+                'time: {time}',
+                'data: {data}'
+            ])
+        MB = 1024.0 * 1024.0
+        for obj in iterable:
+            data_time.update(time.time() - end)
+            yield obj
+
+            iter_time.update(time.time() - end)
+            if i % print_freq == 0 or i == len(iterable) - 1:
+                eta_seconds = iter_time.global_avg * (len(iterable) - i)
+                eta_string = str(datetime.timedelta(seconds=int(eta_seconds)))
+                if torch.cuda.is_available():
+                    print_func(log_msg.format(
+                        i, len(iterable), eta=eta_string,
+                        meters=str(self),
+                        time=str(iter_time), data=str(data_time),
+                        memory=torch.cuda.max_memory_allocated() / MB))
+                else:
+                    print_func(log_msg.format(
+                        i, len(iterable), eta=eta_string,
+                        meters=str(self),
+                        time=str(iter_time), data=str(data_time)))
+            i += 1
+            end = time.time()
+        total_time = time.time() - start_time
+        total_time_str = str(datetime.timedelta(seconds=int(total_time)))
+        print_func('{} Total time: {} ({:.4f} s / it)'.format(
+            header, total_time_str, total_time / len(iterable)))
+
+
+def get_sha():
+    cwd = os.path.dirname(os.path.abspath(__file__))
+
+    def _run(command):
+        return subprocess.check_output(command, cwd=cwd).decode('ascii').strip()
+    sha = 'N/A'
+    diff = "clean"
+    branch = 'N/A'
+    try:
+        sha = _run(['git', 'rev-parse', 'HEAD'])
+        subprocess.check_output(['git', 'diff'], cwd=cwd)
+        diff = _run(['git', 'diff-index', 'HEAD'])
+        diff = "has uncommited changes" if diff else "clean"
+        branch = _run(['git', 'rev-parse', '--abbrev-ref', 'HEAD'])
+    except Exception:
+        pass
+    message = f"sha: {sha}, status: {diff}, branch: {branch}"
+    return message
+
+
+def setup_for_distributed(is_master):
+    """
+    This function disables printing when not in master process
+    """
+    import builtins as __builtin__
+    builtin_print = __builtin__.print
+
+    def print(*args, **kwargs):
+        force = kwargs.pop('force', False)
+        if is_master or force:
+            builtin_print(*args, **kwargs)
+
+    __builtin__.print = print
+
+
+def is_dist_avail_and_initialized():
+    if not dist.is_available():
+        return False
+    if not dist.is_initialized():
+        return False
+    return True
+
+
+def get_world_size():
+    if not is_dist_avail_and_initialized():
+        return 1
+    return dist.get_world_size()
+
+
+def get_rank():
+    if not is_dist_avail_and_initialized():
+        return 0
+    return dist.get_rank()
+
+
+def is_main_process():
+    return get_rank() == 0
+
+
+def save_on_master(*args, **kwargs):
+    if is_main_process():
+        torch.save(*args, **kwargs)
+
+
+def init_distributed_mode(args):
+    if 'WORLD_SIZE' in os.environ and os.environ['WORLD_SIZE'] != '': # 'RANK' in os.environ and 
+        # args.rank = int(os.environ["RANK"])
+        # args.world_size = int(os.environ['WORLD_SIZE'])
+        # args.gpu = args.local_rank = int(os.environ['LOCAL_RANK'])
+
+        # launch by torch.distributed.launch
+        # Single node
+        #   python -m torch.distributed.launch --nproc_per_node=8 main.py --world-size 1 --rank 0 ...
+        # Multi nodes
+        #   python -m torch.distributed.launch --nproc_per_node=8 main.py --world-size 2 --rank 0 --dist-url 'tcp://IP_OF_NODE0:FREEPORT' ...
+        #   python -m torch.distributed.launch --nproc_per_node=8 main.py --world-size 2 --rank 1 --dist-url 'tcp://IP_OF_NODE0:FREEPORT' ...
+        local_world_size = int(os.environ['WORLD_SIZE'])
+        args.world_size = args.world_size * local_world_size
+        args.gpu = args.local_rank = int(os.environ['LOCAL_RANK'])
+        args.rank = args.rank * local_world_size + args.local_rank
+        print('world size: {}, rank: {}, local rank: {}'.format(args.world_size, args.rank, args.local_rank))
+        # print(json.dumps(dict(os.environ), indent=2))
+    elif 'SLURM_PROCID' in os.environ:
+        args.rank = int(os.environ['SLURM_PROCID'])
+        args.gpu = args.local_rank = int(os.environ['SLURM_LOCALID'])
+        args.world_size = int(os.environ['SLURM_NPROCS'])
+
+        print('world size: {}, world rank: {}, local rank: {}, device_count: {}'.format(args.world_size, args.rank, args.local_rank, torch.cuda.device_count()))
+    else:
+        print('Not using distributed mode')
+        args.distributed = False
+        args.world_size = 1
+        args.rank = 0
+        args.local_rank = 0
+        return
+
+    print("world_size:{} rank:{} local_rank:{}".format(args.world_size, args.rank, args.local_rank))
+    args.distributed = True
+    torch.cuda.set_device(args.local_rank)
+    args.dist_backend = 'nccl'
+    print('| distributed init (rank {}): {}'.format(args.rank, args.dist_url), flush=True)
+    torch.distributed.init_process_group(backend=args.dist_backend, init_method=args.dist_url,
+                                         world_size=args.world_size, rank=args.rank)
+    print("Before torch.distributed.barrier()")
+    torch.distributed.barrier()
+    print("End torch.distributed.barrier()")
+    setup_for_distributed(args.rank == 0)
+
+
+def synchronize():
+    """
+    Helper function to synchronize (barrier) among all processes when
+    using distributed training
+    """
+    if not dist.is_available():
+        return
+    if not dist.is_initialized():
+        return
+    world_size = dist.get_world_size()
+    if world_size == 1:
+        return
+    if dist.get_backend() == dist.Backend.NCCL:
+        # This argument is needed to avoid warnings.
+        # It's valid only for NCCL backend.
+        dist.barrier(device_ids=[torch.cuda.current_device()])
+    else:
+        dist.barrier()
+
+
+@functools.lru_cache()
+def _get_global_gloo_group():
+    """
+    Return a process group based on gloo backend, containing all the ranks
+    The result is cached.
+    """
+    if dist.get_backend() == "nccl":
+        return dist.new_group(backend="gloo")
+    else:
+        return dist.group.WORLD
+
+
+def gather(data, dst=0, group=None):
+    """
+    Run gather on arbitrary picklable data (not necessarily tensors).
+
+    Args:
+        data: any picklable object
+        dst (int): destination rank
+        group: a torch process group. By default, will use a group which
+            contains all ranks on gloo backend.
+
+    Returns:
+        list[data]: on dst, a list of data gathered from each rank. Otherwise,
+            an empty list.
+    """
+    if get_world_size() == 1:
+        return [data]
+    if group is None:
+        group = _get_global_gloo_group()
+    world_size = dist.get_world_size(group=group)
+    if world_size == 1:
+        return [data]
+    rank = dist.get_rank(group=group)
+
+    if rank == dst:
+        output = [None for _ in range(world_size)]
+        dist.gather_object(data, output, dst=dst, group=group)
+        return output
+    else:
+        dist.gather_object(data, None, dst=dst, group=group)
+        return []
diff --git a/utils/utilities.py b/utils/utilities.py
new file mode 100644
index 0000000..ad3b845
--- /dev/null
+++ b/utils/utilities.py
@@ -0,0 +1,38 @@
+import collections
+from typing import List, Optional, Union
+
+
+class DisjointSet:
+    """
+    A disjoint set implementation from HierText
+    github.com/tensorflow/models/blob/master/official/projects/unified_detector/utils/utilities.py
+    """
+
+    def __init__(self, num_elements: int):
+        self._num_elements = num_elements
+        self._parent = list(range(num_elements))
+
+    def find(self, item: int) -> int:
+        if self._parent[item] == item:
+          return item
+        else:
+          self._parent[item] = self.find(self._parent[item])
+          return self._parent[item]
+
+    def union(self, i1: int, i2: int) -> None:
+        r1 = self.find(i1)
+        r2 = self.find(i2)
+        self._parent[r1] = r2
+
+    def to_group(self) -> List[List[int]]:
+        """Return the grouping results.
+
+        Returns:
+            A list of integer lists. Each list represents the IDs belonging to the
+          same group.
+        """
+        groups = collections.defaultdict(list)
+        for i in range(self._num_elements):
+          r = self.find(i)
+          groups[r].append(i)
+        return list(groups.values())