-
Notifications
You must be signed in to change notification settings - Fork 0
1.Abnormality detection in mammography
The objective is to perform abnormality classification in mammography using Convolutional Neural Networks
The dataset we will focus on is CBIS DDSM: Curated Breast Imaging Subset of Digital Database for Screening Mammography.
The original data, along with a detailed description of the collection and the policies about usage and citation, can be found here: https://wiki.cancerimagingarchive.net/display/Public/CBIS-DDSM
This collection is freely available to browse, download, and use for commercial, scientific and educational purposes as outlined in the Creative Commons Attribution 3.0 Unported License.
A description of the dataset is provided in:
Lee, Rebecca Sawyer, et al. "A curated mammography data set for use in computer-aided detection and diagnosis research." Scientific data 4 (2017): 170177. URL: https://www.nature.com/articles/sdata2017177
The original images are in DICOM format, the standard format for the communication and management of medical imaging information and related data.
The CBIS-DDSM dataset represents a collection of images from two classes of abnormalities. Indeed, it enables a task of abnormality classification, which aims at distinguishing the following classes:
- Mass
- Calcification
Furthermore, several csv files are hosted here and provide a detailed description of each image, according to the following fields:
patient_id, breast density, left or right breast, image view, abnormality id, abnormality type, calc type, calc distribution, assessment, pathology, subtlety, image file path, cropped image file path, ROI mask file path
Such description enables another fine-grained task: abnormality diagnosis classification. It aims at distinguishing the following classes:
- Mass, Benign (with or without callback)
- Mass, Malignant
- Calcification, Benign
- Calcification, Malignant
In the following you can find a sample image from the original dataset:
Please notice that:
- Full images have a high resolution, e.g. 3000x4000
- Full images and patches are grayscale images with a depth of 16bit
Dealing with original dataset is critical since full images are high resolution and the DICOM format is not natively supported in tf.keras.
Indeed, you are provided with numpy arrays containing images and labels from training and test sets.
The steps performed on each original image are described below:
- the abnormality patch has been extracted from the original image according to the binary mask;
- a patch of healthy tissue (baseline patch) adjacent to the abnormality patch has been extracted from the original image (left, right, top or bottom - no overlap). Both abnormality patch and baseline patch have been added to the images tensor; in other words, an abnormality patch has been ignored if a related baseline patch could not be extracted.
- both abnormality patch and baseline patch have been resized to shape (150x150) using OpenCV resize function:
cv2.resize(img, dsize=(shape, shape), interpolation=cv2.INTER_NEAREST)
- class labels have been assigned to the patches according to the following mapping:
- 0: Baseline patch
- 1: Mass, benign
- 2: Mass, malignant
- 3: Calcification, benign
- 4: Calcification, malignant
- images of baseline patch and abnormality patch, and their related labels, have been added to distinct numpy arrays for images and labels.
-
train_tensor.npy
: images tensor for training -
train_labels.npy
: labels tensor for training -
public_test_tensor.npy
: images tensor for test -
public_test_labels.npy
: images tensor for test
-
- The images tensor of a private test set is also provided. The relative labels tensor is not published within the project files.
private_test_tensor.npy
private_test_labels.npy
-
Training set:
- images tensor shape (5352, 150, 150)
- labels tensor shape (5352,)
-
Public Test set:
- images tensor shape (672, 150, 150)
- labels tensor shape (672,)
- Train
- | benign | malignant | total |
---|---|---|---|
Train Masses | 620 | 598 | 1218 |
Train Calcification | 948 | 510 | 1458 |
Total | 1568 | 1108 | 2676 |
- Test
- | benign | malignant | total |
---|---|---|---|
Global Test Masses | 214 | 144 | 358 |
Global Test Calcification | 192 | 122 | 314 |
Total | 406 | 266 | 672 |
- | right | left | top | bottom | total baselines |
---|---|---|---|---|---|
Baseline for train Masses | 836 | 240 | 89 | 53 | 1218 |
Baseline for train Calcification | 894 | 246 | 163 | 155 | 1458 |
Baseline for global Test Masses | 241 | 78 | 27 | 12 | 358 |
Baseline for global Test Calcification | 180 | 50 | 42 | 42 | 314 |
In light of the procedure described above, particular attention should be paid to the structure of the input tensors:
- odd indices [
2i + 1 for i in range(0,len(tensor)/2)
] will refer to abnormality patches - previous even indices [
2i for i in range(0,len(tensor)/2)
] will refer to respective baseline patches
You will be able to load the arrays using numpy load
function.
def load_training():
images = np.load(os.path.join(out_path,'train_tensor.npy'))
labels = np.load(os.path.join(out_path,'train_labels.npy'))
return images,labels
images, labels= load_training()
print(labels[:10])
>>> array([0, 2, 0, 2, 0, 1, 0, 1, 0, 1])
The first item is the baseline patch (label 0) that is adjacent to the patch described by the second element of the array, i.e. the first abnormality patch (malignant mass, label 2), and so on.