Skip to content

1.Abnormality detection in mammography

Marsha Gómez edited this page Jan 4, 2021 · 1 revision

Convolutional Neural Network for Medical Imaging Analysis - Abnormality detection in mammography

The objective is to perform abnormality classification in mammography using Convolutional Neural Networks

Original Dataset

The dataset we will focus on is CBIS DDSM: Curated Breast Imaging Subset of Digital Database for Screening Mammography.

The original data, along with a detailed description of the collection and the policies about usage and citation, can be found here: https://wiki.cancerimagingarchive.net/display/Public/CBIS-DDSM

This collection is freely available to browse, download, and use for commercial, scientific and educational purposes as outlined in the Creative Commons Attribution 3.0 Unported License.

A description of the dataset is provided in:

Lee, Rebecca Sawyer, et al. "A curated mammography data set for use in computer-aided detection and diagnosis research." Scientific data 4 (2017): 170177. URL: https://www.nature.com/articles/sdata2017177

The original images are in DICOM format, the standard format for the communication and management of medical imaging information and related data.

Description of abnormalities and Classification Tasks

The CBIS-DDSM dataset represents a collection of images from two classes of abnormalities. Indeed, it enables a task of abnormality classification, which aims at distinguishing the following classes:

  • Mass
  • Calcification

Furthermore, several csv files are hosted here and provide a detailed description of each image, according to the following fields:

patient_id, breast density, left or right breast, image view, abnormality id, abnormality type, calc type, calc distribution, assessment, pathology, subtlety, image file path, cropped image file path, ROI mask file path

Such description enables another fine-grained task: abnormality diagnosis classification. It aims at distinguishing the following classes:

  • Mass, Benign (with or without callback)
  • Mass, Malignant
  • Calcification, Benign
  • Calcification, Malignant

In the following you can find a sample image from the original dataset:

Project Dataset

Please notice that:

  • Full images have a high resolution, e.g. 3000x4000
  • Full images and patches are grayscale images with a depth of 16bit

Dataset as it is provided for the final project

Dealing with original dataset is critical since full images are high resolution and the DICOM format is not natively supported in tf.keras.

Indeed, you are provided with numpy arrays containing images and labels from training and test sets.

The steps performed on each original image are described below:

  • the abnormality patch has been extracted from the original image according to the binary mask;
  • a patch of healthy tissue (baseline patch) adjacent to the abnormality patch has been extracted from the original image (left, right, top or bottom - no overlap). Both abnormality patch and baseline patch have been added to the images tensor; in other words, an abnormality patch has been ignored if a related baseline patch could not be extracted.
  • both abnormality patch and baseline patch have been resized to shape (150x150) using OpenCV resize function: cv2.resize(img, dsize=(shape, shape), interpolation=cv2.INTER_NEAREST)
  • class labels have been assigned to the patches according to the following mapping:
    • 0: Baseline patch
    • 1: Mass, benign
    • 2: Mass, malignant
    • 3: Calcification, benign
    • 4: Calcification, malignant
  • images of baseline patch and abnormality patch, and their related labels, have been added to distinct numpy arrays for images and labels.
    • train_tensor.npy: images tensor for training
    • train_labels.npy: labels tensor for training
    • public_test_tensor.npy: images tensor for test
    • public_test_labels.npy: images tensor for test
  • The images tensor of a private test set is also provided. The relative labels tensor is not published within the project files.
    • private_test_tensor.npy
    • private_test_labels.npy

Dataset Structure

  • Training set:

    • images tensor shape (5352, 150, 150)
    • labels tensor shape (5352,)
  • Public Test set:

    • images tensor shape (672, 150, 150)
    • labels tensor shape (672,)

Classes distribution of images

  • Train
- benign malignant total
Train Masses 620 598 1218
Train Calcification 948 510 1458
Total 1568 1108 2676
  • Test
- benign malignant total
Global Test Masses 214 144 358
Global Test Calcification 192 122 314
Total 406 266 672

Baseline patches number and position

- right left top bottom total baselines
Baseline for train Masses 836 240 89 53 1218
Baseline for train Calcification 894 246 163 155 1458
Baseline for global Test Masses 241 78 27 12 358
Baseline for global Test Calcification 180 50 42 42 314

In light of the procedure described above, particular attention should be paid to the structure of the input tensors:

  • odd indices [2i + 1 for i in range(0,len(tensor)/2)] will refer to abnormality patches
  • previous even indices [2i for i in range(0,len(tensor)/2)] will refer to respective baseline patches

You will be able to load the arrays using numpy load function.

def load_training():
  images = np.load(os.path.join(out_path,'train_tensor.npy'))
  labels = np.load(os.path.join(out_path,'train_labels.npy'))
  return images,labels

images, labels= load_training()
print(labels[:10])
>>> array([0, 2, 0, 2, 0, 1, 0, 1, 0, 1])

The first item is the baseline patch (label 0) that is adjacent to the patch described by the second element of the array, i.e. the first abnormality patch (malignant mass, label 2), and so on.