Proposal: Integrating Image and Adjunct Patient Data from TCIA via CSV files #2212

aylward · 2021-05-19T06:26:25Z

aylward
May 19, 2021
Maintainer

This is a proposal from the MONAI Developers and the I/O, Data, and Deploy Working Groups.

A. Overview

Premise:
A potential growth area for MONAI is via the incorporation of adjunct data (patient demographics, lab results, image acquisition parameters and other non-image data) with images for diagnoses and outcome prediction. See MIDL 2020 keynote talk by Prof. Nikos Paragios https://2020.midl.io/keynotes.html

Goal:
Provide reference implementation of a Dataset loader in MONAI to help guide challenge organizers and researchers in their organization of adjunct data for input into MONAI.

Proposed solution:
Create a CSV_TCIA_Dataset loader for image and adjunct data, where the images are stored on The Cancer Image Archive (TCIA).

B. Data Details

Proposed data:
ISPY1 = Breast cancer MRI curated cases with DICOM images and adjunct CSV data are available on the TCIA.

Data Location:
https://wiki.cancerimagingarchive.net/display/Public/ISPY1

Data Description:
ACRIN 6657 was designed as a prospective study to test MRI for ability to predict response to treatment and risk-of-recurrence in patients with stage 2 or 3 breast cancer receiving neoadjuvant chemotherapy (NACT). ACRIN 6657 was conducted as a companion study to CALGB 150007, a correlative science study evaluating tissue-based biomarkers in the setting of neoadjuvant treatment of breast cancer. Collectively, CALGB 150007 and ACRIN 6657 formed the basis of the multicenter Investigation of Serial Studies to Predict Your Therapeutic Response with Imaging and moLecular Analysis (I-SPY TRIAL) breast cancer trial, a study of imaging and tissue-based biomarkers for predicting pathologic complete response (pCR) and recurrence-free survival (RFS).Participant Eligibility and Enrollment: Criteria for inclusion were patients enrolling on CALGB 150007 with T3 tumors measuring at least 3 cm in diameter by clinical exam or imaging and receiving neoadjuvant chemotherapy with an anthracycline-cyclophosphamide regimen alone or followed by a taxane. Pregnant patients and those with ferromagnetic prostheses were excluded from the study. The study was open to enrollment from May 2002 to March 2006. 237 patients were enrolled, of which 230 met eligibility criteria.

C. Accessing the Data

1) MONAI users will have a local CSV file
The local CSV file points to the TCIA data via a URL and includes adjunct data and outcomes. This will be the input to the Dataset loader along with lists specifying columns for inputs and outcomes.

The local CSV file is based on info currently spread across multiple CSV files, but we should consolidate to one CSV for this demo. For example, see the two source CSV files at:
(a) https://drive.google.com/file/d/1DGAz4MVjupAiai3bOaYEerM6LImXIJg5/view?usp=sharing - Provides all adjunct medical data and the URL the points to each patient's collection on TCIA.
(b) https://drive.google.com/file/d/1D2zfyWCLfFHPwfDKIfNFEgeBqiS1YeNR/view?usp=sharing - Lists the MRI scans available for each patient on TCIA.
Both (a) and (b) should be combined into a single CSV file. That combination is partially completed in this file - https://drive.google.com/file/d/1HQ7BZvBr1edmi8HIwdG5KBweXWms5Uzk/view?usp=sharing

2) The CSV file will be passed to the CSV_TCIA_Dataset command
Let's assuming we want to load the adjunct data in columns M (age), N (ERpos), and P (PfRpos); the URL of the data on TCIA is given in column AM (https); and the outcome to be determined is in column AE (RFS). the command may look like:
monai.apps.CsvTciaDataset("ISPY1_Combined.csv",["M","N","P"],["AM"],["AE"],"/tmp","training",transforms)

3) Image data can be loaded from TCIA using its REST API
The API is documented at https://wiki.cancerimagingarchive.net/display/Public/TCIA+Programmatic+Interface+REST+API+Guides

Via that API, we can access individual cases. In this study, for each case, there are studies from 4 different time points. At each time point, there are the DICOM images and segmentations. For our example, the Dynamic-3dfgre may be most informative of outcome, and we should use that scan from the last time point for each patient.

D. Open Issues

How to combine adjunct data with image pixel data as input to a MONAI network?
- This is a task for the Research Working Group. They may want to consider the keynote at MIDL 2020 by Prof. Nikos Paragios https://2020.midl.io/keynotes.html

E. Future Opportunities

The CSV files could use a ontology standard for naming columns to aid in the automated interpretation of CSV fils. For more information on NIH initiatives to standardize the naming of medical data in CSV files, see https://wiki.cancerimagingarchive.net/display/DOI/SDTM+datasets+of+clinical+data+and+measurements+for+selected+cancer+collections+to+TCIA
The data in this collection has DICOM SR (structured reports) that include segmentations and outcome - eventually MONAI should have a DICOM SR reader.
The data in this collection can be used in longitudinal studies (look for changes over time points)

kirbyju · 2021-05-20T19:47:03Z

kirbyju
May 20, 2021
Collaborator

The "DICOM Metadata Digest (CSV)" for this dataset is an outlier. Most TCIA collections don't have this file provided as a CSV attachment on the wiki. However, all of the information in that CSV is available via our API which is likely preferable to relying on a CSV wiki attachment.

Option #1: https://services.cancerimagingarchive.net/services/v4/TCIA/query/getSeries?Collection=ISPY1 might be sufficient for your needs. This will spit out a subset of image metadata for each DICOM scan/series in the ISPY1 collection which could then be merged with the clinical spreadsheet. This should be more generalizable and easier to work with than manually tracking down a spreadsheet from the wiki. Note that you can add "&format=JSON" to specify JSON/CSV/XML.

Option #2: If you want an even more robust set of DICOM metadata you could look into this NBIA API endpoint: https://wiki.cancerimagingarchive.net/display/Public/NBIA+Search+REST+API+Guide#NBIASearchRESTAPIGuide-SeriesMetadataAPI. This API endpoint lets a user specify a list of Series UIDs and provides a longer list of metadata fields. This NBIA API requires some extra steps to set up an authorization token before you can use it.

Once you have your list of Series UIDs from the previous query you can use https://services.cancerimagingarchive.net/services/v4/TCIA/query/getImage?SeriesInstanceUID=1.3.6.1.4.1.14519.5.2.1.7695.1700.250955243295773832626617549482 to iteratively download each one. The down side of this is that it provides a zip file for each scan that you have to unpack and organize the data yourself into a logical directory hierarchy.

We are also very close (weeks, I think, not months) to releasing a "command line interface" version of the NBIA Data Retriever which might also be of interest. This will let you specify a ".TCIA" manifest file (which you can obtain using the "Image Download" button on the ISPY page) and that will easily download all the data from the collection/manifest into a well organized hierarchy (Collection / Patient / Study / Series / Image).

1 reply

aylward May 23, 2021
Maintainer Author

Thanks for the feedback! Think this information is going to be extremely helpful as our collection of dataset readers expand. The plans you've outline for TCAI nicely parallel many of the items on our wish list!!

For this near-term proposal, our goal isn't to come up with the "best" TCIA reader, but we sought to create a data import method/example that illustrates a broad set of dataset reading capabilities to the MONAI community.

In particular, the motivation for the CSV file is that they are a commonly occurring and easy-to-craft data representation used in challenges, by other platforms, and in a variety of domains. So, by having MONAI ingest a CSV file

that file can then reference data anywhere on the web.
that file can specify training, testing, and validation data allocations.
that file can be easily shared with others and other applications.
MD5 checksum can be used to ensure the CSV file used in a challenge is the same file used by every participant.
that file is human readable and human editable using a multitude of programs.

So, what we are proposing is a great demo in that it shows

how to use CSV files with MONAI
how data can be read from the web into MONAI
how DICOM data can be read into MONAI (e.g., after downloading it from the web)
one easy-to-understand option for getting DICOM data from any of the massive number of collections on TCIA

This is FAR from a final solution for data management or for TCIA data reading, but it is going to help a very broad audience of data producers and consumers by its simplicity and extensibility.

The Data (and Deploy) working group(s) is (are) looking at FIHR and other data formats as well as various ontologies to specify a more comprehensive solution for MONAI data management/ingest, but that is a much longer-term process. If you have image and meta-data now, a CSV file seems like a good near-term solution to quickly get it into MONAI.

The info you've provided is excellent for the Data and Deploy working groups as they consider FIHR and other data ingest solutions. For example, they may have to devise a way for a TCIA manifest to be easily convert to their proposed FIHR/? format. Additionally, they may consider a TCIA specific reader, and we look forward to working with the TCIA team to figure out how to designate training, testing and validation subsets from a collection on TCIA (perhaps by ingesting manifests that are pre-defined, MD5-checksum verified, and hosted on the TCIA). We may also want a way of mixing TCIA data with local data...and then we're perhaps back to a version of the FIHR format that allows data to be stored in multiple locations. Again, it is going to be a longer-term process to get those details resolved, and MONAI would still want to support CSV reading, DICOM reading, and Web reading...as this proposal demonstrates.

With that in mind, do you agree that a CSV might be a good first step for MONAI or do you think there is other "lower hanging fruit" we should pursue instead?

Thanks!
Stephen

kirbyju · 2021-05-20T19:51:54Z

kirbyju
May 20, 2021
Collaborator

Regarding DICOM SR, you might also want to take a look at these datasets as examples:

QIN-PROSTATE-Repeatability - https://doi.org/10.7937/K9/TCIA.2018.MR1CKGND
QIN-HeadNeck - https://doi.org/10.7937/K9/TCIA.2015.K0F5CGLI
Standardized representation of the TCIA LIDC-IDRI annotations using DICOM - https://doi.org/10.7937/TCIA.2018.h7umfurq

1 reply

aylward May 23, 2021
Maintainer Author

Excellent! I think DICOM SR handling (ingest and generation) needs to move up the MONAI priority queue, but I do also wonder where if fits within the MONAI ecosystem that is expanding to include Clara, MONAI Label, and such... However, regardless of the implementation details, it is going to become critical to the MONAI community. Which is a great testament to how the MONAI community is moving and expanding closer and closer to clinical workflows! The pace of this progression to clinical applicability is phenomenal!

The examples you provide are exactly what is needed to start this process. Having examples from multiple groups/domains is critical to developing a robust solution. Thank you!!!

wyli · 2021-07-28T15:50:00Z

wyli
Jul 28, 2021
Collaborator

TCIA downloading APIs:

for reference (from @kirbyju)

1 reply

kirbyju Jul 28, 2021
Collaborator

I should note that those are community repositories. The official REST API Guides are at https://wiki.cancerimagingarchive.net/x/NIIiAQ. One more community repo of potential interest is https://github.com/oncoramedical/tcia_bootstrap which aims to make importing data into a DICOM server quick and easy.

wyli · 2021-09-06T13:16:14Z

wyli
Sep 6, 2021
Collaborator

We discussed a dev task in #2877, and @Nic-Ma has kindly come up with an initial solution:
https://github.com/Project-MONAI/tutorials/blob/82e1e623c2cfaad3b3dd94db537bb743dce523a6/modules/tcia_csv_processing.ipynb

0 replies

aylward · 2021-09-07T16:35:58Z

aylward
Sep 7, 2021
Maintainer Author

Nice! Had not thought of downloading the CSV file. As long as we also have the option of specifying only filename (and not a url) so that we can load a local CSV file, your solution looks good. Is it assumed that the first row of the csv gives the name for each column (i.e., col_name)? That seems fine, but should be made explicit in description. Should it be possible to specify rows for training, testing, and validation - much like the decathlon data? Should there be an md5 checksum for the CSV, to ensure its data hasn't been modified?

…

On Mon, Sep 6, 2021 at 9:16 AM Wenqi Li ***@***.***> wrote: We discussed a dev task in #2877 <#2877>, and @Nic-Ma <https://github.com/Nic-Ma> has kindly come up with an initial solution: https://github.com/Project-MONAI/tutorials/blob/82e1e623c2cfaad3b3dd94db537bb743dce523a6/modules/tcia_csv_processing.ipynb — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#2212 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACEJL22F5CAWYBICRITTT3UAS5KTANCNFSM45D52SSQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

-- Stephen R. Aylward, Ph.D. Senior Director of Strategic Initiatives --- Kitware: *Delivering innovative, open source, scientific software.*

0 replies

aylward · 2021-09-08T12:42:10Z

aylward
Sep 8, 2021
Maintainer Author

For specifying training, testing, and validation data, perhaps a column can be designated to contain that info. It would greatly reduce the complexity of the command line and allow for a full experiment to be defined and reproducible. On Tue, Sep 7, 2021 at 12:35 PM Stephen Aylward ***@***.***> wrote:

…

Nice! Had not thought of downloading the CSV file. As long as we also have the option of specifying only filename (and not a url) so that we can load a local CSV file, your solution looks good. Is it assumed that the first row of the csv gives the name for each column (i.e., col_name)? That seems fine, but should be made explicit in description. Should it be possible to specify rows for training, testing, and validation - much like the decathlon data? Should there be an md5 checksum for the CSV, to ensure its data hasn't been modified? On Mon, Sep 6, 2021 at 9:16 AM Wenqi Li ***@***.***> wrote: > We discussed a dev task in #2877 > <#2877>, and @Nic-Ma > <https://github.com/Nic-Ma> has kindly come up with an initial solution: > > https://github.com/Project-MONAI/tutorials/blob/82e1e623c2cfaad3b3dd94db537bb743dce523a6/modules/tcia_csv_processing.ipynb > > — > You are receiving this because you authored the thread. > Reply to this email directly, view it on GitHub > <#2212 (comment)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AACEJL22F5CAWYBICRITTT3UAS5KTANCNFSM45D52SSQ> > . > Triage notifications on the go with GitHub Mobile for iOS > <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> > or Android > <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. > > -- Stephen R. Aylward, Ph.D. Senior Director of Strategic Initiatives --- Kitware: *Delivering innovative, open source, scientific software.*

-- Stephen R. Aylward, Ph.D. Senior Director of Strategic Initiatives --- Kitware: *Delivering innovative, open source, scientific software.*

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: Integrating Image and Adjunct Patient Data from TCIA via CSV files #2212

{{title}}

Replies: 6 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Proposal: Integrating Image and Adjunct Patient Data from TCIA via CSV files #2212

aylward May 19, 2021 Maintainer

A. Overview

B. Data Details

C. Accessing the Data

D. Open Issues

E. Future Opportunities

Replies: 6 comments · 3 replies

kirbyju May 20, 2021 Collaborator

aylward May 23, 2021 Maintainer Author

kirbyju May 20, 2021 Collaborator

aylward May 23, 2021 Maintainer Author

wyli Jul 28, 2021 Collaborator

kirbyju Jul 28, 2021 Collaborator

wyli Sep 6, 2021 Collaborator

aylward Sep 7, 2021 Maintainer Author

aylward Sep 8, 2021 Maintainer Author

aylward
May 19, 2021
Maintainer

Replies: 6 comments 3 replies

kirbyju
May 20, 2021
Collaborator

aylward May 23, 2021
Maintainer Author

kirbyju
May 20, 2021
Collaborator

aylward May 23, 2021
Maintainer Author

wyli
Jul 28, 2021
Collaborator

kirbyju Jul 28, 2021
Collaborator

wyli
Sep 6, 2021
Collaborator

aylward
Sep 7, 2021
Maintainer Author

aylward
Sep 8, 2021
Maintainer Author