-
Notifications
You must be signed in to change notification settings - Fork 294
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question about data from Imaging Data Commons used for training/testing the MR model #313
Comments
Hi, |
After converting the Dicoms to niftis using dcm2niix I got the following IDs. Does that help?
|
Hi, Hmm it looks like those may be a combination of PatientID and some sort of date. I think the unique SeriesInstanceUID would be the most helpful. The CT volume will have a SeriesInstanceUID, and the DICOM SEG object will also have its own SeriesInstanceUID. You could use
Thanks! |
Hi Jakob, I wanted to follow up on the previous comment, if it would be possible to obtain the SeriesInstanceUIDs? Also it would be great to match the SeriesInstanceUIDs to the IDs that you used (s0001, s0002, etc). Thank you! Deepa |
@wasserth I think it is quite important to know precisely what data was used to train the model. IDC makes it possible to very very easily retrieve the images identified by DICOM UIDs. All you have to do is provide the list of those UIDs and the IDC data release version. It would be great if we could work together to help gather and share this information and by doing this improve transparency of your training process. |
When I downloaded all the files I got one directory with thousands of dicom slices in it. It was not easily possible to see which ones make up one 3d volume. I ran dcm2niix on the directly and luckily it figured out all the files and generated a list of 3d nifti files. From these files I selected a subset of images. They have the names which i showed earlier. I do not really know how to go back from these names to the DICOM UIDs. For each file dcm2niix also generated the following json file. But this also does not seem to contain the DICOM UID:
|
@wasserth thank you for the explanation. This definitely helps understand how matching UIDs is not trivial for your processing approach. We have an idea how to match those, and will explore this and update this issue. We will also improve download process from IDC Portal so that you do not have to deal with one directory with thousands of DICOM slices in it. |
@wasserth do you mind sharing the dcm2niix command you used? did you only use the compression argument as in -z y? |
I don't know why I didn't think about this earlier, but I think I figured it out. It appears that the strings shared in #313 (comment) are formed as concatenation of @vkt1414 here's the query WITH
selected AS (
SELECT
REPLACE(CONCAT(PatientID,'_',CAST(StudyDate AS string FORMAT 'YYYYMMDD'),CAST(StudyTime AS string format 'HHMISS')),'-','_') AS filename
FROM
`bigquery-public-data.idc_current.dicom_all`)
SELECT
DISTINCT(filename)
FROM
selected
WHERE
# exact name CMB_CRC_MSB_09151_19591104154630
filename LIKE "CMB_CRC_MSB_09151%"
ORDER BY
filename The only difference appears to be in the shift in the HH part of the |
@wasserth can you provide the mapping from the |
Hi @wasserth, We wanted to know if it would be possible to address the above issue. Thank you! Deepa |
Hi Jakob and co-authors,
Thank you for your contribution for creating a model to segment MR structures! My lab is very excited to try it out.
I was curious about the data that you used from Imaging Data Commons. In your supplementary material S4, you listed 21 collections from IDC that you used, and in the paper you mentioned you used data from 47 patients.
In your metadata csv, would it be possible to include more identifiable information about which exact patients (and if applicable, corresponding segmentations) were used from the 21 IDC collections? Perhaps you could include the SeriesInstanceUID of the data as well as for the segmentations.
We would like to try to try your model on more data from IDC, and would like to make sure that the it does not overlap with data used for training.
Thank you!
Deepa
The text was updated successfully, but these errors were encountered: