Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add indexing to filesystem client #28

Closed
wants to merge 1 commit into from
Closed

Conversation

dillonwilliams
Copy link
Contributor

Issue #27

@dillonwilliams dillonwilliams requested a review from johndgiese May 7, 2019 01:26
Copy link
Contributor

@johndgiese johndgiese left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR looks great! Is there a noticable speedup?

I did a quick review, but feel free to have willy do a more in-depth review if you like

client.send_datasets([get_new_dataset()])
results = client.search_patients('new')
assert len(results) == 1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extra whitespace

This filesystem client can be used for testing in development when a PACS server
is not available. It may be slow if many datasets are present: All get/fetch operations
are O(N) on the number of DICOM datasets loaded from the `test_dicom_data` dir.
This filesystem client can be used for prototyping and testing in development when a PACS
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 nice attention keeping the commend up to date

for dicom_file in glob.glob(f'{dicom_source_dir}/**/*.dcm', recursive=True):
self._read_and_add_data_set(dicom_file)
# load and use the index if it is present and its hash matches the current dir
index_path = self._filepath(INDEX_FILENAME)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we could split:

        index_path = self._filepath(INDEX_FILENAME)
	        if os.path.exists(index_path):

into a method, e.g. _index_exists?

if os.path.exists(index_path):
with open(index_path, 'rb') as f:
self.index = pickle.load(f)
if self.index.dicom_source_dir_hash != self._dicom_source_dir_hash():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wondering, but why did you decide to put the hash in the index file, vs putting it in the file name?

E.g., is it so that we don't accumulate index files?

If the hash was in the filename, we could check it without loading the file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having one index file name makes it easier to keep track, put in things like .gitignore, etc. The index file is pretty small (less than 1MB for 150+ series), so loading the file to check the hash is not a big slowdown.

and modification times.
"""
h = hashlib.md5()
for dicom_file in glob.glob(f'{self.dicom_source_dir}/**/*.dcm', recursive=True):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like we are assuming all dicom files end in .dcm. I think this is fine for now, but perhaps we should add a comment about it?

with open(index_path, 'wb') as f:
pickle.dump(self.index, f)

def _read_dataset(self, filepath: str) -> Dataset:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like how you made a method for this

self.study_id_to_filepaths: Dict[str, List[int]] = \
defaultdict(_default_empty_list)
self.patient_name_to_filepaths: Dict[str, List[int]] = \
defaultdict(_default_empty_list)


class FilesystemDicomClient(BaseDicomClient):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a thought, but would passing in the index filename into the FilesystemDicomClient, vs using a global const, make it easier to test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would make it a little easier to test, but each user when then have to concern themselves with the index file location and name, as opposed to it being handled in the background. There are definitely use cases for this, like putting the index and the data on separate volumes, but they might be beyond the scope of the filesystem client at that point.

@dillonwilliams
Copy link
Contributor Author

So there is huge speedup on startup / first use when the files have not changed, but the first search of an uncached patient (for example) is actually slower than before, because the patient datasets have to be loaded at the search time. Before, all datasets were cached at the first use. Searching for a patient with 3 studies might take 3-5 seconds.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants