-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add indexing to filesystem client #28
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR looks great! Is there a noticable speedup?
I did a quick review, but feel free to have willy do a more in-depth review if you like
client.send_datasets([get_new_dataset()]) | ||
results = client.search_patients('new') | ||
assert len(results) == 1 | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extra whitespace
This filesystem client can be used for testing in development when a PACS server | ||
is not available. It may be slow if many datasets are present: All get/fetch operations | ||
are O(N) on the number of DICOM datasets loaded from the `test_dicom_data` dir. | ||
This filesystem client can be used for prototyping and testing in development when a PACS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 nice attention keeping the commend up to date
for dicom_file in glob.glob(f'{dicom_source_dir}/**/*.dcm', recursive=True): | ||
self._read_and_add_data_set(dicom_file) | ||
# load and use the index if it is present and its hash matches the current dir | ||
index_path = self._filepath(INDEX_FILENAME) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we could split:
index_path = self._filepath(INDEX_FILENAME)
if os.path.exists(index_path):
into a method, e.g. _index_exists
?
if os.path.exists(index_path): | ||
with open(index_path, 'rb') as f: | ||
self.index = pickle.load(f) | ||
if self.index.dicom_source_dir_hash != self._dicom_source_dir_hash(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wondering, but why did you decide to put the hash in the index file, vs putting it in the file name?
E.g., is it so that we don't accumulate index files?
If the hash was in the filename, we could check it without loading the file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having one index file name makes it easier to keep track, put in things like .gitignore
, etc. The index file is pretty small (less than 1MB for 150+ series), so loading the file to check the hash is not a big slowdown.
and modification times. | ||
""" | ||
h = hashlib.md5() | ||
for dicom_file in glob.glob(f'{self.dicom_source_dir}/**/*.dcm', recursive=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like we are assuming all dicom files end in .dcm
. I think this is fine for now, but perhaps we should add a comment about it?
with open(index_path, 'wb') as f: | ||
pickle.dump(self.index, f) | ||
|
||
def _read_dataset(self, filepath: str) -> Dataset: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like how you made a method for this
self.study_id_to_filepaths: Dict[str, List[int]] = \ | ||
defaultdict(_default_empty_list) | ||
self.patient_name_to_filepaths: Dict[str, List[int]] = \ | ||
defaultdict(_default_empty_list) | ||
|
||
|
||
class FilesystemDicomClient(BaseDicomClient): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a thought, but would passing in the index filename into the FilesystemDicomClient
, vs using a global const, make it easier to test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would make it a little easier to test, but each user when then have to concern themselves with the index file location and name, as opposed to it being handled in the background. There are definitely use cases for this, like putting the index and the data on separate volumes, but they might be beyond the scope of the filesystem client at that point.
So there is huge speedup on startup / first use when the files have not changed, but the first search of an uncached patient (for example) is actually slower than before, because the patient datasets have to be loaded at the search time. Before, all datasets were cached at the first use. Searching for a patient with 3 studies might take 3-5 seconds. |
Issue #27