Add indexing to filesystem client #28

dillonwilliams · 2019-05-07T01:26:27Z

Issue #27

johndgiese

This PR looks great! Is there a noticable speedup?

I did a quick review, but feel free to have willy do a more in-depth review if you like

johndgiese · 2019-05-07T12:39:55Z

pacsman/filesystem_client_test.py

+    client.send_datasets([get_new_dataset()])
+    results = client.search_patients('new')
+    assert len(results) == 1
+


Extra whitespace

johndgiese · 2019-05-07T12:40:24Z

pacsman/filesystem_dev_client.py

-This filesystem client can be used for testing in development when a PACS server
-is not available. It may be slow if many datasets are present: All get/fetch operations
-are O(N) on the number of DICOM datasets loaded from the `test_dicom_data` dir.
+This filesystem client can be used for prototyping and testing in development when a PACS


👍 nice attention keeping the commend up to date

johndgiese · 2019-05-07T12:41:56Z

pacsman/filesystem_dev_client.py

-        for dicom_file in glob.glob(f'{dicom_source_dir}/**/*.dcm', recursive=True):
-            self._read_and_add_data_set(dicom_file)
+        # load and use the index if it is present and its hash matches the current dir
+        index_path = self._filepath(INDEX_FILENAME)


Perhaps we could split:

index_path = self._filepath(INDEX_FILENAME) if os.path.exists(index_path):

into a method, e.g. _index_exists?

johndgiese · 2019-05-07T12:43:38Z

pacsman/filesystem_dev_client.py

+        if os.path.exists(index_path):
+            with open(index_path, 'rb') as f:
+                self.index = pickle.load(f)
+                if self.index.dicom_source_dir_hash != self._dicom_source_dir_hash():


Just wondering, but why did you decide to put the hash in the index file, vs putting it in the file name?

E.g., is it so that we don't accumulate index files?

If the hash was in the filename, we could check it without loading the file.

Having one index file name makes it easier to keep track, put in things like .gitignore, etc. The index file is pretty small (less than 1MB for 150+ series), so loading the file to check the hash is not a big slowdown.

johndgiese · 2019-05-07T12:44:58Z

pacsman/filesystem_dev_client.py

+        and modification times.
+        """
+        h = hashlib.md5()
+        for dicom_file in glob.glob(f'{self.dicom_source_dir}/**/*.dcm', recursive=True):


It seems like we are assuming all dicom files end in .dcm. I think this is fine for now, but perhaps we should add a comment about it?

johndgiese · 2019-05-07T12:46:49Z

pacsman/filesystem_dev_client.py

+        with open(index_path, 'wb') as f:
+            pickle.dump(self.index, f)
+
+    def _read_dataset(self, filepath: str) -> Dataset:


I like how you made a method for this

johndgiese · 2019-05-07T12:49:17Z

pacsman/filesystem_dev_client.py

+        self.study_id_to_filepaths: Dict[str, List[int]] = \
+            defaultdict(_default_empty_list)
+        self.patient_name_to_filepaths: Dict[str, List[int]] = \
+            defaultdict(_default_empty_list)


 class FilesystemDicomClient(BaseDicomClient):


Just a thought, but would passing in the index filename into the FilesystemDicomClient, vs using a global const, make it easier to test?

It would make it a little easier to test, but each user when then have to concern themselves with the index file location and name, as opposed to it being handled in the background. There are definitely use cases for this, like putting the index and the data on separate volumes, but they might be beyond the scope of the filesystem client at that point.

dillonwilliams · 2019-05-15T22:29:27Z

So there is huge speedup on startup / first use when the files have not changed, but the first search of an uncached patient (for example) is actually slower than before, because the patient datasets have to be loaded at the search time. Before, all datasets were cached at the first use. Searching for a patient with 3 studies might take 3-5 seconds.

Add indexing to filesystem client

d53deef

Issue #27

dillonwilliams requested a review from johndgiese May 7, 2019 01:26

johndgiese reviewed May 7, 2019

View reviewed changes

dillonwilliams closed this Feb 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add indexing to filesystem client #28

Add indexing to filesystem client #28

dillonwilliams commented May 7, 2019

johndgiese left a comment

johndgiese May 7, 2019

johndgiese May 7, 2019

johndgiese May 7, 2019

johndgiese May 7, 2019

dillonwilliams May 15, 2019

johndgiese May 7, 2019

johndgiese May 7, 2019

johndgiese May 7, 2019

dillonwilliams May 15, 2019

dillonwilliams commented May 15, 2019

Add indexing to filesystem client #28

Add indexing to filesystem client #28

Conversation

dillonwilliams commented May 7, 2019

johndgiese left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dillonwilliams commented May 15, 2019