Merge pull request #382 from hdmf-dev/hdmf_2.0

Co-authored-by: Oliver Ruebel <[email protected]> Co-authored-by: Andrew Tritt <[email protected]>
hdmf-dev · Jul 17, 2020 · c258390 · c258390
2 parents b39bada + 74ca76b
commit c258390
Show file tree

Hide file tree

Showing 20 changed files with 1,976 additions and 316 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,5 +1,26 @@
 # HDMF Changelog
 
+## HDMF 2.0.0 (July 17, 2020)
+
+### New features
+- Users can now call `HDF5IO.export` and `HDF5IO.export_io` to write data that was read from one source to a new HDF5
+  file. Developers can implement the `export` method in classes that extend `HDMFIO` to customize the export
+  functionality. See https://hdmf.readthedocs.io/en/latest/export.html for more details. @rly (#388)
+- Users can use the new export functionality to read data from one source, modify the data in-memory, and then write the
+  modified data to a new file. Modifications can include additions and removals. To facilitate removals,
+  `AbstractContainer` contains a new `_remove_child` method and `BuildManager` contains a new `purge_outdated` method.
+  @rly (#388)
+- Users can now call `Container.generate_new_id` to generate new object IDs for the container and all of its children.
+  @rly (#401)
+
+### Breaking changes
+- `Builder` objects no longer have the `written` field which was used by `HDF5IO` to mark the object as written. This
+  is replaced by `HDF5IO.get_written`. @rly (#381)
+- `HDMFIO.write` and `HDMFIO.write_builder` no longer have the keyword argument `exhaust_dcis`. This remains present in
+  `HDF5IO.write` and `HDF5IO.write_builder`. @rly (#388)
+- The class method `HDF5IO.copy_file` is no longer supported and may be removed in a future version. Please use the
+  `HDF5IO.export` method or `h5py.File.copy` method instead. @rly (#388)
+
 ## HDMF 1.6.4 (June 26, 2020)
 
 ### Internal improvements

diff --git a/docs/source/conf.py b/docs/source/conf.py
@@ -70,9 +70,9 @@
 
 intersphinx_mapping = {
     'python': ('https://docs.python.org/3.8', None),
-    'numpy': ('http://docs.scipy.org/doc/numpy/', None),
-    'scipy': ('http://docs.scipy.org/doc/scipy/reference', None),
-    'matplotlib': ('http://matplotlib.org', None),
+    'numpy': ('https://numpy.org/doc/stable/', None),
+    'scipy': ('https://docs.scipy.org/doc/scipy/reference', None),
+    'matplotlib': ('https://matplotlib.org', None),
     'h5py': ('http://docs.h5py.org/en/latest/', None),
     'pandas': ('https://pandas.pydata.org/pandas-docs/stable/', None),
 }

diff --git a/docs/source/export.rst b/docs/source/export.rst
@@ -0,0 +1,93 @@
+Export
+======
+
+Export is a new feature in HDMF 2.0. You can use export to take a container that was read from a file and write it to
+a different file, with or without modifications to the container in memory.
+The in-memory container being exported will be written to the exported file as if it was never read from a file.
+
+To export a container, first read the container from a file, then create a new 
+:py:class:`~hdmf.backends.hdf5.h5tools.HDF5IO` object for exporting the data, then call 
+:py:meth:`~hdmf.backends.hdf5.h5tools.HDF5IO.export` on the
+:py:class:`~hdmf.backends.hdf5.h5tools.HDF5IO` object, passing in the IO object used to read the container
+and optionally, the container itself, which may be modified in memory between reading and exporting.
+
+For example:
+
+.. code-block:: python
+
+   with HDF5IO(self.read_path, manager=manager, mode='r') as read_io:
+       with HDF5IO(self.export_path, mode='w') as export_io:
+           export_io.export(src_io=read_io)
+
+FAQ
+---
+
+Can I read a container from disk, modify it, and then export the modified container?
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+Yes, you can export the in-memory container after modifying it in memory. The modifications will appear in the exported
+file and not the read file.
+
+- If the modifications are removals or additions of containers, then no special action must be taken, as long as the
+  container hierarchy is updated correspondingly.
+- If the modifications are changes to attributes, then
+  :py:meth:`Container.set_modified() <hdmf.container.AbstractContainer.set_modified>` must be called
+  on the container before exporting.
+
+.. note::
+
+  Modifications to :py:class:`h5py.Dataset <h5py.Dataset>` objects act *directly* on the read file on disk.
+  Changes are applied immediately and do not require exporting or writing the file. If you want to modify a dataset only in the new file, than you should replace the whole object with a new array holding the modified data. To prevent unintentional changes to the source file, the source file should be opened with ``mode='r'``.
+
+Can I export a newly instantiated container?
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+No, you can only export containers that have been read from a file. The ``src_io`` argument is required in
+:py:meth:`HDMFIO.export <hdmf.backends.io.HDMFIO.export>`.
+
+Can I read a container from disk and export only part of the container?
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+It depends. You can only export the root container from a file. To export the root container without certain other
+sub-containers in the hierarchy, you can remove those other containers before exporting. However, you cannot export
+only a sub-container of the container hierarchy.
+
+Can I write a newly instantiated container to two different files?
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+HDMF does not allow you to write a container that was not read from a file to two different files. For example, if you
+instantiate container A and write it file 1 and then try to write it to file 2, an error will be raised. However, you
+can read container A from file 1 and then export it to file 2, with or without modifications to container A in
+memory.
+
+What happens to links when I export?
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+The exported file will not contain any links to the original file.
+
+All links (such as internal links (i.e., HDF5 soft links) and links to other files (i.e., HDF5 external links))
+will be preserved in the exported file.
+
+If a link to an :py:class:`h5py.Dataset <h5py.Dataset>` in another file is added to the in-memory container after
+reading it from file and then exported, then by default, the export process will create an external link to the
+existing :py:class:`h5py.Dataset <h5py.Dataset>` object. To instead copy the data from the
+:py:class:`h5py.Dataset <h5py.Dataset>` in another
+file to the exported file, pass the keyword argument ``write_args={'link_data': False}`` to
+:py:meth:`HDF5IO.export <hdmf.backends.hdf5.h5tools.HDF5IO.export>`. This is similar to passing the keyword argument
+``link_data=False`` to :py:meth:`HDF5IO.write <hdmf.backends.hdf5.h5tools.HDF5IO.write>` when writing a file with a
+copy of externally linked datasets.
+
+What happens to references when I export?
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+References will be preserved in the exported file.
+NOTE: Exporting a file involves loading into memory all datasets that contain references and attributes that are
+references. The HDF5 reference IDs within an exported file may differ from the reference IDs in the original file.
+
+What happens to object IDs when I export?
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+After exporting a container, the object IDs of the container and its child containers will be identical to the object
+IDs of the read container and its child containers. The object ID of a container uniquely identifies the container
+within a file, but should *not* be used to distinguish between two different files.
+
+If you would like all object IDs to change on export, then first call the method
+:py:meth:`generate_new_id <hdmf.container.AbstractContainer.generate_new_id>` on the root container to generate
+a new set of IDs for the root container and all of its children, recursively. Then export the container with its
+new IDs. Note: calling the :py:meth:`generate_new_id <hdmf.container.AbstractContainer.generate_new_id>` method
+changes the object IDs of the containers in memory. These changes are not reflected in the original file from
+which the containers were read unless the :py:meth:`HDF5IO.write <hdmf.backends.hdf5.h5tools.HDF5IO.write>`
+method is subsequently called.
diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -38,6 +38,7 @@ with the intention of providing it as an open-source tool for other scientific c
    extensions
    building_api
    validation
+   export
    api_docs
    software_process
    make_roundtrip_test

diff --git a/requirements-dev.txt b/requirements-dev.txt
@@ -1,5 +1,5 @@
-codecov==2.1.3
-coverage==5.1
-flake8==3.7.9
+codecov==2.1.8
+coverage==5.2
+flake8==3.8.3
 python-dateutil==2.8.1
-tox==3.14.3
+tox==3.17.1
diff --git a/requirements.txt b/requirements.txt
@@ -1,5 +1,5 @@
 h5py==2.10.0
-numpy==1.18.1
+numpy==1.18.5
 scipy==1.4.1
 pandas==0.25.3
-ruamel.yaml==0.16.5
+ruamel.yaml==0.16.10
diff --git a/src/hdmf/backends/hdf5/h5_utils.py b/src/hdmf/backends/hdf5/h5_utils.py
@@ -67,6 +67,16 @@ def invert(self):
             self.__inverted = cls(**kwargs)
         return self.__inverted
 
+    def _get_ref(self, ref):
+        return self.get_object(self.dataset.file[ref])
+
+    def __iter__(self):
+        for ref in super().__iter__():
+            yield self._get_ref(ref)
+
+    def __next__(self):
+        return self._get_ref(super().__next__())
+
 
 class BuilderResolverMixin(BuilderResolver):
     """
@@ -108,7 +118,7 @@ def __init__(self, **kwargs):
             if t is RegionReference:
                 self.__refgetters[i] = self.__get_regref
             elif t is Reference:
-                self.__refgetters[i] = self.__get_ref
+                self.__refgetters[i] = self._get_ref
         self.__types = types
         tmp = list()
         for i in range(len(self.dataset.dtype)):
@@ -152,25 +162,26 @@ def __swap_refs(self, row):
             getref = self.__refgetters[i]
             row[i] = getref(row[i])
 
-    def __get_ref(self, ref):
-        return self.get_object(self.dataset.file[ref])
-
     def __get_regref(self, ref):
-        obj = self.__get_ref(ref)
+        obj = self._get_ref(ref)
         return obj[ref]
 
     def resolve(self, manager):
         return self[0:len(self)]
 
+    def __iter__(self):
+        for i in range(len(self)):
+            yield self[i]
+
 
 class AbstractH5ReferenceDataset(DatasetOfReferences):
 
     def __getitem__(self, arg):
         ref = super().__getitem__(arg)
         if isinstance(ref, np.ndarray):
-            return [self.get_object(self.dataset.file[x]) for x in ref]
+            return [self._get_ref(x) for x in ref]
         else:
-            return self.get_object(self.dataset.file[ref])
+            return self._get_ref(ref)
 
     @property
     def dtype(self):