Skip to content

Commit

Permalink
Merge pull request #382 from hdmf-dev/hdmf_2.0
Browse files Browse the repository at this point in the history
Co-authored-by: Oliver Ruebel <[email protected]>
Co-authored-by: Andrew Tritt <[email protected]>
  • Loading branch information
3 people authored Jul 17, 2020
2 parents b39bada + 74ca76b commit c258390
Show file tree
Hide file tree
Showing 20 changed files with 1,976 additions and 316 deletions.
21 changes: 21 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,26 @@
# HDMF Changelog

## HDMF 2.0.0 (July 17, 2020)

### New features
- Users can now call `HDF5IO.export` and `HDF5IO.export_io` to write data that was read from one source to a new HDF5
file. Developers can implement the `export` method in classes that extend `HDMFIO` to customize the export
functionality. See https://hdmf.readthedocs.io/en/latest/export.html for more details. @rly (#388)
- Users can use the new export functionality to read data from one source, modify the data in-memory, and then write the
modified data to a new file. Modifications can include additions and removals. To facilitate removals,
`AbstractContainer` contains a new `_remove_child` method and `BuildManager` contains a new `purge_outdated` method.
@rly (#388)
- Users can now call `Container.generate_new_id` to generate new object IDs for the container and all of its children.
@rly (#401)

### Breaking changes
- `Builder` objects no longer have the `written` field which was used by `HDF5IO` to mark the object as written. This
is replaced by `HDF5IO.get_written`. @rly (#381)
- `HDMFIO.write` and `HDMFIO.write_builder` no longer have the keyword argument `exhaust_dcis`. This remains present in
`HDF5IO.write` and `HDF5IO.write_builder`. @rly (#388)
- The class method `HDF5IO.copy_file` is no longer supported and may be removed in a future version. Please use the
`HDF5IO.export` method or `h5py.File.copy` method instead. @rly (#388)

## HDMF 1.6.4 (June 26, 2020)

### Internal improvements
Expand Down
6 changes: 3 additions & 3 deletions docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -70,9 +70,9 @@

intersphinx_mapping = {
'python': ('https://docs.python.org/3.8', None),
'numpy': ('http://docs.scipy.org/doc/numpy/', None),
'scipy': ('http://docs.scipy.org/doc/scipy/reference', None),
'matplotlib': ('http://matplotlib.org', None),
'numpy': ('https://numpy.org/doc/stable/', None),
'scipy': ('https://docs.scipy.org/doc/scipy/reference', None),
'matplotlib': ('https://matplotlib.org', None),
'h5py': ('http://docs.h5py.org/en/latest/', None),
'pandas': ('https://pandas.pydata.org/pandas-docs/stable/', None),
}
Expand Down
93 changes: 93 additions & 0 deletions docs/source/export.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
Export
======

Export is a new feature in HDMF 2.0. You can use export to take a container that was read from a file and write it to
a different file, with or without modifications to the container in memory.
The in-memory container being exported will be written to the exported file as if it was never read from a file.

To export a container, first read the container from a file, then create a new
:py:class:`~hdmf.backends.hdf5.h5tools.HDF5IO` object for exporting the data, then call
:py:meth:`~hdmf.backends.hdf5.h5tools.HDF5IO.export` on the
:py:class:`~hdmf.backends.hdf5.h5tools.HDF5IO` object, passing in the IO object used to read the container
and optionally, the container itself, which may be modified in memory between reading and exporting.

For example:

.. code-block:: python
with HDF5IO(self.read_path, manager=manager, mode='r') as read_io:
with HDF5IO(self.export_path, mode='w') as export_io:
export_io.export(src_io=read_io)
FAQ
---

Can I read a container from disk, modify it, and then export the modified container?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Yes, you can export the in-memory container after modifying it in memory. The modifications will appear in the exported
file and not the read file.

- If the modifications are removals or additions of containers, then no special action must be taken, as long as the
container hierarchy is updated correspondingly.
- If the modifications are changes to attributes, then
:py:meth:`Container.set_modified() <hdmf.container.AbstractContainer.set_modified>` must be called
on the container before exporting.

.. note::

Modifications to :py:class:`h5py.Dataset <h5py.Dataset>` objects act *directly* on the read file on disk.
Changes are applied immediately and do not require exporting or writing the file. If you want to modify a dataset only in the new file, than you should replace the whole object with a new array holding the modified data. To prevent unintentional changes to the source file, the source file should be opened with ``mode='r'``.

Can I export a newly instantiated container?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
No, you can only export containers that have been read from a file. The ``src_io`` argument is required in
:py:meth:`HDMFIO.export <hdmf.backends.io.HDMFIO.export>`.

Can I read a container from disk and export only part of the container?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
It depends. You can only export the root container from a file. To export the root container without certain other
sub-containers in the hierarchy, you can remove those other containers before exporting. However, you cannot export
only a sub-container of the container hierarchy.

Can I write a newly instantiated container to two different files?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
HDMF does not allow you to write a container that was not read from a file to two different files. For example, if you
instantiate container A and write it file 1 and then try to write it to file 2, an error will be raised. However, you
can read container A from file 1 and then export it to file 2, with or without modifications to container A in
memory.

What happens to links when I export?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The exported file will not contain any links to the original file.

All links (such as internal links (i.e., HDF5 soft links) and links to other files (i.e., HDF5 external links))
will be preserved in the exported file.

If a link to an :py:class:`h5py.Dataset <h5py.Dataset>` in another file is added to the in-memory container after
reading it from file and then exported, then by default, the export process will create an external link to the
existing :py:class:`h5py.Dataset <h5py.Dataset>` object. To instead copy the data from the
:py:class:`h5py.Dataset <h5py.Dataset>` in another
file to the exported file, pass the keyword argument ``write_args={'link_data': False}`` to
:py:meth:`HDF5IO.export <hdmf.backends.hdf5.h5tools.HDF5IO.export>`. This is similar to passing the keyword argument
``link_data=False`` to :py:meth:`HDF5IO.write <hdmf.backends.hdf5.h5tools.HDF5IO.write>` when writing a file with a
copy of externally linked datasets.

What happens to references when I export?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
References will be preserved in the exported file.
NOTE: Exporting a file involves loading into memory all datasets that contain references and attributes that are
references. The HDF5 reference IDs within an exported file may differ from the reference IDs in the original file.

What happens to object IDs when I export?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
After exporting a container, the object IDs of the container and its child containers will be identical to the object
IDs of the read container and its child containers. The object ID of a container uniquely identifies the container
within a file, but should *not* be used to distinguish between two different files.

If you would like all object IDs to change on export, then first call the method
:py:meth:`generate_new_id <hdmf.container.AbstractContainer.generate_new_id>` on the root container to generate
a new set of IDs for the root container and all of its children, recursively. Then export the container with its
new IDs. Note: calling the :py:meth:`generate_new_id <hdmf.container.AbstractContainer.generate_new_id>` method
changes the object IDs of the containers in memory. These changes are not reflected in the original file from
which the containers were read unless the :py:meth:`HDF5IO.write <hdmf.backends.hdf5.h5tools.HDF5IO.write>`
method is subsequently called.
1 change: 1 addition & 0 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@ with the intention of providing it as an open-source tool for other scientific c
extensions
building_api
validation
export
api_docs
software_process
make_roundtrip_test
Expand Down
8 changes: 4 additions & 4 deletions requirements-dev.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
codecov==2.1.3
coverage==5.1
flake8==3.7.9
codecov==2.1.8
coverage==5.2
flake8==3.8.3
python-dateutil==2.8.1
tox==3.14.3
tox==3.17.1
4 changes: 2 additions & 2 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
h5py==2.10.0
numpy==1.18.1
numpy==1.18.5
scipy==1.4.1
pandas==0.25.3
ruamel.yaml==0.16.5
ruamel.yaml==0.16.10
25 changes: 18 additions & 7 deletions src/hdmf/backends/hdf5/h5_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,16 @@ def invert(self):
self.__inverted = cls(**kwargs)
return self.__inverted

def _get_ref(self, ref):
return self.get_object(self.dataset.file[ref])

def __iter__(self):
for ref in super().__iter__():
yield self._get_ref(ref)

def __next__(self):
return self._get_ref(super().__next__())


class BuilderResolverMixin(BuilderResolver):
"""
Expand Down Expand Up @@ -108,7 +118,7 @@ def __init__(self, **kwargs):
if t is RegionReference:
self.__refgetters[i] = self.__get_regref
elif t is Reference:
self.__refgetters[i] = self.__get_ref
self.__refgetters[i] = self._get_ref
self.__types = types
tmp = list()
for i in range(len(self.dataset.dtype)):
Expand Down Expand Up @@ -152,25 +162,26 @@ def __swap_refs(self, row):
getref = self.__refgetters[i]
row[i] = getref(row[i])

def __get_ref(self, ref):
return self.get_object(self.dataset.file[ref])

def __get_regref(self, ref):
obj = self.__get_ref(ref)
obj = self._get_ref(ref)
return obj[ref]

def resolve(self, manager):
return self[0:len(self)]

def __iter__(self):
for i in range(len(self)):
yield self[i]


class AbstractH5ReferenceDataset(DatasetOfReferences):

def __getitem__(self, arg):
ref = super().__getitem__(arg)
if isinstance(ref, np.ndarray):
return [self.get_object(self.dataset.file[x]) for x in ref]
return [self._get_ref(x) for x in ref]
else:
return self.get_object(self.dataset.file[ref])
return self._get_ref(ref)

@property
def dtype(self):
Expand Down
Loading

0 comments on commit c258390

Please sign in to comment.