Merge branch 'dev' into postinit

hdmf-dev · Apr 10, 2024 · 8f991a7 · 8f991a7
2 parents 4799bb9 + d85d0cb
commit 8f991a7
Show file tree

Hide file tree

Showing 6 changed files with 185 additions and 31 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -4,6 +4,7 @@
 
 ### Enhancements
 - Added `TermSetConfigurator` to automatically wrap fields with `TermSetWrapper` according to a configuration file. @mavaylon1 [#1016](https://github.com/hdmf-dev/hdmf/pull/1016)
+- Updated `TermSetWrapper` to support validating a single field within a compound array. @mavaylon1 [#1061](https://github.com/hdmf-dev/hdmf/pull/1061)
 
 ## HDMF 3.13.0 (March 20, 2024)
 
@@ -138,8 +139,8 @@ will increase the minor version number to 3.10.0. See the 3.9.1 release notes be
 ## HDMF 3.6.0 (May 12, 2023)
 
 ### New features and minor improvements
-- Updated `ExternalResources` to have `FileTable` and new methods to query data. the `ResourceTable` has been removed along with methods relating to `Resource`. @mavaylon [#850](https://github.com/hdmf-dev/hdmf/pull/850)
-- Updated hdmf-common-schema version to 1.6.0. @mavaylon [#850](https://github.com/hdmf-dev/hdmf/pull/850)
+- Updated `ExternalResources` to have `FileTable` and new methods to query data. the `ResourceTable` has been removed along with methods relating to `Resource`. @mavaylon1 [#850](https://github.com/hdmf-dev/hdmf/pull/850)
+- Updated hdmf-common-schema version to 1.6.0. @mavaylon1 [#850](https://github.com/hdmf-dev/hdmf/pull/850)
 - Added testing of HDMF-Zarr on PR and nightly. @rly [#859](https://github.com/hdmf-dev/hdmf/pull/859)
 - Replaced `setup.py` with `pyproject.toml`. @rly [#844](https://github.com/hdmf-dev/hdmf/pull/844)
 - Use `ruff` instead of `flake8`. @rly [#844](https://github.com/hdmf-dev/hdmf/pull/844)
@@ -153,7 +154,7 @@ will increase the minor version number to 3.10.0. See the 3.9.1 release notes be
   [#853](https://github.com/hdmf-dev/hdmf/pull/853)
 
 ### Documentation and tutorial enhancements:
-- Updated `ExternalResources` how to tutorial to include the new features. @mavaylon [#850](https://github.com/hdmf-dev/hdmf/pull/850)
+- Updated `ExternalResources` how to tutorial to include the new features. @mavaylon1 [#850](https://github.com/hdmf-dev/hdmf/pull/850)
 
 ## HDMF 3.5.6 (April 28, 2023)
 
@@ -193,13 +194,13 @@ will increase the minor version number to 3.10.0. See the 3.9.1 release notes be
 
 ### Bug fixes
 - Fixed issue with conda CI. @rly [#823](https://github.com/hdmf-dev/hdmf/pull/823)
-- Fixed issue with deprecated `pkg_resources`. @mavaylon [#822](https://github.com/hdmf-dev/hdmf/pull/822)
-- Fixed `hdmf.common` deprecation warning. @mavaylon [#826]((https://github.com/hdmf-dev/hdmf/pull/826)
+- Fixed issue with deprecated `pkg_resources`. @mavaylon1 [#822](https://github.com/hdmf-dev/hdmf/pull/822)
+- Fixed `hdmf.common` deprecation warning. @mavaylon1 [#826]((https://github.com/hdmf-dev/hdmf/pull/826)
 
 ### Internal improvements
 - A number of typos fixed and Github action running codespell to ensure that no typo sneaks in [#825](https://github.com/hdmf-dev/hdmf/pull/825) was added.
-- Added additional documentation for `__fields__` in `AbstactContainer`. @mavaylon [#827](https://github.com/hdmf-dev/hdmf/pull/827)
-- Updated warning message for broken links. @mavaylon [#829](https://github.com/hdmf-dev/hdmf/pull/829)
+- Added additional documentation for `__fields__` in `AbstactContainer`. @mavaylon1 [#827](https://github.com/hdmf-dev/hdmf/pull/827)
+- Updated warning message for broken links. @mavaylon1 [#829](https://github.com/hdmf-dev/hdmf/pull/829)
 
 ## HDMF 3.5.1 (January 26, 2023)
 
@@ -218,9 +219,9 @@ will increase the minor version number to 3.10.0. See the 3.9.1 release notes be
 - Added ``HDMFIO.__del__`` to ensure that I/O objects are being closed on delete. @oruebel[#811](https://github.com/hdmf-dev/hdmf/pull/811)
 
 ### Minor improvements
-- Added support for reading and writing `ExternalResources` to and from denormalized TSV files. @mavaylon [#799](https://github.com/hdmf-dev/hdmf/pull/799)
-- Changed the name of `ExternalResources.export_to_sqlite` to `ExternalResources.to_sqlite`. @mavaylon [#799](https://github.com/hdmf-dev/hdmf/pull/799)
-- Updated the tutorial for `ExternalResources`. @mavaylon [#799](https://github.com/hdmf-dev/hdmf/pull/799)
+- Added support for reading and writing `ExternalResources` to and from denormalized TSV files. @mavaylon1 [#799](https://github.com/hdmf-dev/hdmf/pull/799)
+- Changed the name of `ExternalResources.export_to_sqlite` to `ExternalResources.to_sqlite`. @mavaylon1 [#799](https://github.com/hdmf-dev/hdmf/pull/799)
+- Updated the tutorial for `ExternalResources`. @mavaylon1 [#799](https://github.com/hdmf-dev/hdmf/pull/799)
 - Added `message` argument for assert methods defined by `hdmf.testing.TestCase` to allow developers to include custom error messages with asserts. @oruebel [#812](https://github.com/hdmf-dev/hdmf/pull/812)
 - Clarify the expected chunk shape behavior for `DataChunkIterator`. @oruebel [#813](https://github.com/hdmf-dev/hdmf/pull/813)
 
@@ -361,7 +362,7 @@ the fields (i.e., when the constructor sets some fields to fixed values). @rly
 - Plotted results in external resources tutorial. @oruebel (#667)
 - Added support for Python 3.10. @rly (#679)
 - Updated requirements. @rly @TheChymera (#681)
-- Improved testing for `ExternalResources`. @mavaylon (#673)
+- Improved testing for `ExternalResources`. @mavaylon1 (#673)
 - Improved docs for export. @rly (#674)
 - Enhanced data chunk iteration speeds through new ``GenericDataChunkIterator`` class.  @CodyCBakerPhD (#672)
 - Enhanced issue template forms on GitHub. @CodyCBakerPHD (#700)
@@ -437,7 +438,7 @@ the fields (i.e., when the constructor sets some fields to fixed values). @rly
 - Allow passing ``index=True`` to ``DynamicTable.to_dataframe()`` to support returning `DynamicTableRegion` columns
   as indices or Pandas DataFrame. @rly (#579)
 - Improve ``DynamicTable`` documentation. @rly (#639)
-- Updated external resources tutorial. @mavaylon (#611)
+- Updated external resources tutorial. @mavaylon1 (#611)
 
 ### Breaking changes and deprecations
 - Previously, when using ``DynamicTable.__getitem__`` or ``DynamicTable.get`` to access a selection of a
@@ -522,7 +523,7 @@ the fields (i.e., when the constructor sets some fields to fixed values). @rly
   - Add experimental namespace to HDMF common schema. New data types should go in the experimental namespace
     (hdmf-experimental) prior to being added to the core (hdmf-common) namespace. The purpose of this is to provide
     a place to test new data types that may break backward compatibility as they are refined. @ajtritt (#545)
-  - `ExternalResources` was changed to support storing both names and URIs for resources. @mavaylon (#517, #548)
+  - `ExternalResources` was changed to support storing both names and URIs for resources. @mavaylon1 (#517, #548)
   - The `VocabData` data type was replaced by `EnumData` to provide more flexible support for data from a set of
     fixed values.
   - Added `AlignedDynamicTable`, which defines a `DynamicTable` that supports storing a collection of sub-tables.

diff --git a/docs/gallery/plot_term_set.py b/docs/gallery/plot_term_set.py
@@ -67,6 +67,7 @@
 """
 from hdmf.common import DynamicTable, VectorData
 import os
+import numpy as np
 
 try:
     import linkml_runtime  # noqa: F401
@@ -129,6 +130,19 @@
     data=TermSetWrapper(value=['Homo sapiens'], termset=terms)
     )
 
+######################################################
+# Validate Compound Data with TermSetWrapper
+# ----------------------------------------------------
+# :py:class:`~hdmf.term_set.TermSetWrapper` can be wrapped around compound data.
+# The user will set the field within the compound data type that is to be validated
+# with the termset.
+c_data = np.array([('Homo sapiens', 24)], dtype=[('species', 'U50'), ('age', 'i4')])
+data = VectorData(
+    name='species',
+    description='...',
+    data=TermSetWrapper(value=c_data, termset=terms, field='species')
+)
+
 ######################################################
 # Validate Attributes with TermSetWrapper
 # ----------------------------------------------------

diff --git a/src/hdmf/data_utils.py b/src/hdmf/data_utils.py
@@ -20,7 +20,10 @@ def append_data(data, arg):
         data.append(arg)
         return data
     elif isinstance(data, np.ndarray):
-        return np.append(data,  np.expand_dims(arg, axis=0), axis=0)
+        if len(data.dtype)>0: # data is a structured array
+            return np.append(data, arg)
+        else: # arg is a scalar or row vector
+            return np.append(data,  np.expand_dims(arg, axis=0), axis=0)
     elif isinstance(data, h5py.Dataset):
         shape = list(data.shape)
         shape[0] += 1

diff --git a/src/hdmf/term_set.py b/src/hdmf/term_set.py
@@ -216,19 +216,26 @@ class TermSetWrapper:
             {'name': 'value',
              'type': (list, np.ndarray, dict, str, tuple),
              'doc': 'The target item that is wrapped, either data or attribute.'},
+            {'name': 'field', 'type': str, 'default': None,
+             'doc': 'The field within a compound array.'}
             )
     def __init__(self, **kwargs):
         self.__value = kwargs['value']
         self.__termset = kwargs['termset']
+        self.__field = kwargs['field']
         self.__validate()
 
     def __validate(self):
-        # check if list, tuple, array
-        if isinstance(self.__value, (list, np.ndarray, tuple)): # TODO: Future ticket on DataIO support
-            values = self.__value
-        # create list if none of those -> mostly for attributes
+        if self.__field is not None:
+            values = self.__value[self.__field]
         else:
-            values = [self.__value]
+            # check if list, tuple, array
+            if isinstance(self.__value, (list, np.ndarray, tuple)):
+                values = self.__value
+            # create list if none of those -> mostly for scalar attributes
+            else:
+                values = [self.__value]
+
         # iteratively validate
         bad_values = []
         for term in values:
@@ -243,6 +250,10 @@ def __validate(self):
     def value(self):
         return self.__value
 
+    @property
+    def field(self):
+        return self.__field
+
     @property
     def termset(self):
         return self.__termset
@@ -273,26 +284,55 @@ def __iter__(self):
         """
         return self.__value.__iter__()
 
+    def __multi_validation(self, data):
+        """
+        append_data includes numpy arrays. This is not the same as list append.
+        Numpy array append is essentially list extend. Now if a user appends an array (for compound data), we need to
+        support validating arrays with multiple items. This method is an internal bulk validation
+        check for numpy arrays and extend.
+        """
+        bad_values = []
+        for item in data:
+            if not self.termset.validate(term=item):
+                bad_values.append(item)
+        return bad_values
+
     def append(self, arg):
         """
         This append resolves the wrapper to use the append of the container using
         the wrapper.
         """
-        if self.termset.validate(term=arg):
-            self.__value = append_data(self.__value, arg)
+        if isinstance(arg, np.ndarray):
+            if self.__field is not None: # compound array
+                values = arg[self.__field]
+            else:
+                msg = "Array needs to be a structured array with compound dtype. If this does not apply, use extend."
+                raise ValueError(msg)
         else:
-            msg = ('"%s" is not in the term set.' % arg)
+            values = [arg]
+
+        bad_values = self.__multi_validation(values)
+
+        if len(bad_values)!=0:
+            msg = ('"%s" is not in the term set.' % ', '.join([str(value) for value in bad_values]))
             raise ValueError(msg)
 
+        self.__value = append_data(self.__value, arg)
+
     def extend(self, arg):
         """
         This append resolves the wrapper to use the extend of the container using
         the wrapper.
         """
-        bad_data = []
-        for item in arg:
-            if not self.termset.validate(term=item):
-                bad_data.append(item)
+        if isinstance(arg, np.ndarray):
+            if self.__field is not None: # compound array
+                values = arg[self.__field]
+            else:
+                values = arg
+        else:
+            values = arg
+
+        bad_data = self.__multi_validation(values)
 
         if len(bad_data)==0:
             self.__value = extend_data(self.__value, arg)

diff --git a/tests/unit/common/test_table.py b/tests/unit/common/test_table.py
@@ -220,6 +220,101 @@ def test_add_row_validate_bad_data_all_col(self):
         with self.assertRaises(ValueError):
             species.add_row(Species_1='bad data', Species_2='bad data')
 
+    def test_compound_data_append(self):
+        c_data = np.array([('Homo sapiens', 24)], dtype=[('species', 'U50'), ('age', 'i4')])
+        c_data2 = np.array([('Mus musculus', 24)], dtype=[('species', 'U50'), ('age', 'i4')])
+        compound_vector_data = VectorData(
+            name='Species_1',
+            description='...',
+            data=c_data
+        )
+        compound_vector_data.append(c_data2)
+
+        np.testing.assert_array_equal(compound_vector_data.data, np.append(c_data, c_data2))
+
+    @unittest.skipIf(not REQUIREMENTS_INSTALLED, "optional LinkML module is not installed")
+    def test_array_append_error(self):
+        c_data = np.array(['Homo sapiens'])
+        c_data2 = np.array(['Mus musculus'])
+
+        terms = TermSet(term_schema_path='tests/unit/example_test_term_set.yaml')
+        vectordata_termset = VectorData(
+            name='Species_1',
+            description='...',
+            data=TermSetWrapper(value=c_data, termset=terms)
+        )
+
+        with self.assertRaises(ValueError):
+            vectordata_termset.append(c_data2)
+
+    def test_compound_data_extend(self):
+        c_data = np.array([('Homo sapiens', 24)], dtype=[('species', 'U50'), ('age', 'i4')])
+        c_data2 = np.array([('Mus musculus', 24)], dtype=[('species', 'U50'), ('age', 'i4')])
+        compound_vector_data = VectorData(
+            name='Species_1',
+            description='...',
+            data=c_data
+        )
+        compound_vector_data.extend(c_data2)
+
+        np.testing.assert_array_equal(compound_vector_data.data, np.vstack((c_data, c_data2)))
+
+    @unittest.skipIf(not REQUIREMENTS_INSTALLED, "optional LinkML module is not installed")
+    def test_add_ref_wrapped_array_append(self):
+        data = np.array(['Homo sapiens'])
+        data2 = 'Mus musculus'
+        terms = TermSet(term_schema_path='tests/unit/example_test_term_set.yaml')
+        vector_data = VectorData(
+            name='Species_1',
+            description='...',
+            data=TermSetWrapper(value=data, termset=terms)
+        )
+        vector_data.append(data2)
+
+        np.testing.assert_array_equal(vector_data.data.data, np.append(data, data2))
+
+    @unittest.skipIf(not REQUIREMENTS_INSTALLED, "optional LinkML module is not installed")
+    def test_add_ref_wrapped_array_extend(self):
+        data = np.array(['Homo sapiens'])
+        data2 = np.array(['Mus musculus'])
+        terms = TermSet(term_schema_path='tests/unit/example_test_term_set.yaml')
+        vector_data = VectorData(
+            name='Species_1',
+            description='...',
+            data=TermSetWrapper(value=data, termset=terms)
+        )
+        vector_data.extend(data2)
+
+        np.testing.assert_array_equal(vector_data.data.data, np.vstack((data, data2)))
+
+    @unittest.skipIf(not REQUIREMENTS_INSTALLED, "optional LinkML module is not installed")
+    def test_add_ref_wrapped_compound_data_append(self):
+        c_data = np.array([('Homo sapiens', 24)], dtype=[('species', 'U50'), ('age', 'i4')])
+        c_data2 = np.array([('Mus musculus', 24)], dtype=[('species', 'U50'), ('age', 'i4')])
+        terms = TermSet(term_schema_path='tests/unit/example_test_term_set.yaml')
+        compound_vector_data = VectorData(
+            name='Species_1',
+            description='...',
+            data=TermSetWrapper(value=c_data, field='species', termset=terms)
+        )
+        compound_vector_data.append(c_data2)
+
+        np.testing.assert_array_equal(compound_vector_data.data.data, np.append(c_data, c_data2))
+
+    @unittest.skipIf(not REQUIREMENTS_INSTALLED, "optional LinkML module is not installed")
+    def test_add_ref_wrapped_compound_data_extend(self):
+        c_data = np.array([('Homo sapiens', 24)], dtype=[('species', 'U50'), ('age', 'i4')])
+        c_data2 = np.array([('Mus musculus', 24)], dtype=[('species', 'U50'), ('age', 'i4')])
+        terms = TermSet(term_schema_path='tests/unit/example_test_term_set.yaml')
+        compound_vector_data = VectorData(
+            name='Species_1',
+            description='...',
+            data=TermSetWrapper(value=c_data, field='species', termset=terms)
+        )
+        compound_vector_data.extend(c_data2)
+
+        np.testing.assert_array_equal(compound_vector_data.data.data, np.vstack((c_data, c_data2)))
+
     def test_constructor_bad_columns(self):
         columns = ['bad_column']
         msg = "'columns' must be a list of dict, VectorData, DynamicTableRegion, or VectorIndex"

diff --git a/tests/unit/test_term_set.py b/tests/unit/test_term_set.py
@@ -155,21 +155,22 @@ def setUp(self):
         self.wrapped_array = TermSetWrapper(value=np.array(['Homo sapiens']), termset=self.termset)
         self.wrapped_list = TermSetWrapper(value=['Homo sapiens'], termset=self.termset)
 
+        c_data = np.array([('Homo sapiens', 24)], dtype=[('species', 'U50'), ('age', 'i4')])
+        self.wrapped_comp_array = TermSetWrapper(value=c_data,
+                                                 termset=self.termset,
+                                                 field='species')
+
         self.np_data = VectorData(
             name='Species_1',
             description='...',
             data=self.wrapped_array
         )
-        self.list_data = VectorData(
-            name='Species_1',
-            description='...',
-            data=self.wrapped_list
-        )
 
     def test_properties(self):
         self.assertEqual(self.wrapped_array.value, ['Homo sapiens'])
         self.assertEqual(self.wrapped_array.termset.view_set, self.termset.view_set)
         self.assertEqual(self.wrapped_array.dtype, 'U12') # this covers __getattr__
+        self.assertEqual(self.wrapped_comp_array.field, 'species')
 
     def test_get_item(self):
         self.assertEqual(self.np_data.data[0], 'Homo sapiens')