Automatically convert inconsistent data types #715

azngeek · 2023-10-25T16:08:24Z

Feature description

Data integration often faces the challenge of inconsistent data types for specific keys within JSON payloads. For instance, the value for a given key might be a dictionary in one record and a list in another, leading to complexities in data processing. The absence of uniform data types forces developers to write additional logic to handle these inconsistencies.

Are you a dlt user?

Yes, I use it for fun.

Use case

This feature is especially useful for developers who work with dynamic JSON data, which could come from various sources like APIs, databases, or data lakes.

Especially when the data source was XML, where it is common to have multiple nodes with the same name, like in the following example:

Example 1 - Single node

<root>
  <element>
    <subelement>Value1</subelement>
  </element>
</root>

gets converted to a string

{
  "root": {
    "element": {
      "subelement": "Value1"
    }
  }
}

Example 2 - multiple nodes

<root>
  <element>
    <subelement>Value2</subelement>
    <subelement>Value3</subelement>
  </element>
</root>

gets converted to a list

{
  "root": {
    "element": {
      "subelement": ["Value2", "Value3"]
    }
  }
}

It provides a straightforward way to standardize the data before performing operations like data transformation, analytics, or loading into databases.

Proposed solution

Introduce a universal function that traverses through JSON data recursively to identify and normalize all inconsistent data types. Specifically, values that sometimes appear as a dict and sometimes as a list should be converted to a list.

I could see that there i a test in

dlt/tests/common/normalizers/test_json_relational.py

Line 576 in 27cc8b6

def test_wrap_in_dict(norm: RelationalNormalizer) -> None:

which partially covers this topic but i am more looking for a solution, which automatically can do that.

def identify_inconsistent_keys(json_data):
    """
    Traverse the JSON data to identify keys with inconsistent value types.
    Return a dictionary where each key is a JSON key path and the value is a set of types observed for that key.
    """
    key_type_mapping = defaultdict(set)

    def traverse_json(json_obj, path=""):
        """
        Helper function to traverse the JSON object and populate key_type_mapping.
        """
        if isinstance(json_obj, dict):
            for k, v in json_obj.items():
                new_path = f"{path}.{k}" if path else k
                key_type_mapping[new_path].add(type(v).__name__)
                traverse_json(v, new_path)
        elif isinstance(json_obj, list):
            for item in json_obj:
                traverse_json(item, path)

    # Traverse the JSON data
    for item in json_data:
        traverse_json(item)

    # Identify and return keys with inconsistent types
    return {k: v for k, v in key_type_mapping.items() if len(v) > 1}
    
    
    def normalize_to_list(json_obj, inconsistent_keys, path=""):
    """
    Recursively traverse the JSON object to normalize inconsistent types.
    If a key has a dict value in one instance and a list value in another,
    convert the dict to a list containing that single dict.
    """
    if isinstance(json_obj, dict):
        for key, value in json_obj.items():
            new_path = f"{path}.{key}" if path else key
            # If the value itself is a dictionary or a list, recurse into it
            if isinstance(value, dict):
                json_obj[key] = normalize_to_list(value, inconsistent_keys, new_path)
            elif isinstance(value, list):
                json_obj[key] = [normalize_to_list(v, inconsistent_keys, new_path) for v in value]

            # Check if the key has inconsistent types across the JSON data
            if new_path in inconsistent_keys:
                # If the value is a dict but sometimes appears as a list, convert it to a list
                if isinstance(value, dict) and 'list' in inconsistent_keys[new_path]:
                    json_obj[key] = [value]
        return json_obj
    elif isinstance(json_obj, list):
        return [normalize_to_list(item, inconsistent_keys, path) for item in json_obj]
    else:
        return json_obj

Test Dataset before

[
    {
        "key1": "value1",
        "key2": {
            "subkey1": "subvalue1"
        },
        "key3": [
            {"subkey2": "subvalue2"},
            {"subkey2": "subvalue3"}
        ]
    },
    {
        "key1": "value2",
        "key2": [
            {"subkey1": "subvalue4"},
            {"subkey1": "subvalue5"}
        ],
        "key3": {"subkey2": "subvalue6"}
    }
]

inconsistent_keys = identify_inconsistent_keys(data)
normalized_data = [normalize_to_list(item, inconsistent_keys) for item in data]

Test Dataset after

[
    {
        "key1": "value1",
        "key2": [
            {"subkey1": "subvalue1"}
        ],
        "key3": [
            {"subkey2": "subvalue2"},
            {"subkey2": "subvalue3"}
        ]
    },
    {
        "key1": "value2",
        "key2": [
            {"subkey1": "subvalue4"},
            {"subkey1": "subvalue5"}
        ],
        "key3": [
            {"subkey2": "subvalue6"}
        ]
    }
]

Related issues

I could not find one.

rudolfix · 2023-10-26T11:50:04Z

thanks for this! we have a related issue here: #67
when implemented, it will correct some of the issues you fix in the code above: it makes the list to take precedence over dicts but if several dicts come first, they will be loaded as dicts. in other word this solution does not look ahead.

my take is that we put your code in dlt.sources.helpers.json (with a few others) and do an example in our docs on how to use it

azngeek · 2023-10-26T12:13:49Z

@rudolfix Thanks for the update. This looks promising! I think the most important impact in this case would be an extended documentation, as currently it seems like there is some "magic" working in the background.

As an engineer you could debug and code around those issues but if your background is more non-technical, it would be helpful to understand more about what to expect when working with different data types.

github-project-automation bot added this to dlt core library Oct 25, 2023

github-project-automation bot moved this to Todo in dlt core library Oct 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automatically convert inconsistent data types #715

Automatically convert inconsistent data types #715

azngeek commented Oct 25, 2023

rudolfix commented Oct 26, 2023

azngeek commented Oct 26, 2023

Automatically convert inconsistent data types #715

Automatically convert inconsistent data types #715

Comments

azngeek commented Oct 25, 2023

Feature description

Are you a dlt user?

Use case

Example 1 - Single node

Example 2 - multiple nodes

Proposed solution

Test Dataset before

Test Dataset after

Related issues

rudolfix commented Oct 26, 2023

azngeek commented Oct 26, 2023