Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatically convert inconsistent data types #715

Open
azngeek opened this issue Oct 25, 2023 · 2 comments
Open

Automatically convert inconsistent data types #715

azngeek opened this issue Oct 25, 2023 · 2 comments

Comments

@azngeek
Copy link

azngeek commented Oct 25, 2023

Feature description

Data integration often faces the challenge of inconsistent data types for specific keys within JSON payloads. For instance, the value for a given key might be a dictionary in one record and a list in another, leading to complexities in data processing. The absence of uniform data types forces developers to write additional logic to handle these inconsistencies.

Are you a dlt user?

Yes, I use it for fun.

Use case

This feature is especially useful for developers who work with dynamic JSON data, which could come from various sources like APIs, databases, or data lakes.

Especially when the data source was XML, where it is common to have multiple nodes with the same name, like in the following example:

Example 1 - Single node

<root>
  <element>
    <subelement>Value1</subelement>
  </element>
</root>

gets converted to a string

{
  "root": {
    "element": {
      "subelement": "Value1"
    }
  }
}

Example 2 - multiple nodes

<root>
  <element>
    <subelement>Value2</subelement>
    <subelement>Value3</subelement>
  </element>
</root>

gets converted to a list

{
  "root": {
    "element": {
      "subelement": ["Value2", "Value3"]
    }
  }
}

It provides a straightforward way to standardize the data before performing operations like data transformation, analytics, or loading into databases.

Proposed solution

Introduce a universal function that traverses through JSON data recursively to identify and normalize all inconsistent data types. Specifically, values that sometimes appear as a dict and sometimes as a list should be converted to a list.

I could see that there i a test in

def test_wrap_in_dict(norm: RelationalNormalizer) -> None:
which partially covers this topic but i am more looking for a solution, which automatically can do that.

def identify_inconsistent_keys(json_data):
    """
    Traverse the JSON data to identify keys with inconsistent value types.
    Return a dictionary where each key is a JSON key path and the value is a set of types observed for that key.
    """
    key_type_mapping = defaultdict(set)

    def traverse_json(json_obj, path=""):
        """
        Helper function to traverse the JSON object and populate key_type_mapping.
        """
        if isinstance(json_obj, dict):
            for k, v in json_obj.items():
                new_path = f"{path}.{k}" if path else k
                key_type_mapping[new_path].add(type(v).__name__)
                traverse_json(v, new_path)
        elif isinstance(json_obj, list):
            for item in json_obj:
                traverse_json(item, path)

    # Traverse the JSON data
    for item in json_data:
        traverse_json(item)

    # Identify and return keys with inconsistent types
    return {k: v for k, v in key_type_mapping.items() if len(v) > 1}
    
    
    def normalize_to_list(json_obj, inconsistent_keys, path=""):
    """
    Recursively traverse the JSON object to normalize inconsistent types.
    If a key has a dict value in one instance and a list value in another,
    convert the dict to a list containing that single dict.
    """
    if isinstance(json_obj, dict):
        for key, value in json_obj.items():
            new_path = f"{path}.{key}" if path else key
            # If the value itself is a dictionary or a list, recurse into it
            if isinstance(value, dict):
                json_obj[key] = normalize_to_list(value, inconsistent_keys, new_path)
            elif isinstance(value, list):
                json_obj[key] = [normalize_to_list(v, inconsistent_keys, new_path) for v in value]

            # Check if the key has inconsistent types across the JSON data
            if new_path in inconsistent_keys:
                # If the value is a dict but sometimes appears as a list, convert it to a list
                if isinstance(value, dict) and 'list' in inconsistent_keys[new_path]:
                    json_obj[key] = [value]
        return json_obj
    elif isinstance(json_obj, list):
        return [normalize_to_list(item, inconsistent_keys, path) for item in json_obj]
    else:
        return json_obj

Test Dataset before

[
    {
        "key1": "value1",
        "key2": {
            "subkey1": "subvalue1"
        },
        "key3": [
            {"subkey2": "subvalue2"},
            {"subkey2": "subvalue3"}
        ]
    },
    {
        "key1": "value2",
        "key2": [
            {"subkey1": "subvalue4"},
            {"subkey1": "subvalue5"}
        ],
        "key3": {"subkey2": "subvalue6"}
    }
]

inconsistent_keys = identify_inconsistent_keys(data)
normalized_data = [normalize_to_list(item, inconsistent_keys) for item in data]

Test Dataset after

[
    {
        "key1": "value1",
        "key2": [
            {"subkey1": "subvalue1"}
        ],
        "key3": [
            {"subkey2": "subvalue2"},
            {"subkey2": "subvalue3"}
        ]
    },
    {
        "key1": "value2",
        "key2": [
            {"subkey1": "subvalue4"},
            {"subkey1": "subvalue5"}
        ],
        "key3": [
            {"subkey2": "subvalue6"}
        ]
    }
]

Related issues

I could not find one.

@rudolfix
Copy link
Collaborator

thanks for this! we have a related issue here: #67
when implemented, it will correct some of the issues you fix in the code above: it makes the list to take precedence over dicts but if several dicts come first, they will be loaded as dicts. in other word this solution does not look ahead.

my take is that we put your code in dlt.sources.helpers.json (with a few others) and do an example in our docs on how to use it

@azngeek
Copy link
Author

azngeek commented Oct 26, 2023

@rudolfix Thanks for the update. This looks promising! I think the most important impact in this case would be an extended documentation, as currently it seems like there is some "magic" working in the background.

As an engineer you could debug and code around those issues but if your background is more non-technical, it would be helpful to understand more about what to expect when working with different data types.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Todo
Development

No branches or pull requests

2 participants