You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Data integration often faces the challenge of inconsistent data types for specific keys within JSON payloads. For instance, the value for a given key might be a dictionary in one record and a list in another, leading to complexities in data processing. The absence of uniform data types forces developers to write additional logic to handle these inconsistencies.
Are you a dlt user?
Yes, I use it for fun.
Use case
This feature is especially useful for developers who work with dynamic JSON data, which could come from various sources like APIs, databases, or data lakes.
Especially when the data source was XML, where it is common to have multiple nodes with the same name, like in the following example:
It provides a straightforward way to standardize the data before performing operations like data transformation, analytics, or loading into databases.
Proposed solution
Introduce a universal function that traverses through JSON data recursively to identify and normalize all inconsistent data types. Specifically, values that sometimes appear as a dict and sometimes as a list should be converted to a list.
which partially covers this topic but i am more looking for a solution, which automatically can do that.
defidentify_inconsistent_keys(json_data):
""" Traverse the JSON data to identify keys with inconsistent value types. Return a dictionary where each key is a JSON key path and the value is a set of types observed for that key. """key_type_mapping=defaultdict(set)
deftraverse_json(json_obj, path=""):
""" Helper function to traverse the JSON object and populate key_type_mapping. """ifisinstance(json_obj, dict):
fork, vinjson_obj.items():
new_path=f"{path}.{k}"ifpathelsekkey_type_mapping[new_path].add(type(v).__name__)
traverse_json(v, new_path)
elifisinstance(json_obj, list):
foriteminjson_obj:
traverse_json(item, path)
# Traverse the JSON dataforiteminjson_data:
traverse_json(item)
# Identify and return keys with inconsistent typesreturn {k: vfork, vinkey_type_mapping.items() iflen(v) >1}
defnormalize_to_list(json_obj, inconsistent_keys, path=""):
""" Recursively traverse the JSON object to normalize inconsistent types. If a key has a dict value in one instance and a list value in another, convert the dict to a list containing that single dict. """ifisinstance(json_obj, dict):
forkey, valueinjson_obj.items():
new_path=f"{path}.{key}"ifpathelsekey# If the value itself is a dictionary or a list, recurse into itifisinstance(value, dict):
json_obj[key] =normalize_to_list(value, inconsistent_keys, new_path)
elifisinstance(value, list):
json_obj[key] = [normalize_to_list(v, inconsistent_keys, new_path) forvinvalue]
# Check if the key has inconsistent types across the JSON dataifnew_pathininconsistent_keys:
# If the value is a dict but sometimes appears as a list, convert it to a listifisinstance(value, dict) and'list'ininconsistent_keys[new_path]:
json_obj[key] = [value]
returnjson_objelifisinstance(json_obj, list):
return [normalize_to_list(item, inconsistent_keys, path) foriteminjson_obj]
else:
returnjson_obj
thanks for this! we have a related issue here: #67
when implemented, it will correct some of the issues you fix in the code above: it makes the list to take precedence over dicts but if several dicts come first, they will be loaded as dicts. in other word this solution does not look ahead.
my take is that we put your code in dlt.sources.helpers.json (with a few others) and do an example in our docs on how to use it
@rudolfix Thanks for the update. This looks promising! I think the most important impact in this case would be an extended documentation, as currently it seems like there is some "magic" working in the background.
As an engineer you could debug and code around those issues but if your background is more non-technical, it would be helpful to understand more about what to expect when working with different data types.
Feature description
Data integration often faces the challenge of inconsistent data types for specific keys within JSON payloads. For instance, the value for a given key might be a dictionary in one record and a list in another, leading to complexities in data processing. The absence of uniform data types forces developers to write additional logic to handle these inconsistencies.
Are you a dlt user?
Yes, I use it for fun.
Use case
This feature is especially useful for developers who work with dynamic JSON data, which could come from various sources like APIs, databases, or data lakes.
Especially when the data source was XML, where it is common to have multiple nodes with the same name, like in the following example:
Example 1 - Single node
gets converted to a string
Example 2 - multiple nodes
gets converted to a list
It provides a straightforward way to standardize the data before performing operations like data transformation, analytics, or loading into databases.
Proposed solution
Introduce a universal function that traverses through JSON data recursively to identify and normalize all inconsistent data types. Specifically, values that sometimes appear as a dict and sometimes as a list should be converted to a list.
I could see that there i a test in
dlt/tests/common/normalizers/test_json_relational.py
Line 576 in 27cc8b6
Test Dataset before
Test Dataset after
Related issues
I could not find one.
The text was updated successfully, but these errors were encountered: