Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues with REST API source - response body #2310

Open
barletza-pathid opened this issue Feb 14, 2025 · 10 comments
Open

Issues with REST API source - response body #2310

barletza-pathid opened this issue Feb 14, 2025 · 10 comments
Assignees

Comments

@barletza-pathid
Copy link

Hi,

We have two issues with the way dlt (probably) transform the API response before loading it to the destination (s3).

  1. Properties with the value null are removed
  2. The response is loaded without the "result" or "data" or "values" key, and we need this key.

Let's get this response for example:
Regarding the first issue - "department" is null on this response, and on the destination, we don't have this property. We read on the docs that there is an option to define a schema. The thing is that it's a lot of work for something we need across the board, for every API response.
Regarding the second issue - the data on the destination is saved without the "value" key, and we need it with this key.

    "@odata.context": "https://graph.microsoft.com/beta/$metadata#users",
    "value": [
        {
            "id": "6e7b768e-07e2-4810-8459-485f84f8f204",
            "deletedDateTime": null,
            "accountEnabled": true,
            "ageGroup": null,
            "businessPhones": [],
            "city": null,
            "createdDateTime": "2017-09-04T15:35:02Z",
            "creationType": null,
            "companyName": null,
            "consentProvidedForMinor": null,
            "country": null,
            "department": null,
            "displayName": "Conf Room Adams",
            "employeeId": null,
            "employeeHireDate": null,
            "employeeLeaveDateTime": null,
            "employeeType": null,
            "faxNumber": null,
            "givenName": null,
            "imAddresses": [],
            "infoCatalogs": [],
            "isLicenseReconciliationNeeded": true,
            "isManagementRestricted": null,
            "isResourceAccount": null,
            "jobTitle": null,
            "legalAgeGroupClassification": null,
            "mail": "[email protected]",
            "mailNickname": "Adams",
            "mobilePhone": null,
            "onPremisesDistinguishedName": null,
            "officeLocation": null,
            "onPremisesDomainName": null,
            "onPremisesImmutableId": null,
            "onPremisesLastSyncDateTime": null,
            "onPremisesObjectIdentifier": null,
            "onPremisesSecurityIdentifier": null,
            "onPremisesSamAccountName": null,
            "onPremisesSyncEnabled": null,
            "onPremisesUserPrincipalName": null,
            "otherMails": [],
            "passwordPolicies": "None",
            "postalCode": null,
            "preferredDataLocation": null,
            "preferredLanguage": null,
            "proxyAddresses": [
                "SMTP:[email protected]"
            ],
            "refreshTokensValidFromDateTime": "2017-09-12T21:08:14Z",
            "securityIdentifier": "S-1-12-1-1853585038-1209010146-1598577028-83032196",
            "showInAddressList": null,
            "signInSessionsValidFromDateTime": "2017-09-12T21:08:14Z",
            "state": null,
            "streetAddress": null,
            "surname": null,
            "usageLocation": null,
            "userPrincipalName": "[email protected]",
            "externalUserConvertedOn": null,
            "externalUserState": null,
            "externalUserStateChangeDateTime": null,
            "userType": "Member",
            "employeeOrgData": null,
            "passwordProfile": null,
            "assignedLicenses": [],
            "assignedPlans": [],
            "authorizationInfo": {
                "certificateUserIds": []
            },
            "cloudRealtimeCommunicationInfo": {
                "isSipEnabled": false
            },
            "deviceKeys": [],
            "identities": [
                {
                    "signInType": "userPrincipalName",
                    "issuer": "M365x214355.onmicrosoft.com",
                    "issuerAssignedId": "[email protected]"
                }
            ],
            "onPremisesExtensionAttributes": {
                "extensionAttribute1": null,
                "extensionAttribute2": null,
                "extensionAttribute3": null,
                "extensionAttribute4": null,
                "extensionAttribute5": null,
                "extensionAttribute6": null,
                "extensionAttribute7": null,
                "extensionAttribute8": null,
                "extensionAttribute9": null,
                "extensionAttribute10": null,
                "extensionAttribute11": null,
                "extensionAttribute12": null,
                "extensionAttribute13": null,
                "extensionAttribute14": null,
                "extensionAttribute15": null
            },
            "onPremisesProvisioningErrors": [],
            "onPremisesSipInfo": {
                "isSipEnabled": false,
                "sipDeploymentLocation": null,
                "sipPrimaryAddress": null
            },
            "provisionedPlans": [],
            "serviceProvisioningErrors": []
        },
  ]
}
@sh-rp
Copy link
Collaborator

sh-rp commented Feb 17, 2025

Please provide us some code of our pipeline. Also if you see columns set to "NULL" not appear in the destination it is most likely because none of the incoming rows/records have a datatype there and dlt can not determine the type of the column so it is dropped. You can provide hints for this if you like.

@barletza-pathid
Copy link
Author

Please provide us some code of our pipeline. Also if you see columns set to "NULL" not appear in the destination it is most likely because none of the incoming rows/records have a datatype there and dlt can not determine the type of the column so it is dropped. You can provide hints for this if you like.

Thanks.

The thing is that the schema and the columns are dynamic. We can't know the properties ahead of time, so we need that the response from the API will be loaded as is to s3. Without any transformation or modifications.
As if you execute the request from Postman.

@sh-rp
Copy link
Collaborator

sh-rp commented Feb 17, 2025

I doubt that we will be supporting columns with unknown type any time in the future, since pretty much all destinations (except for pure json files) have schemas. You could set the max_nesting_level to 0 though, then the nested "null" columns will be retained.

@sh-rp
Copy link
Collaborator

sh-rp commented Feb 17, 2025

If you want more help, we really need some of your code.

@sh-rp sh-rp self-assigned this Feb 17, 2025
@barletza-pathid
Copy link
Author

@sh-rp

We already use max_table_nesting=0

resource_name = get_resource_name(api_config)
        destination = filesystem(bucket_url=run_path)

pipeline = dlt.pipeline(
            destination=destination,
            dataset_name=api_config.resource_config.table_name,
)
source = create_dynamic_source(
                rest_client_config, api_config.resource_config, resource_name
            )
load_info = pipeline.run(source())






def create_dynamic_source(
    rest_client_config: ClientConfig,
    resource_config: ResourceConfig,
    resource_name: str,
) -> SourceFactory:
    """Creates a dynamic source based on the configuration"""

    @dlt.source(name=resource_name, max_table_nesting=0)
    def dynamic_source() -> Iterator:
        rest_config: RESTAPIConfig = {
            "client": rest_client_config,
            "resources": [
                {
                    "name": resource_name,
                    "primary_key": "id",
                    "write_disposition": "replace",
                    "table_name": resource_name,
                    "endpoint": {
                        "path": resource_config.path,
                        "params": resource_config.params,
                    },
                }
            ],
        }
        yield from rest_api_resources(rest_config)

    return dynamic_source

@sh-rp
Copy link
Collaborator

sh-rp commented Feb 18, 2025

Ok, so maybe try to adjust your path setting to include the full result (which also was part of your questions), then you should have the full result and also the nested columns that are null should be retained. What is path set to in your example?

@barletza-pathid
Copy link
Author

Ok, so maybe try to adjust your path setting to include the full result (which also was part of your questions), then you should have the full result and also the nested columns that are null should be retained. What is path set to in your example?

The path is the API endpoint, for example if the base url is /api/<provider_api>/v1 path is for example users and then the request url should be /api/<provider_api>/v1/users

I don't really understand adjusting the path can help us.

@sh-rp
Copy link
Collaborator

sh-rp commented Feb 18, 2025

Ah sorry, I meant the "data_selector" which selects which part of the returned json is forwarded into the resource: https://dlthub.com/docs/general-usage/http/rest-client

@barletza-pathid
Copy link
Author

data_selector

Yeah, we have tried it. It didn't work. I think the reason is the way this specific API (msgraph API) returns the response, where the value is an array (list) and not an object. I guess that it works with other APIs (like in the example on the docs)

@sh-rp
Copy link
Collaborator

sh-rp commented Feb 24, 2025

You could add a transformer to change the data shape before it is ingested by the dlt extract stage, but imho both dictionaries or lists should work..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Todo
Development

No branches or pull requests

2 participants