Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

timezone-less NWB files are failing DANDI-side validation #284

Open
bendichter opened this issue May 23, 2024 · 16 comments
Open

timezone-less NWB files are failing DANDI-side validation #284

bendichter opened this issue May 23, 2024 · 16 comments
Labels
bug Something isn't working

Comments

@bendichter
Copy link
Member

We have a server-side issue with validation of NWB files. Until this is resolved, most NWB files submitted to DANDI will be unpublishable.

NWB recently removed the requirement for NWB session_start_time to have a timezone (PR). We previously had been defaulting to assigning the timezone of where the conversion was run, but were concerned that this was often incorrect. It’s also very unusual for recording systems or any other metadata to include the timezone, so the timezone requirement made it challenging to automate conversion.

The necessary changes have been made to the software, and these new timezone-less files propagate just fine through NWB GUIDE, NeuroConv, PyNWB, HDMF, NWB Inspector, and the dandi CLI. Creation, validation, and upload work well.

However, these new files are failing validation on DANDI server side, which means it is currently impossible to publish these files. See here for an example.

Our preferred solution would be to relax the validation on DANDI server side to allow for NWB files that have datetime that do not have timezones. We're not sure where this line would be. The schema just lists the requirement as datetime, and it's not clear how that is meant to be serealized/deserialized and if that requires a timezone to be present.

startDate: Optional[datetime] = Field(None, json_schema_extra={"nskey": "schema"})

@yarikoptic
Copy link
Member

hm, not yet sure if that was a good step forward. @bendichter you said

, but were concerned that this was often incorrect.

was there some discussion of concern and/or why you would think they were incorrect? so far I seems to spot only technical considerations/problems which could have possibly inspired such a change.

@bendichter
Copy link
Member Author

@yarikoptic see discussion here: NeurodataWithoutBorders/nwb-schema#321

@satra
Copy link
Member

satra commented May 23, 2024 via email

@satra
Copy link
Member

satra commented May 23, 2024

i downloaded this example and ran it through dandischema validate and it does not error. so we need to check if versions are matched between the archive and the schema.

@bendichter
Copy link
Member Author

@satra I can pull the metadata down using the API into an dandischema.models.Asset object, which I assume means it passes schema validation.

client = DandiAPIClient(api_url="https://api-staging.dandiarchive.org/api")
metadata = client.get_asset("d5c1c760-2d7e-4a52-bbf0-52225e88cc3c").get_metadata()
type(metadata)
dandischema.models.Asset

"d5c1c760-2d7e-4a52-bbf0-52225e88cc3c" is the asset ID of one of the files.

@candleindark
Copy link
Member

candleindark commented May 23, 2024

The datetime validation is unlikely to be the cause of the problem, at least not at the Pydantic model level. I think @satra example has demonstrated that, but here is a more pointed example.

from typing import Optional

from datetime import datetime
from pydantic import BaseModel


class A(BaseModel):
    t: Optional[datetime]


a = A(t='2024-05-06T10:56:41')  # no error from here

@yarikoptic
Copy link
Member

@yarikoptic see discussion here: NeurodataWithoutBorders/nwb-schema#321

ok, thank you! I think it would still be great may be at the level of nwb inspector now to warn on times without timezone and possibly loosing ability to align multiple data files/points.

@CodyCBakerPhD
Copy link

I think it would still be great may be at the level of nwb inspector now to warn on times without timezone and possibly loosing ability to align multiple data files/points.

We agree, already a WIP PR about that (awaiting some final debugging): NeurodataWithoutBorders/nwbinspector#458

@yarikoptic
Copy link
Member

Summary so far from my looking at the problem - it is due to jsonschema formatting check requiring date-time to be standard compliant RFC3339 and thus with time zone. More info etc

@rly
Copy link

rly commented Sep 20, 2024

It would be nice to fix this upstream in pydantic, but until then, we could create our own special datetime type that uses a regex pattern in the JSON schema, rather than the built-in RFC 3339 date-time validator, to validate a datetime that may or may not have an offset/timezone. We could decide on whatever pattern is best here - ISO 8601, or RFC 3339 where offset is optional, or something else.

From what I can gather, there is no good or consensus regex for ISO 8601. Code seems to be preferred. But I think we are limited to using regex within JSON schema. The below code uses one regex for ISO 8601 that I found on the internet.

regex = r"(\d{4}-[01]\d-[0-3]\dT[0-2]\d:[0-5]\d:[0-5]\d\.\d+)|(\d{4}-[01]\d-[0-3]\dT[0-2]\d:[0-5]\d:[0-5]\d)|(\d{4}-[01]\d-[0-3]\dT[0-2]\d:[0-5]\d)"

from pydantic import WithJsonSchema, BaseModel
from typing import Annotated
import datetime

BetterDatetime = Annotated[
    datetime.datetime,
    WithJsonSchema({"pattern": regex, "type": "string"}, mode="serialization"),
    WithJsonSchema({"pattern": regex, "type": "string"}, mode="validation")
]

class Model(BaseModel):
    dt: BetterDatetime

schema = Model.model_json_schema()
schema
# {'properties': {'dt': {'pattern': '(\\d{4}-[01]\\d-[0-3]\\dT[0-2]\\d:[0-5]\\d:[0-5]\\d\\.\\d+)|(\\d{4}-[01]\\d-[0-3]\\dT[0-2]\\d:[0-5]\\d:[0-5]\\d)|(\\d{4}-[01]\\d-[0-3]\\dT[0-2]\\d:[0-5]\\d)', 'title': 'Dt', 'type': 'string'}}, 'required': ['dt'], 'title': 'Model', 'type': 'object'}

m1 = Model(dt="2032-04-23T10:20:30")
j1 = m1.model_dump_json()
j1
# '{"dt":"2032-04-23T10:20:30"}'

m2= Model(dt="2032-04-23T10:20:30+02:30")
j2 = m2.model_dump_json()
j2
# '{"dt":"2032-04-23T10:20:30+02:30"}'

import json
import jsonschema
jsonschema.validate(schema, json.loads(j1))  # success
jsonschema.validate(schema, json.loads(j2))  # success

@candleindark
Copy link
Member

regex = r"(\d{4}-[01]\d-[0-3]\dT[0-2]\d:[0-5]\d:[0-5]\d\.\d+)|(\d{4}-[01]\d-[0-3]\dT[0-2]\d:[0-5]\d:[0-5]\d)|(\d{4}-[01]\d-[0-3]\dT[0-2]\d:[0-5]\d)"

from pydantic import WithJsonSchema, BaseModel
from typing import Annotated
import datetime

BetterDatetime = Annotated[
    datetime.datetime,
    WithJsonSchema({"pattern": regex, "type": "string"}, mode="serialization"),
    WithJsonSchema({"pattern": regex, "type": "string"}, mode="validation")
]

class Model(BaseModel):
    dt: BetterDatetime

I think it is a good approach. This allows custom JSON schema for a particular set of datetime fields without customizing the fields of datetime type. However, I think BetterDatetime can be sufficiently defined as the following according to the doc for WithJsonSchema.

BetterDatetime = Annotated[
    datetime.datetime,
    WithJsonSchema({"pattern": regex, "type": "string"}),
]

@rly
Copy link

rly commented Jan 30, 2025

Any further thoughts on the proposed solution above? There does not seem to be movement on the pydantic side so I think a custom datetime type is appropriate. I can create a PR so that we can make progress on this issue here and with NWB.

@bendichter
Copy link
Member Author

bendichter commented Jan 30, 2025

I initially proposed this as a path towards being able to handle dates without times (e.g., for DOB and session start dates where exact times are unknown). However, given the pydantic validation constraints, I think we should reconsider this approach. Instead, we could simplify by only supporting:

  • datetimes (always with timezone)
  • date objects (timezone-agnostic by definition)

This would avoid the ambiguity of timezone-less datetimes while still meeting the core use cases.

@waxlamp
Copy link
Member

waxlamp commented Jan 31, 2025

@bendichter, @yarikoptic, @candleindark, @satra: is this a dandi-archive problem, or does it belong in dandi-schema or somewhere else? I'd like to move it to the appropriate place if there's nothing actionable in the dandi-archive codebase.

@yarikoptic
Copy link
Member

Let's indeed move to dandi-schema since I do not think there is anything what could/should be done on dandi-archive side -- for assets metadata it just invokes validation via dandi-schema so if any workaround to be done, should be done there. ref to code: https://github.com/dandi/dandi-archive/blob/HEAD/dandiapi/api/services/metadata/__init__.py#L60

Unfortunately issue is still not address on pydantic side anyhow so we might need to cook up a workaround if issue persists... but IIRC nwb reverted back to do contain timezone, (? @rly @bendichter) , so there might be no immediate rush, right?

@yarikoptic yarikoptic transferred this issue from dandi/dandi-archive Feb 7, 2025
@rly
Copy link

rly commented Feb 7, 2025

Correct, we have reverted NWB back to requiring timezones if a time is present. We will be allowing date-only datetimes, but I believe that should pass validation fine here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

7 participants