-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
datecollected assigned as current month and day #229
Comments
Hi @mgaynor1, In order to do date range searches (collected between two dates) we convert "things that look like some kind of date" to an actual date. What would you suggest the month and date be set to when none are provided? |
As a botanist, I heavily use the eventDate to identify duplicate records. These could be two herbarium specimens taken from one individual, but deposited each at different herbarium that then send their records for ingestion on different days. Due to this ingestion process, the date is now meaningless and cannot be used to identify true duplicates. Any study that has used the month and day provided by the datecollected column could be inferring biological meaning where it doesn't exist - at this time, I would caution the use of iDigBio data for any phenology studies unless researchers are only using the data.dwc columns. iDigBio's search feature should not come before the quality of this data and I urge you to prioritize correcting this. Obtaining data within a certain date range is meaningless when the date was made up during the ingestion of the data and has no biological meaning. I suggest that you all stop creating data that has no meaning. Do not put a month and day when none is provided. You are not converting a "kind of date" to an "actual date", it is converted to a fake date. Maybe just follow GBIF and convert things to ISO 8601 (2004) (YYYY-MM-DD, YYYY-MM, or YYYY) - see GBIFs description here and a great recent blog post on dates Also - the minimum date of a collection shouldn't be 1700 but somewhere closer to 1550 (https://doi.org/10.2307/2421492). |
To identify duplicate records you may have more success using the as-published data fields rather than interpreted fields. The interpreted fields are subject to change over time. A stronger way to say this is
I appreciate the links to the GBIF examples which are good suggestions for data providers on sharing / publishing dates. I do not see a discussion of how this affects data access, display, and discovery in the GBIF system on these fields, though. For example, in GBIF's web ui, if a data record contains only a year such as '1960', which month in the histrogram contains that record? In the idigbio case, I'd be more concerned about whether this hypothetical '1960' record would ever show up in any search that included a month such as '1960-06'.
Interesting! I confirm there are now some records in GBIF older than year 1700. We can use that "new" earlier date from now on, thanks for pointing that out. |
As reported in #229 the oldest natural history records actually date back to the 1500s rather than 1700. I have confirmed there are some digitized records in gbif from the 1600s. Moving our zero date back a few hundred years.
As reported in #229 the oldest natural history records actually date back to the 1500s rather than 1700. I have confirmed there are some digitized records in GBIF from the 1600s. Moving our zero date back a few hundred years.
Hey Dan - Take a step back and look at that screen capture from GBIF, notice a large number of records from January? They are assigning YEAR-01-01 when month and day are not provided. This is a standardized approach and means researchers can take out all 01-01 records if they need dates. You are right, I found a workaround. But, do most data users know about the "data.dwc:" fields? Should iDigBio really return fake dates to users just to streamline a search feature? The answer to both of these is no. Taking the current date of ingestion and assigning it as the month/day of specimen collection is wild and needs to be fixed. Finally -Where does iDigBio document these interpretations so users can make informed decisions? |
For future readers, the specific GBIF implementation discussion: None of the solutions to inventing missing data are ideal and all of them have trade-offs. Converting "1960" to "1960-01-01" is also creating a fake date. I understand your opinion that this type of fake date is preferred to the current one. Previous domain experts on the project determined that artificially inflating how much collecting activity was happening on the first of every month, and the first of every year, was the poorer choice of the various potential solutions. Some managers or PIs had strong opinions on various aesthetics of this issue. For some research activities the current method was preferred (the introduction of these fake dates is statistically distributed across the entire date range rather than always falling into the first date bucket). I do not have answers to all of your questions but I will make sure the Cyberinfrastructure team is aware of this github issue. |
José escalated this issue to the project PIs.
|
There are instances where the darwin core values take the approach of assigning the first day of the period for the dwc:month and dwc:day field. This is likely why the GBIF monthly histograms have excessive January counts. Here's an example: https://portal.idigbio.org/portal/records/4125d2a8-1bc1-4744-86be-549ac814b579 Would a data quality flag specifying the eventDate interval size in days or just a non specific eventDate flag be useful in excluding these records from analysis requiring a date? Best practice for eventDate includes a time. We currently default to midnight. This is the same problem but with hours. Perhaps we could have a projected time data quality flag as well. |
Hi there. In the example you provided, iDigBio is only sampling the starting date in the interval and is not assigning 01-01 artificially. Intervals can be found in the DarwinCore eventDate field and should be found there. This discussion focuses on Adding a flag when Here is a great paper that discusses dates in research: https://doi.org/10.1111/1365-2435.14173 |
How about a date specificity value that is based on days? An eventDate specifying the year 1912 would generate a date specificity value of 366. A year and month such as 1902-06 would generate a date specificity value of 30. An ISO 8601 2019 standard interval of 2007-03-01T13:00:00Z/2008-05-11T15:30:00Z would generate a date specificity value of 437.104166667. Then you could use the value to exclude records based on your needs. |
We are only discussing The purpose of
I strongly recommend you do not do this. |
I was suggesting adding a data quality field in addition to the current fields not as a replacement to datecollected. |
In all responses, please be specific about which field you are discussing. An additional field based on your logic above would not be interpretable to users due to the intervals varying in length. If you want to flag with numbers, make a key and flag with numbers. However, I do not recommend number flags unless they are heavily documented and are stable values (ex. |
This would not be a boolean value but a float similar to coordinate uncertainty values we currently have. |
Maybe something you can propose to tdwg? I do not see how this helps with this issue and would encourage us not to create more fields without documenting what exists in the current fields. The issue discussion here should shift back to the |
When the data.dwc:day or data.dwc:month is missing, but a data.dwc:year is provided, the datecollected column is assigned the current month and day.
This error comes from this:
idb-backend/idb/helpers/conversions.py
Lines 544 to 606 in 3c9551c
Here is the line causing this issue:
idb-backend/idb/helpers/conversions.py
Line 599 in 3c9551c
This is really easy to recreate in python as well:
Out[1]: datetime.date(2010, 4, 1)
The text was updated successfully, but these errors were encountered: