Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

requests.timestamp_day cannot be converted to default timezone #4

Open
robert-carroll opened this issue Nov 8, 2019 · 7 comments
Open

Comments

@robert-carroll
Copy link
Collaborator

robert-carroll commented Nov 8, 2019

This field can't be converted to a local default timezone from UTC without a timestamp.
The field is stored in VARCHAR which makes some queries using this field problematic.
You can store the field using Embulk Output SQL type 'date', but there will be no conversion.

Best solution would be to load the table to staging and convert that value in production locally...
or just use timestamp and convert to create a new value from timestamp.

Example: #4 (comment)

@lafent
Copy link
Contributor

lafent commented Nov 9, 2019

Can this be accomplished via an existing embulk filter plugin? I'm not sure what plugins anyone is using, but I've got a couple I've used for importing some pretty messy data. I'm about 90% sure I've got one timestamp field that's partially parsed from character data. I'll take a whack at it. I'm not sure that's really a viable option for everyone, but I can document it as an option at least -- assuming it actually works like I remember.

@lafent
Copy link
Contributor

lafent commented Dec 13, 2019

I've played around with this a little bit and I'm not sure I really understand what the problem is.

My current config for the requests data is (MySQL-centric, irrelevant bits removed)

in:
    parser:
        columns:
        - {name: timestamp, type: timestamp, format: '%Y-%m-%d %H:%M:%S.%L'}
out:
    default_timezone: 'America/Chicago'
    column_options:
        timestamp: {type: 'DATETIME NULL', value_type: timestamp, timestamp_format: '%Y-%m-%d %H:%M:%S.%L'}

This gets me the timestamp in my local time. I haven't cross-checked vs. daylight savings and I do have some other plugins installed. I don't think anything is mutating the workflow, but I haven't tried to run a vanilla instance either.

@robert-carroll
Copy link
Collaborator Author

robert-carroll commented Dec 13, 2019

The problem only exists in the timestamp_day, timestamp_month, and timestamp_year columns, as there's no time to evaluate a timezone difference. Therefore, if you run a SELECT ... WHERE timestamp_day = '2019-12-13', the results would be off by the difference of Local to UTC.

@robert-carroll
Copy link
Collaborator Author

robert-carroll commented Dec 13, 2019

I have some time today, I'll play with this one, https://github.com/sonots/embulk-filter-timestamp_format

@lafent
Copy link
Contributor

lafent commented Dec 13, 2019 via email

@robert-carroll
Copy link
Collaborator Author

robert-carroll commented Dec 13, 2019

Here's a better representation of the problem, for detail, not for explanation.

Importing a single days request files, 1 sequence from cd:cli fetch command

Using timestamp_day

SELECT
	timestamp_day
	, MIN(timestamp) AS [first]
	, MAX(timestamp) AS [last]
	, COUNT(*) total
FROM requests
GROUP BY timestamp_day
ORDER BY [last] DESC
timestamp_day first last total
2019-12-09 2019-12-09 16:00:00.0090000 2019-12-10 15:59:59.9070000 3467752

Resulting in 3,467,752 rows returned for 12/9

Using CONVERT(date, timestamp)

SELECT
	CONVERT(date, timestamp)
	, MIN(timestamp) AS [first]
	, MAX(timestamp) AS [last]
	, COUNT(*) total
FROM requests
GROUP BY CONVERT(date, timestamp)
ORDER BY [last] DESC

While timestamp_day indicates 12/9, with UTC to America/Los_Angeles conversion, many of those rows are for 12/10.

CONVERT(date, timestamp) first last total
2019-12-10 2019-12-10 00:00:00.2560000 2019-12-10 15:59:59.9070000 2674451
2019-12-09 2019-12-09 16:00:00.0090000 2019-12-09 23:59:59.8540000 793301
3467752

Resulting in 2,674,451 rows returned for 12/9, but this is still not complete.

Driving it home, importing 2 sequences of requests*.gz

CONVERT(date, timestamp) first last total
2019-12-10 2019-12-10 00:00:00.2560000 2019-12-10 15:59:59.9070000 2674451
2019-12-09 2019-12-09 00:00:00.0150000 2019-12-09 23:59:59.8540000 3377973
2019-12-08 2019-12-08 16:00:00.0140000 2019-12-08 23:59:59.9990000 613354

With a complete dataset, there are actually 3,377,973 rows returned for 12/9

@robert-carroll
Copy link
Collaborator Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants