-
Notifications
You must be signed in to change notification settings - Fork 584
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
reduce chunkstore memory footprint #747
base: master
Are you sure you want to change the base?
Conversation
What's the memory saving? Have you measured it? 50% only 1 copy of data instead of 2? Would be great to have a way to show the saving, and automated test to avoid any accidental regressions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no changes needed per-se, just questions
@@ -218,16 +222,19 @@ def deserialize(self, data, columns=None): | |||
if index: | |||
columns = columns[:] | |||
columns.extend(meta[INDEX]) | |||
if len(columns) > len(set(columns)): | |||
raise Exception("Duplicate columns specified, cannot de-serialize") | |||
columns = list(set(columns)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I see this as a win. It seems like the caller may have a bug if they are specifying duplicate columns, we're just hiding the error now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current logic is confusing when subsetting data frames with indexes. For example, if you have the data frame:
index: date, security
columns: price, volume
The logic works if the user passes:
['price']
Raises a duplicate columns error when passing:
['date','security','price']
I don't see the value in the check - it should just do the right thing..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using pandas nomenclature the columns and the index are separate. If there is an index, you always get it back, even if you specify a subset of columns (and even if they do not include the index columns). Maybe the documentation should be improved. If for example, you specify price and security, you'll still get date as well as price and security, so your fix would only introduce more weirdness (in my opinion).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could remove index columns from columns and then check for duplicates. This keeps the nomenclature but keeps the user interface 'minimum surprise'. Or raise an error saying they have included index columns in the column list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The result would be the same though, no? You'd supply index columns and it wont complain. I foresee someone opening a bug complaining they only specified 1 of 3 index columns but still got all 3 back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, in retrospect, that means breaking the API for clients. How about we keep the fuzziness for clients and simply output a warning instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, lets see a warning :) but i still think that get info should change, otherwise how would you ever know how to rid yourself of the warning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with get_info change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so sounds like you just need to fix the broken tests and add the log and we're all set :D
d4ccf47
to
036a89f
Compare
|
||
Returns | ||
------- | ||
pandas dataframe or series | ||
""" | ||
if not data: | ||
return pd.DataFrame() | ||
if not inplace: | ||
data = data[:] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there are some errors in the tests, so I'm thinking this will need to be tweaked a bit more
@TomTaylorLondon are you going to have the bandwidth to finish this or would you like me to resolve it? |
Hi @TomTaylorLondon any luck with this? |
@shashank88 I spoke with @TomTaylorLondon and am going to take this over from him. I'll get it all fixed up later this week(end). |
👍 |
Two changes:
Using a 1GB dataframe:
this PR:
master: