Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reduce chunkstore memory footprint #747

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

TomTaylorLondon
Copy link
Contributor

@TomTaylorLondon TomTaylorLondon commented Apr 19, 2019

Two changes:

  1. Reduce memory footprint when reading data
  2. Handle duplicate columns in the filter.

Using a 1GB dataframe:

this PR:
plot_new

master:
plot_old

import numpy as np
import pandas as pd
from datetime import datetime as dt
from datetime import timedelta as td

days = 2000
secs = 15000

a1 = [range(secs) for _ in range(days)]
a2 = [[dt(2000,1,1)+td(days=x)]*secs for x in range(days)]
a3 = [['foo']*secs for _ in range(days)]
a4 = [np.random.rand(secs) for _ in range(days)]
a5 = [np.random.rand(secs) for _ in range(days)]
a6 = [['HOLIDAY INN WORLD CORP']*secs for _ in range(days)]

now = dt.now()
result = []
for i in range(days):
    result.append(pd.DataFrame({'security_id':a1[i], 'date':a2[i], 'c':a3[i], 'd':a4[i], 'e':a5[i], 'f':a6[i]}, copy=True))
df = pd.concat(result)
print(df.shape)
print((dt.now() - now).total_seconds())
df = df.set_index(['date','security_id'])
print(df.memory_usage(index=True).sum() / 1e6)
from arctic import Arctic
import arctic
print(arctic.__file__)
a = Arctic('localhost')
a.initialize_library('test', lib_type='ChunkStoreV1')
lib = a['test']
lib.write('test', df)
del df
df = lib.read('test')

@TomTaylorLondon TomTaylorLondon requested a review from bmoscon April 19, 2019 13:43
@yschimke
Copy link
Contributor

What's the memory saving? Have you measured it? 50% only 1 copy of data instead of 2?

Would be great to have a way to show the saving, and automated test to avoid any accidental regressions.

Copy link
Collaborator

@bmoscon bmoscon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no changes needed per-se, just questions

arctic/serialization/numpy_arrays.py Outdated Show resolved Hide resolved
arctic/serialization/numpy_arrays.py Show resolved Hide resolved
@@ -218,16 +222,19 @@ def deserialize(self, data, columns=None):
if index:
columns = columns[:]
columns.extend(meta[INDEX])
if len(columns) > len(set(columns)):
raise Exception("Duplicate columns specified, cannot de-serialize")
columns = list(set(columns))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I see this as a win. It seems like the caller may have a bug if they are specifying duplicate columns, we're just hiding the error now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current logic is confusing when subsetting data frames with indexes. For example, if you have the data frame:
index: date, security
columns: price, volume

The logic works if the user passes:
['price']
Raises a duplicate columns error when passing:
['date','security','price']

I don't see the value in the check - it should just do the right thing..

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using pandas nomenclature the columns and the index are separate. If there is an index, you always get it back, even if you specify a subset of columns (and even if they do not include the index columns). Maybe the documentation should be improved. If for example, you specify price and security, you'll still get date as well as price and security, so your fix would only introduce more weirdness (in my opinion).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could remove index columns from columns and then check for duplicates. This keeps the nomenclature but keeps the user interface 'minimum surprise'. Or raise an error saying they have included index columns in the column list.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The result would be the same though, no? You'd supply index columns and it wont complain. I foresee someone opening a bug complaining they only specified 1 of 3 index columns but still got all 3 back.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, in retrospect, that means breaking the API for clients. How about we keep the fuzziness for clients and simply output a warning instead?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, lets see a warning :) but i still think that get info should change, otherwise how would you ever know how to rid yourself of the warning?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with get_info change

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so sounds like you just need to fix the broken tests and add the log and we're all set :D

@TomTaylorLondon TomTaylorLondon force-pushed the feature/chunkstore-reduce-memory-footprint branch from d4ccf47 to 036a89f Compare April 19, 2019 14:45

Returns
-------
pandas dataframe or series
"""
if not data:
return pd.DataFrame()
if not inplace:
data = data[:]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are some errors in the tests, so I'm thinking this will need to be tweaked a bit more

@bmoscon
Copy link
Collaborator

bmoscon commented May 9, 2019

@TomTaylorLondon are you going to have the bandwidth to finish this or would you like me to resolve it?

@shashank88
Copy link
Contributor

Hi @TomTaylorLondon any luck with this?

@bmoscon
Copy link
Collaborator

bmoscon commented Jul 5, 2019

@shashank88 I spoke with @TomTaylorLondon and am going to take this over from him. I'll get it all fixed up later this week(end).

@shashank88
Copy link
Contributor

@shashank88 I spoke with @TomTaylorLondon and am going to take this over from him. I'll get it all fixed up later this week(end).

👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants