Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RII with Billion scale dataset code configuration/reconfiguration crashes the kernel #66

Open
ashleyabraham opened this issue Jun 24, 2024 · 3 comments

Comments

@ashleyabraham
Copy link
Contributor

ashleyabraham commented Jun 24, 2024

I have a RII object with 3.3 billion scale dataset which was batch loaded using the add(update_posting_lists=False) and then at the end I ran the reconfigure(), but it crashes the python kernel in the reconfigure step. I tried adding and then configuring it immediateIy and it was taking forever, didn't know if it worked or just hung. I was looking at the code and saw some comments about large memory consumption, is there a alternate way to do this without crashing?

@matsui528
Copy link
Owner

Let me know the minimum code to reproduce the error.

@ashleyabraham
Copy link
Contributor Author

ashleyabraham commented Jun 26, 2024

import rii
import pickle 
import numpy as np
import nanopq

N, Nt, D = 3_300_000_000, 660_000, 75

#MemoryError: Unable to allocate 1.80 TiB for an array with shape (3_300_000_000, 75) and data type float64
# X = np.random.random((N, D)).astype(np.float32)  # 3_300_000_000  75-dim vectors to be searched

Xt = np.random.random((Nt, D)).astype(np.float32)  # 660_000 75-dim vectors for training
q = np.random.random((D,)).astype(np.float32)  # a 75-dim vector

# Prepare a PQ/OPQ codec with M=5 sub spaces
codec = nanopq.PQ(M=5).fit(vecs=Xt)  # Trained using Xt

# Instantiate a Rii class with the codec
e = rii.Rii(fine_quantizer=codec)

# Batch Add vectors - 1_000_000 x 3_300 times = 3.3 Billion 
# In reality data is loaded from Parquet files
for i in range(3_300):
    X = np.random.random((1_000_000, D)).astype(np.float32) 
    e.add(vecs=X, update_posting_lists=False)
    # e.reconfigure() ## takes longer and longer as the loop advances

e.reconfigure() # Crashes

# Search
ids, dists = e.query(q=q, topk=3)
print(ids, dists)  # e.g., [7484 8173 1556] [15.06257439 15.38533878 16.16935158]

@ashleyabraham
Copy link
Contributor Author

ashleyabraham commented Jul 1, 2024

what is the best way to load a large billion-scale dataset like this?

When I downsample from 3.3 billion to 1 billion it didn't error right away, I was able to reconfigure using the minimum code above, but the 1 billion model crashes in the query() step self.impl_cpp.query_ivf(q_, topk, tids, L)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants