Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: fix cosine residual calculation #2015

Merged
merged 13 commits into from
Mar 3, 2024
Merged

chore: fix cosine residual calculation #2015

merged 13 commits into from
Mar 3, 2024

Conversation

eddyxu
Copy link
Contributor

@eddyxu eddyxu commented Feb 29, 2024

No description provided.

@eddyxu eddyxu requested review from wjones127, westonpace, chebbyChefNEQ and BubbleCal and removed request for wjones127 and westonpace February 29, 2024 23:15
<T as ArrowPrimitiveType>::Native: Float + Sum,
{
let v = arr.as_primitive::<T>();
Ok(Arc::new(PrimitiveArray::<T>::from_iter_values(normalize(v.values()))) as ArrayRef)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI I plan to fix this soon, but normalize doesn't dispatch to the optimized SIMD kernels we have yet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥

let num_rows = data.len() / dimension;
let num_rows = data.num_rows();

let (data, metric_type) = if self.metric_type == MetricType::Cosine {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we move this higher? maybe to search?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok i can try

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

THe shuffler is still using this piece separately, tho

.collect::<Vec<_>>();
let data = T::ArrayType::from(values);
FixedSizeListArray::try_new_from_values(data, self.dimension as i32)?
normalize_fsl(&fsl)?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we remove cosine type PQ entirely? And let IVF handle normalization and resid?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^^ can we add a panic! that make sure we don't create a cosine PQ?

@eddyxu eddyxu force-pushed the lei/pq_cosine_idx branch from 7ef3c0c to 9d8a2e5 Compare March 1, 2024 21:54
@eddyxu eddyxu force-pushed the lei/pq_cosine_idx branch from 9d8a2e5 to 2be5e30 Compare March 2, 2024 04:46

// TODO: add range filter
let ivf_transform = Arc::new(IvfTransformer::new(
centroids.clone(),
metric_type,
MetricType::L2,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about dot product?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -402,7 +394,7 @@ where
let mut best_kmeans = Self::empty(k, dimension, params.metric_type);
let mut best_stddev = f32::MAX;

let rng = rand::rngs::SmallRng::from_entropy();
let rng = SmallRng::from_entropy();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add a todo to use seeds

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

.collect::<Vec<_>>();
return compute_partitions_l2(centroids_array, &normalized, dimension)
.collect();
panic!("KMeans: should not use cosine distance to train kmeans, use L2 instead.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add this check in the constructor?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The match arm needs a branch to handle MetricType::Cosine tho.

let partition_ids =
self.ivf
.find_partitions(&query.key, query.nprobes, self.metric_type)?;
let mut query = query.clone();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how cheap is this copy, can we avoid it for the L2 path?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very cheap, the query vector is zero copy (copied by pointer), other than that, there are just 6 other string/int fields.

// vector_column,
// )));
// };
let mt = if metric_type == MetricType::Cosine {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: make a macro for this?

@@ -163,6 +167,7 @@ impl VectorIndex for PQIndex {
});
}
pre_filter.wait_for_ready().await?;
println!("PQIndex::search: metric type: {:?}", self.metric_type);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

log?

training_data = normalize_fsl(&training_data)?;
}

println!("PQ Training, ivf: {:?} ", ivf);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

log

Comment on lines +367 to +369
span!(Level::INFO, "compute residual for PQ training")
.in_scope(|| ivf2.compute_residual(&training_data, None))
.await?
Copy link
Contributor

@chebbyChefNEQ chebbyChefNEQ Mar 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we still need to make sure we don't residulize for dot.

So, maybe something like pq_params.should_residulize?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the design lets the caller decide whether to run residual or not. If provided IVF, it runs residual PQ.

So PQ is orthogonal to distance type as much as possible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but we always residualize currently.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need a flag for dot distance to not residualize

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, checked it on the caller

Copy link
Contributor

@chebbyChefNEQ chebbyChefNEQ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a few println! and pq residual needs to be fixed

@eddyxu eddyxu merged commit f1218e3 into main Mar 3, 2024
17 checks passed
@eddyxu eddyxu deleted the lei/pq_cosine_idx branch March 3, 2024 21:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants