-
Notifications
You must be signed in to change notification settings - Fork 254
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore: fix cosine residual calculation #2015
Conversation
rust/lance-linalg/src/kernels.rs
Outdated
<T as ArrowPrimitiveType>::Native: Float + Sum, | ||
{ | ||
let v = arr.as_primitive::<T>(); | ||
Ok(Arc::new(PrimitiveArray::<T>::from_iter_values(normalize(v.values()))) as ArrayRef) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI I plan to fix this soon, but normalize
doesn't dispatch to the optimized SIMD kernels we have yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔥
rust/lance-index/src/vector/ivf.rs
Outdated
let num_rows = data.len() / dimension; | ||
let num_rows = data.num_rows(); | ||
|
||
let (data, metric_type) = if self.metric_type == MetricType::Cosine { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we move this higher? maybe to search
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok i can try
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
THe shuffler is still using this piece separately, tho
rust/lance-index/src/vector/pq.rs
Outdated
.collect::<Vec<_>>(); | ||
let data = T::ArrayType::from(values); | ||
FixedSizeListArray::try_new_from_values(data, self.dimension as i32)? | ||
normalize_fsl(&fsl)? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we remove cosine type PQ entirely? And let IVF
handle normalization and resid?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
^^ can we add a panic!
that make sure we don't create a cosine PQ?
7ef3c0c
to
9d8a2e5
Compare
9d8a2e5
to
2be5e30
Compare
rust/lance-index/src/vector/ivf.rs
Outdated
|
||
// TODO: add range filter | ||
let ivf_transform = Arc::new(IvfTransformer::new( | ||
centroids.clone(), | ||
metric_type, | ||
MetricType::L2, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about dot product?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
@@ -402,7 +394,7 @@ where | |||
let mut best_kmeans = Self::empty(k, dimension, params.metric_type); | |||
let mut best_stddev = f32::MAX; | |||
|
|||
let rng = rand::rngs::SmallRng::from_entropy(); | |||
let rng = SmallRng::from_entropy(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: add a todo to use seeds
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
rust/lance-linalg/src/kmeans.rs
Outdated
.collect::<Vec<_>>(); | ||
return compute_partitions_l2(centroids_array, &normalized, dimension) | ||
.collect(); | ||
panic!("KMeans: should not use cosine distance to train kmeans, use L2 instead."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add this check in the constructor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The match arm needs a branch to handle MetricType::Cosine
tho.
let partition_ids = | ||
self.ivf | ||
.find_partitions(&query.key, query.nprobes, self.metric_type)?; | ||
let mut query = query.clone(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how cheap is this copy, can we avoid it for the L2 path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very cheap, the query vector is zero copy (copied by pointer), other than that, there are just 6 other string/int fields.
// vector_column, | ||
// ))); | ||
// }; | ||
let mt = if metric_type == MetricType::Cosine { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: make a macro for this?
rust/lance/src/index/vector/pq.rs
Outdated
@@ -163,6 +167,7 @@ impl VectorIndex for PQIndex { | |||
}); | |||
} | |||
pre_filter.wait_for_ready().await?; | |||
println!("PQIndex::search: metric type: {:?}", self.metric_type); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
log?
rust/lance/src/index/vector/pq.rs
Outdated
training_data = normalize_fsl(&training_data)?; | ||
} | ||
|
||
println!("PQ Training, ivf: {:?} ", ivf); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
log
span!(Level::INFO, "compute residual for PQ training") | ||
.in_scope(|| ivf2.compute_residual(&training_data, None)) | ||
.await? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we still need to make sure we don't residulize for dot.
So, maybe something like pq_params.should_residulize
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the design lets the caller decide whether to run residual or not. If provided IVF, it runs residual PQ.
So PQ is orthogonal to distance type as much as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but we always residualize currently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need a flag for dot distance to not residualize
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, checked it on the caller
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a few println!
and pq residual needs to be fixed
No description provided.