-
Notifications
You must be signed in to change notification settings - Fork 3.7k
GPU k means example
Here is a C++ example of k-means on a single GPU, which incidentally shows how the GPU code can be a drop-in replacement for the CPU code:
// just generate some random vecs in a hypercube (CPU)
std::vector<float> makeRandomVecs(size_t numVecs, int dim) {
std::vector<float> vecs(numVecs * dim);
faiss::float_rand(vecs.data(), vecs.size(), 1);
return vecs;
}
int main(int argc, char** argv) {
// Reserves 18% of GPU memory for temporary work by default; the
// size can be adjusted, or your own implementation of GpuResources
// can be made to manage memory in a different way.
StandardGpuResources res;
int cudaDevice = 0;
int dim = 128;
bool useFloat16 = false;
int numberOfEMIterations = 20;
size_t numberOfClusters = 20000;
size_t numVecsToCluster = 5000000;
// generate a bunch of random vectors; note that this is on the CPU!
std::vector<float> vecs = makeRandomVecs(numVecsToCluster, dim);
faiss::gpu::GpuIndexFlatL2 index(&res, cudaDevice, dim, false);
faiss::ClusteringParameters cp;
cp.niter = numberOfEMIterations;
cp.verbose = true; // print out per-iteration stats
// For spherical k-means, use GpuIndexFlatIP and set cp.spherical = true
// By default faiss only samples 256 vectors per centroid, in case
// you are asking for too few centroids for too many vectors.
// e.g., numberOfClusters = 1000, numVecsToCluster = 1000000 would
// only sample 256000 vectors.
//
// You can override this to use any number of clusters
// cp.max_points_per_centroid =
// ((numVecsToCluster + numberOfClusters - 1) / numberOfClusters);
faiss::Clustering kMeans(dim, numberOfClusters, cp);
// do the work!
kMeans.train(numVecsToCluster, vecs.data(), index);
// kMeans.centroids contains the resulting cluster centroids (on CPU)
printf("centroid 3 dim 6 is %f\n", kMeans.centroids[3 * dim + 6]);
return 0;
}
On a P100-PCIe GPU, it generates output that looks like this:
Clustering 5000000 points in 128D to 20000 clusters, redo 1 times, 20 iterations
Preprocessing in 0.585711 s
Iteration 19 (109.14 s, search 106.78 s): objective=4.51621e+07 imbalance=1.012 nsplit=0
centroid 3 dim 6 is 0.468013
Of note, the actual vectors being clustered above are coming from the CPU. They don't remain resident on the GPU and are only paged in as needed; the memory usage on the GPU is limited. Thus, massive datasets can be clustered.
Part of the k-means code is shared between GPU and CPU (the faiss::Clustering
object); in fact, our k-means could be made even faster by moving the CPU components to the GPU, but for very large k-means applications (e.g., the clustering for the DEEP1B k-NN graph), the CPU overhead is a small fraction (<5%) of overall work. In the meantime, faiss::Clustering
requires CPU input and generates CPU output; unlike GpuIndexFlatL2
, it cannot accept GPU-resident input. Converting the above to CPU k-means requires changing GpuIndexFlatL2
to IndexFlatL2
.
Multi-GPU clustering is similar, the GpuIndexFlatL2
must be replaced with an IndexProxy
of GpuIndexFlatL2
s. See benchs/kmeans_mnist.py for an example in Python.
TODO: show multi-GPU clustering in C++
Faiss building blocks: clustering, PCA, quantization
Index IO, cloning and hyper parameter tuning
Threads and asynchronous calls
Inverted list objects and scanners
Indexes that do not fit in RAM
Brute force search without an index
Fast accumulation of PQ and AQ codes (FastScan)
Setting search parameters for one query
Binary hashing index benchmark