Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: When building an index for the same data, the CPU usage of the master image is 100%, but that of the 2.4 image is only 60% #39090

Closed
1 task done
ThreadDao opened this issue Jan 8, 2025 · 9 comments
Assignees
Labels
kind/bug Issues or changes related a bug severity/critical Critical, lead to crash, data missing, wrong result, function totally doesn't work. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Milestone

Comments

@ThreadDao
Copy link
Contributor

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version: 2.4-20250106-9e221063-amd64 vs master-20250108-f0dae814-amd64
- Deployment mode(standalone or cluster): standalone
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

Test index cost for sift 10m data on master and 2.4 branch. The cpu usage of master is 100% while the 2.4 is only 60%

  1. standalone config
    standalone:
      env:
      - name: GOTRACEBACK
        value: crash
      replicas: 1
      resources:
        limits:
          cpu: "8" 
          memory: 32Gi
        requests:
          cpu: "8" 
          memory: 30Gi
  1. index params:
{'index_type': 'HNSW', 'metric_type': 'L2', 'params': {'M': 30, 'efConstruction': 360}}
  1. index cost and cpu usage

pyroscope of stats-24-op-55-4263
image

pyroscope of stats-master-op-62-8296
image

Expected Behavior

No response

Steps To Reproduce

https://argo-workflows.zilliz.cc/archived-workflows/qa/8a74abd1-0140-4ce5-af21-f146e2275f1f?nodeId=zong-stats-index-4

Milvus Log

  • pods of 2.4:
stats-24-op-55-4263-etcd-0                                        1/1     Running                  0               5h20m   10.104.25.5     4am-node30   <none>           <none>
stats-24-op-55-4263-kafka-0                                       1/1     Running                  1 (5h19m ago)   5h19m   10.104.30.233   4am-node38   <none>           <none>
stats-24-op-55-4263-kafka-1                                       1/1     Running                  1 (5h19m ago)   5h19m   10.104.32.33    4am-node39   <none>           <none>
stats-24-op-55-4263-kafka-2                                       1/1     Running                  0               5h19m   10.104.19.161   4am-node28   <none>           <none>
stats-24-op-55-4263-kafka-zookeeper-0                             1/1     Running                  0               5h19m   10.104.32.31    4am-node39   <none>           <none>
stats-24-op-55-4263-kafka-zookeeper-1                             1/1     Running                  0               5h19m   10.104.17.211   4am-node23   <none>           <none>
stats-24-op-55-4263-kafka-zookeeper-2                             1/1     Running                  0               5h19m   10.104.19.160   4am-node28   <none>           <none>
stats-24-op-55-4263-milvus-standalone-b95fd7954-x65ss             1/1     Running                  0               5h18m   10.104.9.159    4am-node14   <none>           <none>
stats-24-op-55-4263-minio-64c5f5f586-zdqkb                        1/1     Running                  0               5h19m   10.104.25.7     4am-node30   <none>           <none>
  • pods of master
stats-master-op-62-8296-etcd-0                                    1/1     Running                  0              5h6m    10.104.25.3     4am-node30   <none>           <none>
stats-master-op-62-8296-kafka-0                                   1/1     Running                  1 (5h6m ago)   5h6m    10.104.19.156   4am-node28   <none>           <none>
stats-master-op-62-8296-kafka-1                                   1/1     Running                  1 (5h6m ago)   5h6m    10.104.25.4     4am-node30   <none>           <none>
stats-master-op-62-8296-kafka-2                                   1/1     Running                  1 (5h6m ago)   5h6m    10.104.30.231   4am-node38   <none>           <none>
stats-master-op-62-8296-kafka-zookeeper-0                         1/1     Running                  0              5h6m    10.104.19.157   4am-node28   <none>           <none>
stats-master-op-62-8296-kafka-zookeeper-1                         1/1     Running                  0              5h6m    10.104.32.28    4am-node39   <none>           <none>
stats-master-op-62-8296-kafka-zookeeper-2                         1/1     Running                  0              5h6m    10.104.20.211   4am-node22   <none>           <none>
stats-master-op-62-8296-milvus-standalone-77c5ddc7d6-gzj5t        1/1     Running                  0              5h5m    10.104.6.4      4am-node13   <none>           <none>
stats-master-op-62-8296-minio-7fdf8fbc77-r4rnz                    1/1     Running                  0              5h6m    10.104.32.27    4am-node39   <none>           <none>
stats-master-op-62-8296-minio-update-prometheus-secret-zzh49      0/1     Completed                0              5h6m    10.104.6.3      4am-node13   <none>           <none>

Anything else?

No response

@ThreadDao ThreadDao added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jan 8, 2025
@ThreadDao ThreadDao added the severity/critical Critical, lead to crash, data missing, wrong result, function totally doesn't work. label Jan 8, 2025
@ThreadDao ThreadDao added this to the 2.4.21 milestone Jan 8, 2025
@yanliang567 yanliang567 added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jan 8, 2025
@yanliang567
Copy link
Contributor

/unassign

@xiaocai2333
Copy link
Contributor

xiaocai2333 commented Jan 10, 2025

@ThreadDao In standalone mode, the default CPU usage is limited to 75%, so the usage of 2.4 is expected. However, I'm still investigating why the master can utilize 100% of the CPU.

@yanliang567
Copy link
Contributor

@ThreadDao In standalone mode, the default CPU usage is limited to 75%, so the usage of 2.4 is expected. However, I'm still investigating why the master can utilize 100% of the CPU.

is there any config for this cpu usage limitation? I remember it was 50%? @xiaocai2333

@xiaocai2333
Copy link
Contributor

@ThreadDao In standalone mode, the default CPU usage is limited to 75%, so the usage of 2.4 is expected. However, I'm still investigating why the master can utilize 100% of the CPU.

is there any config for this cpu usage limitation? I remember it was 50%? @xiaocai2333

yeah, the index-building pool size is usually set to 50% in standalone mode.
When starting Milvus, both indexNode and querynode set their respective index building pool sizes based on the configuration:

In standalone mode, the component that starts later will overwrite the corresponding configuration. Typically, the querynode starts last, so the index-building pool size is usually set to 50% in standalone mode.

@cqy123456
Copy link
Contributor

The cpu util of building diskann index in master branch is normal: 50%( = build thread number / total core number).

Image

@cqy123456
Copy link
Contributor

cqy123456 commented Jan 16, 2025

The implementation of hnsw was switched from hnswlib to faiss when the milvus version switch from 2.4 to master. It is possible that the omp setter does not work in faiss_hnsw train and add function ?/assign @alexanderguzhva

@foxspy
Copy link
Contributor

foxspy commented Jan 16, 2025

Image

Image

https://github.com/zilliztech/knowhere/blob/main/src/index/hnsw/faiss_hnsw.cc#L113

From the test results, build_pool is effective and can limit the CPU resources used to build the index; the code also shows that it is limited by omp https://github.com/zilliztech/knowhere/blob/main/src/index/hnsw/faiss_hnsw.cc#L113

@foxspy
Copy link
Contributor

foxspy commented Jan 16, 2025

Image

sre-ci-robot pushed a commit that referenced this issue Jan 16, 2025
sre-ci-robot pushed a commit that referenced this issue Jan 16, 2025
issue: #39090 
The num_build_thread parameter will limit the number of build omps. This
parameter will override the effect of buildIndexThreadPoolRatio.
Removing this parameter will have no actual effect. This parameter is
actually only used in the growing index, where it will be explicitly
set.

Signed-off-by: xianliang.li <[email protected]>
@yanliang567
Copy link
Contributor

verified on 2.5-20250120-44649664-amd64 which cpu usage keeps on 50% in standalone mode.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues or changes related a bug severity/critical Critical, lead to crash, data missing, wrong result, function totally doesn't work. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

5 participants