Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Milvus memory's usage is too high, and the minio's cpu usage is also very high #40270

Open
1 task done
ljpassingby opened this issue Feb 28, 2025 · 6 comments
Open
1 task done
Assignees
Labels
area/performance Performance issues kind/bug Issues or changes related a bug severity/major Major, major function doesn't work under some condition. triage/needs-information Indicates an issue needs more information in order to work on it.

Comments

@ljpassingby
Copy link

ljpassingby commented Feb 28, 2025

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version: 2.5.3
- Deployment mode(standalone or cluster): standalone machine  (without docker)
- MQ type(rocksmq, pulsar or kafka):   rocksmq 
- SDK version(e.g. pymilvus v2.0.0rc2):  java-sdk-2.5.2
- OS(Ubuntu or CentOS):  eular
- CPU/Memory:  96 cores, 755GB

Current Behavior

I have inserted 200000000 data into a standalone milvus server, and the milvus server's memory usage rise to 329.4g.
After the vector index builds up, the minio's cpu usage rises to 8000%.

When I restart the minio after closing the milvus server, the minio is normal. But when I start the milvus, the minio's cpu usage will still rise to 8000% again, and milvus' memory usage rises to 300g+

The vector type is ivf_sq8, 256 dimension

Is there any config I can update to avoid the high memory usage and the cpu usage?

Expected Behavior

No response

Steps To Reproduce

Milvus Log

There is a lot of logs like this, printing very frequently.

[2025/02/28 15:40:00.581 +08:00] [INFO] [datacoord/task_scheduler.go:258] ["there is no idle indexing node, waiting for retry..."]
[2025/02/28 15:40:01.578 +08:00] [INFO] [datacoord/task_scheduler.go:248] ["task scheduler"] ["task num"=3507]
[2025/02/28 15:40:01.578 +08:00] [INFO] [datacoord/task_scheduler.go:279] ["task is processing"] [taskID=456276053287398509] ["task type"=JobTypeIndexJob] [state=JobStateInProgress]
[2025/02/28 15:40:01.579 +08:00] [INFO] [datacoord/task_index.go:316] ["query task index info successfully"] [taskID=456276053287398509] ["result state"=InProgress] [failReason=]
[2025/02/28 15:40:01.579 +08:00] [INFO] [datacoord/task_scheduler.go:279] ["task is processing"] [taskID=456276053287600963] ["task type"=JobTypeIndexJob] [state=JobStateInit]
[2025/02/28 15:40:01.580 +08:00] [INFO] [datacoord/index_engine_version_manager.go:84] ["Merged current version"] [current=6]
[2025/02/28 15:40:01.580 +08:00] [INFO] [datacoord/task_index.go:261] ["index task pre check successfully"] [taskID=456276053287600963] [segID=456276053211821517]
[2025/02/28 15:40:01.580 +08:00] [INFO] [indexnode/indexnode_service.go:210] ["Get Index Job Stats"] [traceID=230aa89e91105c95d078d95033492943] [unissued=0] [active=1] [slot=0]
[2025/02/28 15:40:01.580 +08:00] [INFO] [datacoord/task_scheduler.go:258] ["there is no idle indexing node, waiting for retry..."]
[2025/02/28 15:40:01.994 +08:00] [INFO] [balance/score_based_balancer.go:582] ["node channel workload status"] [collectionID=456135731644206123] [replicaID=456135731990495233] [nodes="[\"{NodeID: 13, AssignedScore: 15.000000, CurrentScore: 15.000000, Priority: 0}\"]"]
[2025/02/28 15:40:01.994 +08:00] [INFO] [balance/score_based_balancer.go:514] ["node segment workload status"] [collectionID=456135731644206123] [replicaID=456135731990495233] [nodes="[\"{NodeID: 13, AssignedScore: 20017600.000000, CurrentScore: 22019360.000000, Priority: 2001760}\"]"]
[2025/02/28 15:40:01.994 +08:00] [INFO] [balance/score_based_balancer.go:582] ["node channel workload status"] [collectionID=456162640469972779] [replicaID=456162640804708354] [nodes="[\"{NodeID: 13, AssignedScore: 15.000000, CurrentScore: 15.000000, Priority: 0}\"]"]
[2025/02/28 15:40:01.994 +08:00] [INFO] [balance/score_based_balancer.go:514] ["node segment workload status"] [collectionID=456162640469972779] [replicaID=456162640804708354] [nodes="[\"{NodeID: 13, AssignedScore: 20017600.000000, CurrentScore: 22019360.000000, Priority: 2001760}\"]"]
[2025/02/28 15:40:01.994 +08:00] [INFO] [balance/score_based_balancer.go:582] ["node channel workload status"] [collectionID=456276053142546157] [replicaID=456276053475721217] [nodes="[\"{NodeID: 13, AssignedScore: 15.000000, CurrentScore: 15.000000, Priority: 0}\"]"]
[2025/02/28 15:40:01.995 +08:00] [INFO] [balance/score_based_balancer.go:514] ["node segment workload status"] [collectionID=456276053142546157] [replicaID=456276053475721217] [nodes="[\"{NodeID: 13, AssignedScore: 20103600.000000, CurrentScore: 22113960.000000, Priority: 2010360}\"]"]
[2025/02/28 15:40:01.995 +08:00] [INFO] [balance/score_based_balancer.go:582] ["node channel workload status"] [collectionID=456276053143974342] [replicaID=456276053475721219] [nodes="[\"{NodeID: 13, AssignedScore: 15.000000, CurrentScore: 15.000000, Priority: 0}\"]"]
[2025/02/28 15:40:01.995 +08:00] [INFO] [balance/score_based_balancer.go:514] ["node segment workload status"] [collectionID=456276053143974342] [replicaID=456276053475721219] [nodes="[\"{NodeID: 13, AssignedScore: 220107600.000000, CurrentScore: 242118360.000000, Priority: 22010760}\"]"]
[2025/02/28 15:40:02.002 +08:00] [INFO] [balance/score_based_balancer.go:582] ["node channel workload status"] [collectionID=456276053367555199] [replicaID=456276053475721223] [nodes="[\"{NodeID: 13, AssignedScore: 15.000000, CurrentScore: 15.000000, Priority: 0}\"]"]
[2025/02/28 15:40:02.002 +08:00] [INFO] [balance/score_based_balancer.go:514] ["node segment workload status"] [collectionID=456276053367555199] [replicaID=456276053475721223] [nodes="[\"{NodeID: 13, AssignedScore: 20017601.000000, CurrentScore: 22019361.100000, Priority: 2001760}\"]"]

Anything else?

No response

@ljpassingby ljpassingby added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 28, 2025
@dosubot dosubot bot added area/performance Performance issues severity/major Major, major function doesn't work under some condition. labels Feb 28, 2025
Copy link

dosubot bot commented Feb 28, 2025

There are a few similar discussions related to high memory and CPU usage in Milvus:

  1. Milvus can cause the CPU to suddenly soar to 90% - This discussion is still open and was last updated on January 24, 2025.
  2. Questions on memory usage and disk access patterns of Milvus - This discussion is also open and was last updated on April 2, 2024.
  3. The milvus server has high CPU usage. I don't insert the data and search now. The version is 2.3.11. Why? - This discussion is open and was last updated on March 18, 2024.

For addressing high memory and CPU usage in Milvus and MinIO, you can consider updating the following configurations:

Milvus

  • Reduce CPU Usage:

    • Enable skip mode to reduce CPU usage by skipping some timetick messages:
      dataNode.dataSync.skipMode.enable: true
      dataNode.dataSync.skipMode.skipNum: 4
      dataNode.dataSync.skipMode.coldTime: 60
  • Manage Memory Usage:

    • Adjust the buffer size for segments to control memory usage:
      dataNode.segment.insertBufSize: [desired_value_in_bytes]
      dataNode.segment.deleteBufBytes: [desired_value_in_bytes]
    • Enable force sync when memory usage is high:
      dataNode.memory.forceSyncEnable: true
      dataNode.memory.forceSyncSegmentNum: 1
      dataNode.memory.forceSyncWatermark: 0.5

MinIO

  • Adjust MinIO configurations in the milvus.yaml file:
    minio.accessKeyID: [your_access_key]
    minio.secretAccessKey: [your_secret_key]

These configurations can help manage resource usage more effectively. For more detailed configuration options, you can refer to the Milvus documentation [1][2].

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other

@ljpassingby
Copy link
Author

ljpassingby commented Feb 28, 2025

For the standalone milvus, is there any recommended suggestions about maximum data scale
Is this resource usage normal for my current data scale?

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
121140 root      20   0  387.7g 330.4g 109384 S   7.5  43.7 217:12.27 CGO_LOAD
114963 root      20   0 4545952   3.4g  23584 S  8104   0.5   8508:51 minio

Copy link

dosubot bot commented Feb 28, 2025

Milvus Standalone is suitable for medium-sized datasets, scaling up to 100 million vectors [1]. Given that you've inserted 200 million vectors, this exceeds the recommended scale for standalone mode, which could contribute to the high resource usage you're experiencing.

For your current data scale, the resource usage you're observing might be expected due to the large dataset size. To manage resource usage more effectively, consider optimizing memory and CPU usage by adjusting configurations such as enabling skip mode, managing segment buffer sizes, and enabling force sync when memory usage is high [2].

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other

@yanliang567
Copy link
Contributor

@ljpassingby The memory usage is a bit high if all the segments are indexed. But according to the description above, I think there are many compaction tasks running. If so, it is reasonable.
quick questions:

  1. how many collections and partitions do you have?
  2. please try to describe the index to double confirm if there are 0 pending_index_rows.
  3. also please share the birdwatcher backup files for investigation. refer to this doc: https://github.com/milvus-io/birdwatcher to backup etcd backup with birdwatcher
    /assign @ljpassingby
    /unassign

@yanliang567 yanliang567 added triage/needs-information Indicates an issue needs more information in order to work on it. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 28, 2025
@ljpassingby
Copy link
Author

@yanliang567 Thanks for your reply. The suggestions I will try later, but I have a question that why the minio uses so much cpu resource and it seems not to decrease?

@xiaofan-luan
Copy link
Collaborator

@ljpassingby
there might be bug in 2.5.3 for compaction.

Please upgrade to 2.5.5 and see if it's still the same issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/performance Performance issues kind/bug Issues or changes related a bug severity/major Major, major function doesn't work under some condition. triage/needs-information Indicates an issue needs more information in order to work on it.
Projects
None yet
Development

No branches or pull requests

3 participants