[Bug]: Milvus memory's usage is too high, and the minio's cpu usage is also very high #40270

ljpassingby · 2025-02-28T07:39:30Z

Is there an existing issue for this?

I have searched the existing issues

Environment

- Milvus version: 2.5.3
- Deployment mode(standalone or cluster): standalone machine  (without docker)
- MQ type(rocksmq, pulsar or kafka):   rocksmq 
- SDK version(e.g. pymilvus v2.0.0rc2):  java-sdk-2.5.2
- OS(Ubuntu or CentOS):  eular
- CPU/Memory:  96 cores, 755GB

Current Behavior

I have inserted 200000000 data into a standalone milvus server, and the milvus server's memory usage rise to 329.4g.
After the vector index builds up, the minio's cpu usage rises to 8000%.

When I restart the minio after closing the milvus server, the minio is normal. But when I start the milvus, the minio's cpu usage will still rise to 8000% again, and milvus' memory usage rises to 300g+

The vector type is ivf_sq8, 256 dimension

Is there any config I can update to avoid the high memory usage and the cpu usage?

Expected Behavior

No response

Steps To Reproduce

Milvus Log

There is a lot of logs like this, printing very frequently.

[2025/02/28 15:40:00.581 +08:00] [INFO] [datacoord/task_scheduler.go:258] ["there is no idle indexing node, waiting for retry..."]
[2025/02/28 15:40:01.578 +08:00] [INFO] [datacoord/task_scheduler.go:248] ["task scheduler"] ["task num"=3507]
[2025/02/28 15:40:01.578 +08:00] [INFO] [datacoord/task_scheduler.go:279] ["task is processing"] [taskID=456276053287398509] ["task type"=JobTypeIndexJob] [state=JobStateInProgress]
[2025/02/28 15:40:01.579 +08:00] [INFO] [datacoord/task_index.go:316] ["query task index info successfully"] [taskID=456276053287398509] ["result state"=InProgress] [failReason=]
[2025/02/28 15:40:01.579 +08:00] [INFO] [datacoord/task_scheduler.go:279] ["task is processing"] [taskID=456276053287600963] ["task type"=JobTypeIndexJob] [state=JobStateInit]
[2025/02/28 15:40:01.580 +08:00] [INFO] [datacoord/index_engine_version_manager.go:84] ["Merged current version"] [current=6]
[2025/02/28 15:40:01.580 +08:00] [INFO] [datacoord/task_index.go:261] ["index task pre check successfully"] [taskID=456276053287600963] [segID=456276053211821517]
[2025/02/28 15:40:01.580 +08:00] [INFO] [indexnode/indexnode_service.go:210] ["Get Index Job Stats"] [traceID=230aa89e91105c95d078d95033492943] [unissued=0] [active=1] [slot=0]
[2025/02/28 15:40:01.580 +08:00] [INFO] [datacoord/task_scheduler.go:258] ["there is no idle indexing node, waiting for retry..."]
[2025/02/28 15:40:01.994 +08:00] [INFO] [balance/score_based_balancer.go:582] ["node channel workload status"] [collectionID=456135731644206123] [replicaID=456135731990495233] [nodes="[\"{NodeID: 13, AssignedScore: 15.000000, CurrentScore: 15.000000, Priority: 0}\"]"]
[2025/02/28 15:40:01.994 +08:00] [INFO] [balance/score_based_balancer.go:514] ["node segment workload status"] [collectionID=456135731644206123] [replicaID=456135731990495233] [nodes="[\"{NodeID: 13, AssignedScore: 20017600.000000, CurrentScore: 22019360.000000, Priority: 2001760}\"]"]
[2025/02/28 15:40:01.994 +08:00] [INFO] [balance/score_based_balancer.go:582] ["node channel workload status"] [collectionID=456162640469972779] [replicaID=456162640804708354] [nodes="[\"{NodeID: 13, AssignedScore: 15.000000, CurrentScore: 15.000000, Priority: 0}\"]"]
[2025/02/28 15:40:01.994 +08:00] [INFO] [balance/score_based_balancer.go:514] ["node segment workload status"] [collectionID=456162640469972779] [replicaID=456162640804708354] [nodes="[\"{NodeID: 13, AssignedScore: 20017600.000000, CurrentScore: 22019360.000000, Priority: 2001760}\"]"]
[2025/02/28 15:40:01.994 +08:00] [INFO] [balance/score_based_balancer.go:582] ["node channel workload status"] [collectionID=456276053142546157] [replicaID=456276053475721217] [nodes="[\"{NodeID: 13, AssignedScore: 15.000000, CurrentScore: 15.000000, Priority: 0}\"]"]
[2025/02/28 15:40:01.995 +08:00] [INFO] [balance/score_based_balancer.go:514] ["node segment workload status"] [collectionID=456276053142546157] [replicaID=456276053475721217] [nodes="[\"{NodeID: 13, AssignedScore: 20103600.000000, CurrentScore: 22113960.000000, Priority: 2010360}\"]"]
[2025/02/28 15:40:01.995 +08:00] [INFO] [balance/score_based_balancer.go:582] ["node channel workload status"] [collectionID=456276053143974342] [replicaID=456276053475721219] [nodes="[\"{NodeID: 13, AssignedScore: 15.000000, CurrentScore: 15.000000, Priority: 0}\"]"]
[2025/02/28 15:40:01.995 +08:00] [INFO] [balance/score_based_balancer.go:514] ["node segment workload status"] [collectionID=456276053143974342] [replicaID=456276053475721219] [nodes="[\"{NodeID: 13, AssignedScore: 220107600.000000, CurrentScore: 242118360.000000, Priority: 22010760}\"]"]
[2025/02/28 15:40:02.002 +08:00] [INFO] [balance/score_based_balancer.go:582] ["node channel workload status"] [collectionID=456276053367555199] [replicaID=456276053475721223] [nodes="[\"{NodeID: 13, AssignedScore: 15.000000, CurrentScore: 15.000000, Priority: 0}\"]"]
[2025/02/28 15:40:02.002 +08:00] [INFO] [balance/score_based_balancer.go:514] ["node segment workload status"] [collectionID=456276053367555199] [replicaID=456276053475721223] [nodes="[\"{NodeID: 13, AssignedScore: 20017601.000000, CurrentScore: 22019361.100000, Priority: 2001760}\"]"]

Anything else?

No response

The text was updated successfully, but these errors were encountered:

dosubot · 2025-02-28T07:42:45Z

There are a few similar discussions related to high memory and CPU usage in Milvus:

Milvus can cause the CPU to suddenly soar to 90% - This discussion is still open and was last updated on January 24, 2025.
Questions on memory usage and disk access patterns of Milvus - This discussion is also open and was last updated on April 2, 2024.
The milvus server has high CPU usage. I don't insert the data and search now. The version is 2.3.11. Why? - This discussion is open and was last updated on March 18, 2024.

For addressing high memory and CPU usage in Milvus and MinIO, you can consider updating the following configurations:

Milvus

Reduce CPU Usage:

Enable skip mode to reduce CPU usage by skipping some timetick messages:

dataNode.dataSync.skipMode.enable: true
dataNode.dataSync.skipMode.skipNum: 4
dataNode.dataSync.skipMode.coldTime: 60

Manage Memory Usage:

Adjust the buffer size for segments to control memory usage:

dataNode.segment.insertBufSize: [desired_value_in_bytes]
dataNode.segment.deleteBufBytes: [desired_value_in_bytes]

Enable force sync when memory usage is high:

dataNode.memory.forceSyncEnable: true
dataNode.memory.forceSyncSegmentNum: 1
dataNode.memory.forceSyncWatermark: 0.5

MinIO

Adjust MinIO configurations in the milvus.yaml file:

minio.accessKeyID: [your_access_key]
minio.secretAccessKey: [your_secret_key]

These configurations can help manage resource usage more effectively. For more detailed configuration options, you can refer to the Milvus documentation ^[1]^[2].

To continue talking to Dosu, mention @dosu.

^{Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other}

ljpassingby · 2025-02-28T08:14:57Z

For the standalone milvus, is there any recommended suggestions about maximum data scale
Is this resource usage normal for my current data scale?

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
121140 root      20   0  387.7g 330.4g 109384 S   7.5  43.7 217:12.27 CGO_LOAD
114963 root      20   0 4545952   3.4g  23584 S  8104   0.5   8508:51 minio

dosubot · 2025-02-28T08:16:11Z

Milvus Standalone is suitable for medium-sized datasets, scaling up to 100 million vectors ^[1]. Given that you've inserted 200 million vectors, this exceeds the recommended scale for standalone mode, which could contribute to the high resource usage you're experiencing.

For your current data scale, the resource usage you're observing might be expected due to the large dataset size. To manage resource usage more effectively, consider optimizing memory and CPU usage by adjusting configurations such as enabling skip mode, managing segment buffer sizes, and enabling force sync when memory usage is high ^[2].

To continue talking to Dosu, mention @dosu.

^{Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other}

yanliang567 · 2025-02-28T08:40:22Z

@ljpassingby The memory usage is a bit high if all the segments are indexed. But according to the description above, I think there are many compaction tasks running. If so, it is reasonable.
quick questions:

how many collections and partitions do you have?
please try to describe the index to double confirm if there are 0 pending_index_rows.
also please share the birdwatcher backup files for investigation. refer to this doc: https://github.com/milvus-io/birdwatcher to backup etcd backup with birdwatcher
/assign @ljpassingby
/unassign

ljpassingby · 2025-02-28T09:20:57Z

@yanliang567 Thanks for your reply. The suggestions I will try later, but I have a question that why the minio uses so much cpu resource and it seems not to decrease?

xiaofan-luan · 2025-03-05T06:52:20Z

@ljpassingby
there might be bug in 2.5.3 for compaction.

Please upgrade to 2.5.5 and see if it's still the same issue.

ljpassingby added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 28, 2025

ljpassingby assigned yanliang567 Feb 28, 2025

dosubot bot added area/performance Performance issues severity/major Major, major function doesn't work under some condition. labels Feb 28, 2025

sre-ci-robot assigned ljpassingby and unassigned yanliang567 Feb 28, 2025

yanliang567 added triage/needs-information Indicates an issue needs more information in order to work on it. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Milvus memory's usage is too high, and the minio's cpu usage is also very high #40270

[Bug]: Milvus memory's usage is too high, and the minio's cpu usage is also very high #40270

ljpassingby commented Feb 28, 2025 •

edited

Loading

dosubot bot commented Feb 28, 2025

ljpassingby commented Feb 28, 2025 •

edited

Loading

dosubot bot commented Feb 28, 2025

yanliang567 commented Feb 28, 2025

ljpassingby commented Feb 28, 2025

xiaofan-luan commented Mar 5, 2025

[Bug]: Milvus memory's usage is too high, and the minio's cpu usage is also very high #40270

[Bug]: Milvus memory's usage is too high, and the minio's cpu usage is also very high #40270

Comments

ljpassingby commented Feb 28, 2025 • edited Loading

Is there an existing issue for this?

Environment

Current Behavior

Expected Behavior

Steps To Reproduce

Milvus Log

Anything else?

dosubot bot commented Feb 28, 2025

Milvus

MinIO

ljpassingby commented Feb 28, 2025 • edited Loading

dosubot bot commented Feb 28, 2025

yanliang567 commented Feb 28, 2025

ljpassingby commented Feb 28, 2025

xiaofan-luan commented Mar 5, 2025

ljpassingby commented Feb 28, 2025 •

edited

Loading

ljpassingby commented Feb 28, 2025 •

edited

Loading