Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Index Node CrashLoopBackOff #40241

Open
1 task done
roy-akash opened this issue Feb 27, 2025 · 10 comments
Open
1 task done

[Bug]: Index Node CrashLoopBackOff #40241

roy-akash opened this issue Feb 27, 2025 · 10 comments
Assignees
Labels
component/indexnode kind/bug Issues or changes related a bug severity/critical Critical, lead to crash, data missing, wrong result, function totally doesn't work. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@roy-akash
Copy link

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version: 2.5.3
- Deployment mode(standalone or cluster): cluster
- MQ type(rocksmq, pulsar or kafka):    kafka
- SDK version(e.g. pymilvus v2.0.0rc2): java-sdk
- OS(Ubuntu or CentOS): centos
- CPU/Memory: 12/60GB
- GPU: No
- Others:

Current Behavior

There is a specific task id "456147103870884550" related to stats_log for a given collection id : "452973157558657785", when this task is scheduled in the given index node it causes a panic and index node crashes

As per my debugging i can trace this panic to this part of the code, when it checks here the req.GetInsertLogs()[0] seems to be empty causing a Panic and index node crashes

Expected Behavior

No response

Steps To Reproduce

Not sure how to reproduce but it is happening in our non production milvus instance currently, we have recently upgraded our milvus instance from 2.3.19 to 2.5.3. 

Garbage collection has been off for this milvus instance for quite some time and the collection was created 4 months back. Currently there are no rows in the collection

Milvus Log

milvus-log.tar.gz

Anything else?

How do i fix this ?

@roy-akash roy-akash added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 27, 2025
Copy link

dosubot bot commented Feb 27, 2025

Hey @roy-akash! 👋 I'm here to help you with any bugs, questions, or contributions you have in mind. Let's squash those bugs together!

To fix the Index Node CrashLoopBackOff issue in Milvus version 2.5.3 caused by a panic due to an empty req.GetInsertLogs()[0], you should ensure that the DataPaths are correctly populated before they are accessed. This can be done by modifying the PreExecute function in the indexBuildTask struct to check if DataPaths are empty and then construct the paths using metautil.BuildInsertLogPath. Here's a relevant code snippet:

func (it *indexBuildTask) PreExecute(ctx context.Context) error {
    if len(it.req.DataPaths) == 0 {
        for _, id := range it.req.GetDataIds() {
            path := metautil.BuildInsertLogPath(it.req.GetStorageConfig().RootPath, it.req.GetCollectionID(), it.req.GetPartitionID(), it.req.GetSegmentID(), it.req.GetField().GetFieldID(), id)
            it.req.DataPaths = append(it.req.DataPaths, path)
        }
    }
    // Additional logic...
}

Ensure that this logic is correctly implemented to prevent the panic caused by accessing an empty req.GetInsertLogs()[0]. If the DataPaths are still empty after this logic, further investigation may be needed to determine why the DataIds are not being set or if there is an issue with the BuildInsertLogPath function [1].

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other

@dosubot dosubot bot added component/indexnode severity/critical Critical, lead to crash, data missing, wrong result, function totally doesn't work. labels Feb 27, 2025
@yanliang567
Copy link
Contributor

/assign @xiaocai2333
/unassign

@yanliang567 yanliang567 added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 28, 2025
@roy-akash
Copy link
Author

@xiaocai2333 please let me know if any info is required regarding this ? I want to recover our milvus instance.

Should i go ahead and try to drop this collection as a mitigation ?

@xiaocai2333
Copy link
Contributor

@xiaocai2333 please let me know if any info is required regarding this ? I want to recover our milvus instance.

Should i go ahead and try to drop this collection as a mitigation ?

If you have a way to delete a single segment, just delete segment 452973157558666152. You can refer to Birdwatcher for the method to delete an individual segment.
Perhaps this segment contains 0 rows; you can also check it with Birdwatcher.

@xiaocai2333
Copy link
Contributor

[2025/02/27 10:41:01.652 +00:00] [INFO] [indexnode/indexnode_service.go:380] ["receive stats job"] [traceID=b52a0b2da86be6e002264e7fb531c53e] [clusterID=milvusdefaultv2] [TaskID=456147103870884550] [jobType=JobTypeStatsJob] [collectionID=452973157558657785] [partitionID=452973157558657794] [segmentID=452973157558666152] [targetSegmentID=456147103870884549] [subJobType=Sort] [startLogID=456301132344294350] [endLogID=456301132344299251]

@xiaocai2333
Copy link
Contributor

By the way, version 2.5.5 has been released and you can upgrade to it directly.

@roy-akash
Copy link
Author

@xiaocai2333 sure, we can try upgrading as a long term goal.

when i run show segment for this segment id i get
Milvus(milvusdefaultv2-1707325333543) > show segment --segment 452973157558666152 SegmentID: 452973157558666152 State: Flushed, Row Count:0 --- Growing: 0, Sealed: 0, Flushed: 1 --- Total Segments: 1, row count: 0

when i try to remove this i get
Milvus(milvusdefaultv2-1707325333543) > remove segment --segment 452973157558666152 failed to list segments context deadline exceeded

@roy-akash
Copy link
Author

I've gone ahead and dropped the collection, things seem to be improving now.

However i am very much interested in how a segment with 0 rows can be in a flushed state.

@xiaocai2333
Copy link
Contributor

xiaocai2333 commented Mar 3, 2025

I've gone ahead and dropped the collection, things seem to be improving now.

However i am very much interested in how a segment with 0 rows can be in a flushed state.

The segment with 0 rows will be directly marked as not dropped after v2.4.x.

@roy-akash
Copy link
Author

so there should be a case added here right for the backward compatibility ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/indexnode kind/bug Issues or changes related a bug severity/critical Critical, lead to crash, data missing, wrong result, function totally doesn't work. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

3 participants