Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: milvus load stuck at 0% with querynode error: [msgdispatcher/dispatcher.go:122] ["seek failed"] [pchannel=by-dev-rootcoord-dml_6] #39562

Open
1 task done
eskinmi opened this issue Jan 23, 2025 · 6 comments
Assignees
Labels
kind/bug Issues or changes related a bug triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@eskinmi
Copy link

eskinmi commented Jan 23, 2025

Is there an existing issue for this?

Environment

- Milvus version: 2.5.0
- Deployment mode(standalone or cluster): Cluster (EKS)
- MQ type(rocksmq, pulsar or kafka):  pulsar
- SDK version(e.g. pymilvus v2.0.0rc2): 2.4.9
- OS(Ubuntu or CentOS): Ubuntu
- CPU/Memory: Xeon Scalable 3.5 GHz 16vCPU & 128Gi
- GPU: N/A

Current Behavior

When I try to load my collection in Milvus, from Attu I can see that the load is stuck at 0%. I have about 20M entities in the collections that I'm trying to load. When I look into the querynode logs, I see:

[2025/01/23 21:42:50.576 +00:00] [ERROR] [msgdispatcher/dispatcher.go:122] ["seek failed"] [pchannel=by-dev-rootcoord-dml_6] [subName=querynode-48-by-dev-rootcoord-dml_6_455282050800631750v0-true] [isMain=true] [error="failed to seek, error server error: UnknownError: Error when resetting subscription: unable to persist readPosition for cursor reset 464:805"] [stack="github.com/milvus-io/milvus/pkg/mq/msgdispatcher.NewDispatcher\n\t/workspace/source/pkg/mq/msgdispatcher/dispatcher.go:122\ngithub.com/milvus-io/milvus/pkg/mq/msgdispatcher.(*dispatcherManager).Add\n\t/workspace/source/pkg/mq/msgdispatcher/manager.go:106\ngithub.com/milvus-io/milvus/pkg/mq/msgdispatcher.(*client).Register\n\t/workspace/source/pkg/mq/msgdispatcher/client.go:94\ngithub.com/milvus-io/milvus/internal/util/pipeline.(*streamPipeline).ConsumeMsgStream\n\t/workspace/source/internal/util/pipeline/stream_pipeline.go:122\ngithub.com/milvus-io/milvus/internal/querynodev2.(*QueryNode).WatchDmChannels\n\t/workspace/source/internal/querynodev2/services.go:327\ngithub.com/milvus-io/milvus/internal/distributed/querynode.(*Server).WatchDmChannels\n\t/workspace/source/internal/distributed/querynode/service.go:296\ngithub.com/milvus-io/milvus/pkg/proto/querypb._QueryNode_WatchDmChannels_Handler.func1\n\t/workspace/source/pkg/proto/querypb/query_coord_grpc.pb.go:2020\ngithub.com/milvus-io/milvus/internal/distributed/querynode.(*Server).startGrpcLoop.ServerIDValidationUnaryServerInterceptor.func7\n\t/workspace/source/pkg/util/interceptor/server_id_interceptor.go:54\ngithub.com/milvus-io/milvus/internal/distributed/querynode.(*Server).startGrpcLoop.ChainUnaryServer.func8.1.1\n\t/go/pkg/mod/github.com/grpc-ecosystem/[email protected]/chain.go:25\ngithub.com/milvus-io/milvus/internal/distributed/querynode.(*Server).startGrpcLoop.ClusterValidationUnaryServerInterceptor.func6\n\t/workspace/source/pkg/util/interceptor/cluster_interceptor.go:48\ngithub.com/milvus-io/milvus/internal/distributed/querynode.(*Server).startGrpcLoop.ChainUnaryServer.func8.1.1\n\t/go/pkg/mod/github.com/grpc-ecosystem/[email protected]/chain.go:25\ngithub.com/milvus-io/milvus/pkg/util/logutil.UnaryTraceLoggerInterceptor\n\t/workspace/source/pkg/util/logutil/grpc_interceptor.go:23\ngithub.com/milvus-io/milvus/internal/distributed/querynode.(*Server).startGrpcLoop.ChainUnaryServer.func8.1.1\n\t/go/pkg/mod/github.com/grpc-ecosystem/[email protected]/chain.go:25\ngithub.com/milvus-io/milvus/internal/distributed/querynode.(*Server).startGrpcLoop.ChainUnaryServer.func8\n\t/go/pkg/mod/github.com/grpc-ecosystem/[email protected]/chain.go:34\ngithub.com/milvus-io/milvus/pkg/proto/querypb._QueryNode_WatchDmChannels_Handler\n\t/workspace/source/pkg/proto/querypb/query_coord_grpc.pb.go:2022\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/go/pkg/mod/google.golang.org/[email protected]/server.go:1379\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/go/pkg/mod/google.golang.org/[email protected]/server.go:1790\ngoogle.golang.org/grpc.(*Server).serveStreams.func2.1\n\t/go/pkg/mod/google.golang.org/[email protected]/server.go:1029"]

Expected Behavior

The collection should load successfully.

Steps To Reproduce

No response

Milvus Log

milvus-log.tar.gz

Anything else?

No response

@eskinmi eskinmi added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jan 23, 2025
@yanliang567
Copy link
Contributor

/assign @bigsheeper
/unassign

@yanliang567 yanliang567 added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jan 24, 2025
@xiaofan-luan
Copy link
Collaborator

@eskinmi

I think this might be a different issue.
The error message showed seek failure. Which mean pulsar report an error on subscribe by-dev-rootcoord-dml_6

  1. restart pulsar cluster see it works
  2. try to use pulsarctl to connect to pulsar cluster, and subscribe from by-dev-rootcoord-dml_6 this topic, see if can read data
  3. upgrade Milvus to 2.5.4

@xiaofan-luan
Copy link
Collaborator

If your Pulsar TTL (Time-To-Live) is set to a very short duration, it can also cause similar issues. When the system recovers from a failure, the corresponding position may have already been garbage collected.

@xiaofan-luan
Copy link
Collaborator

check which collection watch failed and use birdwatcher to see the checkpoints, if some of the checkpoints is old, we need to reset it to newer timestamp

@eskinmi
Copy link
Author

eskinmi commented Jan 31, 2025

@xiaofan-luan

Thank you very much for your help! Restarting the pulsar cluster did work indeed.
I will make sure to upgrade to the latest version.

🎉

@eskinmi
Copy link
Author

eskinmi commented Jan 31, 2025

@xiaofan-luan btw I can't see the TTL configuration in here: https://milvus.io/docs/configure_pulsar.md

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues or changes related a bug triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

4 participants