Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Data race and panics in wal and wal scanner #40264

Closed
1 task done
chyezh opened this issue Feb 28, 2025 · 2 comments
Closed
1 task done

[Bug]: Data race and panics in wal and wal scanner #40264

chyezh opened this issue Feb 28, 2025 · 2 comments
Assignees
Labels
feature/streaming node streaming node feature kind/bug Issues or changes related a bug severity/major Major, major function doesn't work under some condition. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Milestone

Comments

@chyezh
Copy link
Contributor

chyezh commented Feb 28, 2025

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version: 0a4e7b51164ad7188cce954dfb0a745faabee0f0
- Deployment mode(standalone or cluster):
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

Data race:

2025-02-27T13:26:59.3567122Z WARNING: DATA RACE
2025-02-27T13:26:59.3567222Z Read at 0x00c0029fcb48 by goroutine 14410:
2025-02-27T13:26:59.3567538Z   github.com/milvus-io/milvus/internal/streamingnode/server/wal/metricsutil.(*ScannerMetrics).Close()
2025-02-27T13:26:59.3567851Z       /go/src/github.com/milvus-io/milvus/internal/streamingnode/server/wal/metricsutil/wal_scan.go:179 +0x394
2025-02-27T13:26:59.3568153Z   github.com/milvus-io/milvus/internal/streamingnode/server/wal/adaptor.(*scannerAdaptorImpl).Close()
2025-02-27T13:26:59.3568469Z       /go/src/github.com/milvus-io/milvus/internal/streamingnode/server/wal/adaptor/scanner_adaptor.go:88 +0x129
2025-02-27T13:26:59.3568912Z   github.com/milvus-io/milvus/internal/streamingnode/server/wal/adaptor_test.(*testOneWALFramework).testRead.deferwrap1()
2025-02-27T13:26:59.3569206Z       /go/src/github.com/milvus-io/milvus/internal/streamingnode/server/wal/adaptor/wal_test.go:355 +0x42
2025-02-27T13:26:59.3569286Z   runtime.deferreturn()
2025-02-27T13:26:59.3569627Z       /go/pkg/mod/golang.org/[email protected]/src/runtime/panic.go:602 +0x5d
2025-02-27T13:26:59.3570017Z   github.com/milvus-io/milvus/internal/streamingnode/server/wal/adaptor_test.(*testOneWALFramework).testReadAndWrite.func2()
2025-02-27T13:26:59.3570372Z       /go/src/github.com/milvus-io/milvus/internal/streamingnode/server/wal/adaptor/wal_test.go:175 +0xfc
2025-02-27T13:26:59.3570378Z
2025-02-27T13:26:59.3570503Z Previous write at 0x00c0029fcb48 by goroutine 14430:
2025-02-27T13:26:59.3570826Z   github.com/milvus-io/milvus/internal/streamingnode/server/wal/metricsutil.(*ScannerMetrics).SwitchModel()
2025-02-27T13:26:59.3571133Z       /go/src/github.com/milvus-io/milvus/internal/streamingnode/server/wal/metricsutil/wal_scan.go:124 +0x184
2025-02-27T13:26:59.3571538Z   github.com/milvus-io/milvus/internal/streamingnode/server/wal/adaptor.(*scannerAdaptorImpl).produceEventLoop()
2025-02-27T13:26:59.3571855Z       /go/src/github.com/milvus-io/milvus/internal/streamingnode/server/wal/adaptor/scanner_adaptor.go:138 +0xb15
2025-02-27T13:26:59.3572178Z   github.com/milvus-io/milvus/internal/streamingnode/server/wal/adaptor.(*scannerAdaptorImpl).execute.func2()
2025-02-27T13:26:59.3572495Z       /go/src/github.com/milvus-io/milvus/internal/streamingnode/server/wal/adaptor/scanner_adaptor.go:105 +0xdc
2025-02-27T13:26:59.3572499Z

panics:

2025-02-27T13:47:39.1441817Z panic: runtime error: invalid memory address or nil pointer dereference
2025-02-27T13:47:39.1441991Z [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x97c0797]
2025-02-27T13:47:39.1441998Z
2025-02-27T13:47:39.1442079Z goroutine 6388 [running]:
2025-02-27T13:47:39.1442567Z github.com/milvus-io/milvus/internal/streamingnode/server/wal/interceptors/timetick/mvcc.(*MVCCManager).UpdateMVCC(0x0, {0xb514958, 0xc002152d60})
2025-02-27T13:47:39.1442954Z    /go/src/github.com/milvus-io/milvus/internal/streamingnode/server/wal/interceptors/timetick/mvcc/mvcc_manager.go:52 +0x117
2025-02-27T13:47:39.1443384Z github.com/milvus-io/milvus/internal/streamingnode/server/wal/interceptors/timetick.(*timeTickAppendInterceptor).DoAppend.func1()
2025-02-27T13:47:39.1443777Z    /go/src/github.com/milvus-io/milvus/internal/streamingnode/server/wal/interceptors/timetick/timetick_interceptor.go:48 +0xe5
2025-02-27T13:47:39.1444441Z github.com/milvus-io/milvus/internal/streamingnode/server/wal/interceptors/timetick.(*timeTickAppendInterceptor).DoAppend(0xc002eff4a0, {0xb41b128, 0xc002f48720}, {0xb514958, 0xc002152d60}, 0xc000a589f0)
2025-02-27T13:47:39.1444902Z    /go/src/github.com/milvus-io/milvus/internal/streamingnode/server/wal/interceptors/timetick/timetick_interceptor.go:125 +0xecb
2025-02-27T13:47:39.1445668Z github.com/milvus-io/milvus/internal/streamingnode/server/wal/interceptors.getChainAppendInvoker.func1.adaptAppendWithMetricCollecting.1({0xb41b128, 0xc002f48720}, {0xb514958, 0xc002152d60}, 0xc002f487b0)
2025-02-27T13:47:39.1446020Z    /go/src/github.com/milvus-io/milvus/internal/streamingnode/server/wal/interceptors/chain_interceptor.go:106 +0x16e
2025-02-27T13:47:39.1446537Z github.com/milvus-io/milvus/internal/streamingnode/server/wal/interceptors.getChainAppendInvoker.func1({0xb41b128, 0xc002f48720}, {0xb514958, 0xc002152d60})
2025-02-27T13:47:39.1446879Z    /go/src/github.com/milvus-io/milvus/internal/streamingnode/server/wal/interceptors/chain_interceptor.go:96 +0x2c5
2025-02-27T13:47:39.1447526Z github.com/milvus-io/milvus/internal/streamingnode/server/wal/interceptors/flusher.(*flusherAppendInterceptor).DoAppend(0xc003088140?, {0xb41b128, 0xc002f48720}, {0xb514958, 0xc002152d60}, 0xc002f48780)
2025-02-27T13:47:39.1448028Z    /go/src/github.com/milvus-io/milvus/internal/streamingnode/server/wal/interceptors/flusher/flusher_interceptor.go:29 +0xa8
2025-02-27T13:47:39.1448537Z github.com/milvus-io/milvus/internal/streamingnode/server/wal/interceptors.getChainAppendInvoker.func1({0xb41b128, 0xc002f48720}, {0xb514958, 0xc002152d60})
2025-02-27T13:47:39.1448880Z    /go/src/github.com/milvus-io/milvus/internal/streamingnode/server/wal/interceptors/chain_interceptor.go:98 +0x393
2025-02-27T13:47:39.1449490Z github.com/milvus-io/milvus/internal/streamingnode/server/wal/interceptors/redo.(*redoAppendInterceptor).DoAppend(0xc003088140?, {0xb41b128, 0xc002f48720}, {0xb514958, 0xc002152d60}, 0xc002f48750)

Expected Behavior

No response

Steps To Reproduce

Milvus Log

No response

Anything else?

No response

@chyezh chyezh added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 28, 2025
@chyezh
Copy link
Contributor Author

chyezh commented Feb 28, 2025

/assign

@chyezh chyezh added the feature/streaming node streaming node feature label Feb 28, 2025
@dosubot dosubot bot added the severity/major Major, major function doesn't work under some condition. label Feb 28, 2025
Copy link

dosubot bot commented Feb 28, 2025

Hey @chyezh! I'm here to help you with any bugs, questions, or contributions while you wait for a human maintainer. Let me know how I can assist you! Dosu

The data race and panic issues you're encountering in the Milvus WAL module seem to be related to concurrency management and potential nil pointer dereferences.

  1. Data Race: The data race is occurring between the Close method and the SwitchModel method of ScannerMetrics. This suggests that there might be concurrent access to shared resources without proper synchronization. To resolve this, you should ensure that access to shared data structures is properly synchronized using mutexes or other concurrency control mechanisms. Specifically, check the scannerAdaptorImpl and ScannerMetrics implementations to ensure that all shared state modifications are protected.

  2. Panic (Nil Pointer Dereference): The panic is happening in the UpdateMVCC method of MVCCManager, indicating that the MVCCManager instance might not be properly initialized before use. Ensure that the MVCCManager is correctly initialized using the NewMVCCManager function before it is used in the timeTickAppendInterceptor. This will prevent nil pointer dereferences by ensuring that all necessary fields are set up before any method calls.

Review the initialization and usage patterns of these components to ensure they are thread-safe and properly initialized. This should help in resolving the data race and panic issues you're facing.

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other

@yanliang567 yanliang567 added this to the 2.6.0 milestone Feb 28, 2025
@yanliang567 yanliang567 added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 28, 2025
@yanliang567 yanliang567 removed their assignment Feb 28, 2025
sre-ci-robot pushed a commit that referenced this issue Feb 28, 2025
issue: #40264

- wal may panics when context canceled
- scanner may data race when closing

Signed-off-by: chyezh <[email protected]>
@chyezh chyezh closed this as completed Feb 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature/streaming node streaming node feature kind/bug Issues or changes related a bug severity/major Major, major function doesn't work under some condition. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

2 participants