Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: simple aggregation does not return #17036

Open
1 of 2 tasks
rad-pat opened this issue Dec 11, 2024 · 18 comments · May be fixed by #17235
Open
1 of 2 tasks

bug: simple aggregation does not return #17036

rad-pat opened this issue Dec 11, 2024 · 18 comments · May be fixed by #17235
Labels
C-bug Category: something isn't working

Comments

@rad-pat
Copy link

rad-pat commented Dec 11, 2024

Search before asking

  • I had searched in the issues and found no similar issues.

Version

v1.2.670-nightly

What's Wrong?

Simple Aggregation query does not return. If we remove just one field from the aggregation, it will run quickly. Extracting the data to another server and running the query there seems to also work, so I'm not sure why the query will not run where it is. I was advised to record logs at the INFO level and they are attached.

Please see attached logs, query of interest is id 1adebd66-611e-41eb-9f2c-3bbf93931d36
databend-query-logs.txt

If you require any further details, please let me know.

How to Reproduce?

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!
@zhang2014
Copy link
Member

Possibly related(only in 670 version): #17037

@rad-pat
Copy link
Author

rad-pat commented Dec 11, 2024

Possibly related(only in 670 version): #17037

It was happening prior to 670, we upgraded to see if maybe fixed.

@zhang2014
Copy link
Member

Possibly related(only in 670 version): #17037

It was happening prior to 670, we upgraded to see if maybe fixed.

Please wait for the version after the merge of #17037

@rad-pat
Copy link
Author

rad-pat commented Dec 12, 2024

Possibly related(only in 670 version): #17037

It was happening prior to 670, we upgraded to see if maybe fixed.

Please wait for the version after the merge of #17037

I used version v1.2.672 with #17037 , still does not return. Please see revised logs attached.
databend-query-logs12Dec2024.txt

@inviscid
Copy link

inviscid commented Jan 8, 2025

@zhang2014 any chance this could get some attention on the priority list? :)

This is causing significant issues during workflows that involve any significant data sizes.

@sundy-li
Copy link
Member

sundy-li commented Jan 8, 2025

can you reproduce it via set enable_experimental_aggregate_hashtable = 0 ?

@rad-pat
Copy link
Author

rad-pat commented Jan 8, 2025

I will try this and report back

@rad-pat
Copy link
Author

rad-pat commented Jan 9, 2025

I have tried that just now and it doesn't seem to make a difference
databend-query-logs-09Jan25.txt

@rad-pat
Copy link
Author

rad-pat commented Jan 9, 2025

Does the message below mean that it is waiting for a task for which no workers are assigned?
WARN databend_query::pipelines::executor::processor_async_task: processor_async_task.rs:146 Very slow processor async task, query_id:"5cb349b3-d3d0-4126-94c9-776c52e96cce", processor id: NodeIndex(28), name: "ExchangeSourceReader", elapsed: 210.051349435s, active sync workers: 0

@rad-pat
Copy link
Author

rad-pat commented Jan 9, 2025

Eventually aborted with these final logs:

5cb349b3-d3d0-4126-94c9-776c52e96cce 2025-01-09T09:40:05.505018Z  WARN databend_query::pipelines::executor::processor_async_task: processor_async_task.rs:146 Very slow processor async task, query_id:"5cb349b3-d3d0-4126-94c9-776c52e96cce", processor id: NodeIndex(170), name: "ExchangeSourceReader", elapsed: 1195.281296491s, active sync workers: 0
5cb349b3-d3d0-4126-94c9-776c52e96cce 2025-01-09T09:40:05.817455Z  WARN databend_query::pipelines::executor::processor_async_task: processor_async_task.rs:146 Very slow processor async task, query_id:"5cb349b3-d3d0-4126-94c9-776c52e96cce", processor id: NodeIndex(28), name: "ExchangeSourceReader", elapsed: 1195.260041322s, active sync workers: 0
5cb349b3-d3d0-4126-94c9-776c52e96cce 2025-01-09T09:40:05.859804Z  WARN databend_query::pipelines::executor::processor_async_task: processor_async_task.rs:146 Very slow processor async task, query_id:"5cb349b3-d3d0-4126-94c9-776c52e96cce", processor id: NodeIndex(28), name: "ExchangeSourceReader", elapsed: 1195.281849182s, active sync workers: 0
5cb349b3-d3d0-4126-94c9-776c52e96cce 2025-01-09T09:40:10.222264Z  INFO databend_common_pipeline_core::finished_chain: finished_chain.rs:217 Executor apply finished callback state:
├──✅:4.34µs - (always) src/query/service/src/pipelines/pipeline_builder.rs:98:14
├──✅:711ns - (always) src/query/service/src/pipelines/pipeline_builder.rs:98:14
5cb349b3-d3d0-4126-94c9-776c52e96cce 2025-01-09T09:40:10.222308Z  INFO databend_query::pipelines::executor::pipeline_executor: pipeline_executor.rs:173 Pipeline executor finished, elapsed: 1200.001358518s
5cb349b3-d3d0-4126-94c9-776c52e96cce 2025-01-09T09:40:10.222469Z ERROR databend_query::servers::flight::v1::packets::packet_data: packet_data.rs:93 Got error code data packet: AbortedQuery. Code: 1043, Text = Aborted query, because the execution time exceeds the maximum execution time limit.
<Backtrace disabled by default. Please use RUST_BACKTRACE=1 to enable>
5cb349b3-d3d0-4126-94c9-776c52e96cce 2025-01-09T09:40:10.224702Z  INFO databend_common_base::runtime::runtime: runtime.rs:400 close_sender to shutdown Runtime is sent
5cb349b3-d3d0-4126-94c9-776c52e96cce 2025-01-09T09:40:10.224747Z  INFO databend_common_base::runtime::runtime: runtime.rs:157 Runtime(Some("StatisticsReceiver")) received shutdown signal, start to shut down
5cb349b3-d3d0-4126-94c9-776c52e96cce 2025-01-09T09:40:10.225725Z  INFO databend_common_base::runtime::runtime: runtime.rs:400 close_sender to shutdown Runtime is sent
5cb349b3-d3d0-4126-94c9-776c52e96cce 2025-01-09T09:40:10.225740Z  INFO databend_common_base::runtime::runtime: runtime.rs:157 Runtime(Some("query-ctx")) received shutdown signal, start to shut down
5cb349b3-d3d0-4126-94c9-776c52e96cce 2025-01-09T09:40:10.226255Z  INFO databend_query::interpreters::common::query_log: query_log.rs:72 query: 5cb349b3-d3d0-4126-94c9-776c52e96cce becomes Aborted
5cb349b3-d3d0-4126-94c9-776c52e96cce 2025-01-09T09:40:10.226281Z  INFO databend_common_pipeline_core::finished_chain: finished_chain.rs:217 Executor apply finished callback state:
├──❌:1.1879ms - src/query/service/src/servers/flight/v1/exchange/exchange_manager.rs:484:41
├──❌:0ns - (interrupt) src/query/service/src/servers/flight/v1/exchange/exchange_manager.rs:499:22
├──❌:0ns - (interrupt) src/query/service/src/interpreters/hook/compact_hook.rs:80:14
├──❌:0ns - (interrupt) src/query/service/src/interpreters/hook/refresh_hook.rs:61:14
├──✅:3.56µs - (always) src/query/service/src/pipelines/pipeline_builder.rs:98:14
├──✅:480ns - (always) src/query/service/src/pipelines/pipeline_builder.rs:98:14
├──✅:9.22µs - (always) src/query/service/src/pipelines/pipeline_builder.rs:98:14
├──❌:411.29µs - (always) src/query/service/src/interpreters/interpreter.rs:119:14
5cb349b3-d3d0-4126-94c9-776c52e96cce 2025-01-09T09:40:10.226484Z  INFO databend_query::pipelines::executor::pipeline_executor: pipeline_executor.rs:173 Pipeline executor finished, elapsed: 1200.004236593s
5cb349b3-d3d0-4126-94c9-776c52e96cce 2025-01-09T09:40:10.228367Z  INFO databend_query::interpreters::common::query_log: query_log.rs:72 query: 5cb349b3-d3d0-4126-94c9-776c52e96cce becomes Aborted
5cb349b3-d3d0-4126-94c9-776c52e96cce 2025-01-09T09:40:10.228504Z  INFO databend_query::servers::http::v1::query::execute_state: execute_state.rs:297 5cb349b3-d3d0-4126-94c9-776c52e96cce: http query changing state from Running to Stopped, reason Err(AbortedQuery. Code: 1043, Text = Aborted query, because the execution time exceeds the maximum execution time limit.
<Backtrace disabled by default. Please use RUST_BACKTRACE=1 to enable> )
5cb349b3-d3d0-4126-94c9-776c52e96cce 2025-01-09T09:40:10.228522Z  INFO databend_query::servers::http::v1::query::execute_state: execute_state.rs:326 5cb349b3-d3d0-4126-94c9-776c52e96cce: http query has change state to Stopped, reason Err(AbortedQuery. Code: 1043, Text = Aborted query, because the execution time exceeds the maximum execution time limit.
<Backtrace disabled by default. Please use RUST_BACKTRACE=1 to enable> )
5cb349b3-d3d0-4126-94c9-776c52e96cce 2025-01-09T09:40:10.228588Z  INFO databend_query::servers::http::v1::query::page_manager: page_manager.rs:195 http query reach end of blocks
5cb349b3-d3d0-4126-94c9-776c52e96cce 2025-01-09T09:40:10.228668Z  INFO databend_common_base::runtime::runtime: runtime.rs:400 close_sender to shutdown Runtime is sent
5cb349b3-d3d0-4126-94c9-776c52e96cce 2025-01-09T09:40:10.228690Z  INFO databend_common_base::runtime::runtime: runtime.rs:157 Runtime(Some("query-ctx")) received shutdown signal, start to shut down

@sundy-li
Copy link
Member

sundy-li commented Jan 9, 2025

We still can't figure out the reason, better provide us a case to reproduce the issue.

@rad-pat
Copy link
Author

rad-pat commented Jan 9, 2025

It is difficult to provide the replica case, I have tried myself previously by extracting the data to a development machine. The query appeared to run fine there. However, I will try that again in case I get a different result and can demonstrate the failure. If not, is there any other useful information to provide from this instance?

@zhang2014
Copy link
Member

I have tried that just now and it doesn't seem to make a difference databend-query-logs-09Jan25.txt

It appears to be caused by other nodes. Could you provide the logs from all nodes in the this cluster?

@rad-pat
Copy link
Author

rad-pat commented Jan 9, 2025

Those are consolidated logs, contain entries from both query pods, unfortunately it does not include the node for each query entry. I will gather the individual logs
databend-query-node-0.txt
databend-query-node-1.txt

@zhang2014 zhang2014 linked a pull request Jan 9, 2025 that will close this issue
11 tasks
@sundy-li
Copy link
Member

sundy-li commented Jan 9, 2025

@rad-pat

Hi, we may find some possible codes that lead this issue, you can try two different approaches

  1. Expand your cluster to have three nodes, and check the query works or not
  2. Check fix(cluster): try fix cluster agg hang #17235 pr , you can build this from source, and test it with origin cluster deployment (two nodes)

@rad-pat
Copy link
Author

rad-pat commented Jan 9, 2025

Thanks @sundy-li , I will try Option 1 firstly as that is easy to try. Option 2 requires a little more work to get a build env going, I don't suppose you could provide link to built docker image of that PR?

@sundy-li
Copy link
Member

sundy-li commented Jan 9, 2025

See Makefile which has

build-in-docker:
	bash ./scripts/setup/run_build_tool.sh cargo build --target $(TARGET)

@rad-pat
Copy link
Author

rad-pat commented Jan 9, 2025

Expanding to 3 nodes made the query work

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Category: something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants