Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve index benefit estimation with more accurate query execution (with index) cost calculations #6

Merged
merged 6 commits into from
Jan 21, 2025

Conversation

YanRong-au
Copy link
Contributor

This pull request includes updates to two files to enhance the accuracy of index benefit estimation by improving query execution cost calculations:

  • bin/mindexer:

    • Integrated the improved query execution cost calculation logic into the main script.
  • mindexer/utils/query.py:

    • Added a new function to generate queries designed to determine the number of index keys that will be examined for a particular query and index configuration.

Copy link
Member

@rueckstiess rueckstiess left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Yan, I left a few suggestions for improvements inline. Can you take a look?

Also the PR includes the PDF of your thesis. I'm creating a section on the README and will link directly to the PDF in your github repo, so we can remove the PDF from the PR. I'd like to avoid having large binary files in there because everyone who installs the tool would otherwise download the PDF as well.

@@ -161,6 +161,25 @@ def index_intersect(self, index):

return query

def index_number_key_qurey(self, index):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in the function name. Can you change it here and where this is called?

Suggested change
def index_number_key_qurey(self, index):
def index_number_key_query(self, index):

Comment on lines 164 to 182
def index_number_key_qurey(self, index):
"""
return the query that can be used to determine the number of
index key needs to be examined.
"""
query = Query()

for field in index:
if field in self.filter:
if isinstance(self.filter[field], dict):
query.add_predicate({field: self.filter[field]})
break
else:
query.add_predicate({field: self.filter[field]})
else:
break

return query

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a few unit tests under tests/test_query.py to confirm this function does what it is supposed to?

bin/mindexer Outdated
@@ -1,5 +1,6 @@
#!/usr/bin/env python

from ast import Not
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem to be used, can be removed again.

bin/mindexer Outdated
# collscan = 1.0 relative to other costs)
benefit = estimator.get_cardinality() * COLLSCAN_COST - est * cost
# calculating index benefit by substract index cost from collection scan cost
index_cost = (IXSCAN_COST+(len(candidate)-1)*0.05) * index_key_scanned_est
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does the 0.05 factor do here? Maybe define at the top with the other constants?

bin/mindexer Outdated
# benefit of the index over a collection scan (assuming cost of
# collscan = 1.0 relative to other costs)
benefit = estimator.get_cardinality() * COLLSCAN_COST - est * cost
# calculating index benefit by substract index cost from collection scan cost
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you update the comment and explain how the index_key_scanned_est factors in? Feel free to use multiple comment lines, more explicit is better :-)

Copy link
Member

@rueckstiess rueckstiess left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Yan, thanks, the code changes are perfect!

I saw you removed the thesis PDF in a commit, but the next commit added it back again. If you could remove the PDF again then we can merge the PR.

I made a separate PR to add a Contributors section in the readme where I link to your thesis in your github repo, so people will still be able to find it.

Copy link
Member

@rueckstiess rueckstiess left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@rueckstiess rueckstiess merged commit 946a999 into mongodb-labs:main Jan 21, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants