Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Index on list of strings #7142

Merged
merged 3 commits into from
Nov 18, 2023
Merged

Index on list of strings #7142

merged 3 commits into from
Nov 18, 2023

Conversation

jedelbo
Copy link
Contributor

@jedelbo jedelbo commented Nov 15, 2023

What, How & Why?

I have chosen to store the whole word in the index as I believe it would be too slow to try to find the value in the list based on the prefix.

Partial fix for #7132

☑️ ToDos

  • 📝 Changelog update
  • 🚦 Tests (or not relevant)
  • C-API, if public C++ API changed
  • bindgen/spec.yml, if public C++ API changed

Copy link

coveralls-official bot commented Nov 15, 2023

Pull Request Test Coverage Report for Build github_pull_request_284878

  • 345 of 348 (99.14%) changed or added relevant lines in 10 files are covered.
  • 70 unchanged lines in 16 files lost coverage.
  • Overall coverage increased (+0.006%) to 91.667%

Changes Missing Coverage Covered Lines Changed/Added Lines %
src/realm/index_string.cpp 75 78 96.15%
Files with Coverage Reduction New Missed Lines %
src/realm/sync/noinst/client_impl_base.hpp 1 93.91%
test/test_query2.cpp 1 99.08%
src/realm/array_string.cpp 2 88.0%
src/realm/sync/network/http.hpp 2 80.87%
src/realm/sync/network/websocket.cpp 2 74.23%
src/realm/table_view.cpp 2 94.18%
test/test_sync.cpp 2 94.14%
src/realm/sync/network/network.cpp 3 89.86%
src/realm/util/serializer.cpp 3 90.03%
src/realm/sync/instructions.hpp 4 75.96%
Totals Coverage Status
Change from base Build 1841: 0.006%
Covered Lines: 231384
Relevant Lines: 252419

💛 - Coveralls

@jedelbo jedelbo force-pushed the je/list-string-index branch 2 times, most recently from 827b4d8 to 5288797 Compare November 16, 2023 11:50
@jedelbo jedelbo force-pushed the je/list-string-index branch from 5288797 to 0052d98 Compare November 16, 2023 13:15
@jedelbo jedelbo requested a review from ironage November 16, 2023 13:15
@jedelbo jedelbo marked this pull request as ready for review November 16, 2023 14:36
// std::cout << tv.get_object(0).get<Int>("_id") << std::endl;
}

// TEST_TYPES(StringIndex_ListOfStrings, std::true_type, std::false_type)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems useful for test coverage

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, but I had a hard time making it compile, but I found out.

if (col_key.is_list()) {
auto list = o.get_list<String>(col_key);
for (auto& s : list) {
index->insert(key, s); // Throws
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will insert duplicate values if they exist in the list. Elsewhere, that isn't allowed, shouldn't this protect against that as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I made this first - before really thinking of duplicates.

void Lst<StringData>::do_insert(size_t ndx, StringData value)
{
if (auto index = m_obj.get_table()->get_search_index(m_col_key)) {
if (m_tree->find_first(value) == realm::not_found) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose the idea behind handling duplicates at this level is to optimize the find operation. But the cost of O(N) on insert/set/remove seems very high. What would you think about allowing duplicates into the index to make these operations faster and less error prone? Then we would have a small sort/unique cost applied in StringIndex::from_list_all to remove duplicates from the results. Or perhaps we could even return duplicates there? Then users could see how many items in the list match which could be a useful feature?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll give it a try.

Copy link
Contributor Author

@jedelbo jedelbo Nov 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit wary of changing the invariant that there are no duplicate object keys in the index. I have now changed the code so that inserting a value twice is idempotent, so at least we can avoid the O(N) behavior on insertions. Would that be an acceptable compromise?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm good with that. Thanks!

@@ -2229,7 +2229,7 @@ TEST(Parser_list_of_primitive_strings)

constexpr bool nullable = true;
auto col_str_list = t->add_column_list(type_String, "strings", nullable);
CHECK_THROW_ANY(t->add_search_index(col_str_list));
t->add_search_index(col_str_list);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe template this test so that it runs with and without the index?

std::vector<ObjKey> result;

StringIndex* index = m_link_map.get_target_table()->get_search_index(m_column_key);
REALM_ASSERT(index);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit wary of this. Is there really no path through the query engine that executes find_all without being indexed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is only used in Compare<TCond>::init, and here it is guarded by a call to has_search_index().

@@ -268,7 +275,7 @@ int64_t IndexArray::index_string(Mixed value, InternalFindResult& result_ref, co
// List of row indices with common prefix up to this point, in sorted order.
if (!sub_isindex) {
const IntegerColumn sub(m_alloc, ref_type(ref));
if (column.is_fulltext()) {
if (column.full_word()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea to share this logic with the fulltext index 👍

@jedelbo jedelbo requested a review from ironage November 17, 2023 10:09
@jedelbo jedelbo force-pushed the je/list-string-index branch from 6d1aad4 to 43ff35e Compare November 17, 2023 10:19
@jedelbo jedelbo merged commit 2e66295 into master Nov 18, 2023
2 checks passed
@jedelbo jedelbo deleted the je/list-string-index branch November 18, 2023 14:13
jedelbo added a commit that referenced this pull request Nov 20, 2023
jedelbo added a commit that referenced this pull request Nov 20, 2023
jedelbo added a commit that referenced this pull request Nov 20, 2023
jedelbo added a commit that referenced this pull request Nov 20, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Mar 21, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants