-
Notifications
You must be signed in to change notification settings - Fork 168
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Index on list of strings #7142
Index on list of strings #7142
Conversation
Pull Request Test Coverage Report for Build github_pull_request_284878
💛 - Coveralls |
827b4d8
to
5288797
Compare
5288797
to
0052d98
Compare
test/test_index_string.cpp
Outdated
// std::cout << tv.get_object(0).get<Int>("_id") << std::endl; | ||
} | ||
|
||
// TEST_TYPES(StringIndex_ListOfStrings, std::true_type, std::false_type) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems useful for test coverage
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, but I had a hard time making it compile, but I found out.
if (col_key.is_list()) { | ||
auto list = o.get_list<String>(col_key); | ||
for (auto& s : list) { | ||
index->insert(key, s); // Throws |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will insert duplicate values if they exist in the list. Elsewhere, that isn't allowed, shouldn't this protect against that as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. I made this first - before really thinking of duplicates.
src/realm/list.cpp
Outdated
void Lst<StringData>::do_insert(size_t ndx, StringData value) | ||
{ | ||
if (auto index = m_obj.get_table()->get_search_index(m_col_key)) { | ||
if (m_tree->find_first(value) == realm::not_found) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose the idea behind handling duplicates at this level is to optimize the find operation. But the cost of O(N) on insert/set/remove seems very high. What would you think about allowing duplicates into the index to make these operations faster and less error prone? Then we would have a small sort/unique cost applied in StringIndex::from_list_all
to remove duplicates from the results. Or perhaps we could even return duplicates there? Then users could see how many items in the list match which could be a useful feature?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll give it a try.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a bit wary of changing the invariant that there are no duplicate object keys in the index. I have now changed the code so that inserting a value twice is idempotent, so at least we can avoid the O(N) behavior on insertions. Would that be an acceptable compromise?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm good with that. Thanks!
test/test_parser.cpp
Outdated
@@ -2229,7 +2229,7 @@ TEST(Parser_list_of_primitive_strings) | |||
|
|||
constexpr bool nullable = true; | |||
auto col_str_list = t->add_column_list(type_String, "strings", nullable); | |||
CHECK_THROW_ANY(t->add_search_index(col_str_list)); | |||
t->add_search_index(col_str_list); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe template this test so that it runs with and without the index?
std::vector<ObjKey> result; | ||
|
||
StringIndex* index = m_link_map.get_target_table()->get_search_index(m_column_key); | ||
REALM_ASSERT(index); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit wary of this. Is there really no path through the query engine that executes find_all without being indexed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is only used in Compare<TCond>::init
, and here it is guarded by a call to has_search_index()
.
@@ -268,7 +275,7 @@ int64_t IndexArray::index_string(Mixed value, InternalFindResult& result_ref, co | |||
// List of row indices with common prefix up to this point, in sorted order. | |||
if (!sub_isindex) { | |||
const IntegerColumn sub(m_alloc, ref_type(ref)); | |||
if (column.is_fulltext()) { | |||
if (column.full_word()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good idea to share this logic with the fulltext index 👍
6d1aad4
to
43ff35e
Compare
What, How & Why?
I have chosen to store the whole word in the index as I believe it would be too slow to try to find the value in the list based on the prefix.
Partial fix for #7132
☑️ ToDos
bindgen/spec.yml
, if public C++ API changed