-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement infrastructure for IND discovery and Faida algorithm #272
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
src/core/algorithms/ind/faida/include_lib/atomicbitvector/atomic_bitvector.hpp
Outdated
Show resolved
Hide resolved
src/core/algorithms/ind/faida/include_lib/atomicbitvector/atomic_bitvector.hpp
Outdated
Show resolved
Hide resolved
src/core/algorithms/ind/faida/include_lib/atomicbitvector/atomic_bitvector_no_warn.hpp
Outdated
Show resolved
Hide resolved
src/core/algorithms/ind/faida/include_lib/emhash/hash_set2_no_warn.hpp
Outdated
Show resolved
Hide resolved
src/core/algorithms/ind/faida/include_lib/emhash/hash_table7.hpp
Outdated
Show resolved
Hide resolved
src/core/algorithms/ind/faida/include_lib/emhash/hash_table7_no_warn.hpp
Outdated
Show resolved
Hide resolved
src/core/algorithms/ind/faida/include_lib/emhash/hash_table8.hpp
Outdated
Show resolved
Hide resolved
src/core/algorithms/ind/faida/include_lib/emhash/hash_table8_no_warn.hpp
Outdated
Show resolved
Hide resolved
d57de91
to
289a289
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
289a289
to
9f5725a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of these days I will try to review some points in more detail and add some comments.
src/core/algorithms/ind/faida/candidate_generation/apriori_candidate_generator.cpp
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
src/core/algorithms/ind/faida/inclusion_testing/combined_inclusion_tester.cpp
Outdated
Show resolved
Hide resolved
src/core/algorithms/ind/faida/inclusion_testing/combined_inclusion_tester.cpp
Outdated
Show resolved
Hide resolved
@alexandrsmirn, please take a look at the two pull requests (#295 and #296). We communicated in a personal dialogue and I hope that I was able to convince of the need to merge the correct option for working with tables.
Regarding the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Faida
is a false positive algorithm, so it should find all dependencies, plus it may find the wrong dependencies. So I decided to run the Faida
for unary dependencies (that is, I specified the appropriate option findNary: false
) on my tests to make sure that the Faida
works correctly. I found some differences in the results:
- for
TestWide.csv
Faida
doesn't find any dependencies (but expected2 -> 3
and3 -> 2
) - for
neighbors10k.csv
andneighbors100k.csv
Faida
doesn’t find dependency5 -> 6
(expected3 -> 4
,4 -> 3
,5 -> 6
, but last dependency wasn’t found — column 5 have unique values {“2”}, column 6 have unique values {“1”, “2”})
I also have a question that arose when comparing the results of dependency mining on the dataset CIPublicHighway700.csv
and EpicMeds.csv
(these datasets are necessary for testing work with nulls). Therefore, I will not provide the expected results, but will propose to discuss the rules for inferring inds in the case of working with nulls. I would also like to hear the opinion from @chernishev. I have also already discussed these rules with @polyntsov.
For simplicity, we will talk about unary dependencies. We will use the following notation:
Suppose dependency A -> B
satisfied, where A
and B
are columns without nulls. Also, the notation A + nulls
means that we added nulls to column A
.
Then the following is true:
A + nulls -> B
not satisfiedA -> B + nulls
satisfiedA + nulls -> B + nulls
not satisfied (if flagnull_equal_nulls
isfalse
)A + nulls -> B + nulls
satisfied (if flagnull_equal_nulls
istrue
)
We also separately consider the case when one of the columns consists entirely of nulls (nulls -> B
and A -> nulls
) - such dependencies are not satisfied.
src/core/algorithms/ind/faida/inclusion_testing/sampled_inverted_index.h
Outdated
Show resolved
Hide resolved
We have discussed this in voice call; proposed rules are fine. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
02470b5
to
23778fe
Compare
This was because my implementation of Faida used some optimization and ignored columns which are constant (the sutuation when all the values in a column are the same), and the columns which are null-columns (all values are null). Now I added two options to control this feature, and now it disabled by default, so the results should be the same. Also, the algorithm now supports the rules proposed by you (except that Faida does not support the semantics |
1eddb08
to
7f7150a
Compare
src/tests/test_faida.cpp
Outdated
TEST_F(FaidaINDAlgorithmTest, TestEmptyFolder) { | ||
IndsTest expected_inds{}; | ||
|
||
auto file_name = "empty_dataset"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is worth adding configs to the all_tables_config.h and using them in this file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a couple of small comments, but I approve.
76e56ec
to
160e761
Compare
This pull request introduces Faida algorithm for inclusion dependency discovery.
It is easier to review this PR by individual commits:
Node: this PR contains source code of third-party libraries: emhash, murmurhash3, atomicbitvector. The code is added in the second commit. This is a temporary solution, so later third-party code should be removed, and libraries should be downloaded at the build stage.