Implement infrastructure for IND discovery and Faida algorithm #272

alexandrsmirn · 2023-09-18T18:44:15Z

This pull request introduces Faida algorithm for inclusion dependency discovery.

It is easier to review this PR by individual commits:

The first commit adds infrastructure for IND discovery algorithms. Also the commit adds support for multi-file datasets in Desbordante;
The second commit adds the implementation of Faida IND discovery algorithm and some tests for the algorithm;
Third commit integrates Faida to Desbordante codebase.

Node: this PR contains source code of third-party libraries: emhash, murmurhash3, atomicbitvector. The code is added in the second commit. This is a temporary solution, so later third-party code should be removed, and libraries should be downloaded at the build stage.

github-actions

clang-tidy made some suggestions

src/core/algorithms/ind/faida/faida.cpp

src/core/algorithms/ind/faida/hashing/hashing.h

src/core/algorithms/ind/faida/include_lib/atomicbitvector/atomic_bitvector.hpp

src/core/algorithms/ind/faida/include_lib/atomicbitvector/atomic_bitvector_no_warn.hpp

src/core/algorithms/ind/faida/include_lib/emhash/hash_set2_no_warn.hpp

src/core/algorithms/ind/faida/include_lib/emhash/hash_table7.hpp

src/core/algorithms/ind/faida/include_lib/emhash/hash_table7_no_warn.hpp

src/core/algorithms/ind/faida/include_lib/emhash/hash_table8.hpp

src/core/algorithms/ind/faida/include_lib/emhash/hash_table8_no_warn.hpp

github-actions

clang-tidy made some suggestions

src/core/algorithms/ind/faida/hashing/hashing.h

vs9h

One of these days I will try to review some points in more detail and add some comments.

src/core/algorithms/algo_factory.h

src/core/algorithms/ind/ind.h

src/core/algorithms/ind/ind_algorithm.h

src/core/algorithms/ind/faida/util/simple_ind.h

src/tests/test_faida.cpp

src/core/algorithms/ind/faida/faida.h

src/tests/test_faida.cpp

src/core/config/tabular_data/input_tableset_type.h

src/core/algorithms/ind/ind_algorithm.h

src/core/algorithms/ind/faida/candidate_generation/apriori_candidate_generator.cpp

src/core/config/descriptions.h

github-actions

clang-tidy made some suggestions

src/core/algorithms/ind/faida/inclusion_testing/combined_inclusion_tester.cpp

src/core/config/tabular_data/input_table_collection_type.h

src/tests/test_faida.cpp

src/core/algorithms/ind/ind.h

vs9h · 2023-11-25T19:59:12Z

@alexandrsmirn, please take a look at the two pull requests (#295 and #296). We communicated in a personal dialogue and I hope that I was able to convince of the need to merge the correct option for working with tables.
Changes:

added a --csv_paths option that expects a vector of paths (in tests it will now be necessary to explicitly pass vector of files)
minor changes to the table collection option (the option and the corresponding files have been renamed)
removed PrettyIND class
added type aliases for table and column indexes (now both unsigned int)

Regarding the PrettyIND class: the task of transformation into a more user-friendly form is more complex than it might seem at first glance. We need to understand whether we want to do this in the c++ core or in python. Also what type should be used.
And since there is now no ambiguity in the order of the tables, there is no need to give information about the order of the tables.

vs9h

Faida is a false positive algorithm, so it should find all dependencies, plus it may find the wrong dependencies. So I decided to run the Faida for unary dependencies (that is, I specified the appropriate option findNary: false) on my tests to make sure that the Faida works correctly. I found some differences in the results:

for TestWide.csv Faida doesn't find any dependencies (but expected 2 -> 3 and 3 -> 2)
for neighbors10k.csv and neighbors100k.csv Faida doesn’t find dependency 5 -> 6 (expected 3 -> 4, 4 -> 3, 5 -> 6, but last dependency wasn’t found — column 5 have unique values {“2”}, column 6 have unique values {“1”, “2”})

I also have a question that arose when comparing the results of dependency mining on the dataset CIPublicHighway700.csv and EpicMeds.csv (these datasets are necessary for testing work with nulls). Therefore, I will not provide the expected results, but will propose to discuss the rules for inferring inds in the case of working with nulls. I would also like to hear the opinion from @chernishev. I have also already discussed these rules with @polyntsov.

For simplicity, we will talk about unary dependencies. We will use the following notation:
Suppose dependency A -> B satisfied, where A and B are columns without nulls. Also, the notation A + nulls means that we added nulls to column A.
Then the following is true:

A + nulls -> B not satisfied
A -> B + nulls satisfied
A + nulls -> B + nulls not satisfied (if flag null_equal_nulls is false)
A + nulls -> B + nulls satisfied (if flag null_equal_nulls is true)

We also separately consider the case when one of the columns consists entirely of nulls (nulls -> B and A -> nulls) - such dependencies are not satisfied.

src/core/algorithms/ind/faida/inclusion_testing/sampled_inverted_index.h

src/core/config/all_options.cpp

chernishev · 2023-12-07T18:53:30Z

Faida is a false positive algorithm, so it should find all dependencies, plus it may find the wrong dependencies. So I decided to run the Faida for unary dependencies (that is, I specified the appropriate option findNary: false) on my tests to make sure that the Faida works correctly. I found some differences in the results:
* for `TestWide.csv` `Faida` doesn't find any dependencies (but expected `2 -> 3` and `3 -> 2`)

* for `neighbors10k.csv` and `neighbors100k.csv` `Faida` doesn’t find dependency `5 -> 6` (expected `3 -> 4`, `4 -> 3`, `5 -> 6`, but last dependency wasn’t found — column 5 have unique values {“2”}, column 6 have unique values {“1”, “2”})
I also have a question that arose when comparing the results of dependency mining on the dataset CIPublicHighway700.csv and EpicMeds.csv (these datasets are necessary for testing work with nulls). Therefore, I will not provide the expected results, but will propose to discuss the rules for inferring inds in the case of working with nulls. I would also like to hear the opinion from @chernishev. I have also already discussed these rules with @polyntsov.

For simplicity, we will talk about unary dependencies. We will use the following notation: Suppose dependency A -> B satisfied, where A and B are columns without nulls. Also, the notation A + nulls means that we added nulls to column A. Then the following is true:
* `A + nulls -> B` not satisfied

* `A -> B + nulls` satisfied

* `A + nulls -> B + nulls` not satisfied (if flag `null_equal_nulls` is `false`)

* `A + nulls -> B + nulls` satisfied (if flag `null_equal_nulls` is `true`)
We also separately consider the case when one of the columns consists entirely of nulls (nulls -> B and A -> nulls) - such dependencies are not satisfied.

We have discussed this in voice call; proposed rules are fine.

github-actions

clang-tidy made some suggestions

src/tests/test_faida.cpp

alexandrsmirn · 2023-12-18T23:17:57Z

Faida is a false positive algorithm, so it should find all dependencies, plus it may find the wrong dependencies. So I decided to run the Faida for unary dependencies (that is, I specified the appropriate option findNary: false) on my tests to make sure that the Faida works correctly. I found some differences in the results:
* for `TestWide.csv` `Faida` doesn't find any dependencies (but expected `2 -> 3` and `3 -> 2`)

* for `neighbors10k.csv` and `neighbors100k.csv` `Faida` doesn’t find dependency `5 -> 6` (expected `3 -> 4`, `4 -> 3`, `5 -> 6`, but last dependency wasn’t found — column 5 have unique values {“2”}, column 6 have unique values {“1”, “2”})
I also have a question that arose when comparing the results of dependency mining on the dataset CIPublicHighway700.csv and EpicMeds.csv (these datasets are necessary for testing work with nulls). Therefore, I will not provide the expected results, but will propose to discuss the rules for inferring inds in the case of working with nulls. I would also like to hear the opinion from @chernishev. I have also already discussed these rules with @polyntsov.

For simplicity, we will talk about unary dependencies. We will use the following notation: Suppose dependency A -> B satisfied, where A and B are columns without nulls. Also, the notation A + nulls means that we added nulls to column A. Then the following is true:
* `A + nulls -> B` not satisfied

* `A -> B + nulls` satisfied

* `A + nulls -> B + nulls` not satisfied (if flag `null_equal_nulls` is `false`)

* `A + nulls -> B + nulls` satisfied (if flag `null_equal_nulls` is `true`)
We also separately consider the case when one of the columns consists entirely of nulls (nulls -> B and A -> nulls) - such dependencies are not satisfied.

This was because my implementation of Faida used some optimization and ignored columns which are constant (the sutuation when all the values in a column are the same), and the columns which are null-columns (all values are null). Now I added two options to control this feature, and now it disabled by default, so the results should be the same. Also, the algorithm now supports the rules proposed by you (except that Faida does not support the semantics null_equal_nulls == false, only null_equal_nulls == true)

vs9h · 2024-01-18T19:35:20Z

src/tests/test_faida.cpp

+TEST_F(FaidaINDAlgorithmTest, TestEmptyFolder) {
+    IndsTest expected_inds{};
+
+    auto file_name = "empty_dataset";


it is worth adding configs to the all_tables_config.h and using them in this file

vs9h

I left a couple of small comments, but I approve.

src/core/config/descriptions.h

alexandrsmirn marked this pull request as draft September 18, 2023 18:44

github-actions bot reviewed Sep 18, 2023

View reviewed changes

alexandrsmirn force-pushed the faida branch from d57de91 to 289a289 Compare September 18, 2023 20:39

github-actions bot reviewed Sep 18, 2023

View reviewed changes

src/core/algorithms/ind/faida/hashing/hashing.h Outdated Show resolved Hide resolved

alexandrsmirn force-pushed the faida branch from 289a289 to 9f5725a Compare September 27, 2023 11:59

alexandrsmirn marked this pull request as ready for review October 22, 2023 13:41

polyntsov requested a review from vs9h October 23, 2023 18:01

alexandrsmirn force-pushed the faida branch from 9f5725a to 8881f10 Compare October 23, 2023 18:32

vs9h requested changes Oct 28, 2023

View reviewed changes

vs9h reviewed Oct 28, 2023

View reviewed changes

src/core/algorithms/ind/faida/faida.h Outdated Show resolved Hide resolved

vs9h reviewed Oct 28, 2023

View reviewed changes

src/core/algorithms/ind/faida/faida.h Outdated Show resolved Hide resolved

vs9h reviewed Oct 28, 2023

View reviewed changes

src/tests/test_faida.cpp Outdated Show resolved Hide resolved

vs9h reviewed Oct 28, 2023

View reviewed changes

src/tests/test_faida.cpp Outdated Show resolved Hide resolved

vs9h reviewed Nov 5, 2023

View reviewed changes

src/core/config/tabular_data/input_tableset_type.h Outdated Show resolved Hide resolved

vs9h reviewed Nov 5, 2023

View reviewed changes

src/core/algorithms/ind/ind_algorithm.h Outdated Show resolved Hide resolved

vs9h reviewed Nov 5, 2023

View reviewed changes

src/core/algorithms/ind/ind_algorithm.h Outdated Show resolved Hide resolved

vs9h reviewed Nov 5, 2023

View reviewed changes

src/core/algorithms/ind/faida/candidate_generation/apriori_candidate_generator.cpp Outdated Show resolved Hide resolved

vs9h reviewed Nov 7, 2023

View reviewed changes

src/core/config/descriptions.h Outdated Show resolved Hide resolved

alexandrsmirn force-pushed the faida branch from 8881f10 to a5e01f1 Compare November 17, 2023 23:17

github-actions bot reviewed Nov 17, 2023

View reviewed changes

src/core/algorithms/ind/faida/inclusion_testing/combined_inclusion_tester.cpp Outdated Show resolved Hide resolved

src/core/algorithms/ind/faida/inclusion_testing/combined_inclusion_tester.cpp Outdated Show resolved Hide resolved

alexandrsmirn force-pushed the faida branch from a5e01f1 to a87676d Compare November 17, 2023 23:46

vs9h reviewed Nov 21, 2023

View reviewed changes

src/core/config/tabular_data/input_table_collection_type.h Outdated Show resolved Hide resolved

vs9h reviewed Nov 22, 2023

View reviewed changes

src/tests/test_faida.cpp Outdated Show resolved Hide resolved

vs9h reviewed Nov 22, 2023

View reviewed changes

src/core/algorithms/ind/ind.h Outdated Show resolved Hide resolved

vs9h mentioned this pull request Nov 23, 2023

Add INDAlgorithm infrastructure #296

Merged

alexandrsmirn force-pushed the faida branch from a87676d to 884110d Compare November 26, 2023 22:12

vs9h requested changes Dec 5, 2023

View reviewed changes

src/core/algorithms/ind/faida/inclusion_testing/sampled_inverted_index.h Outdated Show resolved Hide resolved

src/core/config/all_options.cpp Outdated Show resolved Hide resolved

vs9h requested a review from polyntsov December 7, 2023 13:40

alexandrsmirn force-pushed the faida branch from 884110d to 450a180 Compare December 18, 2023 18:22

github-actions bot reviewed Dec 18, 2023

View reviewed changes

src/tests/test_faida.cpp Outdated Show resolved Hide resolved

alexandrsmirn force-pushed the faida branch 5 times, most recently from 02470b5 to 23778fe Compare December 18, 2023 22:43

alexandrsmirn force-pushed the faida branch 3 times, most recently from 1eddb08 to 7f7150a Compare December 24, 2023 20:20

polyntsov requested a review from vs9h January 18, 2024 19:23

vs9h reviewed Jan 18, 2024

View reviewed changes

Add virtual destructor to ColumnCombination

700beb9

alexandrsmirn force-pushed the faida branch from 7f7150a to 8179cbc Compare January 18, 2024 19:41

vs9h approved these changes Jan 18, 2024

View reviewed changes

src/core/config/descriptions.h Outdated Show resolved Hide resolved

alexandrsmirn force-pushed the faida branch 2 times, most recently from 76e56ec to 160e761 Compare January 18, 2024 21:35

polyntsov approved these changes Jan 18, 2024

View reviewed changes

alexandrsmirn added 4 commits January 19, 2024 00:49

Implement Faida

f3e33b7

Integrate Faida into Desbordante codebase

0c84f44

Optimize csv_parser

adde53a

Change InvertedIndex hash table to hash_table8

d062840

alexandrsmirn force-pushed the faida branch from 160e761 to d062840 Compare January 18, 2024 21:49

polyntsov merged commit 19ad025 into Desbordante:main Jan 18, 2024
20 checks passed

vs9h mentioned this pull request Feb 24, 2024

Add Spider algorithm #304

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement infrastructure for IND discovery and Faida algorithm #272

Implement infrastructure for IND discovery and Faida algorithm #272

alexandrsmirn commented Sep 18, 2023 •

edited

Loading

github-actions bot left a comment

github-actions bot left a comment

vs9h left a comment

github-actions bot left a comment

vs9h commented Nov 25, 2023

vs9h left a comment

chernishev commented Dec 7, 2023 •

edited

Loading

github-actions bot left a comment

alexandrsmirn commented Dec 18, 2023

vs9h Jan 18, 2024

vs9h left a comment

Implement infrastructure for IND discovery and Faida algorithm #272

Implement infrastructure for IND discovery and Faida algorithm #272

Conversation

alexandrsmirn commented Sep 18, 2023 • edited Loading

github-actions bot left a comment

Choose a reason for hiding this comment

github-actions bot left a comment

Choose a reason for hiding this comment

vs9h left a comment

Choose a reason for hiding this comment

github-actions bot left a comment

Choose a reason for hiding this comment

vs9h commented Nov 25, 2023

vs9h left a comment

Choose a reason for hiding this comment

chernishev commented Dec 7, 2023 • edited Loading

github-actions bot left a comment

Choose a reason for hiding this comment

alexandrsmirn commented Dec 18, 2023

vs9h Jan 18, 2024

Choose a reason for hiding this comment

vs9h left a comment

Choose a reason for hiding this comment

alexandrsmirn commented Sep 18, 2023 •

edited

Loading

chernishev commented Dec 7, 2023 •

edited

Loading