Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement infrastructure for IND discovery and Faida algorithm #272

Merged
merged 5 commits into from
Jan 18, 2024

Conversation

alexandrsmirn
Copy link
Collaborator

@alexandrsmirn alexandrsmirn commented Sep 18, 2023

This pull request introduces Faida algorithm for inclusion dependency discovery.

It is easier to review this PR by individual commits:

  • The first commit adds infrastructure for IND discovery algorithms. Also the commit adds support for multi-file datasets in Desbordante;
  • The second commit adds the implementation of Faida IND discovery algorithm and some tests for the algorithm;
  • Third commit integrates Faida to Desbordante codebase.

Node: this PR contains source code of third-party libraries: emhash, murmurhash3, atomicbitvector. The code is added in the second commit. This is a temporary solution, so later third-party code should be removed, and libraries should be downloaded at the build stage.

@alexandrsmirn alexandrsmirn marked this pull request as draft September 18, 2023 18:44
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

src/core/algorithms/ind/faida/hashing/hashing.h Outdated Show resolved Hide resolved
Copy link
Collaborator

@vs9h vs9h left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of these days I will try to review some points in more detail and add some comments.

src/core/algorithms/algo_factory.h Outdated Show resolved Hide resolved
src/core/algorithms/ind/ind.h Outdated Show resolved Hide resolved
src/core/algorithms/ind/ind.h Outdated Show resolved Hide resolved
src/core/algorithms/ind/ind.h Outdated Show resolved Hide resolved
src/core/algorithms/ind/ind_algorithm.h Outdated Show resolved Hide resolved
src/core/algorithms/ind/faida/util/simple_ind.h Outdated Show resolved Hide resolved
src/tests/test_faida.cpp Outdated Show resolved Hide resolved
src/tests/test_faida.cpp Outdated Show resolved Hide resolved
src/tests/test_faida.cpp Outdated Show resolved Hide resolved
src/tests/test_faida.cpp Outdated Show resolved Hide resolved
src/tests/test_faida.cpp Outdated Show resolved Hide resolved
src/tests/test_faida.cpp Outdated Show resolved Hide resolved
src/core/config/descriptions.h Outdated Show resolved Hide resolved
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

src/tests/test_faida.cpp Outdated Show resolved Hide resolved
src/core/algorithms/ind/ind.h Outdated Show resolved Hide resolved
@vs9h
Copy link
Collaborator

vs9h commented Nov 25, 2023

@alexandrsmirn, please take a look at the two pull requests (#295 and #296). We communicated in a personal dialogue and I hope that I was able to convince of the need to merge the correct option for working with tables.
Changes:

  • added a --csv_paths option that expects a vector of paths (in tests it will now be necessary to explicitly pass vector of files)
  • minor changes to the table collection option (the option and the corresponding files have been renamed)
  • removed PrettyIND class
  • added type aliases for table and column indexes (now both unsigned int)

Regarding the PrettyIND class: the task of transformation into a more user-friendly form is more complex than it might seem at first glance. We need to understand whether we want to do this in the c++ core or in python. Also what type should be used.
And since there is now no ambiguity in the order of the tables, there is no need to give information about the order of the tables.

Copy link
Collaborator

@vs9h vs9h left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Faida is a false positive algorithm, so it should find all dependencies, plus it may find the wrong dependencies. So I decided to run the Faida for unary dependencies (that is, I specified the appropriate option findNary: false) on my tests to make sure that the Faida works correctly. I found some differences in the results:

  • for TestWide.csv Faida doesn't find any dependencies (but expected 2 -> 3 and 3 -> 2)
  • for neighbors10k.csv and neighbors100k.csv Faida doesn’t find dependency 5 -> 6 (expected 3 -> 4, 4 -> 3, 5 -> 6, but last dependency wasn’t found — column 5 have unique values {“2”}, column 6 have unique values {“1”, “2”})

I also have a question that arose when comparing the results of dependency mining on the dataset CIPublicHighway700.csv and EpicMeds.csv (these datasets are necessary for testing work with nulls). Therefore, I will not provide the expected results, but will propose to discuss the rules for inferring inds in the case of working with nulls. I would also like to hear the opinion from @chernishev. I have also already discussed these rules with @polyntsov.

For simplicity, we will talk about unary dependencies. We will use the following notation:
Suppose dependency A -> B satisfied, where A and B are columns without nulls. Also, the notation A + nulls means that we added nulls to column A.
Then the following is true:

  • A + nulls -> B not satisfied
  • A -> B + nulls satisfied
  • A + nulls -> B + nulls not satisfied (if flag null_equal_nulls is false)
  • A + nulls -> B + nulls satisfied (if flag null_equal_nulls is true)

We also separately consider the case when one of the columns consists entirely of nulls (nulls -> B and A -> nulls) - such dependencies are not satisfied.

src/core/config/all_options.cpp Outdated Show resolved Hide resolved
@vs9h vs9h requested a review from polyntsov December 7, 2023 13:40
@chernishev
Copy link
Collaborator

chernishev commented Dec 7, 2023

Faida is a false positive algorithm, so it should find all dependencies, plus it may find the wrong dependencies. So I decided to run the Faida for unary dependencies (that is, I specified the appropriate option findNary: false) on my tests to make sure that the Faida works correctly. I found some differences in the results:

* for `TestWide.csv` `Faida` doesn't find any dependencies (but expected `2 -> 3` and `3 -> 2`)

* for `neighbors10k.csv` and `neighbors100k.csv` `Faida` doesn’t find dependency `5 -> 6` (expected `3 -> 4`, `4 -> 3`, `5 -> 6`, but last dependency wasn’t found — column 5 have unique values {“2”}, column 6 have unique values {“1”, “2”})

I also have a question that arose when comparing the results of dependency mining on the dataset CIPublicHighway700.csv and EpicMeds.csv (these datasets are necessary for testing work with nulls). Therefore, I will not provide the expected results, but will propose to discuss the rules for inferring inds in the case of working with nulls. I would also like to hear the opinion from @chernishev. I have also already discussed these rules with @polyntsov.

For simplicity, we will talk about unary dependencies. We will use the following notation: Suppose dependency A -> B satisfied, where A and B are columns without nulls. Also, the notation A + nulls means that we added nulls to column A. Then the following is true:

* `A + nulls -> B` not satisfied

* `A -> B + nulls` satisfied

* `A + nulls -> B + nulls` not satisfied (if flag `null_equal_nulls` is `false`)

* `A + nulls -> B + nulls` satisfied (if flag `null_equal_nulls` is `true`)

We also separately consider the case when one of the columns consists entirely of nulls (nulls -> B and A -> nulls) - such dependencies are not satisfied.

We have discussed this in voice call; proposed rules are fine.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

src/tests/test_faida.cpp Outdated Show resolved Hide resolved
@alexandrsmirn alexandrsmirn force-pushed the faida branch 5 times, most recently from 02470b5 to 23778fe Compare December 18, 2023 22:43
@alexandrsmirn
Copy link
Collaborator Author

Faida is a false positive algorithm, so it should find all dependencies, plus it may find the wrong dependencies. So I decided to run the Faida for unary dependencies (that is, I specified the appropriate option findNary: false) on my tests to make sure that the Faida works correctly. I found some differences in the results:

* for `TestWide.csv` `Faida` doesn't find any dependencies (but expected `2 -> 3` and `3 -> 2`)

* for `neighbors10k.csv` and `neighbors100k.csv` `Faida` doesn’t find dependency `5 -> 6` (expected `3 -> 4`, `4 -> 3`, `5 -> 6`, but last dependency wasn’t found — column 5 have unique values {“2”}, column 6 have unique values {“1”, “2”})

I also have a question that arose when comparing the results of dependency mining on the dataset CIPublicHighway700.csv and EpicMeds.csv (these datasets are necessary for testing work with nulls). Therefore, I will not provide the expected results, but will propose to discuss the rules for inferring inds in the case of working with nulls. I would also like to hear the opinion from @chernishev. I have also already discussed these rules with @polyntsov.

For simplicity, we will talk about unary dependencies. We will use the following notation: Suppose dependency A -> B satisfied, where A and B are columns without nulls. Also, the notation A + nulls means that we added nulls to column A. Then the following is true:

* `A + nulls -> B` not satisfied

* `A -> B + nulls` satisfied

* `A + nulls -> B + nulls` not satisfied (if flag `null_equal_nulls` is `false`)

* `A + nulls -> B + nulls` satisfied (if flag `null_equal_nulls` is `true`)

We also separately consider the case when one of the columns consists entirely of nulls (nulls -> B and A -> nulls) - such dependencies are not satisfied.

This was because my implementation of Faida used some optimization and ignored columns which are constant (the sutuation when all the values in a column are the same), and the columns which are null-columns (all values are null). Now I added two options to control this feature, and now it disabled by default, so the results should be the same. Also, the algorithm now supports the rules proposed by you (except that Faida does not support the semantics null_equal_nulls == false, only null_equal_nulls == true)

@alexandrsmirn alexandrsmirn force-pushed the faida branch 3 times, most recently from 1eddb08 to 7f7150a Compare December 24, 2023 20:20
@polyntsov polyntsov requested a review from vs9h January 18, 2024 19:23
TEST_F(FaidaINDAlgorithmTest, TestEmptyFolder) {
IndsTest expected_inds{};

auto file_name = "empty_dataset";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is worth adding configs to the all_tables_config.h and using them in this file

Copy link
Collaborator

@vs9h vs9h left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a couple of small comments, but I approve.

src/core/config/descriptions.h Outdated Show resolved Hide resolved
@alexandrsmirn alexandrsmirn force-pushed the faida branch 2 times, most recently from 76e56ec to 160e761 Compare January 18, 2024 21:35
@polyntsov polyntsov merged commit 19ad025 into Desbordante:main Jan 18, 2024
20 checks passed
@vs9h vs9h mentioned this pull request Feb 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants