Skip to content

Commit

Permalink
Updating the index
Browse files Browse the repository at this point in the history
  • Loading branch information
rcap107 committed May 24, 2024
1 parent c3d461a commit 6612574
Showing 1 changed file with 39 additions and 34 deletions.
73 changes: 39 additions & 34 deletions index.markdown
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,41 @@
layout: home
# title: Retrieve, Merge, Predict
---
This repository contains the code for implementing and running the pipeline described in the paper "Retrieve, Merge, Predict: Augmenting Tables with Data Lakes".
This repository contains the code for implementing and running the pipeline
described in the paper "Retrieve, Merge, Predict: Augmenting Tables with Data Lakes".

It also includes a series of additional ancillary scripts used for preparing the plots
and carrying out additional measurements over the different steps.

## A simple example

![pipeline](/assets/img/benchmark-pipeline-v6.png)
<!-- ![alice-example](/assets/img/alice-example.drawio.png) -->

> Alice is working on a *base table* that contains information about movies. She has also access to a data lake, or a collection
> of other tables on all sorts of subjects.
> She is looking for a way to predict the ranking of a movie based on as much information as possible. Given that she has
> access to a data lake, she would like to extract some of the stored information to improve the performance of her model.
> The problem is that, while the information she needs may indeed be available, it is mixed with a huge amount of unrelated
> data.
> Thus, Alice's problem is: "how can I find tables that are relevant to my problem? how do I combine them with my base table?"
This toy example was our starting point for creating the our pipeline, where we illustrate the various step that
Alice may need in order to predict the movie rating.

We highlight three operations that must be performed to go from Alice's base table, to an integrated table on which she
can train a ML model:
- **Retrieve** the join candidates, i.e., extract from the data lake those tables that can be joined on the base table.
- **Merge** the candidates with the base table in the most effective way possible.
- **Predict** the result using a ML model.

We design and build [YADL](https://github.com/rcap107/YADL), a synthetic data lake based on the YAGO3 knowledge base to
use as our benchmarking data lake. We make the variants we used in the paper available [on Zenodo](https://zenodo.org/doi/10.5281/zenodo.10600047).

## Running the code

**Code repositories**
The repository containing the pipeline and the code required to run the experiments
Expand All @@ -22,6 +56,9 @@ The base tables used for the experiments are available in the [pipeline reposito
The data lakes used for our experiments are stored on Zenodo. Follow the instructions in [Downloading the data lakes](docs/resources) to prepare them.

**Requirements**
The repository relies heavily on the `parquet` format [[ref](https://parquet.apache.org/docs/file-format/)], and will expect all tables (both source tables, and data lake
tables) to be stored in `parquet` format. Please convert your data to parquet before working on the pipeline.

We recommend to use conda environments to fetch the required packages. File `environment.yaml` contains the
required dependencies and allows to create directly a conda environment:
```
Expand All @@ -39,44 +76,12 @@ Running the experiments requires some preparation to build the data structures t
**Pipeline execution**
The page [Running the pipeline](docs/execution) contains the information required for preparing configurations, running the pipeline, recovering from a pipeline crash, and exploring the profiling results.

Make sure that the preparation has been run properly, since the pipeline is relying on data structures that are assumed to have been build during the previous step.
Make sure that the preparation has been run properly, since the pipeline is relying on data structures that are assumed to have been prepared during the previous step.


<!-- ## [Experimental results](docs/results) -->


## A simple example

![pipeline](/assets/img/benchmark-pipeline-v6.png)
<!-- ![alice-example](/assets/img/alice-example.drawio.png) -->

Alice is working on a table that contains information about movies. She has also access to a data lake, or a collection
of other tables on all sorts of subjects.

She is looking for a way to predict the box office revenue of a movie based on as much information as possible, so she
would like to leverage the information stored in the data to improve the performance of her model.

The problem is that, while the information is indeed available, it is mixed with a huge amount of unrelated data. Alice's
problem is thus figuring out how to find those tables that are actually relevant, and how to join them with her starting
table.

This toy example was our starting point for creating the our pipeline, where we illustrate the various step that
Alice may need in order to predict the revenue.

The candidates produced by the join discovery methods are used to augment the base table, then the performance of the
joined tables is compared to that of the base table by training a regressor with Catboost and comparing the R2 score
measured before and after joining.

We use YADL as our data lake, a synthetic data lake based on the YAGO3 knowledge base. The YADL variants used in the paper
are available [on Zenodo](https://zenodo.org/doi/10.5281/zenodo.10600047).

The code for preparing the YADL variants can be found in [this repo](https://github.com/rcap107/YADL).

The base tables used for the experiments are provided in the repository

**NOTE:** The repository relies heavily on the `parquet` format [[ref](https://parquet.apache.org/docs/file-format/)], and will expect all tables (both source tables, and data lake
tables) to be stored in `parquet` format. Please convert your data to parquet before working on the pipeline.


[zenodo_link]: https://zenodo.org/doi/10.5281/zenodo.10600047
[prepare_repo]: https://github.com/rcap107/YADL
Expand Down

0 comments on commit 6612574

Please sign in to comment.