Updating the index

rcap107 · May 24, 2024 · 6612574 · 6612574
1 parent c3d461a
commit 6612574
Showing 1 changed file with 39 additions and 34 deletions.
diff --git a/index.markdown b/index.markdown
@@ -2,7 +2,41 @@
 layout: home
 # title: Retrieve, Merge, Predict
 ---
-This repository contains the code for implementing and running the pipeline described in the paper "Retrieve, Merge, Predict: Augmenting Tables with Data Lakes".
+This repository contains the code for implementing and running the pipeline 
+described in the paper "Retrieve, Merge, Predict: Augmenting Tables with Data Lakes".
+
+It also includes a series of additional ancillary scripts used for preparing the plots
+and carrying out additional measurements over the different steps. 
+
+## A simple example
+
+![pipeline](/assets/img/benchmark-pipeline-v6.png)
+<!-- ![alice-example](/assets/img/alice-example.drawio.png) -->
+
+> Alice is working on a *base table* that contains information about movies. She has also access to a data lake, or a collection 
+> of other tables on all sorts of subjects. 
+
+> She is looking for a way to predict the ranking of a movie based on as much information as possible. Given that she has
+> access to a data lake, she would like to extract some of the stored information to improve the performance of her model. 
+
+> The problem is that, while the information she needs may indeed be available, it is mixed with a huge amount of unrelated
+> data. 
+
+> Thus, Alice's problem is: "how can I find tables that are relevant to my problem? how do I combine them with my base table?"
+
+This toy example was our starting point for creating the our pipeline, where we illustrate the various step that 
+Alice may need in order to predict the movie rating. 
+
+We highlight three operations that must be performed to go from Alice's base table, to an integrated table on which she
+can train a ML model:
+- **Retrieve** the join candidates, i.e., extract from the data lake those tables that can be joined on the base table.
+- **Merge** the candidates with the base table in the most effective way possible.
+- **Predict** the result using a ML model. 
+
+We design and build [YADL](https://github.com/rcap107/YADL), a synthetic data lake based on the YAGO3 knowledge base to 
+use as our benchmarking data lake. We make the variants we used in the paper available [on Zenodo](https://zenodo.org/doi/10.5281/zenodo.10600047).
+
+## Running the code
 
 **Code repositories**
 The repository containing the pipeline and the code required to run the experiments 
@@ -22,6 +56,9 @@ The base tables used for the experiments are available in the [pipeline reposito
 The data lakes used for our experiments are stored on Zenodo. Follow the instructions in [Downloading the data lakes](docs/resources) to prepare them. 
 
 **Requirements** 
+The repository relies heavily on the `parquet` format [[ref](https://parquet.apache.org/docs/file-format/)], and will expect all tables (both source tables, and data lake
+tables) to be stored in `parquet` format. Please convert your data to parquet before working on the pipeline. 
+
 We recommend to use conda environments to fetch the required packages. File `environment.yaml` contains the
 required dependencies and allows to create directly a conda environment:
 ```
@@ -39,44 +76,12 @@ Running the experiments requires some preparation to build the data structures t
 **Pipeline execution**
 The page [Running the pipeline](docs/execution) contains the information required for preparing configurations, running the pipeline, recovering from a pipeline crash, and exploring the profiling results. 
 
-Make sure that the preparation has been run properly, since the pipeline is relying on data structures that are assumed to have been build during the previous step. 
+Make sure that the preparation has been run properly, since the pipeline is relying on data structures that are assumed to have been prepared during the previous step. 
 
 
 <!-- ## [Experimental results](docs/results) -->
 
 
-## A simple example
-
-![pipeline](/assets/img/benchmark-pipeline-v6.png)
-<!-- ![alice-example](/assets/img/alice-example.drawio.png) -->
-
-Alice is working on a table that contains information about movies. She has also access to a data lake, or a collection 
-of other tables on all sorts of subjects. 
-
-She is looking for a way to predict the box office revenue of a movie based on as much information as possible, so she
-would like to leverage the information stored in the data to improve the performance of her model. 
-
-The problem is that, while the information is indeed available, it is mixed with a huge amount of unrelated data. Alice's
-problem is thus figuring out how to find those tables that are actually relevant, and how to join them with her starting
-table. 
-
-This toy example was our starting point for creating the our pipeline, where we illustrate the various step that 
-Alice may need in order to predict the revenue. 
-
-The candidates produced by the join discovery methods are used to augment the base table, then the performance of the
-joined tables is compared to that of the base table by training a regressor with Catboost and comparing the R2 score
-measured before and after joining.
-
-We use YADL as our data lake, a synthetic data lake based on the YAGO3 knowledge base. The YADL variants used in the paper
-are available [on Zenodo](https://zenodo.org/doi/10.5281/zenodo.10600047).
-
-The code for preparing the YADL variants can be found in [this repo](https://github.com/rcap107/YADL).
-
-The base tables used for the experiments are provided in the repository 
-
-**NOTE:** The repository relies heavily on the `parquet` format [[ref](https://parquet.apache.org/docs/file-format/)], and will expect all tables (both source tables, and data lake
-tables) to be stored in `parquet` format. Please convert your data to parquet before working on the pipeline. 
-
 
 [zenodo_link]: https://zenodo.org/doi/10.5281/zenodo.10600047
 [prepare_repo]: https://github.com/rcap107/YADL