Skip to content

Latest commit

 

History

History
193 lines (144 loc) · 8.45 KB

File metadata and controls

193 lines (144 loc) · 8.45 KB

Time-Series-Forecasting-with-Deep-Learning

Implemented by Spiros Chalkias & Harry Maraziaris

This project is seperated into 4 topics:

  • A: Time series forecasting
  • B: Time series anomaly detection with LSTM autoencoders
  • C: Autoencoders for the compression of stock market time series
  • D: Comparison of normal with compressed datasets, using kNN-and-Clustering-on-Curves-and-Time-Series

Usage

You can have access to project's parent directory by typing:
$ cd ~/Time-Series-Forecasting-with-Deep-Learning

Prerequisites

In order to run the project, you need to install the following:

  • Python 3
  • pip
  • Pandas
  • numPy
  • matplotlib
  • seaborn
  • Tensorflow
  • sklearn
  • tqdm

Project's Structure

The filesystem structure is as follows:

  • /Time-Series-Forecasting-with-Deep-Learning/ : Project's main directory.
  • /src/ : Project's source code.
  • /data/ : Input data used by the project.
  • /out_files/ : Files generated by the compression of time series.
  • /saved models/ : Folder where all models are being saved, in order to quickly demonstrate their usage without re-training them.
  • /reports/ : Experiment reports for topics A, B and C individually.
  • /reports/d_comparison_results/ex4_results/ : Comparison of the original dataset versus the compressed dataset using this project's executables.
  • /src/preprocess.py : File containing all the utility functions used by the project's main files.
  • /src/forecast.py : File used for time series forecasting.
  • /src/detect.py : File used for time series anomaly detection with LSTM autoencoders.
  • /src/reduce.py : File used for compression of stock market time series using autoencoders.
  • /src/time-series-forecasting.ipynb : The python notebook used in order to train the models and tune the data!
  • /data/nasdaq2007_17.csv : Data file used in topics A and B.
  • /data/input.csv : Input file used in topic C.
  • /data/query.csv : Query file used in topic C.
  • /out_files/output_dataset_file.csv : Compressed time series file used as an input file in this project.
  • /out_files/output_query_file.csv : Compressed time series file used as query file in this project.

Build & Run

While being in the project's parent directory, simply type the following in order to execute each question's corresponding file.

Run A - Time Series forecasting

python3 ./src/forecast.py -d <dataset> -n <number of time series selected>

Run B - Time Series Anomaly Detection with LSTM Autoencoders

python3 ./src/detect.py -d <dataset> -n <number of time series selected> -mae <error value as double>

Run C - Autoencoders for the compression of stock market time series

python3 ./src/reduce.py -d <dataset> -q <queryset> -od <output_dataset_file> -oq <output_query_file>

General Notes

  1. The project was written in Python 3, using Tensorflow and specifically Keras API.
  2. The assignment's code was inspired by the three (3) articles provided in the lectures and displayed in the Resources section.
  3. In order to prevent overfitting, Early Stopping has been added to every model.
  4. Each model is being compiled with:
    • Mean Squared Error (MSE) as a loss function.
    • Adam as an optimizer.
    • Mean Absolute Error (MAE) as an evaluation metric.
  5. MinMax scaler is used in order to properly scale the data.
  6. In Anomaly Detection, if the anomaly threshold is not provided by the user, then it is being automatically computed by taking the maximum value, when computing the training set's Mean Absolute Error (MAE).

Fine-tuning

Fine-tuning reports showcasing our experiments for topics A, B and C can be found in the additional PDFs provided in the submitted directory.

Search

  • MAF : Maximum Approximation Factor
  • AAT : Average Approximation 1-NN Time taken

We observe that our search algorithms run around x100 faster on the compressed dataset, which is expected. We also notice that our Approximation algorithms run quite well, obtaining scores of perfect MAF = 1 on the reduced datasets and less than 4 on the original dataset.

Stats LSH-Euclidean LSH-Discrete-Frechet LSH-Continuous-Frechet
MAF 3.43 3.84 2.67
AAT (sec) 40.61 3.66 105.39

Table 1: Original input and query files

Stats LSH-Euclidean LSH-Discrete-Frechet LSH-Continuous-Frechet
MAF 1 1 1
AAT (sec) 0.03 0.01 0.03

Table 2: Reduced input and query files

Clustering

Clustering - Mean Vector

We observe that our clustering algorithms run faster on the reduced datasets, as expected, at a factor of at least 20. We also obtain very good Silhouette scores (> 0.8 on average) in both the clustering of the Original and the Reduced datasets. Thus we could argue that most of the information used to cluster our timeseries is preserved even after their compression, leading to equally good clustering.

Stats Cluster 1 Cluster 2 Cluster 3 Cluster 4 Overall
Cluster Size 1 21 1 326 349
Silhouette 1 0.24274 1 0.75713 0.72757
Clustering Time : 0.002 sec

Table 3: Reduced clustering: Lloyd's assignment

Stats Cluster 1 Cluster 2 Cluster 3 Cluster 4 Overall
Cluster Size 1 21 1 326 349
Silhouette 1 0.44962 1 0.97536 0.97249
Clustering Time : 0.092 sec

Table 4: Original clustering: Lloyd's assignment

Stats Cluster 1 Cluster 2 Cluster 3 Cluster 4 Overall
Cluster Size 2 1 1 345 349
Silhouette 0.81551 1 1 0.9105 0.91047
Clustering Time : 0.003 sec

Table 5: Reduced clustering: LSH

Stats Cluster 1 Cluster 2 Cluster 3 Cluster 4 Overall
Cluster Size 1 1 1 346 349
Silhouette 1 1 1 0.9769 0.97718
Clustering Time : 0.053 sec

Table 6: Original clustering: LSH

Stats Cluster 1 Cluster 2 Cluster 3 Cluster 4 Overall
Cluster Size 1 154 1 193 349
Silhouette 1 0.12298 1 0.71973 0.45801
Clustering Time : 0.002 sec

Table 7: Reduced clustering: Hypercube

Stats Cluster 1 Cluster 2 Cluster 3 Cluster 4 Overall
Cluster Size 2 1 1 345 349
Silhouette 0.44962 1 1 0.97536 0.97249
Clustering Time : 0.095 sec

Table 8: Original clustering: Hypercube

Clustering - Mean Frechet

We observe that our clustering algorithms run very faster on the reduced datasets, at a factor of around 2500. We also obtain high Silhouette scores in the Reduced datasets, indicating a good clustering.

Stats Cluster 1 Cluster 2 Cluster 3 Cluster 4 Overall
Cluster Size 264 1 1 83 349
Silhouette 0.6975 1 1 0.1360 0.5657
Clustering Time : 0.054 sec

Table 9: Reduced clustering: Lloyd's assignment

Stats Cluster 1 Cluster 2 Cluster 3 Cluster 4 Overall
Cluster Size 345 1 2 1 349
Silhouette - - - - -
Clustering Time : 2551.43 sec

Table 10: Original clustering: Lloyd's assignment

Stats Cluster 1 Cluster 2 Cluster 3 Cluster 4 Overall
Cluster Size 1 2 1 345 349
Silhouette 1 0.74056 1 0.90767 0.90724
Clustering Time : 0.053 sec

Table 11: Reduced clustering: LSH

Stats Cluster 1 Cluster 2 Cluster 3 Cluster 4 Overall
Cluster Size 1 345 1 2 349
Silhouette - - - - -
Clustering Time : 3555.52 sec

Table 12: Original clustering: LSH

References