Implemented by Spiros Chalkias & Harry Maraziaris
This project is seperated into 4 topics:
- A: Time series forecasting
- B: Time series anomaly detection with LSTM autoencoders
- C: Autoencoders for the compression of stock market time series
- D: Comparison of normal with compressed datasets, using kNN-and-Clustering-on-Curves-and-Time-Series
You can have access to project's parent directory by typing:
$ cd ~/Time-Series-Forecasting-with-Deep-Learning
In order to run the project, you need to install the following:
- Python 3
- pip
- Pandas
- numPy
- matplotlib
- seaborn
- Tensorflow
- sklearn
- tqdm
The filesystem structure is as follows:
/Time-Series-Forecasting-with-Deep-Learning/
: Project's main directory./src/
: Project's source code./data/
: Input data used by the project./out_files/
: Files generated by the compression of time series./saved models/
: Folder where all models are being saved, in order to quickly demonstrate their usage without re-training them./reports/
: Experiment reports for topics A, B and C individually./reports/d_comparison_results/ex4_results/
: Comparison of the original dataset versus the compressed dataset using this project's executables./src/preprocess.py
: File containing all the utility functions used by the project's main files./src/forecast.py
: File used for time series forecasting./src/detect.py
: File used for time series anomaly detection with LSTM autoencoders./src/reduce.py
: File used for compression of stock market time series using autoencoders./src/time-series-forecasting.ipynb
: The python notebook used in order to train the models and tune the data!/data/nasdaq2007_17.csv
: Data file used in topics A and B./data/input.csv
: Input file used in topic C./data/query.csv
: Query file used in topic C./out_files/output_dataset_file.csv
: Compressed time series file used as an input file in this project./out_files/output_query_file.csv
: Compressed time series file used as query file in this project.
While being in the project's parent directory, simply type the following in order to execute each question's corresponding file.
python3 ./src/forecast.py -d <dataset> -n <number of time series selected>
python3 ./src/detect.py -d <dataset> -n <number of time series selected> -mae <error value as double>
python3 ./src/reduce.py -d <dataset> -q <queryset> -od <output_dataset_file> -oq <output_query_file>
- The project was written in Python 3, using Tensorflow and specifically Keras API.
- The assignment's code was inspired by the three (3) articles provided in the lectures and displayed in the Resources section.
- In order to prevent overfitting, Early Stopping has been added to every model.
- Each model is being compiled with:
- Mean Squared Error (MSE) as a loss function.
- Adam as an optimizer.
- Mean Absolute Error (MAE) as an evaluation metric.
- MinMax scaler is used in order to properly scale the data.
- In Anomaly Detection, if the anomaly threshold is not provided by the user, then it is being automatically computed by taking the maximum value, when computing the training set's Mean Absolute Error (MAE).
Fine-tuning reports showcasing our experiments for topics A, B and C can be found in the additional PDFs provided in the submitted directory.
Comparison with kNN-and-Clustering-on-Curves-and-Time-Series
- MAF : Maximum Approximation Factor
- AAT : Average Approximation 1-NN Time taken
We observe that our search algorithms run around x100 faster on the compressed dataset, which is expected. We also notice that our Approximation algorithms run quite well, obtaining scores of perfect MAF = 1 on the reduced datasets and less than 4 on the original dataset.
Stats | LSH-Euclidean | LSH-Discrete-Frechet | LSH-Continuous-Frechet |
---|---|---|---|
MAF | 3.43 | 3.84 | 2.67 |
AAT (sec) | 40.61 | 3.66 | 105.39 |
Table 1: Original input and query files
Stats | LSH-Euclidean | LSH-Discrete-Frechet | LSH-Continuous-Frechet |
---|---|---|---|
MAF | 1 | 1 | 1 |
AAT (sec) | 0.03 | 0.01 | 0.03 |
Table 2: Reduced input and query files
We observe that our clustering algorithms run faster on the reduced datasets, as expected, at a factor of at least 20. We also obtain very good Silhouette scores (> 0.8 on average) in both the clustering of the Original and the Reduced datasets. Thus we could argue that most of the information used to cluster our timeseries is preserved even after their compression, leading to equally good clustering.
Stats | Cluster 1 | Cluster 2 | Cluster 3 | Cluster 4 | Overall |
---|---|---|---|---|---|
Cluster Size | 1 | 21 | 1 | 326 | 349 |
Silhouette | 1 | 0.24274 | 1 | 0.75713 | 0.72757 |
Clustering Time : 0.002 sec |
Table 3: Reduced clustering: Lloyd's assignment
Stats | Cluster 1 | Cluster 2 | Cluster 3 | Cluster 4 | Overall |
---|---|---|---|---|---|
Cluster Size | 1 | 21 | 1 | 326 | 349 |
Silhouette | 1 | 0.44962 | 1 | 0.97536 | 0.97249 |
Clustering Time : 0.092 sec |
Table 4: Original clustering: Lloyd's assignment
Stats | Cluster 1 | Cluster 2 | Cluster 3 | Cluster 4 | Overall |
---|---|---|---|---|---|
Cluster Size | 2 | 1 | 1 | 345 | 349 |
Silhouette | 0.81551 | 1 | 1 | 0.9105 | 0.91047 |
Clustering Time : 0.003 sec |
Table 5: Reduced clustering: LSH
Stats | Cluster 1 | Cluster 2 | Cluster 3 | Cluster 4 | Overall |
---|---|---|---|---|---|
Cluster Size | 1 | 1 | 1 | 346 | 349 |
Silhouette | 1 | 1 | 1 | 0.9769 | 0.97718 |
Clustering Time : 0.053 sec |
Table 6: Original clustering: LSH
Stats | Cluster 1 | Cluster 2 | Cluster 3 | Cluster 4 | Overall |
---|---|---|---|---|---|
Cluster Size | 1 | 154 | 1 | 193 | 349 |
Silhouette | 1 | 0.12298 | 1 | 0.71973 | 0.45801 |
Clustering Time : 0.002 sec |
Table 7: Reduced clustering: Hypercube
Stats | Cluster 1 | Cluster 2 | Cluster 3 | Cluster 4 | Overall |
---|---|---|---|---|---|
Cluster Size | 2 | 1 | 1 | 345 | 349 |
Silhouette | 0.44962 | 1 | 1 | 0.97536 | 0.97249 |
Clustering Time : 0.095 sec |
Table 8: Original clustering: Hypercube
We observe that our clustering algorithms run very faster on the reduced datasets, at a factor of around 2500. We also obtain high Silhouette scores in the Reduced datasets, indicating a good clustering.
Stats | Cluster 1 | Cluster 2 | Cluster 3 | Cluster 4 | Overall |
---|---|---|---|---|---|
Cluster Size | 264 | 1 | 1 | 83 | 349 |
Silhouette | 0.6975 | 1 | 1 | 0.1360 | 0.5657 |
Clustering Time : 0.054 sec |
Table 9: Reduced clustering: Lloyd's assignment
Stats | Cluster 1 | Cluster 2 | Cluster 3 | Cluster 4 | Overall |
---|---|---|---|---|---|
Cluster Size | 345 | 1 | 2 | 1 | 349 |
Silhouette | - | - | - | - | - |
Clustering Time : 2551.43 sec |
Table 10: Original clustering: Lloyd's assignment
Stats | Cluster 1 | Cluster 2 | Cluster 3 | Cluster 4 | Overall |
---|---|---|---|---|---|
Cluster Size | 1 | 2 | 1 | 345 | 349 |
Silhouette | 1 | 0.74056 | 1 | 0.90767 | 0.90724 |
Clustering Time : 0.053 sec |
Table 11: Reduced clustering: LSH
Stats | Cluster 1 | Cluster 2 | Cluster 3 | Cluster 4 | Overall |
---|---|---|---|---|---|
Cluster Size | 1 | 345 | 1 | 2 | 349 |
Silhouette | - | - | - | - | - |
Clustering Time : 3555.52 sec |
Table 12: Original clustering: LSH