-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathtimeseries_forecast_rmarkdown.Rmd
922 lines (783 loc) · 48.4 KB
/
timeseries_forecast_rmarkdown.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
---
title: "Time series forecasting"
author: "Jaakko Paavola"
date: "19 Mar 2024"
output:
html_document:
toc: true
toc_depth: '3'
df_print: paged
html_notebook:
pandoc_args: "--number-offset=1,0"
fig_caption: true
number_sections: true
toc: true
toc_depth: 3
header-includes:
- \usepackage{fontspec}
- \usepackage{amsmath}
- \usepackage{calligra}
- \usepackage{amsmath}
- \defaultfontfeatures[\rmfamily,\sffamily]{Ligatures=TeX,Numbers=OldStyle}
- \setmainfont{Libertinus Serif}
- \setsansfont{Libertinus Sans}
- \setmathfont{Libertinus Math}%[RawFeature={+ss08}
- \setmonofont{Libertinus Mono}
- \usepackage{eurosym}
editor_options:
markdown:
wrap: 72
---
Work in progress. The idea of this project is that you can read in any time series (currently tested only with the enclosed united-states-gdp-growth-rate.csv containing quarterly US GDP growth rates), and the program fits the following time series models to it: 1) seasonal regression, 2) trend + seasonal regression, 3) Holt-Winters multiplicative, 4) Holt-Winters additive, 5) ARIMA, 6) ARIMA with box-cox transformation, 7) ETS, and 8) TBATS. It then evaluates the performance of each model against a test set, and picks the "best performer" to forecast from one to four steps into the future (truly out-of-sample). The best performer can be different for each of the four forecasts. Moreover, we also compare the forecasts against two naive baselines for benchmark: 1) predict the value for the same month from last year, and 2) predict the value from the previous datapoint.
```{r}
options(wid_of_tsth=10000)
options(setWid_of_tsthOnResize=TRUE)
```
```{r}
preprocess_data <- function(since = NULL, query, first.month.to.forecast = NULL){
### Fetches the data and does preprocessing on it. Checks whether there is complete data (data from the last day) from the month before the current one.
### Parameters:
### since: The earliest date for data we want to fetch.
### query: The name of the query.
### prevMonthCheckFlag: TRUE if we want to check whether there is data from the last day of the previous month. Otherwise FALSE.
### Returns:
### If there was complete data from the previous month, returns the fetched dataset.
### If there was no complete data from the previous month, returns 1.
flog.info("Loading data")
if (query == "USGDPGR") {
# Import data from file "united-states-gdp-growth-rate.csv"
data <- read.csv("data/united-states-gdp-growth-rate.csv", header = T)
}
# Rename columns in data to "Date" and "Value"
colnames(data) <- c("Date", query)
data$Date <- as.Date(data$Date, "%d/%m/%Y")
# since = "2013-02-01"
if(!is.null(since)) {
since <- as.Date(since, "%Y-%m-%d")
}
if(!is.null(since)) {
data <- data[data$Date >= since,]
data <- data[order(data$Date),]
}
data <- as.data.frame(data)
# print(data)
# if(!is.null(first.month.to.forecast)) {
# if(max(data$Date) < as.Date(as.yearmon(first.month.to.forecast)) - 1) {
# stop(paste0("No complete data from previous month (", month(as.Date(as.yearmon(Sys.Date())) - 1, abbr = TRUE, label = T), ")"))
# }
# }
return(data)
}
monthly_decomposition <- function(dataFull, id_of_ts, imputations.output.filename){
### Composes a time series object and does imputations in it.
### Parameters:
### imputations.output.filename: if the parameter is given, the imputation information will be printed out to a file.
### Returns: A list with a time series component with the actual values as the first element.
# Fill the time series if there are any NAs
index <- as.integer(min(which(!is.na(dataFull[, id_of_ts]))))
cleaned_data <- dataFull
cleaned_data[, id_of_ts] <- ifelse(is.na(dataFull[, id_of_ts]), 0, dataFull[, id_of_ts])
cleaned_data <- cleaned_data[index:nrow(cleaned_data), ]
cleaned_data[, id_of_ts] <- na_interpolation(cleaned_data[, id_of_ts])
# monthly.ts <- apply.monthly(as.xts(cleaned_data[, id_of_ts], order.by = as.Date(cleaned_data$Date)), sum) # If time steps are shorter than a month, sum them up to month granularity
# Generate a sequence of the first day of each month between the start and end dates
start_date <- floor_date(min(cleaned_data$Date), "month") + months(1)
end_date <- ceiling_date(max(cleaned_data$Date), "month")
date_seq_first <- seq(from = start_date, to = end_date, by = "month")
# Adjust to the last day of each month and create a new dataframe with this sequence
date_seq_last <- date_seq_first - days(1)
# Merge with the original, keeping all dates
monthly_df <- merge(data.frame(Date = date_seq_last), cleaned_data, by = "Date", all.x = TRUE)
# Interpolate the missing USGDPGR values using linear interpolation
monthly_df$USGDPGR <- round(na.approx(monthly_df$USGDPGR, na.rm = FALSE), 1)
monthly.ts <- ts(monthly_df[, id_of_ts], start = c(year(min(monthly_df$Date)), month(min(monthly_df$Date))), frequency = 12)
monthly.ts.original <- monthly.ts
# imputed <- vector(length = nrow(monthly.ts))
# for(aux in 1:nrow(monthly.ts)) {
# if(as.numeric(monthly.ts[aux]) < X) { # Impute based on the following rows if the value is under X
# if(aux > 24) { # Impute the average of the sum of one year ago (12) and two years ago (24) if we are over 24 months from the start of the time series
# if((as.numeric(monthly.ts.original[aux - 12]) + as.numeric(monthly.ts.original[aux - 24])) / 2 > as.numeric(monthly.ts.original[aux])) {
# monthly.ts[aux] <- (as.numeric(monthly.ts.original[aux - 12]) + as.numeric(monthly.ts.original[aux - 24])) / 2
# imputed[aux] <- 2
# }
# } else if(aux > 12) { # Impute the value from one year ago if we are over 12 months from the start of the time series
# if(as.numeric(monthly.ts.original[aux - 12]) > as.numeric(monthly.ts.original[aux])) {
# monthly.ts[aux] <- as.numeric(monthly.ts.original[aux - 12])
# imputed[aux] <- 1
# }
# } else if(as.numeric(monthly.ts.original[aux]) < 10) { # If within 12 months from the start, impute Y
# monthly.ts[aux] <- Y
# imputed[aux] <- 3
# }
# }
# if(as.numeric(monthly.ts[aux]) == 0) { # Impute 1 if the value is 0
# monthly.ts[aux] <- 1
# imputed[aux] <- 4
# }
# }
# If a filename was given, print out the imputation information in a file.
# if (!is.null(imputations.output.filename)) {
# monthly.ts.for.printing <- cbind(monthly.ts, imputed)
# colnames(monthly.ts.for.printing) <- c(id_of_ts, "Imputed")
# print(monthly.ts.for.printing)
# sink(file = paste("imputations.txt", sep = "_"), append = TRUE)
# print(paste(id_of_ts))
# print(monthly.ts.for.printing)
# sink()
# }
list.ts <- list(monthly.ts, monthly.ts.original)
return(list.ts)
}
train_ts_models <- function(ts_train, id_of_ts, skipTimeConsumingAlgorithmsFlag){
### Trains the eight models with the training data.
### Parameters:
### ts_train: The time series object.
list_mod <- list()
list_mod[["TSLM.s"]] <- tryCatch({
tslm(ts_train ~ season, lambda="auto")
},
error=function(cond){
message(cond)
if(length(ts_train) < 24) {
start <- start(ts_train) - c(0, (24 - length(ts_train)))
}
else {
start <- start(ts_train)
}
tslm(ts(rep(0, length(ts_train)), start = start, end = end(ts_train), frequency = frequency(ts_train)) ~ season, lambda="auto")
}
)
flog.info(paste0(id_of_ts, ": ", "Seasonal regression trained"))
list_mod[["TSLM.st"]] <- tryCatch({
tslm(ts_train ~ trend + season, lambda="auto")
},
error=function(cond){
message(cond)
if(length(ts_train) < 24) start <- start(ts_train) - c(0, (24 - length(ts_train)))
else start <- start(ts_train)
tslm(ts(rep(0, length(ts_train)), start = start, end = end(ts_train), frequency = frequency(ts_train)) ~ trend + season, lambda="auto")
}
)
flog.info(paste0(id_of_ts, ": ", "Trend + Seasonal regression trained"))
list_mod[["HW.mul"]] <- tryCatch({
HoltWinters(ts_train, seasonal = "mul")
},
error=function(cond){
message(cond)
if(length(ts_train) < 24) start <- start(ts_train) - c(0, (24 - length(ts_train)))
else start <- start(ts_train)
HoltWinters(ts(rep(0, length(ts_train)), start = start, end = end(ts_train), frequency = frequency(ts_train)))
}
)
flog.info(paste0(id_of_ts, ": ", "Holt-Winters multiplicative trained"))
list_mod[["HW.add"]] <- tryCatch({
HoltWinters(ts_train, seasonal = "add")
},
error=function(cond){
message(cond)
if(length(ts_train) < 24) start <- start(ts_train) - c(0, (24 - length(ts_train)))
else start <- start(ts_train)
HoltWinters(ts(rep(0, length(ts_train)), start = start, end = end(ts_train), frequency = frequency(ts_train)))
}
)
flog.info(paste0(id_of_ts, ": ", "Holt-Winters additive trained"))
if(!skipTimeConsumingAlgorithmsFlag) {
list_mod[["ARIMA"]] <- tryCatch({
auto.arima(ts_train, max.p=2, max.q=2,
max.P=2, max.Q=2, max.d=1, max.D=1, allowdrift = TRUE,
stepwise=FALSE, approximation=FALSE,
ic = "bic", parallel = "TRUE")
},
error=function(cond){
message(cond)
return(NA)
}
)
flog.info(paste0(id_of_ts, ": ", "ARIMA trained"))
list_mod[["ARIMA.boxcox"]] <- tryCatch({
auto.arima(ts_train, max.p=2, max.q=2,
max.P=2, max.Q=2, max.d=1, max.D=1, allowdrift = TRUE,
stepwise=FALSE, approximation=FALSE, lambda="auto",
ic = "bic", parallel = "TRUE")
},
error=function(cond){
return(NA)
}
)
flog.info(paste0(id_of_ts, ": ", "ARIMA with boxcox trained"))
list_mod[["ets"]] <- tryCatch({
ets(ts_train, model="ZZZ", lambda = "auto", bounds = "both")
},
error=function(cond){
message(cond)
return(NA)
}
)
flog.info(paste0(id_of_ts, ": ", "ETS trained"))
list_mod[["tbats"]] <- tryCatch({
tbats(ts_train, use.box.cox = NULL, use.trend = NULL, use.damped.trend = NULL,
seasonal.periods = NULL, use.arma.errors = TRUE)
},
error=function(cond){
message(cond)
return(NA)
}
)
flog.info(paste0(id_of_ts, ": ", "TBATS trained"))
}
return(list_mod)
}
forecast_with_external_regressors <- function(mod, ts_disruption_train, n){
tryCatch({
forecast(mod, xreg = ts_disruption_train, h = n)$mean[1]
},
error=function(cond){
message(cond)
return(0)
}
)
}
forecast_without_external_regressors <- function(mod, n){
tryCatch({
na_replace(forecast(mod, h = max(n))$mean) # Replace NA's with zeros.
},
error=function(cond){
message(cond)
return()
}
)
}
predict_ts_models <- function(list_mod, ts_val, forecasts_per_month, step, total_steps, months.ahead, id_of_ts, skipTimeConsumingAlgorithmsFlag){
### Makes forecasts for each step by calling a function, which makes forecasts and creates a data frame with monthly forecasts from each model.
### Parameters:
### list_mod: A list containing all the trained models.
### ts_val: The time series.
### forecasts_per_month:
### total_steps:
### months.ahead: The months ahead to be forecasted as a vector. 1 is the current (incomplete) month.
### Returns:
if(skipTimeConsumingAlgorithmsFlag) {
names <- c("Reg.seasonal", "Reg.trend.seasonal",
"HW.Mul", "HW.Add"
, "ARIMA", "ARIMA.boxcox"
, "ETS", "TBATS"
, "Naive.prev.year"#, "Naive.prev.datapoint"
)
} else {
names <- c("Reg.seasonal", "Reg.trend.seasonal",
"HW.Mul", "HW.Add"
, "ARIMA", "ARIMA.boxcox"
, "ETS", "TBATS"
, "Naive.prev.year"#, "Naive.prev.datapoint"
)
}
no_of_naive_models <- 1
# no_of_naive_models <- 2
for(j in 1:(length(list_mod) + no_of_naive_models)) {
# j <- 6
if (j <= length(list_mod)) {
one_month_forecast <- forecast_without_external_regressors(mod = list_mod[[j]], n = months.ahead)
flog.info(paste(paste0(id_of_ts, ":"), "Predictions made for", class(list_mod[[j]])[1], as.Date(as.yearmon(time(one_month_forecast)))))
} else { # Lastly, do the naive forecasts.
if(j == length(list_mod)+1) {
aux <- rbind(data.frame("Y" = forecasts_per_month[forecasts_per_month$model == "Base" & forecasts_per_month$date == max(subset(forecasts_per_month, type == "1_ahead")$date) - years(1), 1], date = max(subset(forecasts_per_month, type == "1_ahead")$date)), data.frame("Y" = forecasts_per_month[forecasts_per_month$model == "Base" & forecasts_per_month$date == max(subset(forecasts_per_month, type == "2_ahead")$date) - years(1), 1], date = max(subset(forecasts_per_month, type == "2_ahead")$date)), data.frame("Y" = forecasts_per_month[forecasts_per_month$model == "Base" & forecasts_per_month$date == max(subset(forecasts_per_month, type == "3_ahead")$date) - years(1), 1], date = max(subset(forecasts_per_month, type == "3_ahead")$date)), data.frame("Y" = forecasts_per_month[forecasts_per_month$model == "Base" & forecasts_per_month$date == max(subset(forecasts_per_month, type == "4_ahead")$date) - years(1), 1], date = max(subset(forecasts_per_month, type == "4_ahead")$date)))
} else {
aux <- rbind(data.frame("Y" = forecasts_per_month[forecasts_per_month$model == "Base" & forecasts_per_month$date == max(subset(forecasts_per_month, type == "1_ahead")$date) %m-% months(1), 1], date = max(subset(forecasts_per_month, type == "1_ahead")$date)), data.frame("Y" = forecasts_per_month[forecasts_per_month$model == "Base" & forecasts_per_month$date == max(subset(forecasts_per_month, type == "1_ahead")$date) %m-% months(1), 1], date = max(subset(forecasts_per_month, type == "2_ahead")$date)), data.frame("Y" = forecasts_per_month[forecasts_per_month$model == "Base" & forecasts_per_month$date == max(subset(forecasts_per_month, type == "1_ahead")$date) %m-% months(1), 1], date = max(subset(forecasts_per_month, type == "3_ahead")$date)), data.frame("Y" = forecasts_per_month[forecasts_per_month$model == "Base" & forecasts_per_month$date == max(subset(forecasts_per_month, type == "1_ahead")$date) %m-% months(1), 1], date = max(subset(forecasts_per_month, type == "4_ahead")$date)))
}
ts.forecast <- as.xts(aux$Y, order.by = as.Date(aux$date))
ts.all <- ts(ts.forecast, start = c(year(as.Date(time(ts.forecast))[1]), month(as.Date(time(ts.forecast))[1]), week(as.Date(time(ts.forecast))[1])),
end = c(year(as.Date(time(ts.forecast))[length(ts.forecast)]), month(as.Date(time(ts.forecast))[length(ts.forecast)]),
week(as.Date(time(ts.forecast))[length(ts.forecast)])),
frequency = 12)
one_month_forecast <- ts.all
flog.info(paste(paste0(id_of_ts, ":"), "Predictions made for Naive no ", j - 4, as.Date(as.yearmon(time(one_month_forecast)))))
}
p <- 1
for(p in months.ahead){
if((step+p-1) < total_steps){
forecasts_per_month <- rbind(forecasts_per_month, data.frame(Y=as.matrix(one_month_forecast[p]), date=as.Date(as.yearmon(time(one_month_forecast)[p])), model = names[j], type = paste0(p, "_ahead")))
row.names(forecasts_per_month) <- seq(1, nrow(forecasts_per_month), by = 1)
}else if(step == total_steps){
forecasts_per_month <- rbind(forecasts_per_month, data.frame(Y=as.matrix(one_month_forecast[p]), date=as.Date(as.yearmon(time(one_month_forecast)[p])), model = names[j],
type = paste0(p, "_ahead")))
row.names(forecasts_per_month) <- seq(1, nrow(forecasts_per_month), by = 1)
}
}
}
return(forecasts_per_month)
}
create_train_test <- function(start_train, start_test, ts.all, step, number.of.steps.in.test, months.ahead){
### Creates the training and testing datasets.
### Parameters:
### start_train: The start date of the training dataset.
### start_test: The start date of the testing dataset.
### ts_all: The time series.
### step: The current step number, i.e., the current position of the "test set window".
### number.of.steps.in.test: The total number of steps in the test set.
### months.ahead: The months ahead to be forecasted as a vector. 1 is the current (incomplete) month.
### Returns: A list with the given step's training time series as the first element and the forecasts based on
### that time series as the second element.
ts_train <- window(ts.all, start = c(year(as.Date(start_train)), month(as.Date(start_train))), end = c(year(as.Date(start_test + months(step) - months(2))), month(start_test + months(step) - months(2))), frequency = frequency(ts.all))
ts_vsl_list <- list() # A list to hold the forecasts
for(j in months.ahead){
# j <- 1
if((step + j - 1) < number.of.steps.in.test){ # If this step plus the number of months the current forecast is ahead is within the test set, i.e, not out-of-sample
ts_vsl_list[j] <- window(ts.all, start = c(year(start_test), month(start_test + months(step + j - 2))), end = c(year(as.Date(start_test)), month(start_test + months(step + j - 2))), frequency = frequency(ts.all))
} else{ # If this step plus the number of months the current forecast is ahead is beyond the test set, i.e., out-of-sample
# ts_vsl_list[j] <- ts(0, frequency=frequency(ts.all), start=c(year(as.Date(start_test)), month(as.Date(start_test)) - 1 + step + (j - 1)))
ts_vsl_list[j] <- ts(0, frequency = frequency(ts.all), start = c(year(start_test), month(start_test + months(step + j - 2))))
}
}
ts_train <- ifelse(ts_train == 0, 0.01, ts_train)
return(list(ts_train, ts_vsl_list))
}
split_data_and_model_monthly <- function(dataFull, list.ts, id_of_ts, months.ahead, start_test, start_train, first.month.to.forecast, mape.output.filename, progressCounter, skipTimeConsumingAlgorithmsFlag){
### Creates the training and testing datasets and trains the time series models. Calls a function that does the predictions and then a function that does evaluations.
### Parameters:
### list.ts: List with the time series. The first element is with the imputated values (no zeros ever) and the second with the original values.
### months.ahead: The months ahead to be forecasted as a vector. 1 is the current (incomplete) month.
### output.filename: The filename used for saving MAPEs.
### Returns: The value given by the last forecast.
# Do the splits for training and test sets.
if(exists("start_train")){
rm(start_train)
rm(start_test)
}
if(!exists("start_train")){
start_train <- min(as.Date(time(list.ts[[1]]))) # What is the starting point in our entire time series for our training (and test) set
}
if(!exists("start_test")){
start_test <- as.Date(time(list.ts[[1]]))[round(length(as.Date(time(list.ts[[1]])))*0.7)] # Our training set is 70% and the 30% after it is the test set
}
if(start_train < min(as.Date(time(list.ts[[1]])))){
start_train <- min(as.Date(time(list.ts[[1]])))
}
ts.all <- window(list.ts[[1]], start = c(year(as.Date(start_train)), month(as.Date(start_train))), frequency = frequency(list.ts[[1]]))
ts.all.original <- window(list.ts[[2]], start = c(year(as.Date(start_train)), month(as.Date(start_train))), frequency = frequency(list.ts[[1]]))
number.of.steps.in.train <- length(ts.all[as.Date(as.yearmon(time(ts.all))) < start_test]) # The length of our training set.
if(number.of.steps.in.train < 12) { # If the number of steps in the training set is less than twelve, but six or more, lengthen the training set by until it's twelve months
if(number.of.steps.in.train < 6) {
flog.warn(cat(red(paste("Skipping due to less than twelve months between start_train and start_test for", id_of_ts))))
return(NULL)
}
start_test <- as.Date(start_test) %m+% months(12 - number.of.steps.in.train)
number.of.steps.in.train <- length(ts.all[as.Date(as.yearmon(time(ts.all))) < start_test])
}
number.of.steps.in.test <- (length(ts.all) - number.of.steps.in.train) + 1 # The length of the test set is the training set's complement of one. The + 1 comes from the last step, which is the "true" out-of-sample forecast step of the months that we don't yet know the actual/true value for.
total.number.of.steps <- number.of.steps.in.test + max(months.ahead) # The number of training steps and the "true" out-of-sample forecast steps together.
# step represents the shift in the ending position of the "training set window". For example, if max(as.Date(time(list.ts[[1]]))) = "2023-12-01", start_train = "2013-03-01" and start_test = "2020-09-01", we get:
# step <- 1 # The "training set window" becomes:
# Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
# 2013 -1.0 -0.8 6.4 6.2 6.8 2.3 0.4 -5.4 -1.4 4.2
# 2014 -3.3 16.7 12.8 16.4 7.9 5.5 7.1 8.5 0.9 4.3 0.9 2.9
# 2015 13.8 7.6 3.1 -2.2 -5.9 -1.9 0.4 4.6 8.1 11.9 6.7 5.5
# 2016 2.4 -1.5 3.3 -0.4 6.7 2.6 -0.9 4.0 -4.1 -10.0 -1.0 -0.8
# 2017 6.4 6.2 6.8 2.3 0.4 -5.4 -1.4 4.2 -3.3 16.7 12.8 16.4
# 2018 7.9 5.5 7.1 8.5 0.9 4.3 0.9 2.9 13.8 7.6 3.1 -2.2
# 2019 -5.9 -1.9 0.4 4.6 8.1 11.9 6.7 5.5 2.4 -1.5 3.3 -0.4
# 2020 6.7 2.6 -0.9 4.0 -4.1 -10.0 -1.0 -0.8
# step <- 2 # The "training set window" lengthens by one becoming:
# Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
# 2013 -1.0 -0.8 6.4 6.2 6.8 2.3 0.4 -5.4 -1.4 4.2
# 2014 -3.3 16.7 12.8 16.4 7.9 5.5 7.1 8.5 0.9 4.3 0.9 2.9
# 2015 13.8 7.6 3.1 -2.2 -5.9 -1.9 0.4 4.6 8.1 11.9 6.7 5.5
# 2016 2.4 -1.5 3.3 -0.4 6.7 2.6 -0.9 4.0 -4.1 -10.0 -1.0 -0.8
# 2017 6.4 6.2 6.8 2.3 0.4 -5.4 -1.4 4.2 -3.3 16.7 12.8 16.4
# 2018 7.9 5.5 7.1 8.5 0.9 4.3 0.9 2.9 13.8 7.6 3.1 -2.2
# 2019 -5.9 -1.9 0.4 4.6 8.1 11.9 6.7 5.5 2.4 -1.5 3.3 -0.4
# 2020 6.7 2.6 -0.9 4.0 -4.1 -10.0 -1.0 -0.8 6.4
# step <- 3 # The "training set window" lengthens by one becoming:
# Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
# 2013 -1.0 -0.8 6.4 6.2 6.8 2.3 0.4 -5.4 -1.4 4.2
# 2014 -3.3 16.7 12.8 16.4 7.9 5.5 7.1 8.5 0.9 4.3 0.9 2.9
# 2015 13.8 7.6 3.1 -2.2 -5.9 -1.9 0.4 4.6 8.1 11.9 6.7 5.5
# 2016 2.4 -1.5 3.3 -0.4 6.7 2.6 -0.9 4.0 -4.1 -10.0 -1.0 -0.8
# 2017 6.4 6.2 6.8 2.3 0.4 -5.4 -1.4 4.2 -3.3 16.7 12.8 16.4
# 2018 7.9 5.5 7.1 8.5 0.9 4.3 0.9 2.9 13.8 7.6 3.1 -2.2
# 2019 -5.9 -1.9 0.4 4.6 8.1 11.9 6.7 5.5 2.4 -1.5 3.3 -0.4
# 2020 6.7 2.6 -0.9 4.0 -4.1 -10.0 -1.0 -0.8 6.4 6.2
# And so on, while the "test set window" is always the next four steps that come after the "training set window", such as for step = 1:
# [[2]]
# [[2]][[1]]
# [1] 6.4
# [[2]][[2]]
# [1] 6.2
# [[2]][[3]]
# [1] 6.8
# [[2]][[4]]
# [1] 2.3
# The "test set window" is what we can always evaluate our forecasts against made based on the "training set window" of each step, until our step reaches a point where the "test set window" goes over our last observation, "2023-12-01", in our training-test set after which we cannot evaluate anymore, because the forecasts are truly out-of-sample.
if (exists("forecasts.per.month")) {
rm(forecasts.per.month)
}
if (exists("temp.list")) {
rm(temp.list)
}
# Create the dataframe where to place the forecasts, first as a place holder consisting only of the observed values ("Base")
forecasts.per.month <- data.frame(Y=as.matrix(ts.all.original), date=as.Date(as.yearmon(time(ts.all.original))), model = "Base", type = "None")
# step <- 1
# step <- 2
# step <- 3
# step <- 4
# step <- 5
# step <- 39
for(step in 1:(number.of.steps.in.test)){
# print.progress(progressCounter, dataFull, id_of_ts)
flog.info(cat(red(paste("Step", step, "out of", number.of.steps.in.test))))
train_test_set <- create_train_test(start_train, start_test, ts.all.original, step, number.of.steps.in.test, months.ahead) # Where [[1]] is the current step's "training-test set window", and [[2]] holds the "test set window"
# Train the models with this "training set window":
list_mod <- train_ts_models(ts_train = train_test_set[[1]], id_of_ts, skipTimeConsumingAlgorithmsFlag)
# if(!exists("forecasts.per.month")){
# }
# Forecast this step:
# names(forecasts.per.month)[1] <- "Y"
forecasts.per.month <- predict_ts_models(list_mod, ts_val = train_test_set[[2]], forecasts_per_month = forecasts.per.month, step, total_steps = total.number.of.steps, months.ahead, id_of_ts, skipTimeConsumingAlgorithmsFlag)
# Sort forecasts.per.month by date ascending and then type
forecasts.per.month <- forecasts.per.month[order(forecasts.per.month$date, forecasts.per.month$type), ]
# Round to one desimal
forecasts.per.month$Y <- round(forecasts.per.month$Y, 1)
upper_limit = 10
# Cap forecasts over upper_limit
forecasts.per.month$Y[forecasts.per.month$Y > upper_limit] <- NA
forecasts.per.month$Y[is.na(forecasts.per.month$Y)] <- max(forecasts.per.month$Y, na.rm = TRUE)
forecasts.per.month$Y <- round(forecasts.per.month$Y, 1)
# print(forecasts.per.month)
}
# For target variables that should not have negative values, handle them:
# forecasts.per.month$Y <- ifelse(forecasts.per.month$Y < 0, 0, forecasts.per.month$Y)
temp.list <- evaluate_models(dataFull, ts_all = ts.all, ts_all_original = ts.all.original, forecasts.per.month, id_of_ts, start_test, months.ahead, first.month.to.forecast, progressCounter, mape.output.filename) # nolint
temp.list[[5]] <- forecasts.per.month # In temp.list [[1]] are the "best-performer out-of-sample forecasts" for each four, truly out-of-sample steps, [[2]] is the name of the "best-performer" for those forecasts, respectively, [[3]] is the "best-performer's" predictive performance metrics for each step in the test set, and [[4]] is the dataframe with all the observations and all the forecasts, both in-the-sample and out-of-sample.
flog.info(cat(red("The contents of temp.list:\n[[1]] are the 'best-performer out-of-sample forecasts' for each four, truly out-of-sample steps,\n[[2]] is the name of the 'best-performer' for those forecasts, respectively,\n[[3]] is the 'best-performer's' predictive performance metrics for each step in the test set, and\n[[4]] is the dataframe with all the observations and all the forecasts, both in-the-sample and out-of-sample.\n")))
return(temp.list)
}
prepare_data_for_evaluation <- function(ts_all, forecasts.subset, start.test, months.to.forecast){
### Creates a list out of the forecast values.
### Parameters:
### ts_all: The time series.
### forecasts: The forecasts.
### start_test: The start date of the test data.
### Returns:
# For testing:
# rm(pred)
# months.to.forecast <- "1_ahead"
# months.to.forecast <- "2_ahead"
# months.to.forecast <- "3_ahead"
# months.to.forecast <- "4_ahead"
flog.info("Preparing data for evaluation")
if(frequency(ts_all) == 12){
forecasts.subset <- forecasts.subset[forecasts.subset$date >= as.Date(start.test) %m+% months(as.numeric(substr(months.to.forecast, 1, 1))-1),]
}else{
forecasts.subset <- forecasts.subset[forecasts.subset$date >= nrow(forecasts[forecasts$model %in% "Base",]) - nrow(forecasts[forecasts$model %in% c("Reg.seasonal", "Reg", "reg"),])+1,]
}
forecasts_test_set <- forecasts.subset[forecasts.subset$date <= max(forecasts.subset$date[forecasts.subset$model %in% "Base"]),]
for(i in as.character(unique(forecasts_test_set$model))){
# i <- "Base"
# i <- "Reg.seasonal"
# i <- "Reg.trend.seasonal"
# i <- "HW.Mul"
# i <- "HW.Add"
# i <- "Naive.prev.year"
# i <- "Naive.prev.datapoint"
if (exists("pred")) {
tmp <- forecasts_test_set[forecasts_test_set$model == i, c("Y", "date")]
pred <- merge(pred, tmp, by = "date", all.x = TRUE)
} else {
pred <- forecasts_test_set[forecasts_test_set$model == i, c("Y", "date")]
pred <- data.frame(date = pred$date, ts(pred$Y))
}
names(pred)[2:length(names(pred))] <- as.character(unique(forecasts_test_set$model))[1:length(pred)-1]
}
for (i in 2:ncol(pred)) {
pred[, i] <- ifelse(is.na(pred[, i]), 0, pred[, i])
}
# for (i in 2:ncol(pred)) {
# if (any(is.na(pred[, i]))) {
# for (naValueIndex in which(is.na(pred[, i]))) {
# if (naValueIndex > 1) {
# if (naValueIndex < nrow(pred)) {
# pred[naValueIndex, i] <- (pred[naValueIndex - 1, i] + pred[naValueIndex + 1, i]) / 2
# } else {
# pred[naValueIndex, i] <- pred[naValueIndex - 1, i]
# }
# } else {
# pred[naValueIndex, i] <- pred[naValueIndex + 1, i]
# }
# if (is.na(pred[naValueIndex, i])) {
# pred[naValueIndex, i] <- 0
# }
# }
# }
# }
return(list(pred, forecasts_test_set))
}
error_metrics <- function(pred, id_of_ts, months.to.forecast, mape.output.filename = NULL){
### Gets the error metrics.
### Parameters:
### mape.output.filename: The filename of a output text file with the MAPE outputs.
### Returns:
### The error metrics in a data frame.
flog.info("Getting the error metrics.")
evaluation_in_whole_test <- rbind(accuracy(ts(pred[,3]), ts(pred$Base)))
for(i in 4:length(pred)){
evaluation_in_whole_test <- rbind(evaluation_in_whole_test, accuracy(ts(pred[,i]), ts(pred$Base))
)
}
row.names(evaluation_in_whole_test) <- names(pred)[3:length(pred)]
evaluation_in_whole_test <- data.frame(evaluation_in_whole_test)
evaluation_in_whole_test <- evaluation_in_whole_test[, (names(evaluation_in_whole_test) %in% c("RMSE", "MAE", "MAPE"))] # Because MAPE gives Inf in many cases (due to division by zero)
if (!is.null(mape.output.filename)) {
print(paste(id_of_ts))
print(round(evaluation_in_whole_test[order(evaluation_in_whole_test$MAPE),], 5))
sink(file = paste(i, mape.output.filename, sep = "_"), append = TRUE)
print(paste(id_of_ts))
print(round(evaluation_in_whole_test[order(evaluation_in_whole_test$MAPE),], 5))
sink()
}
return(evaluation_in_whole_test)
}
plot_base_forecasts <- function(forecasts, best.model.performance, months.ahead, id_of_ts, plotModelQuadrantGraphFlag = FALSE, plotBestPerformerErrorGraphsFlag = FALSE) {
### Plots graphs for forecasts and realized values
### Parameters:
### forecasts:
### best.model.performance:
### months.ahead: The months ahead to be forecasted as a vector. 1 is the current (incomplete) month.
### plotModelQuadrantGraphFlag:
### plotBestPerformerErroGraphsFlag:
no_of_naive_models <- 1
# Create graphs with the actual values and the forecast of each of the models in the same picture.
# k <- 1
options(repr.plot.width=30, repr.plot.height=8)
for (k in months.ahead) {
plots_list <- list()
for (j in seq(no_of_naive_models, length(unique(forecasts$model)), 1)) {
p <- ggplot(forecasts[forecasts$model %in% as.character(unique(forecasts$model)[c(1, j)]) & (forecasts$type == paste0(k, "_ahead") | (forecasts$date >= as.Date(as.yearmon(Sys.Date())) & na_replace(as.numeric(substr(forecasts$type, 1, 1)), 99) <= k) | forecasts$type == "None"), ], aes(x = date, y = Y, col = model)) +
geom_line() +
geom_point(size = 0.8) +
ylab("") +
xlab("") +
ggtitle(paste(id_of_ts, sep = " - "), subtitle = paste(paste0("Current + ", k - 1, " mon ahead"), "-", "Creation time:", Sys.time(), "UTC")) +
theme(axis.line = element_line(), axis.text = element_text(color = "black"), axis.title = element_text(colour = "black"), legend.text = element_text(), legend.title = element_text(), legend.key = element_rect(colour = "black"), plot.title = element_text(size = 10), plot.subtitle = element_text(size = 8))
plots_list <- c(plots_list, list(p))
}
# print(plots_list)
# plotModelQuadrantGraphFlag <- TRUE
if (plotModelQuadrantGraphFlag) {
# The first two models:
# svg(filename = paste(id_of_ts, k, "ahead", "page_1.svg", sep = "_"), width = 10.7, height = 6, pointsize = 3)
grid.arrange(plots_list[[1]], plots_list[[2]], ncol = 2, widths=c(2,2))
# dev.off()
# The next two models:
# svg(filename = paste(id_of_ts, k, "ahead", "page_2.svg", sep = "_"), width = 10.7, height = 6, pointsize = 3)
grid.arrange(plots_list[[3]], plots_list[[4]], ncol = 2, widths=c(2,2))
# The next two models:
# svg(filename = paste(id_of_ts, k, "ahead", "page_3.svg", sep = "_"), width = 10.7, height = 6, pointsize = 3)
grid.arrange(plots_list[[5]], plots_list[[6]], ncol = 2, widths=c(2,2))#,plots_list[[7]],plots_list[[8]], ncol = 2)
# dev.off()
# The next two models:
# svg(filename = paste(id_of_ts, k, "ahead", "page_4.svg", sep = "_"), width = 10.7, height = 6, pointsize = 3)
grid.arrange(plots_list[[7]], plots_list[[8]], ncol = 2, widths=c(2,2))#,plots_list[[7]],plots_list[[8]], ncol = 2)
# dev.off()
# The next two models:
# svg(filename = paste(id_of_ts, k, "ahead", "page_5.svg", sep = "_"), width = 10.7, height = 6, pointsize = 3)
grid.arrange(plots_list[[9]], plots_list[[10]], ncol = 2, widths=c(2,2))#,plots_list[[7]],plots_list[[8]], ncol = 2)
# dev.off()
# The naive models:
# svg(filename = paste(id_of_ts, k, "ahead", "page_3.svg", sep = "_"), width = 10.7, height = 6, pointsize = 3)
# grid.arrange(plots_list[[9]],plots_list[[10]])
# dev.off()
}
# Create error/residual graphs:
mape <- ggplot(best.model.performance, aes(x=Date,y=MAPE)) + geom_point() + ggtitle(paste(id_of_ts, paste0("best model: ", as.character(best.model.performance$name[1])), sep = " - ")) + geom_hline(yintercept=0, linetype = "longdash") + geom_hline(yintercept=0.2, linetype = "dashed", alpha = 0.5) + geom_hline(yintercept=-0.2, linetype = "dashed", alpha = 0.5) + xlab("") + ylab("forecast % difference to true value") + ylim(-1, 1) + theme(axis.line = element_line(), axis.text=element_text(color='black'), axis.title = element_text(colour = 'black'), legend.text=element_text(), legend.title=element_text(), legend.key = element_rect(colour = "black"))
mae <- ggplot(best.model.performance, aes(x=Date,y=MAE)) + geom_point() + geom_hline(yintercept=0, linetype = "longdash") + xlab("") + ylab("forecast difference to true value") + theme(axis.line = element_line(), axis.text=element_text(color='black'), axis.title = element_text(colour = 'black'), legend.text=element_text(), legend.title=element_text(), legend.key = element_rect(colour = "black"))
# plotBestPerformerErrorGraphsFlag <- TRUE
if (plotBestPerformerErrorGraphsFlag) {
svg(filename = paste(id_of_ts, k, "ahead", "best.svg", sep = "_"), width = 10.7, height = 6, pointsize = 3)
grid.arrange(mape, mae, ncol = 1)
dev.off()
}
}
return(list(mape, mae))
}
# TODO:
plot.summary.graph <- function(id_of_ts, actualValuesAndForecastWithYearCols, graph.filename, first.month.to.forecast) {
highestYear <- year(as.Date(first.month.to.forecast) %m-% months(1))
svg(filename = graph.filename, wid_of_tsth = 10.7, height = 6, pointsize = 3)
p <- ggplot(actualValuesAndForecastWithYearCols) +
geom_line(size = 2, aes(x = factor(row.names(actualValuesAndForecastWithYearCols), levels = row.names(actualValuesAndForecastWithYearCols)), y = threeYrAvg, col = "3 year average"), group = 1) +
geom_line(size = 1, aes(x = factor(row.names(actualValuesAndForecastWithYearCols), levels = row.names(actualValuesAndForecastWithYearCols)), y = fiveYrAvg, col = "5 year average"), group = 1) +
geom_line(aes(x = factor(row.names(actualValuesAndForecastWithYearCols), levels = row.names(actualValuesAndForecastWithYearCols)), y = prevYear, col = "Previous year"), group = 1) +
geom_line(aes(x = factor(row.names(actualValuesAndForecastWithYearCols), levels = row.names(actualValuesAndForecastWithYearCols)), y = pred, col = "Forecasts"), group = 1, linetype = "dashed") +
geom_point(shape = 15, size = 2, color = "red", aes(x = factor(row.names(actualValuesAndForecastWithYearCols), levels = row.names(actualValuesAndForecastWithYearCols)), y = pred, col = "Forecasts")) +
geom_line(aes(x = factor(row.names(actualValuesAndForecastWithYearCols), levels = row.names(actualValuesAndForecastWithYearCols)), y = currentYear, col = as.character(highestYear), group = 1)) +
geom_point(shape = 16, size = 3, aes(x = factor(row.names(actualValuesAndForecastWithYearCols), levels = row.names(actualValuesAndForecastWithYearCols)), y = currentYear, col = as.character(highestYear))) +
scale_color_manual(values = c("black", "green4", "orange", "red", "blue")) +
ylab("") +
xlab("") +
ggtitle(paste(toupper(id_of_ts), " forecasts:", sep = " - "), subtitle = paste0("Current (", as.yearmon(first.month.to.forecast), ") + 3 months ahead. Creation time: ", Sys.time(), " UTC")) +
theme(axis.line = element_line(),
axis.text = element_text(color='black'),
axis.title = element_text(colour = 'black'),
legend.text = element_text(),
legend.title = element_blank(),
legend.key = element_rect(colour = "black"),
legend.position = "right",
plot.title = element_text(size = 12),
plot.subtitle = element_text(size = 10))
print(p)
dev.off()
}
best_model <- function(pred, forecasts.test.set, evaluation_in_whole_test, months.to.forecast){
### Gets the best performing models based on MAPE (primary) or MAE (secondary).
### Parameters:
### pred: A data frame with forecasts.
###
flog.info(paste("Determining the best models for", months.to.forecast))
# Filter out row index "Base" from evaluation_in_whole_test
evaluation_in_whole_test <- evaluation_in_whole_test[!(rownames(evaluation_in_whole_test) == "Base"), ]
# Compute the MAPEs and MAEs
if (exists("evaluation")) {
rm(evaluation)
}
# First Reg.trend.seasonal...
evaluation <- cbind((ts(pred[,!(names(pred) %in% c("date", "Base"))][1])-ts(pred$Base))/ts(pred$Base),(ts(pred[,!(names(pred) %in% c("date", "Base"))][1])-ts(pred$Base)))
# ...then the rest
for(i in 4:length(pred)){
evaluation <- cbind(evaluation, (ts(pred[,i]) - ts(pred$Base))/ts(pred$Base))
evaluation <- cbind(evaluation, (ts(pred[,i]) - ts(pred$Base)))
}
evaluation <- data.frame(evaluation)
names(evaluation) <- rep(names(pred)[3:length(pred)], times = 1, each = 2)
if(any(!is.infinite(evaluation_in_whole_test$MAPE))) { # If at least one of the MAPEs is not infinite, use the MAPE, otherwise use MAE
best_model_performance <- data.frame(name = row.names(evaluation_in_whole_test[order(evaluation_in_whole_test$MAPE),][1,]), value = evaluation[,names(evaluation) %in% row.names(evaluation_in_whole_test[order(evaluation_in_whole_test$MAPE),][1,])], Date = forecasts.test.set$date[as.character(forecasts.test.set$model) %in% row.names(evaluation_in_whole_test[order(evaluation_in_whole_test$MAPE),][1,])])
} else {
best_model_performance <- data.frame(name = row.names(evaluation_in_whole_test[order(evaluation_in_whole_test$MAE),][1,]),value = evaluation[,names(evaluation) %in% row.names(evaluation_in_whole_test[order(evaluation_in_whole_test$MAE),][1,])], Date = forecasts.test.set$date[as.character(forecasts.test.set$model) %in% row.names(evaluation_in_whole_test[order(evaluation_in_whole_test$MAE),][1,])])
}
names(best_model_performance)[2:3] <- c("MAPE", "MAE")
return(best_model_performance)
}
evaluate_models <- function(dataFull, ts_all, ts_all_original, forecasts.per.month, id_of_ts, start_test, months.ahead, first.month.to.forecast, progressCounter, mape.output.filename, ...) {
### Evaluates the performance of the models, determines the best performing model and calls functions for plotting
### and creating an Excel file for the best performing model's data.
### Parameters:
### ts_all: The time series object.
### forecasts.per.month: The baseline (actual monthly values) and all models' for all forecasts.
### start_test: The start date for the testing dataset.
### months.ahead: The months ahead to be forecasted as a vector. 1 is the current (incomplete) month.
### mape.output.filename: The filename of a output text file with the MAPE outputs.
highestYearinBase <- max(year(subset(forecasts.per.month, model == "Base")$date))
highestYearinForecasts <- max(year(forecasts.per.month$date))
flog.info("Evaluating models")
# forecasts[forecasts["model"] == "Base", "Y"] <- c(ts.all.original) # We do the evaluation against the original (non-imputed) realized values.
best.model.names.per.step <- c()
# Sort forecasts.per.month by date ascending and then type
forecasts.per.month <- forecasts.per.month[order(forecasts.per.month$date, forecasts.per.month$type), ]
# print(forecasts.per.month)
if (exists("pred")) {
rm(pred)
}
best.model.names.per.step <- NULL
for(months.to.forecast in as.character(unique(forecasts.per.month$type)[!(unique(forecasts.per.month$type) %in% "None")])){
# months.to.forecast <- "1_ahead"
# months.to.forecast <- "2_ahead"
# months.to.forecast <- "3_ahead"
# months.to.forecast <- "4_ahead"
flog.info(paste("Evaluating", months.to.forecast))
forecasts.subset <- forecasts.per.month[(forecasts.per.month$type == months.to.forecast | forecasts.per.month$type == "None"),]
forecasts.test.set <- prepare_data_for_evaluation(ts_all = ts_all_original, forecasts.subset, start.test = start_test, months.to.forecast)
evaluation.in.whole.test <- error_metrics(pred = forecasts.test.set[[1]], id_of_ts, months.to.forecast, mape.output.filename)
evaluation.in.whole.test <- as.data.frame(evaluation.in.whole.test)
best.model.performance <- best_model(pred = forecasts.test.set[[1]], forecasts.test.set = forecasts.test.set[[2]], evaluation_in_whole_test = evaluation.in.whole.test, months.to.forecast)
best.model.names.per.step <- append(best.model.names.per.step, as.character(best.model.performance$name[1]), length(best.model.names.per.step))
}
forecastValues <- data.frame("fourth.month" = subset(forecasts.per.month, model == as.character(best.model.names.per.step[4]) & type == "4_ahead" & date == max(forecasts.per.month$date))$Y, "third.month" = subset(forecasts.per.month, model == as.character(best.model.names.per.step[3]) & type == "3_ahead" & as.yearmon(date) == as.yearmon(max(forecasts.per.month$date) %m-% months(1)))$Y, "second.month" = subset(forecasts.per.month, model == as.character(best.model.names.per.step[2]) & type == "2_ahead" & as.yearmon(date) == as.yearmon(max(forecasts.per.month$date) %m-% months(2)))$Y, "current.month" = subset(forecasts.per.month, model == as.character(best.model.names.per.step[1]) & type == "1_ahead" & as.yearmon(date) == as.yearmon(max(forecasts.per.month$date) %m-% months(3)))$Y)
# Save the forecasts in a global variable so that summary graph plotting can access them.
assign(paste(id_of_ts, "4-month forecast", sep = "_"), forecastValues, envir = globalenv())
return(list(forecastValues, best.model.names.per.step, best.model.performance, evaluation.in.whole.test))
}
print.progress <- function(progressCounter, dataFull, id_of_ts) {
flog.info(cat(red(paste0("\nOverall progress: ", round(100 * progressCounter / (length(unique(dataFull$id_of_ts)) * length(unique(dataFull$Value))), 1)), "%\n\n")))
}
```
```{r}
start_modeling_process <- function(){
### The main function, which first sets up the environment, then fetches the data, iterates them through preparing
### the time series for them, calls another function (split_data_and_model_monthly) for doing the model training and evaluation, uploads the summary graph onto.
id_of_tss <- list("USGDPGR")
aggregateLevelForecastsFlag <- FALSE # Set TRUE if wanted to compute the forecasts on the data aggregated on the id_of_ts level.
plotModelQuadrantGraphFlag <- TRUE # Set TRUE if wanted to plot the "quadrant" of graphs, which show the performance of each algorithm against the actual values (the base).
plotBestPerformerErrorGraphsFlag <- FALSE # Set TRUE if wanted to plot the MAPE and MAE of the algorithms that was evaluated to have performed the best.
disablePreviousMonthDataCheckFlag <- FALSE # Set TRUE if wanted to disable checking the the existance of data for the previous month.
skipTimeConsumingAlgorithmsFlag <- FALSE # Set TRUE if wanted to skip the forecasts using the most time consuming 4 models (ARIMA, ARIMA + boxcox, ETS, TBATS).
skipComputingForecastsFlag <- FALSE # Set TRUE if wanted to run the script without computing the forecasts (instead use the forecasts already in memory in global environment).
mape.output.filename <- NULL # Give a file name if wanted to save the MAPE results in a file.
imputations.output.filename <- NULL # Give a file name if wanted to save the records about imputations in a file.
months.ahead <- c(1, 2, 3, 4) # By default we forecast 4 months ahead.
if (!require("pacman")) install.packages("pacman") # get pcaman if not present and load it
pacman::p_load( # let pacman control packages
"data.table", # data handling
"jsonlite", # JSON parsing
"ggplot2",
"forecast",
"lubridate",
"futile.logger",
"argparse",
"imputeTS",
"gridExtra",
"tidyr",
"quantmod",
"zoo",
"GGally",
"caret",
"dplyr",
"nnfor",
"prophet",
"xlsx",
"DataCombine",
"XLConnect",
"tools",
"crayon",
"rJava"
)
first.month.to.forecast <- as.yearmon(Sys.Date()) # By default, the first month to forecast is the ongoing month.
start_test <- as.yearmon(Sys.Date() - years(1)) # By default, the test set starts from one year ago.
if(disablePreviousMonthDataCheckFlag) {
first.month.to.forecast <- as.yearmon(as.Date(first.month.to.forecast) %m-% months(1)) # Subtract the number of months from now wanted to be disregarded. By default subtract one month (e.g if data missing from the previous month).
start_test <- as.Date(start_test) %m-% months(1) # Subtract the same number of months as above from the start test date.
}
start_test <- as.character(as.Date(start_test))
start_train <- "2015-06-01"
id_of_ts <- "USGDPGR"
dataFull <- preprocess_data(since = "2013-02-01", query = id_of_ts, first.month.to.forecast)
dataAggr <- preprocess_data(since = "2013-02-01", query = id_of_ts, first.month.to.forecast)
# Let's keep this row here in comments and uncomment it if we want to see what the predictions would have been for months already realized. Set the filter condition
# below to the first day of the month you want filtered out (and all months that come after it).
# dataFull <- dataFull[format(dataFull$Date, "%Y-%m") < format(as.Date("2018-01-01"), "%Y-%m"),]
progressCounter <- 1
if(skipComputingForecastsFlag) {
progressCounter <- progressCounter + length(id_of_tss)
flog.info(cat(red(paste("Skipping computing forecasts and trying to read them from memory instead as skipComputingForecastsFlag is set to TRUE.\n"))))
next # If wanted to skip making the forecasts.
}
if(!is.null(dataFull)) {
list.ts <- monthly_decomposition(dataFull, id_of_ts, imputations.output.filename) # Make the time series object.
forecastValues.best.model.names.per.step.and.forecasts.per.month <- split_data_and_model_monthly(dataFull, list.ts, id_of_ts, months.ahead, start_test, start_train, first.month.to.forecast, mape.output.filename, progressCounter, skipTimeConsumingAlgorithmsFlag)
# print.progress(progressCounter, dataFull, id_of_ts)
if(!is.null(forecastValues.best.model.names.per.step.and.forecasts.per.month)) {
# TODO:
# create.excel(forecastValues.best.model.names.per.step.and.forecasts.per.month[[3]], forecastValues.best.model.names.per.step.and.forecasts.per.month[[2]], id_of_ts)
}
if (plotModelQuadrantGraphFlag | plotBestPerformerErrorGraphsFlag) { # If either flag was set, plot the corresponding plot(s).
plots <- plot_base_forecasts(forecasts = forecastValues.best.model.names.per.step.and.forecasts.per.month[[5]], best.model.performance = forecastValues.best.model.names.per.step.and.forecasts.per.month[[3]], months.ahead, id_of_ts, plotModelQuadrantGraphFlag, plotBestPerformerErrorGraphsFlag)
}
print("The best performing models for current month, current + 1 month, current + 2 months, and current + 3 months:")
print(forecastValues.best.model.names.per.step.and.forecasts.per.month[[2]])
print(forecastValues.best.model.names.per.step.and.forecasts.per.month[[3]])
# TODO:
# plot.summary.graph(id_of_ts, actualValuesAndForecastWithYearCols, graph.filename, first.month.to.forecast)
}
progressCounter <- progressCounter + 1
}
```
```{r}
start_modeling_process()
```