When parallel_processing
within
forecast_time_series()
is set to “local_machine”, each time
series (including training models on the entire data set) is ran in
parallel on the users local machine. Each time series will run on a
separate core of the machine. Hyperparameter tuning, model refitting,
and model averaging will be ran sequentially, which cannot be done in
parallel since a parallel process is already running on the machine for
each time series. This works well for data that contains many time
series where you might only want to run a few simpler models, and in
scenarios where cloud computing is not available.
If parallel_processing
is set to NULL and
inner_parallel
is set to TRUE within
forecast_time_series
, then each time series is ran
sequentially but the hyperparameter tuning, model refitting, and model
averaging is ran in parallel. This works great for data that has a
limited number of time series where you want to run a lot of back
testing and build dozens of models within Finn.
To leverage the full power of Finn, running within Azure is the best
choice in building production ready forecasts that can easily scale. The
most efficient way to run Finn is to set
parallel_processing
to “spark” within
forecast_time_series()
. This will run each time series in
parallel across a spark compute cluster.
Sparklyr is a great R package that allows you to run R code across a spark cluster. A user simply has to connect to a spark cluster then run Finn. Below is an example on how you can run Finn using spark on Azure Databricks. Also check out the growing R support with using spark on Azure Synapse.
# load CRAN libraries
library(finnts)
library(sparklyr)
install.packages("qs")
library(qs)
# connect to spark cluster
options(sparklyr.log.console = TRUE)
options(sparklyr.spark_apply.serializer = "qs") # uses the qs package to improve data serialization before sending to spark cluster
sc <- sparklyr::spark_connect(method = "databricks")
# call Finn with spark parallel processing
hist_data <- timetk::m4_monthly %>%
dplyr::rename(Date = date) %>%
dplyr::mutate(id = as.character(id))
data_sdf <- sparklyr::copy_to(sc, hist_data, "data_sdf", overwrite = TRUE)
run_info <- set_run_info(
experiment_name = "finn_fcst",
run_name = "spark_run_1",
path = "/dbfs/mnt/example/folder" # important that you mount an ADLS path
)
forecast_time_series(
run_info = run_info,
input_data = data_sdf,
combo_variables = c("id"),
target_variable = "value",
date_type = "month",
forecast_horizon = 3,
parallel_processing = "spark",
return_data = FALSE
)
# return the outputs as a spark data frame
finn_output_tbl <- get_forecast_data(run_info = run_info,
return_type = "sdf")
The above example runs each time series on a separate core on a spark
cluster. You can also submit multiple time series where each time series
runs on a separate spark executor (VM) and then leverage all of the
cores on that executor to run things like hyperparameter tuning or model
refitting in parallel. This creates two levels of parallelization. One
at the time series level, then another when doing things like
hyperparameter tuning within a specific time series. To do that set
inner_parallel
to TRUE in
forecast_time_series()
. Also make sure that you adjust the
number of spark executor cores to 1, that ensures that only 1 time
series runs on an executor at a time. Leverage the
“spark.executor.cores” argument when configuring your spark connection.
This can be done using sparklyr
or within the cluster manager itself within the Azure resource. Use the
“num_cores” argument in the “forecast_time_series” function to control
how many cores should be used within an executor when running things
like hyperparameter tuning.
forecast_time_series()
will be looking for a variable
called “sc” to use when submitting tasks to the spark cluster, so make
sure you use that as the variable name when connecting to spark. Also
it’s important that you mount your spark session to an Azure Data Lake
Storage (ADLS) account, and provide the mounted path to where you’d like
your Finn results to be written to within
set_run_info()
.