# Probabilistic Forecast Reconciliation via energy score optimisation.

## Background

The ProbReco package carries out probabilistic forecast reconciliation via score optimisation. This vignette describes how to set up the inputs to the main functions inscoreopt() and scoreopt() which find reconciliation weights using Stochastic Gradient Descent.

### Probabilistic Forecast Reconciliation

For a full treatment of probabilistic forecast reconciliation see (Panagiotelis et al. 2020). Let $$\boldsymbol{y}_t$$ be a $$n$$-vector observed at time $$t$$ that is known to belong to an $$m$$-dimensional subspace of $$\mathbb{R}^n$$ denoted $$\mathfrak{s}$$ with $$m<n$$. Suppose that a base probabilistic forecast for $$\boldsymbol{y}_{t+h}$$ at time $$t$$ is defined as

$\hat{\nu}_{t+h|t}(\mathcal{A})=\textit{Pr}(\boldsymbol{y}_t\in\mathcal{A})\,,$

where $$\mathcal{A}$$ is some region of $$\mathbb{R}^n$$ (more precisely a member of the usual Borel $$\sigma$$-algebra on $$\mathbb{R}^n$$). A reconciled probabilistic forecast is a probability measure $$\tilde{\nu}_{t+h|t}$$ with the property

$\tilde{\nu}_{t+h|t}(\psi(\mathcal{A}))=\hat{\nu}_{t+h|t}(\mathcal{A})\,,$

where $$\psi:\mathbb{R}^n\rightarrow\mathfrak{s}$$ is a mapping and $$\psi({\mathcal{A}})$$ is the image of $$\mathcal{A}$$ under $$\psi$$. In this package we consider linear transformations for $$\psi$$

$\tilde{\boldsymbol{y}}_{t+h|t}=\psi(\hat{\boldsymbol{y}}_{t+h|t})=\boldsymbol{S}(\boldsymbol{d}+\boldsymbol{G}\hat{\boldsymbol{y}}_{t+h|t})\,,$ where $$\boldsymbol{S}$$ is an $$n\times m$$ matrix whose columns span $$\mathfrak{s}$$. If $$\hat{\boldsymbol{y}}_{t+h|t}$$ is sampled from the base forecast then $$\tilde{\boldsymbol{y}}_{t+h|t}$$ will be sampled from the reconciled forecast. The objective of probabilistic forecast reconciliation is to find values of $$\boldsymbol{G}$$ and $$\boldsymbol{d}$$ that are optimal.

### Scoring Rules

Scoring rules are used to measure the quality of probabilistic forecasts. The default loss function used by the package is the total energy score given by

$L=\sum\limits_{t\in\mathcal{T}}\hat{K}_{t+h}$

where $$\mathcal{T}$$ is a training set and $$K_{t+h}$$ is an estimate for energy score given by

$\hat{K}_{t+h}=Q^{-1}\left(\sum\limits_{q=1}^Q\left|\left|\boldsymbol{y}_{t+h}-\boldsymbol{S}\left(\boldsymbol{d}+\boldsymbol{G}\hat{\boldsymbol{y}}^{[q]}_{t+h|t}\right)\right|\right|^\alpha-\frac{1}{2}\sum\limits_{q=1}^Q\left|\left|\boldsymbol{SG}\left(\hat{\boldsymbol{y}}^{[q]}_{t+h|t}-\hat{\boldsymbol{y}}^{[q]*}_{t+h|t}\right)\right|\right|^\alpha\right)\,,$

where $$\hat{\boldsymbol{y}}^{}_{t+h|t},\ldots,\hat{\boldsymbol{y}}^{[Q]}_{t+h|t}$$ and $$\hat{\boldsymbol{y}}^{*}_{t+h|t},\ldots,\hat{\boldsymbol{y}}^{[Q]*}_{t+h|t}$$ are independent copies drawn from the base forecast distribution, $$||.||$$ is the $$L_2$$ norm and $$\alpha\in(0,2]$$ with $$\alpha=1$$ by default. Note that this representation is a negatively oriented energy score, i.e. smaller values of the score indicate more accurate forecasts. For more on scoring rules see (Gneiting and Raftery 2007).

The variogram score (Scheuerer and Hamill 2015) is also implemented in the package and is given by

$\hat{K}_{t+h}=\sum\limits_{i=1}^n\sum\limits_{j=1}^n\left(\left|\left|y_{i,t+h}-y_{j,t+h}\right|\right|^\alpha-Q^{-1}\sum\limits_{q=1}^Q\left|\left|\tilde{y}^{[q]}_{i,t+h|t}-\tilde{y}^{[q]*}_{j,t+h|t}\right|\right|^\alpha\right)\,,$ where $$\tilde{\boldsymbol{y}}^{[q]}_{t+h|t}=\boldsymbol{S}(\boldsymbol{d}+\boldsymbol{G}\hat{\boldsymbol{y}}^{[q]}_{t+h|t})$$, $$\tilde{\boldsymbol{y}}^{[q]*}_{t+h|t}=\boldsymbol{S}(\boldsymbol{d}+\boldsymbol{G}\hat{\boldsymbol{y}}^{[q]*}_{t+h|t})$$ and $$\alpha=1$$ by default (but can be changed).

### Optimisation

A challenging aspect of optimising the loss function is its stochastic nature. The technique used to maximise the loss function is Stochastic Gradient Descent. The gradients with respect to $$\boldsymbol{G}$$ and $$\boldsymbol{d}$$ are found using algorithmic differentiation which is implemented using the Stan Math C++ header only library (see Carpenter et al. 2015). The learning rates for each update are determined using the Adam method (see Kingma and Ba 2014).

## Using the inscoreopt() function

The more straightforward function to use in the package is inscoreopt(). This only requires the data and predicted values. The probabilistic forecasts are not true forecasts in the sense that they are obtained using in sample information.

### The S matrix

Consider a 3-variable hierarchy where $$y_{1,t}=y_{2,t}+y_{3,t}$$. An $$\boldsymbol{S}$$ matrix for this hierarchy is

$\boldsymbol{S}=\begin{pmatrix}1&1\\1&0\\0&1\end{pmatrix}$

which we define in R by

S<-matrix(c(1,1,1,0,0,1),3,2, byrow = TRUE)
S
#>      [,1] [,2]
#> [1,]    1    1
#> [2,]    1    0
#> [3,]    0    1

### The data

Suppose the data are given by

$\begin{pmatrix}y_{2,t}\\y_{3,t}\end{pmatrix}\sim N\left(\begin{pmatrix}1\\1\end{pmatrix},\begin{pmatrix}1&0\\0&1\end{pmatrix}\right)\,,$

to demonstrate the functions, ten observations of data from the full hierarchy can be simulated by

y<-matrix(NA,3,10) #preallocate matrix
for (t in 1:10){
y[,t]<-S%*%(c(1,1)+rnorm(2))
}

Naturally, in real applications the data will not need to be simulated.

For demonstration purposes, suppose the predicted values are simply randomly generated from a uniform distribution (in a real applications these will come from a model). The argument yhat must be a matrix of the same size as y

yhat<-matrix(runif(30),3,10)

The reconciliation weights can be trained via

library(ProbReco)
out<-inscoreopt(y=y,yhat=yhat,S=S)
out$d #>  0.2890895 0.2739345 out$G
#>           [,1]       [,2]        [,3]
#> [1,] 0.5204758  0.7639945 -0.02491508
#> [2,] 0.4550564 -0.1715359  0.74337060

Different assumptions can be made about Gaussianity and dependence in the base forecasts using the arguments basedep and basedist

## Setting up function inputs for scoreopt()

This section describes how to set up the inputs for the scoreopt() function. These are more easily created by the purrr package which can be loaded using:

library(purrr)

### The data argument

In general, rather than using in-sample predictions, a rolling window can be set up where probabilistic forecasts and realisations are both used to train reconciliation weights. The data argument is a list where each list element is an $$n$$-vector corresponding to a realisation of the data. Each element of the list corresponds to a training observation. Assuming the same DGP as the previous example, this can be constructed as follows

data<-map(1:10,function(i){S%*%(c(1,1)+rnorm(2))})
data[]
#>          [,1]
#> [1,] 3.020937
#> [2,] 1.786201
#> [3,] 1.234737
data[]
#>          [,1]
#> [1,] 2.501348
#> [2,] 1.270163
#> [3,] 1.231185

In practice, an actual dataset will need to be converted into this form. This form of storing data is unconvential but allows for great flexibility in the way base probabilistic forecasts can be defined.

### The prob argument

The base probabilistic forecast in general will vary for each period of the training sample. This information is also stored in a list. Each list element corresponds to a training observation. The list elements themselves are functions that simulate from the base probabilistic distribution.

Suppose that the base probabilistic forecast is to simulate from a trivariate standard normal $$N(\boldsymbol{0}_{3\times 1},\boldsymbol{I}_{3\times 3})$$ for each training observation (in practice the probabilistic forecast will vary over the training observations). First define a function that simulates 50 iterates from $$N(\boldsymbol{0}_{3\times 1},\boldsymbol{I}_{3\times 3})$$ and stores these in a matrix. In practice a value larger than 50 should be used to estimate the energy score. A list of these functions can be created using the map() function.

#Function
f<-function(){matrix(rnorm(50*3),3,50)}
prob<-map(1:10,function(i){f})
prob[]
#> function(){matrix(rnorm(50*3),3,50)}
prob[]
#> function(){matrix(rnorm(50*3),3,50)}

### The Gvec and Ginit argument

The reconciliation parameters enter functions as a single argument. The first $$m$$ elements are the elements of $$\boldsymbol{d}$$. The remaining elements are the elements of $$\boldsymbol{G}$$ vectorised (in column-major ordering). A random initial value can be simulated as

Gvec<-runif(8)
Gvec
#>  0.84437123 0.45541695 0.59426441 0.06415706 0.32560712 0.49533097 0.41080253
#>  0.13278460

### Checking inputs with checkinputs

To check that all inputs are correctly specified use the checkinputs() function

checkinputs(data,prob,S,Gvec)

An error will be thrown if an input is incorrectly set up.

## The total_score and scoreopt functions

The function total_score() provides an estimate of the total score and its gradient. The gradient is evaluated by algorithmic differentiation.

out<-total_score(data,prob,S,Gvec)
out$value #>  20.21764 out$grad
#>  -3.7151235 -2.1254724 -1.7135674 -0.1695672 -0.3625651 -2.0611004 -1.1712909
#>  -0.8725126

Since the loss function is stochastic, these values will change each time the function is run.

out2<-total_score(data,prob,S,Gvec)
out$value #>  20.21764 out2$value
#>  19.99923
out$grad #>  -3.7151235 -2.1254724 -1.7135674 -0.1695672 -0.3625651 -2.0611004 -1.1712909 #>  -0.8725126 out2$grad
#>  -4.1454552 -2.4424070 -1.9166395 -0.7954665 -1.0984027 -2.2855283 -0.8056046
#>  -0.4656124

The function scoreopt() maximises the loss function using stochastic gradient descent.

out<-scoreopt(data,prob,S)

The output includes $$\boldsymbol{d}$$ and $$\boldsymbol{G}$$ which is converted back to a matrix using column-major ordering. The converged value of the loss function is also provided.

out$d #>  0.4430438 0.4169748 out$G
#>           [,1]        [,2]       [,3]
#> [1,] 0.6733619  1.04035368 -0.3174013
#> [2,] 0.6452485 -0.02940679  0.9918338
out$val #>  20.1352 Since everything is Gaussian, and the true mean of the base forecasts is $$\boldsymbol{0}$$, the mean of the reconciled distribution is given by $$\boldsymbol{S}\boldsymbol{a}$$. The variance covariance matrix is given by $$\boldsymbol{S}\boldsymbol{G}\boldsymbol{I}\boldsymbol{G}'\boldsymbol{S}'$$. The mean S%*%out$d
#>           [,1]
#> [1,] 0.8600187
#> [2,] 0.4430438
#> [3,] 0.4169748

is close to the true mean of $$(2,1,1)'$$ (keeping in mind that the sample size is small). The variance covariance matrix is

S%*%out$G%*%t(out$G)%*%t(S)
#>          [,1]       [,2]       [,3]
#> [1,] 3.215606 1.72557850 1.49002758
#> [2,] 1.725578 1.63649558 0.08908292
#> [3,] 1.490028 0.08908292 1.40094467

which is close to the true variance covariance matrix

$\begin{pmatrix}2&1&1\\1&1&0\\1&0&1\end{pmatrix}$

These values will be even closer if more training observations are used and more iterates are drawn from the probabilistic forecasts to estimate the score.