```
library(hardhat)
library(tibble)
library(modeldata)
data(penguins)
<- na.omit(penguins) penguins
```

The goal of this vignette is to teach you how to use `mold()`

and `forge()`

in a modeling package. This is the intended use of these functions, even though they can also be called interactively. Creating a new modeling package has two main stages: creating the model fitting function, and implementing a predict method. The stages break down like this:

Stage 1 - Model Fitting

Create a model constructor.

Create a fitting implementation function.

Create a common bridge to go between high level user facing methods and the lower level constructor and implementation function.

Create a user facing function with methods for data frame, matrix, formula, and recipe inputs.

I imagine that this comes together as a few internal pieces that power the user facing methods for your model.

Stage 2 - Model Prediction

Create one or more prediction implementation functions, varying by the

`"type"`

of prediction to make.Create a common bridge between a high level predict method and the lower level prediction implementation functions.

Create a user facing predict method.

In this case, there are 2 user facing methods. Many models have multiple internal implementation functions that you’ll switch between, depending on the `"type"`

.

The end result is a single high level modeling function that has methods for multiple different “interfaces”, and a corresponding predict method to make predictions using one of these models along with new data (by “interfaces”, I just mean the different types of inputs, so: data frame, matrix, formula and recipe).

There are obviously other things that you might want your modeling package to do. For instance, you might implement a `plot()`

or `summary()`

method. But the two stages described here are necessary for almost every model, and they involve the inputs and outputs that hardhat helps the most with.

We will use the underlying `lm()`

infrastructure, `lm.fit()`

, to create our model. Linear regression should be recognizable to many, so we can focus on understanding how `mold()`

and `forge()`

fit in the bigger picture, rather than trying to understand how the model works.

`lm.fit()`

takes `x`

and `y`

directly, rather than using the formula method. It will serve as the core part of our modeling implementation function. More generally, it is easiest if the core implementation function of your algorithm takes `x`

and `y`

in this manner, since that is how `mold()`

will standardize the inputs.

We will call the model `simple_lm()`

. It won’t have all the features of normal linear regression (weights, offsets, etc), but will serve as a nice dummy model to show off features you get with hardhat.

The first thing we need is a modeling *constructor*. Constructors are very simple functions that creates new objects of our model class. In the arguments to the constructor you supply all of the individual pieces, and it wraps them up into a model object. The hardhat function `new_model()`

can help with creating that.

A model constructor should:

Have the name

`new_<model_class>()`

.Take the required model elements as named arguments, including a required

`blueprint`

.Validate the types of the new elements.

Pass the named elements on to

`new_model()`

along with setting the class to`"<model_class>"`

.

If you want to learn more about the details of constructors and creating S3 classes, take a look at the S3 section in Advanced R.

```
<- function(coefs, coef_names, blueprint) {
new_simple_lm
if (!is.numeric(coefs)) {
stop("`coefs` should be a numeric vector.", call. = FALSE)
}
if (!is.character(coef_names)) {
stop("`coef_names` should be a character vector.", call. = FALSE)
}
if (length(coefs) != length(coef_names)) {
stop("`coefs` and `coef_names` must have the same length.")
}
new_model(
coefs = coefs,
coef_names = coef_names,
blueprint = blueprint,
class = "simple_lm"
) }
```

A `"simple_lm"`

object has just enough information to make numeric predictions on new data, but you can store other things here as well to enable your model object to work with extra post-fitting functionality.

We can test this by manually generating a model object. Along with the custom class we provided, this object also has a `"hardhat_model"`

class. There is a very simple print method for objects of this type. Specifically, it prints the name of the class at the top, and only prints out the custom elements (i.e. not the `blueprint`

).

```
<- new_simple_lm(1, "my_coef", default_xy_blueprint())
manual_model
manual_model#> <simple_lm>
#> $coefs
#> [1] 1
#>
#> $coef_names
#> [1] "my_coef"
names(manual_model)
#> [1] "coefs" "coef_names" "blueprint"
$blueprint
manual_model#> XY blueprint:
#>
#> # Predictors: 0
#> # Outcomes: 0
#> Intercept: FALSE
#> Novel Levels: FALSE
#> Composition: tibble
```

The implementation function is where the hard work is done. I generally recommend naming it `<model_class>_impl()`

. It should accept predictors and outcomes in whatever form is required for the algorithm, run the algorithm, and return a named list of the new elements you added to the model constructor. You might also have arguments for extra options that can be used to tweak the internal algorithm.

```
<- function(predictors, outcomes) {
simple_lm_impl <- lm.fit(predictors, outcomes)
lm_fit
<- lm_fit$coefficients
coefs
<- names(coefs)
coef_names <- unname(coefs)
coefs
list(
coefs = coefs,
coef_names = coef_names
) }
```

This simple linear regression implementation just calls `lm.fit()`

with `x = predictors`

and `y = outcomes`

. `lm.fit()`

expects a matrix of predictors and a vector of outcomes (at least for univariate regression). In a moment we will discuss how to create those.

```
<- as.matrix(subset(penguins, select = bill_length_mm))
predictors <- penguins$body_mass_g
outcomes
simple_lm_impl(predictors, outcomes)
#> $coefs
#> [1] 95.49649
#>
#> $coef_names
#> [1] "bill_length_mm"
```

Now that we have our constructor and our implementation function, we can create a common function that will be used in all of our top level methods (for data frames, matrices, formulas, and recipes). It will call the implementation function, and then use that information along with the blueprint to create a new instance of our model. It should have an argument for the output from a call to `mold()`

, here I’ve called that `processed`

. In that object will be (at a minimum) the predictors, outcomes, and blueprint. You might also have arguments for additional options to pass on to the implementation function.

The bridge function should take the standardized predictors and outcomes and convert them to the lower level types that the implementation function requires. The `predictors`

and `outcomes`

that are returned from `mold()`

will *always* be data frames, so in this case we can convert them to matrices and vectors directly for use in the lower level function.

This is also a good place to use some of hardhat’s validation functions. In this case, we always expect the outcome to have a single column since this is a univariate model, so we can use `validate_outcomes_is_univariate()`

to enforce that.

```
<- function(processed) {
simple_lm_bridge
validate_outcomes_are_univariate(processed$outcomes)
<- as.matrix(processed$predictors)
predictors <- processed$outcomes[[1]]
outcomes
<- simple_lm_impl(predictors, outcomes)
fit
new_simple_lm(
coefs = fit$coefs,
coef_names = fit$coef_names,
blueprint = processed$blueprint
) }
```

At this point, we can simulate user input and pass it on to our bridge to run a model.

```
# Simulate formula interface
<- mold(bill_length_mm ~ body_mass_g + species, penguins)
processed_1
# Simulate xy interface
<- mold(x = penguins["body_mass_g"], y = penguins$bill_length_mm)
processed_2
simple_lm_bridge(processed_1)
#> <simple_lm>
#> $coefs
#> [1] 0.003754612 24.908763524 34.817525835 28.447942512
#>
#> $coef_names
#> [1] "body_mass_g" "speciesAdelie" "speciesChinstrap" "speciesGentoo"
simple_lm_bridge(processed_2)
#> <simple_lm>
#> $coefs
#> [1] 0.01022951
#>
#> $coef_names
#> [1] "body_mass_g"
```

Multiple outcomes are an error:

```
<- mold(bill_length_mm + bill_depth_mm ~ body_mass_g + species, penguins)
multi_outcome
simple_lm_bridge(multi_outcome)
#> Error: The outcome must be univariate, but 2 columns were found.
```

With all of the pieces in place, we have everything we need to create our high level modeling interface. This should be a generic function, generally with methods for data frames, matrices, formulas, and recipes. Each method should call `mold()`

with the method specific inputs to run the preprocessing, and then pass off to the bridge function to run the actual model. It is also good practice to provide a default method with a nice error message for unknown types.

```
# Generic
<- function(x, ...) {
simple_lm UseMethod("simple_lm")
}
# Default
<- function(x, ...) {
simple_lm.default stop(
"`simple_lm()` is not defined for a '", class(x)[1], "'.",
call. = FALSE
)
}
# XY method - data frame
<- function(x, y, ...) {
simple_lm.data.frame <- mold(x, y)
processed simple_lm_bridge(processed)
}
# XY method - matrix
<- function(x, y, ...) {
simple_lm.matrix <- mold(x, y)
processed simple_lm_bridge(processed)
}
# Formula method
<- function(formula, data, ...) {
simple_lm.formula <- mold(formula, data)
processed simple_lm_bridge(processed)
}
# Recipe method
<- function(x, data, ...) {
simple_lm.recipe <- mold(x, data)
processed simple_lm_bridge(processed)
}
```

Let’s give it a try:

```
<- penguins[c("bill_length_mm", "bill_depth_mm")]
predictors <- penguins$body_mass_g
outcomes_vec <- penguins["body_mass_g"]
outcomes_df
# Vector outcome
simple_lm(predictors, outcomes_vec)
#> <simple_lm>
#> $coefs
#> [1] 110.88151 -40.16918
#>
#> $coef_names
#> [1] "bill_length_mm" "bill_depth_mm"
# 1 column data frame outcome
simple_lm(predictors, outcomes_df)
#> <simple_lm>
#> $coefs
#> [1] 110.88151 -40.16918
#>
#> $coef_names
#> [1] "bill_length_mm" "bill_depth_mm"
# Formula interface
simple_lm(body_mass_g ~ bill_length_mm + bill_depth_mm, penguins)
#> <simple_lm>
#> $coefs
#> [1] 110.88151 -40.16918
#>
#> $coef_names
#> [1] "bill_length_mm" "bill_depth_mm"
```

We can use preprocessing as well, and it is handled by `mold()`

.

```
library(recipes)
# - Log a predictor
# - Generate dummy variables for factors
simple_lm(body_mass_g ~ log(bill_length_mm) + species, penguins)
#> <simple_lm>
#> $coefs
#> [1] 3985.047 -10865.973 -11753.182 -10290.188
#>
#> $coef_names
#> [1] "log(bill_length_mm)" "speciesAdelie" "speciesChinstrap"
#> [4] "speciesGentoo"
# Same, but with a recipe
<- recipe(body_mass_g ~ bill_length_mm + species, penguins) %>%
rec step_log(bill_length_mm) %>%
step_dummy(species, one_hot = TRUE)
simple_lm(rec, penguins)
#> <simple_lm>
#> $coefs
#> [1] 3985.047 -10865.973 -11753.182 -10290.188
#>
#> $coef_names
#> [1] "bill_length_mm" "species_Adelie" "species_Chinstrap"
#> [4] "species_Gentoo"
```

You might have noticed that our linear regression isn’t adding an intercept. Generally, with linear regression models we will want a default intercept added on. To accomplish this, we can add an `intercept`

argument to our user facing function, and then use that to tweak the blueprint that would otherwise be created for you automatically.

```
<- function(x, ...) {
simple_lm UseMethod("simple_lm")
}
<- function(x, y, intercept = TRUE, ...) {
simple_lm.data.frame <- default_xy_blueprint(intercept = intercept)
blueprint <- mold(x, y, blueprint = blueprint)
processed simple_lm_bridge(processed)
}
<- function(x, y, intercept = TRUE,...) {
simple_lm.matrix <- default_xy_blueprint(intercept = intercept)
blueprint <- mold(x, y, blueprint = blueprint)
processed simple_lm_bridge(processed)
}
<- function(formula, data, intercept = TRUE, ...) {
simple_lm.formula <- default_formula_blueprint(intercept = intercept)
blueprint <- mold(formula, data, blueprint = blueprint)
processed simple_lm_bridge(processed)
}
<- function(x, data, intercept = TRUE, ...) {
simple_lm.recipe <- default_recipe_blueprint(intercept = intercept)
blueprint <- mold(x, data, blueprint = blueprint)
processed simple_lm_bridge(processed)
}
```

```
# By default an intercept is included
simple_lm(predictors, outcomes_df)
#> <simple_lm>
#> $coefs
#> [1] 3413.45185 74.81263 -145.50718
#>
#> $coef_names
#> [1] "(Intercept)" "bill_length_mm" "bill_depth_mm"
# But the user can turn this off
simple_lm(body_mass_g ~ log(bill_length_mm) + species, penguins, intercept = FALSE)
#> <simple_lm>
#> $coefs
#> [1] 3985.047 -10865.973 -11753.182 -10290.188
#>
#> $coef_names
#> [1] "log(bill_length_mm)" "speciesAdelie" "speciesChinstrap"
#> [4] "speciesGentoo"
```

Note that even the formula method respects this `intercept`

argument. To recap, by default `mold()`

will *not* automatically add an intercept for any method, including the formula method.

On the prediction side, we need implementation functions like for fitting our model. These vary based on the `"type"`

argument to `predict()`

. The `"type"`

might be `"numeric"`

for numeric predictions as we will use here, or it could be `"class"`

for hard class predictions, `"prob"`

for class probabilities, and more. A set of recommended names for `"type"`

can be found in the Model Predictions section of the implementation principles.

For our model, we will focus on only returning numeric predictions. I generally like to name these prediction implementation functions `predict_<model_class>_<type>()`

. The arguments for the implementation functions should include the model object and the predictors in the form that your prediction algorithm expects (here, a matrix).

Also used here is another hardhat function for standardizing prediction output, `spruce_numeric()`

. This function tidies up the numeric response output, and automatically standardizes it to match the recommendations in the principles guide. The output is always a tibble, and for the `"numeric"`

type it has 1 column, `.pred`

.

```
<- function(object, predictors) {
predict_simple_lm_numeric
<- object$coefs
coefs
<- as.vector(predictors %*% coefs)
pred
<- spruce_numeric(pred)
out
out }
```

To test it, we will have to run a model and call `forge()`

on the output manually. The higher level user facing function will do this automatically.

```
<- simple_lm(bill_length_mm ~ body_mass_g + species, penguins)
model
<- forge(penguins, model$blueprint)$predictors
predictors <- as.matrix(predictors)
predictors
predict_simple_lm_numeric(model, predictors)
#> # A tibble: 333 x 1
#> .pred
#> <dbl>
#> 1 39.0
#> 2 39.2
#> 3 37.1
#> 4 37.9
#> 5 38.6
#> 6 38.5
#> 7 42.5
#> 8 36.9
#> 9 39.2
#> 10 41.4
#> # … with 323 more rows
```

A prediction bridge converts the standardized `predictors`

into the lower level type that the prediction implementation functions expect. `predictors`

here is always a data frame, and is part of the return value of a call to `forge()`

. Since the prediction implementation function takes a matrix, we convert it to that here.

Additionally, it should `switch()`

on the `type`

argument to decide which of the prediction implementation functions to call. Here, when `type == "numeric"`

, `predict_simple_lm_numeric()`

is called.

I also like using `rlang::arg_match()`

to validate that `type`

is one of the accepted prediction types. This has an advantage over `match.arg()`

in that partial matches are not allowed, and the error messages are a bit nicer.

```
<- function(type, object, predictors) {
predict_simple_lm_bridge
<- rlang::arg_match(type, "numeric")
type
<- as.matrix(predictors)
predictors
switch(
type,numeric = predict_simple_lm_numeric(object, predictors)
) }
```

Let’s test:

```
<- simple_lm(bill_length_mm ~ body_mass_g + species, penguins)
model
# Pass in the data frame
<- forge(penguins, model$blueprint)$predictors
predictors
predict_simple_lm_bridge("numeric", model, predictors)
#> # A tibble: 333 x 1
#> .pred
#> <dbl>
#> 1 39.0
#> 2 39.2
#> 3 37.1
#> 4 37.9
#> 5 38.6
#> 6 38.5
#> 7 42.5
#> 8 36.9
#> 9 39.2
#> 10 41.4
#> # … with 323 more rows
# Partial matches are an error
predict_simple_lm_bridge("numer", model, predictors)
#> Error: `type` must be one of "numeric".
#> Did you mean "numeric"?
```

Finally, we can create an S3 method for the generic `predict()`

function. To match the modeling principles, it should use `new_data`

to accept a matrix or data frame of new predictors.

The first thing that the `predict()`

method should do is call `forge()`

with the `new_data`

and the `blueprint`

that we attached to our `simple_lm`

model at fit time. This performs the required preprocessing on the new data, and checks that the *type* of the data frame supplied to `new_data`

matches the type of data frame supplied at fit time. This is one of the most valuable features of `forge()`

, as it adds a large amount of robustness to your `predict()`

ion function. You will see some examples of this at the end of the vignette.

After calling `forge()`

, it should pass off to the bridge function to call the correct prediction function based on the type.

Finally, it is good practice to call the hardhat function, `validate_prediction_size()`

, with the return value and the original `new_data`

to ensure that the number of rows in the output are the same as the number of rows of the input. If a prediction cannot be made for a row of `new_data`

, an `NA`

value should be placed there instead. Mainly, this validation function is a check on the model developer to ensure that you always return output with a sane length.

```
<- function(object, new_data, type = "numeric", ...) {
predict.simple_lm
# Enforces column order, type, column names, etc
<- forge(new_data, object$blueprint)
processed
<- predict_simple_lm_bridge(type, object, processed$predictors)
out
validate_prediction_size(out, new_data)
out }
```

Finally, we can test out our top level modeling function along with its corresponding `predict()`

method.

```
<- simple_lm(bill_length_mm ~ log(body_mass_g) + species, penguins)
model
predict(model, penguins)
#> # A tibble: 333 x 1
#> .pred
#> <dbl>
#> 1 39.1
#> 2 39.3
#> 3 36.9
#> 4 37.9
#> 5 38.7
#> 6 38.6
#> 7 42.5
#> 8 36.7
#> 9 39.3
#> 10 41.5
#> # … with 323 more rows
```

By using `forge()`

, you automatically get powerful type checking to ensure that the `new_data`

is in a form that you expect.

```
# `new_data` isn't a data frame
predict(model, penguins$species)
#> Error: The class of `new_data`, 'factor', is not recognized.
# Missing a required column
predict(model, subset(penguins, select = -body_mass_g))
#> Error: The following required columns are missing: 'body_mass_g'.
# In this case, 'species' is a character,
# but can be losslessy converted to a factor.
# That happens for you automatically and silently.
<- transform(penguins, species = as.character(species))
penguins_chr_species
predict(model, penguins_chr_species)
#> # A tibble: 333 x 1
#> .pred
#> <dbl>
#> 1 39.1
#> 2 39.3
#> 3 36.9
#> 4 37.9
#> 5 38.7
#> 6 38.6
#> 7 42.5
#> 8 36.7
#> 9 39.3
#> 10 41.5
#> # … with 323 more rows
# Slightly different from above. Here, 'species' is a character,
# AND has an extra unexpected factor level. It is
# removed with a warning, but you still get a factor
# with the correct levels
<- penguins_chr_species
penguins_chr_bad_species $species[1] <- "new_level"
penguins_chr_bad_species
predict(model, penguins_chr_bad_species)
#> Warning: Novel levels found in column 'species': 'new_level'. The levels have
#> been removed, and values have been coerced to 'NA'.
#> # A tibble: 333 x 1
#> .pred
#> <dbl>
#> 1 NA
#> 2 39.3
#> 3 36.9
#> 4 37.9
#> 5 38.7
#> 6 38.6
#> 7 42.5
#> 8 36.7
#> 9 39.3
#> 10 41.5
#> # … with 323 more rows
# This case throws an error.
# Here, 'species' is a double and
# when it should have been a factor.
# You can't cast a double to a factor!
<- transform(penguins, species = 1)
penguins_dbl_species
predict(model, penguins_dbl_species)
#> Error: Can't convert `species` <double> to match type of `species` <factor<600db>>.
```