butcher

library(butcher)
library(parsnip)

One of the beauties of working with R is the ease with which you can implement intricate models and make challenging data analysis pipelines seem almost trivial. Take, for example, the parsnip package; with the installation of a few associated libraries and a few lines of code, you can fit something as complex as a boosted tree:

library(rpart)

fitted_model <- boost_tree(trees = 15) %>%
  set_engine("C5.0") %>%
  fit(as.factor(am) ~ disp + hp, data = mtcars)

Or, let’s say you’re working on petabytes of data, in which data are distributed across many nodes, just switch out the parsnip engine:

library(sparklyr)

sc <- spark_connect(master = "local")

mtcars_tbls <- sdf_copy_to(sc, mtcars[, c("am", "disp", "hp")])

fitted_model <- boost_tree(trees = 15) %>%
  set_engine("spark") %>%
  fit(am ~ disp + hp, data = mtcars_tbls)

Yet, while our code may appear compact, the underlying fitted result may not be. Since parsnip works as a wrapper for many modeling packages, its fitted model objects inherit the same properties as those that arise from the original modeling package. A straightforward example is the popular lm function from the base stats package. Whether you leverage parsnip or not, you arrive at the same result:

parsnip_lm <- linear_reg() %>% 
  set_engine("lm") %>% 
  fit(mpg ~ ., data = mtcars) 
parsnip_lm
#> parsnip model object
#> 
#> 
#> Call:
#> stats::lm(formula = formula, data = data)
#> 
#> Coefficients:
#> (Intercept)          cyl         disp           hp         drat  
#>    12.30337     -0.11144      0.01334     -0.02148      0.78711  
#>          wt         qsec           vs           am         gear  
#>    -3.71530      0.82104      0.31776      2.52023      0.65541  
#>        carb  
#>    -0.19942

Using just lm:

old_lm <- lm(mpg ~ ., data = mtcars) 
old_lm
#> 
#> Call:
#> lm(formula = mpg ~ ., data = mtcars)
#> 
#> Coefficients:
#> (Intercept)          cyl         disp           hp         drat  
#>    12.30337     -0.11144      0.01334     -0.02148      0.78711  
#>          wt         qsec           vs           am         gear  
#>    -3.71530      0.82104      0.31776      2.52023      0.65541  
#>        carb  
#>    -0.19942

Let’s say we take we take this familiar old_lm approach in building our in-house modeling pipeline. Such a pipeline might entail wrapping lm() in other function, but in doing so, we may end up carrying some junk.

in_house_model <- function() {
  some_junk_in_the_environment <- runif(1e6) # we didn't know about
  lm(mpg ~ ., data = mtcars) 
}

The linear model fit that exists in our pipeline is:

library(lobstr)
obj_size(in_house_model())
#> 8,022,440 B

When it is fundamentally the same as our old_lm, which only takes up:

obj_size(old_lm)
#> 22,224 B

Ideally, we want to avoid saving this new in_house_model() on disk, when we could have something like old_lm that takes up less memory. So, what the heck is going on here? We can examine possible issues with a fitted model object using the butcher package:

big_lm <- in_house_model()
butcher::weigh(big_lm, threshold = 0, units = "MB")
#> # A tibble: 25 x 2
#>    object            size
#>    <chr>            <dbl>
#>  1 terms         8.01    
#>  2 qr.qr         0.00666 
#>  3 residuals     0.00286 
#>  4 fitted.values 0.00286 
#>  5 effects       0.0014  
#>  6 coefficients  0.00109 
#>  7 call          0.000728
#>  8 model.mpg     0.000304
#>  9 model.cyl     0.000304
#> 10 model.disp    0.000304
#> # … with 15 more rows

The problem here is in the terms component of big_lm. Because of how lm is implemented in the base stats package—relying on intermediate forms of the data from the model.frame and model.matrix output, the environment in which the linear fit was created was carried along in the model output.

We can see this with the env_print function from the rlang package:

library(rlang)
env_print(big_lm$terms)
#> <environment: 0x7fe9e8b00070>
#> parent: <environment: global>
#> bindings:
#>  * some_junk_in_the_environment: <dbl>

To avoid carrying possible junk in our production pipeline, whether it be associated with an lm model (or something more complex), we can leverage axe_env() within the butcher package. In other words,

cleaned_lm <- butcher::axe_env(big_lm, verbose = TRUE)
#> ✔ Memory released: '7,999,256 B'

Comparing it against our old_lm, we find:

butcher::weigh(cleaned_lm, threshold = 0, units = "MB")
#> # A tibble: 25 x 2
#>    object            size
#>    <chr>            <dbl>
#>  1 terms         0.00789 
#>  2 qr.qr         0.00666 
#>  3 residuals     0.00286 
#>  4 fitted.values 0.00286 
#>  5 effects       0.0014  
#>  6 coefficients  0.00109 
#>  7 call          0.000728
#>  8 model.mpg     0.000304
#>  9 model.cyl     0.000304
#> 10 model.disp    0.000304
#> # … with 15 more rows

…it now takes the same memory on disk:

butcher::weigh(old_lm, threshold = 0, units = "MB")
#> # A tibble: 25 x 2
#>    object            size
#>    <chr>            <dbl>
#>  1 terms         0.00781 
#>  2 qr.qr         0.00666 
#>  3 residuals     0.00286 
#>  4 fitted.values 0.00286 
#>  5 effects       0.0014  
#>  6 coefficients  0.00109 
#>  7 call          0.000728
#>  8 model.mpg     0.000304
#>  9 model.cyl     0.000304
#> 10 model.disp    0.000304
#> # … with 15 more rows

Axing the environment, however, is not the only functionality of butcher. This package provides five S3 generics that include:

In our case here with lm, if we are only interested in prediction as the end product of our modeling pipeline, we could free up a lot of memory if we execute all the possible axe functions at once. To do so, we simply run butcher():

butchered_lm <- butcher::butcher(big_lm)
predict(butchered_lm, mtcars[, 2:11])
#>           Mazda RX4       Mazda RX4 Wag          Datsun 710 
#>            22.59951            22.11189            26.25064 
#>      Hornet 4 Drive   Hornet Sportabout             Valiant 
#>            21.23740            17.69343            20.38304 
#>          Duster 360           Merc 240D            Merc 230 
#>            14.38626            22.49601            24.41909 
#>            Merc 280           Merc 280C          Merc 450SE 
#>            18.69903            19.19165            14.17216 
#>          Merc 450SL         Merc 450SLC  Cadillac Fleetwood 
#>            15.59957            15.74222            12.03401 
#> Lincoln Continental   Chrysler Imperial            Fiat 128 
#>            10.93644            10.49363            27.77291 
#>         Honda Civic      Toyota Corolla       Toyota Corona 
#>            29.89674            29.51237            23.64310 
#>    Dodge Challenger         AMC Javelin          Camaro Z28 
#>            16.94305            17.73218            13.30602 
#>    Pontiac Firebird           Fiat X1-9       Porsche 914-2 
#>            16.69168            28.29347            26.15295 
#>        Lotus Europa      Ford Pantera L        Ferrari Dino 
#>            27.63627            18.87004            19.69383 
#>       Maserati Bora          Volvo 142E 
#>            13.94112            24.36827

Alternatively, we can pick and choose specific axe functions, removing only those parts of the model object that we are no longer interested in characterizing.

butchered_lm <- big_lm %>%
  butcher::axe_env() %>% 
  butcher::axe_fitted()
predict(butchered_lm, mtcars[, 2:11])
#>           Mazda RX4       Mazda RX4 Wag          Datsun 710 
#>            22.59951            22.11189            26.25064 
#>      Hornet 4 Drive   Hornet Sportabout             Valiant 
#>            21.23740            17.69343            20.38304 
#>          Duster 360           Merc 240D            Merc 230 
#>            14.38626            22.49601            24.41909 
#>            Merc 280           Merc 280C          Merc 450SE 
#>            18.69903            19.19165            14.17216 
#>          Merc 450SL         Merc 450SLC  Cadillac Fleetwood 
#>            15.59957            15.74222            12.03401 
#> Lincoln Continental   Chrysler Imperial            Fiat 128 
#>            10.93644            10.49363            27.77291 
#>         Honda Civic      Toyota Corolla       Toyota Corona 
#>            29.89674            29.51237            23.64310 
#>    Dodge Challenger         AMC Javelin          Camaro Z28 
#>            16.94305            17.73218            13.30602 
#>    Pontiac Firebird           Fiat X1-9       Porsche 914-2 
#>            16.69168            28.29347            26.15295 
#>        Lotus Europa      Ford Pantera L        Ferrari Dino 
#>            27.63627            18.87004            19.69383 
#>       Maserati Bora          Volvo 142E 
#>            13.94112            24.36827

butcher makes it easy to axe parts of the fitted output that are no longer needed, without sacrificing much functionality from the original model object.