When you are presented with longitudinal data, it is useful to summarise the data into a format where you have one row per key. That means one row per unique identifier of the data - if you aren’t sure what this means, see the vignette, “Longitudinal Data Structures”.
So, say for example you wanted to find features in the wages data, which looks like this:
library(brolgar)
wages#> # A tsibble: 6,402 x 9 [!]
#> # Key: id [888]
#> id ln_wages xp ged xp_since_ged black hispanic high_grade unemploy_…¹
#> <int> <dbl> <dbl> <int> <dbl> <int> <int> <int> <dbl>
#> 1 31 1.49 0.015 1 0.015 0 1 8 3.21
#> 2 31 1.43 0.715 1 0.715 0 1 8 3.21
#> 3 31 1.47 1.73 1 1.73 0 1 8 3.21
#> 4 31 1.75 2.77 1 2.77 0 1 8 3.3
#> 5 31 1.93 3.93 1 3.93 0 1 8 2.89
#> 6 31 1.71 4.95 1 4.95 0 1 8 2.49
#> 7 31 2.09 5.96 1 5.96 0 1 8 2.6
#> 8 31 2.13 6.98 1 6.98 0 1 8 4.8
#> 9 36 1.98 0.315 1 0.315 0 0 9 4.89
#> 10 36 1.80 0.983 1 0.983 0 0 9 7.4
#> # … with 6,392 more rows, and abbreviated variable name ¹unemploy_rate
You can return a dataset that has one row per key, with say the
minimum value for ln_wages
, for each key:
#> # A tibble: 888 × 2
#> id min
#> <int> <dbl>
#> 1 31 1.43
#> 2 36 1.80
#> 3 53 1.54
#> 4 122 0.763
#> 5 134 2.00
#> 6 145 1.48
#> 7 155 1.54
#> 8 173 1.56
#> 9 206 2.03
#> 10 207 1.58
#> # … with 878 more rows
This then allows us to summarise these kinds of data, to say for example find the distribution of minimum values:
library(ggplot2)
ggplot(wages_min,
aes(x = min)) +
geom_density()
We call these summaries features
of the data.
This vignette discusses how to calculate these features of the data.
We can calculate features
of longitudinal data using the
features
function (from fabletools
,
made available in brolgar
).
features
works by specifying the data, the variable to
summarise, and the feature to calculate:
features(<DATA>, <VARIABLE>, <FEATURE>)
or with the pipe:
<DATA> %>% features(<VARIABLE>, <FEATURE>)
As an example, we can calculate a five number summary (minimum, 25th
quantile, median, mean, 75th quantile, and maximum) of the data using
feat_five_num
, like so:
<- wages %>%
wages_five features(ln_wages, feat_five_num)
wages_five#> # A tibble: 888 × 6
#> id min q25 med q75 max
#> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 31 1.43 1.48 1.73 2.02 2.13
#> 2 36 1.80 1.97 2.32 2.59 2.93
#> 3 53 1.54 1.58 1.71 1.89 3.24
#> 4 122 0.763 2.10 2.19 2.46 2.92
#> 5 134 2.00 2.28 2.36 2.79 2.93
#> 6 145 1.48 1.58 1.77 1.89 2.04
#> 7 155 1.54 1.83 2.22 2.44 2.64
#> 8 173 1.56 1.68 2.00 2.05 2.34
#> 9 206 2.03 2.07 2.30 2.45 2.48
#> 10 207 1.58 1.87 2.15 2.26 2.66
#> # … with 878 more rows
Here we are taking the wages
data, piping it to
features
, and then telling it to summarise the
ln_wages
variable, using feat_five_num
.
There are several handy functions for calculating features of the
data that brolgar
provides. These all start with
feat_
.
You can, for example, find those whose values only increase or
decrease with feat_monotonic
:
<- wages %>%
wages_mono features(ln_wages, feat_monotonic)
wages_mono#> # A tibble: 888 × 5
#> id increase decrease unvary monotonic
#> <int> <lgl> <lgl> <lgl> <lgl>
#> 1 31 FALSE FALSE FALSE FALSE
#> 2 36 FALSE FALSE FALSE FALSE
#> 3 53 FALSE FALSE FALSE FALSE
#> 4 122 FALSE FALSE FALSE FALSE
#> 5 134 FALSE FALSE FALSE FALSE
#> 6 145 FALSE FALSE FALSE FALSE
#> 7 155 FALSE FALSE FALSE FALSE
#> 8 173 FALSE FALSE FALSE FALSE
#> 9 206 TRUE FALSE FALSE TRUE
#> 10 207 FALSE FALSE FALSE FALSE
#> # … with 878 more rows
These could then be used to identify individuals who only increase like so:
library(dplyr)
%>%
wages_mono filter(increase)
#> # A tibble: 50 × 5
#> id increase decrease unvary monotonic
#> <int> <lgl> <lgl> <lgl> <lgl>
#> 1 206 TRUE FALSE FALSE TRUE
#> 2 295 TRUE FALSE FALSE TRUE
#> 3 518 TRUE FALSE FALSE TRUE
#> 4 1508 TRUE FALSE FALSE TRUE
#> 5 2178 TRUE FALSE FALSE TRUE
#> 6 2194 TRUE FALSE FALSE TRUE
#> 7 2330 TRUE FALSE FALSE TRUE
#> 8 2456 TRUE FALSE FALSE TRUE
#> 9 2612 TRUE FALSE FALSE TRUE
#> 10 2890 TRUE FALSE FALSE TRUE
#> # … with 40 more rows
They could then be joined back to the data
<- wages_mono %>%
wages_mono_join filter(increase) %>%
left_join(wages, by = "id")
wages_mono_join#> # A tibble: 164 × 13
#> id increase decrease unvary monotonic ln_wages xp ged xp_sinc…¹ black
#> <int> <lgl> <lgl> <lgl> <lgl> <dbl> <dbl> <int> <dbl> <int>
#> 1 206 TRUE FALSE FALSE TRUE 2.03 1.87 0 0 0
#> 2 206 TRUE FALSE FALSE TRUE 2.30 2.81 0 0 0
#> 3 206 TRUE FALSE FALSE TRUE 2.48 4.31 0 0 0
#> 4 295 TRUE FALSE FALSE TRUE 1.79 2.03 0 0 0
#> 5 295 TRUE FALSE FALSE TRUE 1.81 3.12 0 0 0
#> 6 295 TRUE FALSE FALSE TRUE 2.11 4.16 0 0 0
#> 7 295 TRUE FALSE FALSE TRUE 2.13 5.08 0 0 0
#> 8 295 TRUE FALSE FALSE TRUE 2.31 6.58 0 0 0
#> 9 518 TRUE FALSE FALSE TRUE 1.27 0.525 1 0.525 1
#> 10 518 TRUE FALSE FALSE TRUE 1.61 1.93 1 1.93 1
#> # … with 154 more rows, 3 more variables: hispanic <int>, high_grade <int>,
#> # unemploy_rate <dbl>, and abbreviated variable name ¹xp_since_ged
And these could be plotted:
ggplot(wages_mono_join,
aes(x = xp,
y = ln_wages,
group = id)) +
geom_line()
To get a sense of the data and where it came from, we could create a
plot with gghighlight
to highlight those that only
increase, by using gghighlight(increase)
- since
increase
is a logical, this tells gghighlight
to highlight those that are TRUE.
library(gghighlight)
%>%
wages_mono left_join(wages, by = "id") %>%
ggplot(aes(x = xp,
y = ln_wages,
group = id)) +
geom_line() +
gghighlight(increase)
You can explore the available features, see the function References
To create your own features or summaries to pass to
features
, you provide a named list of functions. For
example:
library(brolgar)
<- list(min = min,
feat_three med = median,
max = max)
feat_three#> $min
#> function (..., na.rm = FALSE) .Primitive("min")
#>
#> $med
#> function (x, na.rm = FALSE, ...)
#> UseMethod("median")
#> <bytecode: 0x13b9c4fa0>
#> <environment: namespace:stats>
#>
#> $max
#> function (..., na.rm = FALSE) .Primitive("max")
These are then passed to features
like so:
%>%
wages features(ln_wages, feat_three)
#> # A tibble: 888 × 4
#> id min med max
#> <int> <dbl> <dbl> <dbl>
#> 1 31 1.43 1.73 2.13
#> 2 36 1.80 2.32 2.93
#> 3 53 1.54 1.71 3.24
#> 4 122 0.763 2.19 2.92
#> 5 134 2.00 2.36 2.93
#> 6 145 1.48 1.77 2.04
#> 7 155 1.54 2.22 2.64
#> 8 173 1.56 2.00 2.34
#> 9 206 2.03 2.30 2.48
#> 10 207 1.58 2.15 2.66
#> # … with 878 more rows
%>%
heights features(height_cm, feat_three)
#> # A tibble: 144 × 4
#> country min med max
#> <chr> <dbl> <dbl> <dbl>
#> 1 Afghanistan 161. 167. 168.
#> 2 Albania 168. 170. 170.
#> 3 Algeria 166. 169 171.
#> 4 Angola 159. 167. 169.
#> 5 Argentina 167. 168. 174.
#> 6 Armenia 164. 169. 172.
#> 7 Australia 170 172. 178.
#> 8 Austria 162. 167. 179.
#> 9 Azerbaijan 170. 172. 172.
#> 10 Bahrain 161. 164. 164
#> # … with 134 more rows
Inside brolgar
, the features are created with the
following syntax:
<- function(x, ...) {
feat_five_num list(
min = b_min(x, ...),
q25 = b_q25(x, ...),
med = b_median(x, ...),
q75 = b_q75(x, ...),
max = b_max(x, ...)
) }
Here the functions b_
are functions with a default of
na.rm = TRUE
, and in the cases of quantiles, they use
type = 8
, and names = FALSE
.
If you want to run many or all features from a package on your data
you can collect them all with feature_set
. For example:
library(fabletools)
<- feature_set(pkgs = "brolgar")
feat_brolgar length(feat_brolgar)
#> [1] 6
You could then run these like so:
%>%
wages features(ln_wages, feat_brolgar)
#> # A tibble: 888 × 46
#> id min...1 med...2 max...3 min...4 q25...5 med...6 q75...7 max...8 min...9
#> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 31 1.43 1.73 2.13 1.43 1.48 1.73 2.02 2.13 1.43
#> 2 36 1.80 2.32 2.93 1.80 1.97 2.32 2.59 2.93 1.80
#> 3 53 1.54 1.71 3.24 1.54 1.58 1.71 1.89 3.24 1.54
#> 4 122 0.763 2.19 2.92 0.763 2.10 2.19 2.46 2.92 0.763
#> 5 134 2.00 2.36 2.93 2.00 2.28 2.36 2.79 2.93 2.00
#> 6 145 1.48 1.77 2.04 1.48 1.58 1.77 1.89 2.04 1.48
#> 7 155 1.54 2.22 2.64 1.54 1.83 2.22 2.44 2.64 1.54
#> 8 173 1.56 2.00 2.34 1.56 1.68 2.00 2.05 2.34 1.56
#> 9 206 2.03 2.30 2.48 2.03 2.07 2.30 2.45 2.48 2.03
#> 10 207 1.58 2.15 2.66 1.58 1.87 2.15 2.26 2.66 1.58
#> # … with 878 more rows, and 36 more variables: max...10 <dbl>,
#> # range_diff...11 <dbl>, iqr...12 <dbl>, var...13 <dbl>, sd...14 <dbl>,
#> # mad...15 <dbl>, iqr...16 <dbl>, min...17 <dbl>, max...18 <dbl>,
#> # median <dbl>, mean <dbl>, q25...21 <dbl>, q75...22 <dbl>, range1 <dbl>,
#> # range2 <dbl>, range_diff...25 <dbl>, sd...26 <dbl>, var...27 <dbl>,
#> # mad...28 <dbl>, iqr...29 <dbl>, increase...30 <dbl>, decrease...31 <dbl>,
#> # unvary...32 <dbl>, diff_min <dbl>, diff_q25 <dbl>, diff_median <dbl>, …
For more information see ?fabletools::feature_set
If you create features in your own package and want to make them
accessible with feature_set
, do the following.
Functions can be registered via
fabletools::register_feature()
. To register features in a
package, I create a file called zzz.R
, and use the
.onLoad(...)
function to set this up on loading the
package:
<- function(...) {
.onLoad ::register_feature(feat_three_num, c("summary"))
fabletools# ... and as many as you want here!
}