Title: | Simulate and Diagnose (Generalized) Linear Models |
---|---|
Description: | Simulate samples from populations with known covariate distributions, generate response variables according to common linear and generalized linear model families, draw from sampling distributions of regression estimates, and perform visual inference on diagnostics from model fits. |
Authors: | Alex Reinhart [aut, cre] |
Maintainer: | Alex Reinhart <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.2.1 |
Built: | 2025-02-09 06:02:45 UTC |
Source: | https://github.com/capnrefsmmat/regressinator |
Use broom::augment()
to augment a model fit with residual and fit
information, then reformat the resulting data frame into a "long" format with
one row per predictor per observation, to facilitate plotting of the result.
augment_longer(x, ...)
augment_longer(x, ...)
x |
A model fit object, such as those returned by |
... |
Additional arguments passed to |
The name comes by analogy to tidyr::pivot_longer()
, and the concept of long
versus wide data formats.
A data frame (tibble) in similar form to those produced by
broom::augment()
, but expanded to have one row per predictor per
observation. Columns .predictor_name
and .predictor_value
identify the
predictor and its value. An additional column .obs
records the original
observation numbers so results can be matched to observations in the
original model data.
Factor predictors (as factors, logical, or character vectors) can't coexist
with numeric variables in the .predictor_value
column. If there are some
numeric and some factor predictors, the factor predictors will automatically
be omitted. If all predictors are factors, they will be combined into one
factor with all levels. However, if a numeric variable is converted to factor
in the model formula, such as with y ~ factor(x)
, the function cannot
determine the appropriate types and will raise an error. Create factors as
needed in the source data frame before fitting the model to avoid this
issue.
partial_residuals()
, binned_residuals()
fit <- lm(mpg ~ cyl + disp + hp, data = mtcars) # each observation appears 3 times, once per predictor: augment_longer(fit)
fit <- lm(mpg ~ cyl + disp + hp, data = mtcars) # each observation appears 3 times, once per predictor: augment_longer(fit)
Generates a data frame containing a model's predictors, the residuals, and the randomized quantile residuals as additional columns.
augment_quantile(x, ...) augment_quantile_longer(x, ...)
augment_quantile(x, ...) augment_quantile_longer(x, ...)
x |
Fitted model to obtain randomized quantile residuals from |
... |
Additional arguments to pass to |
Randomized quantile residuals provide more interpretable residuals for
generalized linear models (GLMs), such as logistic regression. See Dunn and
Smyth (1996) for details, or review the examples provided in
vignette("DHARMa", package="DHARMa")
.
Let be the predicted cumulative distribution function
for
when
, using the fitted GLM. When the response is
continuous, the randomized quantile residual for observation
is
When the response is discrete, let
and
then draw the randomized quantile residual as
As cumulative distributions are left-continuous, this "jitters" the values between the discrete steps, resulting in a residual that is uniformly distributed when the model is correct.
Some definitions of randomized quantile residuals transform the resulting values using the standard normal inverse cdf, so they are normally distributed. That step is omitted here, as uniform residuals are easy to work with.
Data frame with one row per observation used to fit x
, including a
.quantile.resid
column containing the quantile residuals. See
broom::augment()
and its methods for details of other columns.
For augment_quantile_longer()
, the output is in "long" format with one row
per predictor per observation. Columns .predictor_name
and
.predictor_value
identify the predictor and its value. An additional column
.obs
records the original observation numbers so results can be matched to
observations in the original model data. See Limitations in
augment_longer()
for limitations on factor predictors.
Uses broom::augment()
to generate the data frame, then uses the DHARMa package to generate randomized
quantile residuals for the model.
Dunn, Peter K., and Gordon K. Smyth (1996). "Randomized Quantile Residuals." Journal of Computational and Graphical Statistics 5 (3): 236–44. doi:10.2307/1390802
vignette("logistic-regression-diagnostics")
and
vignette("other-glm-diagnostics")
for examples of plotting and
interpreting randomized quantile residuals; augment_longer()
;
broom::augment()
Groups a data frame (similarly to dplyr::group_by()
) based on the values of
a column, either by dividing up the range into equal pieces or by quantiles.
bin_by_interval(.data, col, breaks = NULL) bin_by_quantile(.data, col, breaks = NULL)
bin_by_interval(.data, col, breaks = NULL) bin_by_quantile(.data, col, breaks = NULL)
.data |
Data frame to bin |
col |
Column to bin by |
breaks |
Number of bins to create. |
bin_by_interval()
breaks the numerical range of that column into
equal-sized intervals, or into intervals specified by breaks
.
bin_by_quantile()
splits the range into pieces based on quantiles of the
data, so each interval contains roughly an equal number of observations.
Grouped data frame, similar to those returned by dplyr::group_by()
.
An additional column .bin
indicates the bin number for each group. Use
dplyr::summarize()
to calculate values within each group, or other dplyr
operations that work on groups.
suppressMessages(library(dplyr)) cars |> bin_by_interval(speed, breaks = 5) |> summarize(mean_speed = mean(speed), mean_dist = mean(dist)) cars |> bin_by_quantile(speed, breaks = 5) |> summarize(mean_speed = mean(speed), mean_dist = mean(dist))
suppressMessages(library(dplyr)) cars |> bin_by_interval(speed, breaks = 5) |> summarize(mean_speed = mean(speed), mean_dist = mean(dist)) cars |> bin_by_quantile(speed, breaks = 5) |> summarize(mean_speed = mean(speed), mean_dist = mean(dist))
Construct a data frame by binning the fitted values or predictors of a model into discrete bins of equal width, and calculating the average value of the residuals within each bin.
binned_residuals(fit, predictors = !".fitted", breaks = NULL, ...)
binned_residuals(fit, predictors = !".fitted", breaks = NULL, ...)
fit |
The model to obtain residuals for. This can be a model fit with
|
predictors |
Predictors to calculate binned residuals for. Defaults to
all predictors, skipping factors. Predictors can be specified using
tidyselect syntax; see |
breaks |
Number of bins to create. If |
... |
Additional arguments passed on to |
In many generalized linear models, the residual plots (Pearson or deviance) are not useful because the response variable takes on very few possible values, causing strange patterns in the residuals. For instance, in logistic regression, plotting the residuals versus covariates usually produces two curved lines.
If we first bin the data, i.e. divide up the observations into breaks
bins
based on their fitted values, we can calculate the average residual within
each bin. This can be more informative: if a region has 20 observations and
its average residual value is large, this suggests those observations are
collectively poorly fit. We can also bin each predictor and calculate
averages within those bins, allowing the detection of misspecification for
specific model terms.
Data frame (tibble) with one row per bin per selected predictor, and the following columns:
.bin |
Bin number. |
n |
Number of observations in this bin. |
predictor_name |
Name of the predictor that has been binned. |
predictor_min , predictor_max , predictor_mean , predictor_sd
|
Minimum, maximum, mean, and standard deviation of the predictor (or fitted values). |
resid_mean |
Mean residual in this bin. |
resid_sd |
Standard deviation of residuals in this bin. |
Factor predictors (as factors, logical, or character vectors) are detected
automatically and omitted. However, if a numeric variable is converted to
factor in the model formula, such as with y ~ factor(x)
, the function
cannot determine the appropriate type and will raise an error. Create factors
as needed in the source data frame before fitting the model to avoid this
issue.
Gelman, A., Hill, J., and Vehtari, A. (2021). Regression and Other Stories. Section 14.5. Cambridge University Press.
partial_residuals()
for the related partial residuals;
vignette("logistic-regression-diagnostics")
and
vignette("other-glm-diagnostics")
for examples of use and interpretation
of binned residuals in logistic regression and GLMs; bin_by_interval()
and bin_by_quantile()
to bin data and calculate other values in each bin
fit <- lm(mpg ~ disp + hp, data = mtcars) # Automatically bins both predictors: binned_residuals(fit, breaks = 5) # Just bin one predictor, selected with tidyselect syntax. Multiple could be # selected with c(). binned_residuals(fit, disp, breaks = 5) # Bin the fitted values: binned_residuals(fit, predictors = .fitted) # Bins are made using the predictor, not regressors derived from it, so here # disp is binned, not its polynomial fit2 <- lm(mpg ~ poly(disp, 2), data = mtcars) binned_residuals(fit2)
fit <- lm(mpg ~ disp + hp, data = mtcars) # Automatically bins both predictors: binned_residuals(fit, breaks = 5) # Just bin one predictor, selected with tidyselect syntax. Multiple could be # selected with c(). binned_residuals(fit, disp, breaks = 5) # Bin the fitted values: binned_residuals(fit, predictors = .fitted) # Bins are made using the predictor, not regressors derived from it, so here # disp is binned, not its polynomial fit2 <- lm(mpg ~ poly(disp, 2), data = mtcars) binned_residuals(fit2)
Replace each entry in a vector with its corresponding numeric value, for instance to use a factor variable to specify intercepts for different groups in a regression model.
by_level(x, ...)
by_level(x, ...)
x |
Vector of factor values |
... |
Mapping from factor levels to values. Can be provided either as a series of named arguments, whose names correspond to factor levels, or as a single named vector. |
Named vector of same length as x
, with values replaced with those
specified. Names are the original factor level name.
rfactor()
to draw random factor levels, and the forcats
package
https://forcats.tidyverse.org/ for additional factor manipulation tools
foo <- factor(c("spam", "ham", "spam", "ducks")) by_level(foo, spam = 4, ham = 10, ducks = 16.7) by_level(foo, c("spam" = 4, "ham" = 10, "ducks" = 16.7)) # to define a population with a factor that affects the regression intercept intercepts <- c("foo" = 2, "bar" = 30, "baz" = 7) pop <- population( group = predictor(rfactor, levels = c("foo", "bar", "baz"), prob = c(0.1, 0.6, 0.3)), x = predictor(runif, min = 0, max = 10), y = response(by_level(group, intercepts) + 0.3 * x, error_scale = 1.5) ) sample_x(pop, 5)
foo <- factor(c("spam", "ham", "spam", "ducks")) by_level(foo, spam = 4, ham = 10, ducks = 16.7) by_level(foo, c("spam" = 4, "ham" = 10, "ducks" = 16.7)) # to define a population with a factor that affects the regression intercept intercepts <- c("foo" = 2, "bar" = 30, "baz" = 7) pop <- population( group = predictor(rfactor, levels = c("foo", "bar", "baz"), prob = c(0.1, 0.6, 0.3)), x = predictor(runif, min = 0, max = 10), y = response(by_level(group, intercepts) + 0.3 * x, error_scale = 1.5) ) sample_x(pop, 5)
Allows specification of the random component and link function for a response
variable. In principle this could be used to specify any GLM family, but it
is usually easier to use the predefined families, such as gaussian()
and
binomial()
.
custom_family(distribution, inverse_link)
custom_family(distribution, inverse_link)
distribution |
The distribution of the random component. This should be in the form of a function taking one argument, the vector of values on the inverse link scale, and returning a vector of draws from the distribution. |
inverse_link |
The inverse link function. |
A GLM is specified by a combination of:
Random component, i.e. the distribution that Y is drawn from
Link function relating the mean of the random component to the linear predictor
Linear predictor
Using custom_family()
we can specify the random component and link
function, while the linear predictor is set in population()
when setting up
the population relationships. A family specified this way can be used to
specify a population (via population()
), but can't be used to estimate a
model (such as with glm()
).
A family object representing this family
ols_with_error()
for the special case of linear regression with
custom error distribution
# A zero-inflated Poisson family rzeroinfpois <- function(ys) { n <- length(ys) rpois(n, lambda = ys * rbinom(n, 1, prob = 0.4)) } custom_family(rzeroinfpois, exp)
# A zero-inflated Poisson family rzeroinfpois <- function(ys) { n <- length(ys) rpois(n, lambda = ys * rbinom(n, 1, prob = 0.4)) } custom_family(rzeroinfpois, exp)
Decrypts the message printed by model_lineup()
indicating the location of
the true diagnostics in the lineup.
decrypt(...)
decrypt(...)
... |
Message to decrypt, specifying the location of the true diagnostics |
The decrypted message.
Calculates the average value of the response variable, and places this on the link scale. Plotting these against a predictor (by dividing the dataset into bins) can help assess the choice of link function.
empirical_link(response, family, na.rm = FALSE)
empirical_link(response, family, na.rm = FALSE)
response |
Vector of response variable values. |
family |
Family object representing the response distribution and link function. Only the link function will be used. |
na.rm |
Should |
Mean response value, on the link scale.
suppressMessages(library(dplyr)) suppressMessages(library(ggplot2)) mtcars |> bin_by_interval(disp, breaks = 5) |> summarize( mean_disp = mean(disp), link = empirical_link(am, binomial()) ) |> ggplot(aes(x = mean_disp, y = link)) + geom_point()
suppressMessages(library(dplyr)) suppressMessages(library(ggplot2)) mtcars |> bin_by_interval(disp, breaks = 5) |> summarize( mean_disp = mean(disp), link = empirical_link(am, binomial()) ) |> ggplot(aes(x = mean_disp, y = link)) + geom_point()
A lineup hides diagnostics among "null" diagnostics, i.e. the same
diagnostics calculated using models fit to data where all model assumptions
are correct. For each null diagnostic, model_lineup()
simulates new
responses from the model using the fitted covariate values and the model's
error distribution, link function, and so on. Hence the new response values
are generated under ideal conditions: the fitted model is true and all
assumptions hold. decrypt()
reveals which diagnostics are the true
diagnostics.
model_lineup(fit, fn = augment, nsim = 20, ...)
model_lineup(fit, fn = augment, nsim = 20, ...)
fit |
A model fit to data, such as by |
fn |
A diagnostic function. The function's first argument should be the
fitted model, and it must return a data frame. Defaults to
|
nsim |
Number of total diagnostics. For example, if |
... |
Additional arguments passed to |
To generate different kinds of diagnostics, the user can provide a custom
fn
. The fn
should take a model fit as its argument and return a data
frame. For instance, the data frame might contain one row per observation and
include the residuals and fitted values for each observation; or it might be
a single row containing a summary statistic or test statistic.
fn
will be called on the original fit
provided. Then
parametric_boot_distribution()
will be used to simulate data from the model
fit nsim - 1
times, refit the model to each simulated dataset, and run fn
on each refit model. The null distribution is conditional on X, i.e. the
covariates used will be identical, and only the response values will be
simulated. The data frames are concatenated with an additional .sample
column identifying which fit each row came from.
When called, this function will print a message such as
decrypt("sD0f gCdC En JP2EdEPn ZY")
. This is how to get the location of the
true diagnostics among the null diagnostics: evaluating this in the R console
will produce a string such as "True data in position 5"
.
A data frame (tibble) with columns corresponding to the columns
returned by fn
. The additional column .sample
indicates which set of
diagnostics each row is from. For instance, if the true data is in position
5, selecting rows with .sample == 5
will retrieve the diagnostics from
the original model fit.
Because this function uses S3 generic methods such as simulate()
and
update()
, it can be used with any model fit for which methods are provided.
In base R, this includes lm()
and glm()
.
The model provided as fit
must be fit using the data
argument to provide
a data frame. For example:
fit <- lm(dist ~ speed, data = cars)
When simulating new data, this function provides the simulated data as the
data
argument and re-fits the model. If you instead refer directly to local
variables in the model formula, this will not work. For example, if you fit a
model this way:
# will not work fit <- lm(cars$dist ~ cars$speed)
It will not be possible to refit the model using simulated datasets, as that
would require modifying your environment to edit cars
.
Buja et al. (2009). Statistical inference for exploratory data analysis and model diagnostics. Philosophical Transactions of the Royal Society A, 367 (1906), pp. 4361-4383. doi:10.1098/rsta.2009.0120
Wickham et al. (2010). Graphical inference for infovis. IEEE Transactions on Visualization and Computer Graphics, 16 (6), pp. 973-979. doi:10.1109/TVCG.2010.161
parametric_boot_distribution()
to simulate draws by using the
fitted model to draw new response values; sampling_distribution()
to
simulate draws from the population distribution, rather than from the model
fit <- lm(dist ~ speed, data = cars) model_lineup(fit, nsim = 5) resids_vs_speed <- function(f) { data.frame(resid = residuals(f), speed = model.frame(f)$speed) } model_lineup(fit, fn = resids_vs_speed, nsim = 5)
fit <- lm(dist ~ speed, data = cars) model_lineup(fit, nsim = 5) resids_vs_speed <- function(f) { data.frame(resid = residuals(f), speed = model.frame(f)$speed) } model_lineup(fit, fn = resids_vs_speed, nsim = 5)
The ols_with_error()
family can represent any non-Gaussian error, provided
random variates can be drawn by an R function. A family specified this way
can be used to specify a population (via population()
), but can't be used
to estimate a model (such as with glm()
).
ols_with_error(error, ...)
ols_with_error(error, ...)
error |
Function that can draw random variables from the non-Gaussian
distribution, or a string giving the name of the function. For example,
|
... |
Further arguments passed to the |
A family object representing this family.
custom_family()
for fully custom families, including for GLMs
# t-distributed errors with 3 degrees of freedom ols_with_error(rt, df = 3) # A linear regression with t-distributed error, using error_scale to make # errors large population( x1 = predictor(rnorm, mean = 4, sd = 10), x2 = predictor(runif, min = 0, max = 10), y = response(0.7 + 2.2 * x1 - 0.2 * x2, family = ols_with_error(rt, df = 4), error_scale = 2.5) ) # Cauchy-distributed errors ols_with_error(rcauchy, scale = 3) # A contaminated error distribution, where # 95% of observations are Gaussian and 5% are Cauchy rcontaminated <- function(n) { contaminant <- rbinom(n, 1, prob = 0.05) return(ifelse(contaminant == 1, rcauchy(n, scale = 20), rnorm(n, sd = 1))) } ols_with_error(rcontaminated)
# t-distributed errors with 3 degrees of freedom ols_with_error(rt, df = 3) # A linear regression with t-distributed error, using error_scale to make # errors large population( x1 = predictor(rnorm, mean = 4, sd = 10), x2 = predictor(runif, min = 0, max = 10), y = response(0.7 + 2.2 * x1 - 0.2 * x2, family = ols_with_error(rt, df = 4), error_scale = 2.5) ) # Cauchy-distributed errors ols_with_error(rcauchy, scale = 3) # A contaminated error distribution, where # 95% of observations are Gaussian and 5% are Cauchy rcontaminated <- function(n) { contaminant <- rbinom(n, 1, prob = 0.05) return(ifelse(contaminant == 1, rcauchy(n, scale = 20), rnorm(n, sd = 1))) } ols_with_error(rcontaminated)
Repeatedly simulates new response values by using the fitted model, holding the covariates fixed. By default, refits the same model to each simulated dataset, but an alternative model can be provided. Estimates, confidence intervals, or other quantities are extracted from each fitted model and returned as a tidy data frame.
parametric_boot_distribution( fit, alternative_fit = fit, data = get_data(fit), fn = tidy, nsim = 100, ... )
parametric_boot_distribution( fit, alternative_fit = fit, data = get_data(fit), fn = tidy, nsim = 100, ... )
fit |
A model fit to data, such as by |
alternative_fit |
A model fit to data, to refit to the data sampled from
|
data |
Data frame to be used in the simulation. Must contain the
predictors needed for both |
fn |
Function to call on each new model fit to produce a data frame of
estimates. Defaults to |
nsim |
Number of total simulations to run. |
... |
Additional arguments passed to |
The default behavior samples from a model and refits the same model to the
sampled data; this is useful when, for example, exploring how model
diagnostics look when the model is well-specified. Another common use of the
parametric bootstrap is hypothesis testing, where we might simulate from a
null model and fit an alternative model to the data, to obtain the null
distribution of a particular estimate or statistic. Provide alternative_fit
to have a specific model fit to each simulated dataset, rather than the model
they are simulated from.
Only the response variable from the fit
(or alternative_fit
, if given) is
redrawn; other response variables in the population are left unchanged from
their values in data
.
A data frame (tibble) with columns corresponding to the columns
returned by fn
. The additional column .sample
indicates which fit each
row is from.
Because this function uses S3 generic methods such as simulate()
and
update()
, it can be used with any model fit for which methods are provided.
In base R, this includes lm()
and glm()
.
The model provided as fit
must be fit using the data
argument to provide
a data frame. For example:
fit <- lm(dist ~ speed, data = cars)
When simulating new data, this function provides the simulated data as the
data
argument and re-fits the model. If you instead refer directly to local
variables in the model formula, this will not work. For example, if you fit a
model this way:
# will not work fit <- lm(cars$dist ~ cars$speed)
It will not be possible to refit the model using simulated datasets, as that
would require modifying your environment to edit cars
.
model_lineup()
to use resampling to aid in regression diagnostics;
sampling_distribution()
to simulate draws from the population
distribution, rather than the null
# Bootstrap distribution of estimates: fit <- lm(mpg ~ hp, data = mtcars) parametric_boot_distribution(fit, nsim = 5) # Bootstrap distribution of estimates for a quadratic model, when true # relationship is linear: quad_fit <- lm(mpg ~ poly(hp, 2), data = mtcars) parametric_boot_distribution(fit, quad_fit, nsim = 5) # Bootstrap distribution of estimates for a model with an additional # predictor, when it's truly zero. data argument must be provided so # alternative fit has all predictors available, not just hp: alt_fit <- lm(mpg ~ hp + wt, data = mtcars) parametric_boot_distribution(fit, alt_fit, data = mtcars, nsim = 5)
# Bootstrap distribution of estimates: fit <- lm(mpg ~ hp, data = mtcars) parametric_boot_distribution(fit, nsim = 5) # Bootstrap distribution of estimates for a quadratic model, when true # relationship is linear: quad_fit <- lm(mpg ~ poly(hp, 2), data = mtcars) parametric_boot_distribution(fit, quad_fit, nsim = 5) # Bootstrap distribution of estimates for a model with an additional # predictor, when it's truly zero. data argument must be provided so # alternative fit has all predictors available, not just hp: alt_fit <- lm(mpg ~ hp + wt, data = mtcars) parametric_boot_distribution(fit, alt_fit, data = mtcars, nsim = 5)
Construct a data frame containing the model data, partial residuals for all quantitative predictors, and predictor effects, for use in residual diagnostic plots and other analyses. The result is in tidy form (one row per predictor per observation), allowing it to be easily manipulated for plots and simulations.
partial_residuals(fit, predictors = everything())
partial_residuals(fit, predictors = everything())
fit |
The model to obtain residuals for. This can be a model fit with
|
predictors |
Predictors to calculate partial residuals for. Defaults to
all predictors, skipping factors. Predictors can be specified using
tidyselect syntax; see |
Data frame (tibble) containing the model data and residuals in tidy form. There is one row per selected predictor per observation. All predictors are included as columns, plus the following additional columns:
.obs |
Row number of this observation in the original model data frame. |
.predictor_name |
Name of the predictor this row gives the partial residual for. |
.predictor_value |
Value of the predictor this row gives the partial residual for. |
.partial_resid |
Partial residual for this predictor for this observation. |
.predictor_effect |
Predictor effect |
To define partial residuals, we must distinguish between the predictors, the measured variables we are using to fit our model, and the regressors, which are calculated from them. In a simple linear model, the regressors are equal to the predictors. But in a model with polynomials, splines, or other nonlinear terms, the regressors may be functions of the predictors.
For example, in a regression with a single predictor , the regression
model
has one regressor,
. But if we
choose a polynomial of degree 3, the model is
, and the regressors are
.
Similarly, if we have predictors and
and form a model
with main effects and an interaction, the regressors are
.
Partial residuals are defined in terms of the predictors, not the regressors, and are intended to allow us to see the shape of the relationship between a particular predictor and the response, and to compare it to how we have chosen to model it with regressors. Partial residuals are not useful for categorical (factor) predictors, and so these are omitted.
Consider a linear model where . The mean function
is a linear combination of
regressors. Let
be the fitted model and
be its intercept.
Choose a predictor , the focal predictor, to calculate partial
residuals for. Write the mean function as
, where
is the value of the focal predictor, and
represents all
other predictors.
If is the residual for observation
, the partial residual is
Setting means setting all other numeric predictors to 0; factor
predictors are set to their first (baseline) level.
Consider a generalized linear model where , where
is a link function and
is the linear predictor. Let
be the
fitted linear predictor and
be its intercept.
Let
be
the fitted mean function.
The working residual for observation
is
Choose a predictor , the focal predictor, to calculate partial
residuals for. Write
as
, where
is the
value of the focal predictor, and
represents all other predictors.
Hence
gives the model's prediction on the link scale.
The partial residual is again
In linear regression, because the residuals should have mean zero
in a well-specified model, plotting the partial residuals against
should produce a shape matching the modeled relationship
. If the
model is wrong, the partial residuals will appear to deviate from the fitted
relationship. Provided the regressors are uncorrelated or approximately
linearly related to each other, the plotted trend should approximate the true
relationship between
and the response.
In generalized linear models, this is approximately true if the link function
is approximately linear over the range of observed
values.
Additionally, the function (in linear models) or
(in generalized linear models) can be used to show the
relationship between the focal predictor and the response. In a linear model,
the function is linear; with polynomial or spline regressors, it is
nonlinear. This function is the predictor effect function, and the
estimated predictor effects are included in this function's output.
Factor predictors (as factors, logical, or character vectors) are detected
automatically and omitted. However, if a numeric variable is converted to
factor in the model formula, such as with y ~ factor(x)
, the function
cannot determine the appropriate type and will raise an error. Create factors
as needed in the source data frame before fitting the model to avoid this
issue.
R. Dennis Cook (1993). "Exploring Partial Residual Plots", Technometrics, 35:4, 351-362. doi:10.1080/00401706.1993.10485350
Cook, R. Dennis, and Croos-Dabrera, R. (1998). "Partial Residual Plots in Generalized Linear Models." Journal of the American Statistical Association 93, no. 442: 730–39. doi:10.2307/2670123
Fox, J., & Weisberg, S. (2018). "Visualizing Fit and Lack of Fit in Complex Regression Models with Predictor Effect Plots and Partial Residuals." Journal of Statistical Software, 87(9). doi:10.18637/jss.v087.i09
binned_residuals()
for the related binned residuals;
augment_longer()
for a similarly formatted data frame of ordinary
residuals; vignette("linear-regression-diagnostics")
,
vignette("logistic-regression-diagnostics")
, and
vignette("other-glm-diagnostics")
for examples of plotting and
interpreting partial residuals
fit <- lm(mpg ~ cyl + disp + hp, data = mtcars) partial_residuals(fit) # You can select predictors with tidyselect syntax: partial_residuals(fit, c(disp, hp)) # Predictors with multiple regressors are supported: fit2 <- lm(mpg ~ poly(disp, 2), data = mtcars) partial_residuals(fit2) # Allowing an interaction by number of cylinders is fine, but partial # residuals are not generated for the factor. Notice the factor must be # created first, not in the model formula: mtcars$cylinders <- factor(mtcars$cyl) fit3 <- lm(mpg ~ cylinders * disp + hp, data = mtcars) partial_residuals(fit3)
fit <- lm(mpg ~ cyl + disp + hp, data = mtcars) partial_residuals(fit) # You can select predictors with tidyselect syntax: partial_residuals(fit, c(disp, hp)) # Predictors with multiple regressors are supported: fit2 <- lm(mpg ~ poly(disp, 2), data = mtcars) partial_residuals(fit2) # Allowing an interaction by number of cylinders is fine, but partial # residuals are not generated for the factor. Notice the factor must be # created first, not in the model formula: mtcars$cylinders <- factor(mtcars$cyl) fit3 <- lm(mpg ~ cylinders * disp + hp, data = mtcars) partial_residuals(fit3)
Specifies a hypothetical infinite population of cases. Each case has some predictor variables and one or more response variables. The relationship between the variables and response variables are defined, as well as the population marginal distribution of each predictor variable.
population(...)
population(...)
... |
A sequence of named arguments defining predictor and response variables. These are evaluated in order, so later response variables may refer to earlier predictor and response variables. All predictors should be provided first, before any response variables. |
A population object.
predictor()
and response()
to define the population;
sample_x()
and sample_y()
to draw samples from it
# A population with a simple linear relationship linear_pop <- population( x1 = predictor(rnorm, mean = 4, sd = 10), x2 = predictor(runif, min = 0, max = 10), y = response(0.7 + 2.2 * x1 - 0.2 * x2, error_scale = 1.0) ) # A population whose response depends on local variables slope <- 2.2 intercept <- 0.7 sigma <- 2.5 variable_pop <- population( x = predictor(rnorm), y = response(intercept + slope * x, error_scale = sigma) ) # Response error scale is heteroskedastic and depends on predictors heteroskedastic_pop <- population( x1 = predictor(rnorm, mean = 4, sd = 10), x2 = predictor(runif, min = 0, max = 10), y = response(0.7 + 2.2 * x1 - 0.2 * x2, error_scale = 1 + x2 / 10) ) # A binary outcome Y, using a binomial family with logistic link binary_pop <- population( x1 = predictor(rnorm, mean = 4, sd = 10), x2 = predictor(runif, min = 0, max = 10), y = response(0.7 + 2.2 * x1 - 0.2 * x2, family = binomial(link = "logit")) ) # A binomial outcome Y, with 10 trials per observation, using a logistic link # to determine the probability of success for each trial binomial_pop <- population( x1 = predictor(rnorm, mean = 4, sd = 10), x2 = predictor(runif, min = 0, max = 10), y = response(0.7 + 2.2 * x1 - 0.2 * x2, family = binomial(link = "logit"), size = 10) ) # Another binomial outcome, but the number of trials depends on another # predictor binom_size_pop <- population( x1 = predictor(rnorm, mean = 4, sd = 10), x2 = predictor(runif, min = 0, max = 10), trials = predictor(rpois, lambda = 20), y = response(0.7 + 2.2 * x1 - 0.2 * x2, family = binomial(link = "logit"), size = trials) ) # A population with a simple linear relationship and collinearity. Because X # is bivariate, there will be two predictors, named x1 and x2. library(mvtnorm) collinear_pop <- population( x = predictor(rmvnorm, mean = c(0, 1), sigma = matrix(c(1, 0.8, 0.8, 1), nrow = 2)), y = response(0.7 + 2.2 * x1 - 0.2 * x2, error_scale = 1.0) )
# A population with a simple linear relationship linear_pop <- population( x1 = predictor(rnorm, mean = 4, sd = 10), x2 = predictor(runif, min = 0, max = 10), y = response(0.7 + 2.2 * x1 - 0.2 * x2, error_scale = 1.0) ) # A population whose response depends on local variables slope <- 2.2 intercept <- 0.7 sigma <- 2.5 variable_pop <- population( x = predictor(rnorm), y = response(intercept + slope * x, error_scale = sigma) ) # Response error scale is heteroskedastic and depends on predictors heteroskedastic_pop <- population( x1 = predictor(rnorm, mean = 4, sd = 10), x2 = predictor(runif, min = 0, max = 10), y = response(0.7 + 2.2 * x1 - 0.2 * x2, error_scale = 1 + x2 / 10) ) # A binary outcome Y, using a binomial family with logistic link binary_pop <- population( x1 = predictor(rnorm, mean = 4, sd = 10), x2 = predictor(runif, min = 0, max = 10), y = response(0.7 + 2.2 * x1 - 0.2 * x2, family = binomial(link = "logit")) ) # A binomial outcome Y, with 10 trials per observation, using a logistic link # to determine the probability of success for each trial binomial_pop <- population( x1 = predictor(rnorm, mean = 4, sd = 10), x2 = predictor(runif, min = 0, max = 10), y = response(0.7 + 2.2 * x1 - 0.2 * x2, family = binomial(link = "logit"), size = 10) ) # Another binomial outcome, but the number of trials depends on another # predictor binom_size_pop <- population( x1 = predictor(rnorm, mean = 4, sd = 10), x2 = predictor(runif, min = 0, max = 10), trials = predictor(rpois, lambda = 20), y = response(0.7 + 2.2 * x1 - 0.2 * x2, family = binomial(link = "logit"), size = trials) ) # A population with a simple linear relationship and collinearity. Because X # is bivariate, there will be two predictors, named x1 and x2. library(mvtnorm) collinear_pop <- population( x = predictor(rmvnorm, mean = c(0, 1), sigma = matrix(c(1, 0.8, 0.8, 1), nrow = 2)), y = response(0.7 + 2.2 * x1 - 0.2 * x2, error_scale = 1.0) )
Predictor variables can have any marginal distribution as long as a function is provided to sample from the distribution. Multivariate distributions are also supported: if the random generation function returns multiple columns, multiple random variables will be created. If the columns are named, the random variables will be named accordingly; otherwise, they will be successively numbered.
predictor(dist, ...)
predictor(dist, ...)
dist |
Function to generate draws from this predictor's distribution, provided as a function or as a string naming the function. |
... |
Additional arguments to pass to |
The random generation function must take an argument named n
specifying the
number of draws. For univariate distributions, it should return a vector of
length n
; for multivariate distributions, it should return an array or
matrix with n
rows and a column per variable.
Multivariate predictors are successively numbered. For instance, if predictor
X
is specified with
library(mvtnorm) predictor(dist = rmvnorm, mean = c(0, 1), sigma = matrix(c(1, 0.5, 0.5, 1), nrow = 2))
then the population predictors will be named X1
and X2
, and will have
covariance 0.5.
If the multivariate predictor has named columns, the names will be used
instead. For instance, if predictor X
generates a matrix with columns A
and B
, the population predictors will be named XA
and XB
.
A predictor_dist
object, to be used in population()
to specify a
population distribution
# Univariate normal distribution predictor(dist = rnorm, mean = 10, sd = 2.5) # Multivariate normal distribution library(mvtnorm) predictor(dist = rmvnorm, mean = c(0, 1, 7)) # Multivariate with named columns rmulti <- function(n) { cbind(treatment = rbinom(n, size = 1, prob = 0.5), confounder = rnorm(n) ) } predictor(dist = rmulti)
# Univariate normal distribution predictor(dist = rnorm, mean = 10, sd = 2.5) # Multivariate normal distribution library(mvtnorm) predictor(dist = rmvnorm, mean = c(0, 1, 7)) # Multivariate with named columns rmulti <- function(n) { cbind(treatment = rbinom(n, size = 1, prob = 0.5), confounder = rnorm(n) ) } predictor(dist = rmulti)
Response variables are related to predictors (and other response variables) through a link function and response distribution. First the expression provided is evaluated using the predictors, to give this response variable's value on the link scale; then the inverse link function and response distribution are used to get the response value. See Details for more information.
response(expr, family = gaussian(), error_scale = NULL, size = 1L)
response(expr, family = gaussian(), error_scale = NULL, size = 1L)
expr |
An expression, in terms of other predictor or response variables, giving this predictor's value on the link scale. |
family |
The family of this response variable, e.g. |
error_scale |
Scale factor for errors. Used only for linear families,
such as |
size |
When the |
Response variables are drawn based on a typical generalized linear model
setup. Let represent the response variable and
represent the
predictor variables. We specify that
where
Here is the expression
expr
, and both the distribution and
link function are specified by the
family
provided. For instance,
if the family
is gaussian()
, the distribution is Normal and the link is
the identity function; if the family
is binomial()
, the distribution is
binomial and the link is (by default) the logistic link.
The following response families are supported.
gaussian()
The default family is gaussian()
with the identity link function,
specifying the relationship
where is given by
error_scale
.
ols_with_error()
Allows specification of custom non-Normal error distributions, specifying the relationship
where is drawn from an arbitrary distribution, specified by the
error
argument to ols_with_error()
.
binomial()
Binomial responses include binary responses (as in logistic regression) and responses giving a total number of successes out of a number of trials. The response has distribution
where is set by the
size
argument and is the link function.
The default link is the logistic link, and others can be chosen with the
link
argument to binomial()
. The default is 1, representing a
binary outcome.
poisson()
Poisson-distributed responses with distribution
where is the link function. The default link is the log link, and
others can be chosen with the
link
argument to poisson()
.
custom_family()
Responses drawn from an arbitrary distribution with arbitrary link function, i.e.
where both and SomeDistribution are specified by arguments to
custom_family()
.
The expr
, error_scale
, and size
arguments are evaluated only when
simulating data for this response variable. They are evaluated in an
environment with access to the predictor variables and the preceding response
variables, which they can refer to by name. Additionally, these arguments can
refer to variables in scope when the enclosing population()
was defined.
See the Examples below.
A response_dist
object, to be used in population()
to specify a
population distribution
predictor()
and population()
to define populations;
ols_with_error()
and custom_family()
for custom response distributions
# Defining a binomial response. The expressions can refer to other predictors # and to the environment where the `population()` is defined: slope1 <- 2.5 slope2 <- -3 intercept <- -4.6 size <- 10 population( x1 = predictor(rnorm), x2 = predictor(rnorm), y = response(intercept + slope1 * x1 + slope2 * x2, family = binomial(), size = size) )
# Defining a binomial response. The expressions can refer to other predictors # and to the environment where the `population()` is defined: slope1 <- 2.5 slope2 <- -3 intercept <- -4.6 size <- 10 population( x1 = predictor(rnorm), x2 = predictor(rnorm), y = response(intercept + slope1 * x1 + slope2 * x2, family = binomial(), size = size) )
To specify the population distribution of a factor variable, specify the probability for each of its factor levels. When drawn from the population, factor levels are drawn with replacement according to their probability.
rfactor(n, levels, prob = rep_len(1/length(levels), length(levels)))
rfactor(n, levels, prob = rep_len(1/length(levels), length(levels)))
n |
Number of values to draw |
levels |
Character vector specifying the levels for the factor |
prob |
Vector specifying the probability for each factor level |
Sample of n
values from levels
, drawn in proportion to their
probabilities. By default, levels are equally likely.
by_level()
to assign numeric values based on factor levels, such
as to set population regression coefficients by factor level
rfactor(5, c("foo", "bar", "baz"), c(0.4, 0.3, 0.3))
rfactor(5, c("foo", "bar", "baz"), c(0.4, 0.3, 0.3))
Sampling is split into two steps, for predictors and for response variables,
to allow users to choose which to simulate. sample_x()
will only sample
predictor variables, and sample_y()
will augment a data frame of predictors
with columns for response variables, overwriting any already present. Hence
one can use sample_y()
as part of a simulation with fixed predictors, for
instance.
sample_x(population, n) sample_y(xs)
sample_x(population, n) sample_y(xs)
population |
Population, as defined by |
n |
Number of observations to draw from the population. |
xs |
Data frame of predictor values drawn from the population, as
obtained from |
Data frame (tibble) of n
rows, with columns matching the variables
specified in the population.
# A population with a simple linear relationship pop <- population( x1 = predictor(rnorm, mean = 4, sd = 10), x2 = predictor(runif, min = 0, max = 10), y = response(0.7 + 2.2 * x1 - 0.2 * x2, error_scale = 1.0) ) xs <- pop |> sample_x(5) xs xs |> sample_y()
# A population with a simple linear relationship pop <- population( x1 = predictor(rnorm, mean = 4, sd = 10), x2 = predictor(runif, min = 0, max = 10), y = response(0.7 + 2.2 * x1 - 0.2 * x2, error_scale = 1.0) ) xs <- pop |> sample_x(5) xs xs |> sample_y()
Repeatedly refits the model to new samples from the population, calculates estimates for each fit, and compiles a data frame of the results.
sampling_distribution(fit, data, fn = tidy, nsim = 100, fixed_x = TRUE, ...)
sampling_distribution(fit, data, fn = tidy, nsim = 100, fixed_x = TRUE, ...)
fit |
A model fit to data, such as by |
data |
Data drawn from a |
fn |
Function to call on each new model fit to produce a data frame of
estimates. Defaults to |
nsim |
Number of simulations to run. |
fixed_x |
If |
... |
Additional arguments passed to |
To generate sampling distributions of different quantities, the user can
provide a custom fn
. The fn
should take a model fit as its argument and
return a data frame. For instance, the data frame might contain one row per
estimated coefficient and include the coefficient and its standard error; or
it might contain only one row of model summary statistics.
Data frame (tibble) of nsim + 1
simulation results, formed by
concatenating together the data frames returned by fn
. The .sample
column identifies which simulated sample each row came from. Rows with
.sample == 0
come from the original fit
.
Because this function uses S3 generic methods such as simulate()
and
update()
, it can be used with any model fit for which methods are provided.
In base R, this includes lm()
and glm()
.
The model provided as fit
must be fit using the data
argument to provide
a data frame. For example:
fit <- lm(dist ~ speed, data = cars)
When simulating new data, this function provides the simulated data as the
data
argument and re-fits the model. If you instead refer directly to local
variables in the model formula, this will not work. For example, if you fit a
model this way:
# will not work fit <- lm(cars$dist ~ cars$speed)
It will not be possible to refit the model using simulated datasets, as that
would require modifying your environment to edit cars
.
parametric_boot_distribution()
to simulate draws from a fitted
model, rather than from the population
pop <- population( x1 = predictor(rnorm, mean = 4, sd = 10), x2 = predictor(runif, min = 0, max = 10), y = response(0.7 + 2.2 * x1 - 0.2 * x2, error_scale = 4.0) ) d <- sample_x(pop, n = 20) |> sample_y() fit <- lm(y ~ x1 + x2, data = d) # using the default fn = broom::tidy(). conf.int argument is passed to # broom::tidy() samples <- sampling_distribution(fit, d, conf.int = TRUE) samples suppressMessages(library(dplyr)) # the model is correctly specified, so the estimates are unbiased: samples |> group_by(term) |> summarize(mean = mean(estimate), sd = sd(estimate)) # instead of coefficients, get the sampling distribution of R^2 rsquared <- function(fit) { data.frame(r2 = summary(fit)$r.squared) } sampling_distribution(fit, d, rsquared, nsim = 10)
pop <- population( x1 = predictor(rnorm, mean = 4, sd = 10), x2 = predictor(runif, min = 0, max = 10), y = response(0.7 + 2.2 * x1 - 0.2 * x2, error_scale = 4.0) ) d <- sample_x(pop, n = 20) |> sample_y() fit <- lm(y ~ x1 + x2, data = d) # using the default fn = broom::tidy(). conf.int argument is passed to # broom::tidy() samples <- sampling_distribution(fit, d, conf.int = TRUE) samples suppressMessages(library(dplyr)) # the model is correctly specified, so the estimates are unbiased: samples |> group_by(term) |> summarize(mean = mean(estimate), sd = sd(estimate)) # instead of coefficients, get the sampling distribution of R^2 rsquared <- function(fit) { data.frame(r2 = summary(fit)$r.squared) } sampling_distribution(fit, d, rsquared, nsim = 10)