A function that takes a single dependent variable with a vector of explanatory variable names (continuous or categorical variables) to produce a summary table.

summary_factorlist(
  .data,
  dependent = NULL,
  explanatory = NULL,
  formula = NULL,
  cont = "mean",
  cont_nonpara = NULL,
  cont_cut = 5,
  cont_range = TRUE,
  p = FALSE,
  p_cont_para = "aov",
  p_cat = "chisq",
  column = TRUE,
  total_col = FALSE,
  orderbytotal = FALSE,
  digits = c(1, 1, 3, 1, 0),
  na_include = FALSE,
  na_include_dependent = FALSE,
  na_complete_cases = FALSE,
  na_to_p = FALSE,
  na_to_prop = TRUE,
  fit_id = FALSE,
  add_dependent_label = FALSE,
  dependent_label_prefix = "Dependent: ",
  dependent_label_suffix = "",
  add_col_totals = FALSE,
  include_col_totals_percent = TRUE,
  col_totals_rowname = NULL,
  col_totals_prefix = "",
  add_row_totals = FALSE,
  include_row_totals_percent = TRUE,
  include_row_missing_col = TRUE,
  row_totals_colname = "Total N",
  row_missing_colname = "Missing N",
  catTest = NULL,
  weights = NULL
)

Arguments

.data

Dataframe.

dependent

Character vector of length 1: name of dependent variable (2 to 5 factor levels).

explanatory

Character vector of any length: name(s) of explanatory variables.

formula

an object of class "formula" (or one that can be coerced to that class). Optional instead of standard dependent/explanatory format. Do not include if using dependent/explanatory.

cont

Summary for continuous explanatory variables: "mean" (standard deviation) or "median" (interquartile range). If "median" then non-parametric hypothesis test performed (see below).

cont_nonpara

Numeric vector of form e.g. c(1,2). Specify which variables to perform non-parametric hypothesis tests on and summarise with "median".

cont_cut

Numeric: number of unique values in continuous variable at which to consider it a factor.

cont_range

Logical. Median is show with 1st and 3rd quartiles.

p

Logical: Include null hypothesis statistical test.

p_cont_para

Character. Continuous variable parametric test. One of either "aov" (analysis of variance) or "t.test" for Welch two sample t-test. Note continuous non-parametric test is always Kruskal Wallis (kruskal.test) which in two-group setting is equivalent to Mann-Whitney U /Wilcoxon rank sum test.

For continous dependent and continuous explanatory, the parametric test p-value returned is for the Pearson correlation coefficient. The non-parametric equivalent is for the p-value for the Spearman correlation coefficient.

p_cat

Character. Categorical variable test. One of either "chisq" or "fisher".

column

Logical: Compute margins by column rather than row.

total_col

Logical: include a total column summing across factor levels.

orderbytotal

Logical: order final table by total column high to low.

digits

Number of digits to round to (1) mean/median, (2) standard deviation / interquartile range, (3) p-value, (4) count percentage, (5) weighted count.

na_include

Logical: make explanatory variables missing data explicit (NA).

na_include_dependent

Logical: make dependent variable missing data explicit.

na_complete_cases

Logical: include only rows with complete data.

na_to_p

Logical: include missing as group in statistical test.

na_to_prop

Logical: include missing in calculation of column proportions.

fit_id

Logical: allows merging via finalfit_merge.

add_dependent_label

Add the name of the dependent label to the top left of table.

dependent_label_prefix

Add text before dependent label.

dependent_label_suffix

Add text after dependent label.

add_col_totals

Logical. Include column total n.

include_col_totals_percent

Include column percentage of total.

col_totals_rowname

Logical. Row name for column totals.

col_totals_prefix

Character. Prefix to column totals, e.g. "N=".

add_row_totals

Logical. Include row totals. Note this differs from total_col above particularly for continuous explanatory variables.

include_row_totals_percent

Include row percentage of total.

include_row_missing_col

Logical. Include missing data total for each row. Only used when add_row_totals is TRUE.

row_totals_colname

Character. Column name for row totals.

row_missing_colname

Character. Column name for missing data totals for each row.

catTest

Deprecated. See p_cat above.

weights

Character vector of length 1: name of column to use for weights. Explanatory continuous variables are multiplied by weights. Explanatory categorical variables are counted with a frequency weight (sum(weights)).

Value

Returns a factorlist dataframe.

Details

This function aims to produce publication-ready summary tables for categorical or continuous dependent variables. It usually takes a categorical dependent variable to produce a cross table of counts and proportions expressed as percentages or summarised continuous explanatory variables. However, it will take a continuous dependent variable to produce mean (standard deviation) or median (interquartile range) for use with linear regression models.

Examples

library(finalfit)
library(dplyr)
# Load example dataset, modified version of survival::colon
data(colon_s)

# Table 1 - Patient demographics ----
explanatory = c("age", "age.factor", "sex.factor", "obstruct.factor")
dependent = "perfor.factor"
colon_s %>%
  summary_factorlist(dependent, explanatory, p=TRUE)
#> Warning: There was 1 warning in `dplyr::summarise()`.
#>  In argument: `chisq.test(age.factor, perfor.factor)$p.value`.
#> Caused by warning in `chisq.test()`:
#> ! Chi-squared approximation may be incorrect
#>        label      levels          No         Yes     p
#>  Age (years)   Mean (SD) 59.8 (11.9) 58.4 (13.3) 0.542
#>          Age   <40 years    68 (7.5)     2 (7.4) 1.000
#>              40-59 years  334 (37.0)   10 (37.0)      
#>                60+ years  500 (55.4)   15 (55.6)      
#>          Sex      Female  432 (47.9)   13 (48.1) 1.000
#>                     Male  470 (52.1)   14 (51.9)      
#>  Obstruction          No  715 (81.2)   17 (63.0) 0.035
#>                      Yes  166 (18.8)   10 (37.0)      

# summary.factorlist() is also commonly used to summarise any number of
# variables by an outcome variable (say dead yes/no).

# Table 2 - 5 yr mortality ----
explanatory = c("age.factor", "sex.factor", "obstruct.factor", "perfor.factor")
dependent = "mort_5yr"
colon_s %>%
  summary_factorlist(dependent, explanatory)
#> Note: dependent includes missing data. These are dropped.
#>        label      levels      Alive       Died
#>          Age   <40 years   31 (6.1)   36 (8.9)
#>              40-59 years 208 (40.7) 131 (32.4)
#>                60+ years 272 (53.2) 237 (58.7)
#>          Sex      Female 243 (47.6) 194 (48.0)
#>                     Male 268 (52.4) 210 (52.0)
#>  Obstruction          No 408 (82.1) 312 (78.6)
#>                      Yes  89 (17.9)  85 (21.4)
#>  Perforation          No 497 (97.3) 391 (96.8)
#>                      Yes   14 (2.7)   13 (3.2)