This vignette shows you how to upload and prepare any dataset for use with finalfit. The demonstration will use the boot::melanoma. Use ?boot::melanoma to see the help page with data description. I will use library(tidyverse) methods. First I’ll write_csv() the data just to demonstrate reading it.

Column types

Note the output shows how the columns/variables have been parsed. For full details see ?readr::cols().

Continuous data

  • Integer (whole numbers) - col_integer()
  • Double or numeric (real numbers; the name comes from “double-precision floating point”) - col_double()

Categorical data

Dates and times

Specify factors

Use an original description of the data (often called a data dictionary) to correctly assign and label any factor variables. This can be done in a single pipe.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
melanoma %>% 
  mutate(
    status.factor = factor(status, levels = c(1, 2, 3), 
      labels = c("Died from melanoma", "Alive", "Died from other causes")) %>% 
    ff_label("Status"),
    sex.factor = factor(sex, levels = c(1, 0),
      labels = c("Male", "Female")) %>% 
    ff_label("Sex"),
    ulcer.factor = factor(ulcer, levels = c(1, 0),
      labels = c("Present", "Absent")) %>% 
    ff_label("Ulcer")
  ) -> melanoma

ff_glimpse(melanoma)
#> Continuous
#>               label var_type   n missing_n missing_percent   mean     sd
#> time           time    <dbl> 205         0             0.0 2152.8 1122.1
#> status       status    <dbl> 205         0             0.0    1.8    0.6
#> sex             sex    <dbl> 205         0             0.0    0.4    0.5
#> age             age    <dbl> 205         0             0.0   52.5   16.7
#> year           year    <dbl> 205         0             0.0 1969.9    2.6
#> thickness thickness    <dbl> 205         0             0.0    2.9    3.0
#> ulcer         ulcer    <dbl> 205         0             0.0    0.4    0.5
#>              min quartile_25 median quartile_75    max
#> time        10.0      1525.0 2005.0      3042.0 5565.0
#> status       1.0         1.0    2.0         2.0    3.0
#> sex          0.0         0.0    0.0         1.0    1.0
#> age          4.0        42.0   54.0        65.0   95.0
#> year      1962.0      1968.0 1970.0      1972.0 1977.0
#> thickness    0.1         1.0    1.9         3.6   17.4
#> ulcer        0.0         0.0    0.0         1.0    1.0
#> 
#> Categorical
#>                label var_type   n missing_n missing_percent levels_n
#> status.factor Status    <fct> 205         0             0.0        3
#> sex.factor       Sex    <fct> 205         0             0.0        2
#> ulcer.factor   Ulcer    <fct> 205         0             0.0        2
#>                                                                levels
#> status.factor "Died from melanoma", "Alive", "Died from other causes"
#> sex.factor                                           "Male", "Female"
#> ulcer.factor                                      "Present", "Absent"
#>               levels_count   levels_percent
#> status.factor  57, 134, 14 27.8, 65.4,  6.8
#> sex.factor         79, 126           39, 61
#> ulcer.factor       90, 115           44, 56

Everything looks good and you are ready to start analysis.