simdata: Technical documentation

Introduction

This document is intended to elaborate on the inner workings of the simdata package, for users who may wish to extend it for their purposes.

The simdata package is based on a very simple idea:

  • the simdesign S3 class, and any concrete subclass implemented by the user, which provides a data generating mechanism, and stores all necessary data to simulate data from the data generating mechanism
  • the simulate_data method for the simdesign class, which actually implements drawing from the data generating mechanism

Both key functionalities can be embellished by further features to adapt to the task of interest. How to do this is presented in the Demo vignette of the package. The package further provides some utilities around the core functionality, to assist in simulation tasks, but which are not essential to the usage of the package.

simdesign S3 class

The main class of this package is the simdesign S3 class. It is a list with class attribute simdesign and entries as defined in the documentation of the simdesign class.

Subclassing simdesign

A template for a constructor implementing a subclass for a specific simulation design is given by:

# constructor takes any number of arguments arg1, arg2, and so on
# and it must use the elipsis ... as final argument
new_simdesign <- function(arg1, arg2, ...) {
    
    # define generator function in one argument
    generator = function(n) {
        # implement data generating mechanism
        # make use of any argument passed to the new_simdesign constructor
        # make sure it returns a two-dimensional array
    }
    
    # setup simdesign subclass
    # make sure to pass generator function and ...
    # all other information passed is optional
    dsgn = simdesign(
        generator = generator, 
        arg1 = arg1, 
        arg2 = arg2, 
        ...
    )
    
    # extend the class attribute 
    class(dsgn) = c("binomial_simdesign", class(dsgn))
    
    # return the object
    dsgn
}

Examples for actual implementations are provided in the Demo vignette of this package.

Simulation of data

simulate_data method

The data generation in the simulate_data method follows a simple recipe. In principle, the method can be used without a simdesign object, but here we assume they are used together. In the following graphic, circular shapes denote functions.

  1. Data is drawn from an initial distribution using the generator field (a function object) of the simdesign class.
    • Relevant input: the function stored in the generator field of the simdesignclass, n_obs (number of observations), any further argument passed to simulate_data which is not specified in the documentation
    • Output: initial generated dataset Z
  2. The initial data Z is transformed by one or several functions which are applied to the dataset.
    • Relevant input: Z, function stored in the transform_initial field of the simdesign class (can be implemented by using a function_list, see documentation of this package)
    • Default: base::identity is used to return the dataset Z unchanged
    • Output: final generated dataset X
  3. Optional: the final data X can be post-processed before further usage.
    • Relevant input: X, functions stored in the process_final field of the simdesign object
    • Default: base::identity is used to return the dataset X unchanged
    • Output: post-processed dataset X'.

The final output of the method is a dataset (a matrix or data.frame depending on the data generating mechanism) which can be used in further analysis steps.

Implemented methods

simulate_data is a S3 method, which implements

  • simulate_data.default: the default method doing all the actual work
  • simulate_data.simdesign: calls simulate_data.default with appropriate parameters as stored in the simdesign object; the intended way to use this function

simulate_data_conditional function

Data can be simulated to conform to specific user-specified constraints. These constraints are implemented through a rejection function applied to a simulated dataset. Only datasets for which the function returns FALSE (i.e. not rejected) are returned. This is implemented by repeatedly calling simulate_data to obtain new instances of datasets from the data generating mechanism, either until the rejection function accepts the dataset, or until a maximum number of iterations was conducted. This process is depicted in the following diagram, in which circular shaps denote functions.

R session information

## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] parallel  stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] ggcorrplot_0.1.4.1 patchwork_1.3.0    dplyr_1.1.4        fitdistrplus_1.2-1
##  [5] survival_3.7-0     MASS_7.3-61        nhanesA_1.1        doRNG_1.8.6       
##  [9] rngtools_1.5.2     doParallel_1.0.17  iterators_1.0.14   foreach_1.5.2     
## [13] knitr_1.48         GGally_2.2.1       reshape2_1.4.4     ggplot2_3.5.1     
## [17] simdata_0.4.0      rmarkdown_2.28    
## 
## loaded via a namespace (and not attached):
##  [1] gtable_0.3.6       xfun_0.48          bslib_0.8.0        lattice_0.22-6    
##  [5] vctrs_0.6.5        tools_4.4.1        generics_0.1.3     curl_5.2.3        
##  [9] tibble_3.2.1       fansi_1.0.6        highr_0.11         pkgconfig_2.0.3   
## [13] Matrix_1.7-1       RColorBrewer_1.1-3 lifecycle_1.0.4    compiler_4.4.1    
## [17] farver_2.1.2       stringr_1.5.1      munsell_0.5.1      codetools_0.2-20  
## [21] htmltools_0.5.8.1  sys_3.4.3          buildtools_1.0.0   sass_0.4.9        
## [25] yaml_2.3.10        pillar_1.9.0       jquerylib_0.1.4    tidyr_1.3.1       
## [29] cachem_1.1.0       ggstats_0.7.0      tidyselect_1.2.1   rvest_1.0.4       
## [33] digest_0.6.37      mvtnorm_1.3-1      stringi_1.8.4      purrr_1.0.2       
## [37] maketools_1.3.1    labeling_0.4.3     splines_4.4.1      fastmap_1.2.0     
## [41] grid_4.4.1         colorspace_2.1-1   cli_3.6.3          magrittr_2.0.3    
## [45] utf8_1.2.4         foreign_0.8-87     withr_3.0.2        scales_1.3.0      
## [49] httr_1.4.7         igraph_2.1.1       evaluate_1.0.1     viridisLite_0.4.2 
## [53] rlang_1.1.4        Rcpp_1.0.13        glue_1.8.0         selectr_0.4-2     
## [57] xml2_1.3.6         jsonlite_1.8.9     R6_2.5.1           plyr_1.8.9