This document is intended to elaborate on the inner workings of the
simdata
package, for users who may wish to extend it for
their purposes.
The simdata
package is based on a very simple idea:
simdesign
S3 class, and any concrete subclass
implemented by the user, which provides a data generating mechanism, and
stores all necessary data to simulate data from the data generating
mechanismsimulate_data
method for the simdesign
class, which actually implements drawing from the data generating
mechanismBoth key functionalities can be embellished by further features to
adapt to the task of interest. How to do this is presented in the
Demo
vignette of the package. The package further provides
some utilities around the core functionality, to assist in simulation
tasks, but which are not essential to the usage of the package.
simdesign
S3 classThe main class of this package is the simdesign
S3
class. It is a list with class attribute simdesign
and
entries as defined in the documentation of the simdesign
class.
simdesign
A template for a constructor implementing a subclass for a specific simulation design is given by:
# constructor takes any number of arguments arg1, arg2, and so on
# and it must use the elipsis ... as final argument
new_simdesign <- function(arg1, arg2, ...) {
# define generator function in one argument
generator = function(n) {
# implement data generating mechanism
# make use of any argument passed to the new_simdesign constructor
# make sure it returns a two-dimensional array
}
# setup simdesign subclass
# make sure to pass generator function and ...
# all other information passed is optional
dsgn = simdesign(
generator = generator,
arg1 = arg1,
arg2 = arg2,
...
)
# extend the class attribute
class(dsgn) = c("binomial_simdesign", class(dsgn))
# return the object
dsgn
}
Examples for actual implementations are provided in the
Demo
vignette of this package.
simulate_data
methodThe data generation in the simulate_data
method follows
a simple recipe. In principle, the method can be used without a
simdesign
object, but here we assume they are used
together. In the following graphic, circular shapes denote
functions.
generator
field (a function object) of the
simdesign
class.
generator
field of the simdesign
class, n_obs
(number of
observations), any further argument passed to simulate_data
which is not specified in the documentationZ
Z
is transformed by one or several
functions which are applied to the dataset.
Z
, function stored in the
transform_initial
field of the simdesign
class
(can be implemented by using a function_list
, see
documentation of this package)base::identity
is used to return the dataset
Z
unchangedX
X
can be post-processed before
further usage.
X
, functions stored in the
process_final
field of the simdesign
objectbase::identity
is used to return the dataset
X
unchangedX'
.The final output of the method is a dataset (a matrix or data.frame depending on the data generating mechanism) which can be used in further analysis steps.
simulate_data
is a S3 method, which implements
simulate_data.default
: the default method doing all the
actual worksimulate_data.simdesign
: calls
simulate_data.default
with appropriate parameters as stored
in the simdesign
object; the intended way to use this
functionsimulate_data_conditional
functionData can be simulated to conform to specific user-specified
constraints. These constraints are implemented through a rejection
function applied to a simulated dataset. Only datasets for which the
function returns FALSE (i.e. not rejected) are returned. This is
implemented by repeatedly calling simulate_data
to obtain
new instances of datasets from the data generating mechanism, either
until the rejection function accepts the dataset, or until a maximum
number of iterations was conducted. This process is depicted in the
following diagram, in which circular shaps denote functions.
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] parallel stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] ggcorrplot_0.1.4.1 patchwork_1.3.0 dplyr_1.1.4 fitdistrplus_1.2-1
## [5] survival_3.7-0 MASS_7.3-61 nhanesA_1.1 doRNG_1.8.6
## [9] rngtools_1.5.2 doParallel_1.0.17 iterators_1.0.14 foreach_1.5.2
## [13] knitr_1.48 GGally_2.2.1 reshape2_1.4.4 ggplot2_3.5.1
## [17] simdata_0.4.0 rmarkdown_2.28
##
## loaded via a namespace (and not attached):
## [1] gtable_0.3.6 xfun_0.48 bslib_0.8.0 lattice_0.22-6
## [5] vctrs_0.6.5 tools_4.4.1 generics_0.1.3 curl_5.2.3
## [9] tibble_3.2.1 fansi_1.0.6 highr_0.11 pkgconfig_2.0.3
## [13] Matrix_1.7-1 RColorBrewer_1.1-3 lifecycle_1.0.4 compiler_4.4.1
## [17] farver_2.1.2 stringr_1.5.1 munsell_0.5.1 codetools_0.2-20
## [21] htmltools_0.5.8.1 sys_3.4.3 buildtools_1.0.0 sass_0.4.9
## [25] yaml_2.3.10 pillar_1.9.0 jquerylib_0.1.4 tidyr_1.3.1
## [29] cachem_1.1.0 ggstats_0.7.0 tidyselect_1.2.1 rvest_1.0.4
## [33] digest_0.6.37 mvtnorm_1.3-1 stringi_1.8.4 purrr_1.0.2
## [37] maketools_1.3.1 labeling_0.4.3 splines_4.4.1 fastmap_1.2.0
## [41] grid_4.4.1 colorspace_2.1-1 cli_3.6.3 magrittr_2.0.3
## [45] utf8_1.2.4 foreign_0.8-87 withr_3.0.2 scales_1.3.0
## [49] httr_1.4.7 igraph_2.1.1 evaluate_1.0.1 viridisLite_0.4.2
## [53] rlang_1.1.4 Rcpp_1.0.13 glue_1.8.0 selectr_0.4-2
## [57] xml2_1.3.6 jsonlite_1.8.9 R6_2.5.1 plyr_1.8.9