Title: | Generate Simulated Datasets |
---|---|
Description: | Generate simulated datasets from an initial underlying distribution and apply transformations to obtain realistic data. Implements the 'NORTA' (Normal-to-anything) approach from Cario and Nelson (1997) and other data generating mechanisms. Simple network visualization tools are provided to facilitate communicating the simulation setup. |
Authors: | Michael Kammer [aut, cre] |
Maintainer: | Michael Kammer <[email protected]> |
License: | GPL-3 |
Version: | 0.4.1 |
Built: | 2025-02-01 05:59:50 UTC |
Source: | https://github.com/matherealize/simdata |
Used to make use of apply-like operations, regardless of wether the input is a matrix or a data.frame
apply_array(obj, dim, fun)
apply_array(obj, dim, fun)
obj |
Matrix or data.frame. |
dim |
Dimension to apply function to. |
fun |
Function object to apply. |
function_list
object from list of functionsCreate a function_list
object from a list of functions. This is useful
if such a list is created programmatically.
as_function_list(flist, ...)
as_function_list(flist, ...)
flist |
List in which each entry is a function object. Can be named or unnamed. |
... |
Passed to |
Function with a single input which outputs a data.frame. Has special 'flist' entry in its environment which stores individual functions as list.
Helper function to simplify workflow with lists of functions.
colapply_functions(obj, flist)
colapply_functions(obj, flist)
obj |
2-dimensional array (matrix or data.frame). |
flist |
List of functions of length equal to the number of columns of |
Matrix or data.frame (same type as obj
) with names taken from obj
.
Check if matrix contains constant column(s)
contains_constant(x, eps = .Machine$double.eps)
contains_constant(x, eps = .Machine$double.eps)
x |
Matrix or Data.frame. |
eps |
Threshold for standard deviation below which a column is considered to be constant. |
TRUE if one of the columns has standard deviation of below 'eps“, else FALSE.
Prints a warning if constant is found.
Use to specify correlation matrix in convenient way by giving entries of the upper triangular part.
cor_from_upper(n_var, entries = NULL)
cor_from_upper(n_var, entries = NULL)
n_var |
Integer, number of variables (= rows = columns of matrix). |
entries |
Matrix of correlation entries. Consists of 3 columns (variable_1, variable_2, correlation) that specify both variables and corresponding correlation in the upper triangular part of the matrix (i.e. variable_1 < variable_2) . |
Matrix with user supplied entries.
cor_from_upper(2, rbind(c(1, 2, 0.8)))
cor_from_upper(2, rbind(c(1, 2, 0.8)))
Rescale correlation matrix by variable standard deviations to yield a covariance matrix.
cor_to_cov(m, sds = NULL)
cor_to_cov(m, sds = NULL)
m |
Symmetric correlation matrix. |
sds |
Standard deviations of the variables. Set to 1 for all varirables by default. |
Symmetric covariance matrix.
cor_from_upper
Convert correlation matrix to specification used by
cor_from_upper
cor_to_upper(m, remove_below = .Machine$double.eps)
cor_to_upper(m, remove_below = .Machine$double.eps)
m |
Symmetric correlation matrix. |
remove_below |
Threshold for absolute correlation values below which they are removed from the returned matrix. If NULL then no filtering is applied. |
Matrix with 3 columns (variable_1, variable_2, correlation), where correlation gives the entry at position (variable_1, variable_2) of the input correlation matrix. Note that variable_1 < variable_2 holds for all entries.
Applies functions to a matrix or data.frame.
do_processing(x, functions = list())
do_processing(x, functions = list())
x |
Matrix or Data.frame. |
functions |
List of lists, specifying functions to be applied as well as their arguments. See details. |
Functions are passed into the post-processor as a named list. The name
f
of the list entry is the function to be applied via
base::do.call
.
The list entry itself is another named list, specifying the arguments
to the function f
as named arguments.
The functions must take a matrix or data.frame as first argument and return another matrix or data.frame of the same dimensions as single output.
Examples of post-processing steps are truncation
(process_truncate_by_iqr
,
process_truncate_by_threshold
) or
centering / standardizing data (via scale
,
see example section below).
Can be useful to apply on simulated datasets, even outside of the simulation function (e.g. when standardization is only required at the modeling step).
Matrix or data.frame with post-processing applied.
Use with caution - no error checking is done for now so the user has to take care of everything themselves! Furthermore, output of the functions is not checked either.
do_processing(diag(5), functions = list(scale = list(center = TRUE, scale = FALSE)))
do_processing(diag(5), functions = list(scale = list(center = TRUE, scale = FALSE)))
Used to obtain an estimate of the correlation matrix after transforming the initial data.
estimate_final_correlation( obj, n_obs = 1e+05, cor_type = "pearson", seed = NULL, ... )
estimate_final_correlation( obj, n_obs = 1e+05, cor_type = "pearson", seed = NULL, ... )
obj |
S3 class object of type |
n_obs |
Number of observations to simulate. |
cor_type |
Can be either a character ( |
seed |
Random number seed. NULL does not change the current seed. |
... |
Further arguments are passed to the function that computes the correlation
matrix (either |
This function is useful to estimate the final correlation of the data after transformation of the initial data. To provide a robust estimate it is advised to use a very large number of observations to compute the correlation matrix.
A numeric matrix given by the pairwise correlation coefficients for each
pair of variables defined by obj
and computed according to cor_type
.
Apply list of functions to input
function_list(..., stringsAsFactors = FALSE, check.names = TRUE)
function_list(..., stringsAsFactors = FALSE, check.names = TRUE)
... |
Named or unnamed arguments, each of which is a function taking exactly one input. See details. |
stringsAsFactors , check.names
|
Arguments of |
This is a convenience function which takes a number of functions and returns
another function which applies all of the user specified functions to a new
input, and collects the results as list or data.frame.
This is useful to e.g. transform columns of a data.frame or check
the validity of a matrix during simulations. See the example here and
in simulate_data_conditional
.
The assumptions for the individual functions are:
Each function is expected to take a single input.
Each function is expected to output a result consistent with the other functions (i.e. same output length) to ensure that the results can be summarized as a data.frame.
Function with a single input which outputs a data.frame. Has special 'flist' entry in its environment which stores individual functions as list.
This function works fine without naming the input arguments, but the resulting data.frames have empty column names if that is the case. Thus, it is recommended to only pass named function arguments.
data.frame
,
get_from_function_list
,
get_names_from_function_list
f <- function_list( v1 = function(x) x[, 1] * 2, v2 = function(x) x[, 2] + 10) f(diag(2)) # function_list can be used to add new columns # naming of columns should be handled separately in such cases f <- function_list( function(x) x, # return x as it is X1_X2 = function(x) x[, 2] + 10) # add new column f(diag(2))
f <- function_list( v1 = function(x) x[, 1] * 2, v2 = function(x) x[, 2] + 10) f(diag(2)) # function_list can be used to add new columns # naming of columns should be handled separately in such cases f <- function_list( function(x) x, # return x as it is X1_X2 = function(x) x[, 2] + 10) # add new column f(diag(2))
function_list
Extract individual function objects from environment of a function_list
object.
get_from_function_list(flist)
get_from_function_list(flist)
flist |
|
List with named or unnamed entries corresponding to individual function
objects that were passed to the function_list
object. If flist
is a
simple function, returns NULL.
function_list
Extract names of individual function objects from environment of a
function_list
object.
get_names_from_function_list(flist)
get_names_from_function_list(flist)
flist |
|
Names of list corresponding to individual function objects that were passed
to the function_list
object. If flist
is a simple function, returns NULL.
Check if matrix is collinear
is_collinear(x)
is_collinear(x)
x |
Matrix or Data.frame. |
TRUE if matrix is collinear, else FALSE.
Prints a warning if collinear.
Checks if matrix is numeric, symmetric, has diagonal elements of one,
has only entries in [-1, 1]
, and is positive definite. Prints a warning
if a problem was found.
is_cor_matrix(m, tol = 1e-09)
is_cor_matrix(m, tol = 1e-09)
m |
Matrix. |
tol |
Tolerance for checking diagonal elements. |
TRUE
if matrix is a correlation matrix, else FALSE
.
This function can be used to find a suitable initial correlation for use in the NORTA procedure for a pair of variables with given marginal distributions and target correlation.
optimize_cor_for_pair( cor_target, dist1, dist2, n_obs = 1e+05, seed = NULL, tol = 0.01, ... )
optimize_cor_for_pair( cor_target, dist1, dist2, n_obs = 1e+05, seed = NULL, tol = 0.01, ... )
cor_target |
Target correlation of variable pair. |
dist1 , dist2
|
Marginal distributions of variable pair, given as univariable quantile functions. |
n_obs |
Number of observations to be used in the numerical optimization procedure. |
seed |
Seed for generating standard normal random variables in the numerical optimization procedure. |
tol , ...
|
Further parameters passed to |
Uses stats::uniroot
for actual optimization.
Output of stats::uniroot
for the univariable
optimization for find the initial correlation.
This function can be used to find a suitable correlation matrix to be used
for simulating initial multivariate normal data in a NORTA based simulation
design (see simdesign_norta
).
optimize_cor_mat( cor_target, dist, ensure_cor_mat = TRUE, conv_norm_type = "O", return_diagnostics = FALSE, ... )
optimize_cor_mat( cor_target, dist, ensure_cor_mat = TRUE, conv_norm_type = "O", return_diagnostics = FALSE, ... )
cor_target |
Target correlation matrix. |
dist |
List of functions of marginal distributions for simulated variables.
Must have the same length as the specified correlation matrix
( |
ensure_cor_mat |
if TRUE, this function ensures that the optimized matrix is a proper correlation matrix by ensuring positive definitiness. If FALSE, the optimized matrix is returned as is. |
conv_norm_type |
Metric to be used to find closest positive definite matrix to optimal matrix,
used if |
return_diagnostics |
TRUE to return additional diagnostics of the optimization procedure, see below. |
... |
Additional parameters passed to |
This function first finds a suitable correlation matrix for the underlying
multivariate normal data used in the NORTA procedure. It does so by
solving k*(k-1) univariable optimisation problems (where k is the number
of variables). In case the result is not a positive-definite matrix, the
nearest positive-definite matrix is found according to the user specified
metric using Matrix::nearPD
.
See e.g. Ghosh and Henderson (2003) for an overview of the procedure.
If return_diagnostics
is FALSE, a correlation matrix to be used in the
definition of a simdesign_norta
object. If TRUE, then a list
with two entries: cor_mat
containing the correlation matrix, and
convergence
containing a list of objects returned by the individual
optimisation problems from stats::uniroot
.
Ghosh, S. and Henderson, S. G. (2003) Behavior of the NORTA method for correlated random vector generation as the dimension increases. ACM Transactions on Modeling and Computer Simulation.
Partial functions are useful to define marginal distributions based on additional parameters.
partial(f, ...)
partial(f, ...)
f |
Function in two or more parameters. |
... |
Parameters to be held fixed for function |
This helper function stores passed arguments in a list, and stores this list in the environment of the returned function. Thus, it remembers the arguments that should be held fixed, such that the returned partial function now is a function with fewer arguments.
Function object.
marginal <- partial(function(x, meanx) qnorm(x, meanx), meanx = 2) marginal(0.5)
marginal <- partial(function(x, meanx) qnorm(x, meanx), meanx = 2) marginal(0.5)
Useful to visualize e.g. the associations of the initial multivariate
gaussian distribution used by simdesign_mvtnorm
.
plot_cor_network(obj, ...) ## Default S3 method: plot_cor_network( obj, categorical_indices = NULL, decimals = 2, cor_cutoff = 0.1, vertex_labels = NULL, vertex_label_prefix = "z", edge_width_function = function(x) x * 10, edge_label_function = function(x) round(x, decimals), use_edge_weights = FALSE, edge_weight_function = base::identity, seed = NULL, return_network = FALSE, mar = c(0, 0, 0, 0), vertex.size = 12, margin = 0, asp = 0, vertex.color = "#ececec", vertex.frame.color = "#979797", vertex.label.color = "black", edge.color = "ramp", edge.label.color = "black", edge.label.cex = 0.8, ... ) ## S3 method for class 'simdesign_mvtnorm' plot_cor_network(obj, ...)
plot_cor_network(obj, ...) ## Default S3 method: plot_cor_network( obj, categorical_indices = NULL, decimals = 2, cor_cutoff = 0.1, vertex_labels = NULL, vertex_label_prefix = "z", edge_width_function = function(x) x * 10, edge_label_function = function(x) round(x, decimals), use_edge_weights = FALSE, edge_weight_function = base::identity, seed = NULL, return_network = FALSE, mar = c(0, 0, 0, 0), vertex.size = 12, margin = 0, asp = 0, vertex.color = "#ececec", vertex.frame.color = "#979797", vertex.label.color = "black", edge.color = "ramp", edge.label.color = "black", edge.label.cex = 0.8, ... ) ## S3 method for class 'simdesign_mvtnorm' plot_cor_network(obj, ...)
obj |
Correlation matrix or S3 class object which has a class method available (see below). |
... |
Passed to |
categorical_indices |
Vector of indices of variables which should be drawn as rectangles (i.e. represent categorical data). |
decimals |
Number of decimals, used for default labeling of the network edges. |
cor_cutoff |
Threshold of absolute correlation below which nodes are not considered as connected. Useful to control complexity of drawn network. Set to NULL to disable. |
vertex_labels |
Character vector of length |
vertex_label_prefix |
String which is added as prefix to node labels. |
edge_width_function |
Function which takes one vector input (absolute correlation values) and outputs transformation of this vector (must be >= 0). Defines edge widths. |
edge_label_function |
Function which takes on vector input (absolute correlation values) and outputs labels for these values as character vector. Defines edges labels. If set to NULL, then no edge labels will be displayed. |
use_edge_weights |
Logical, if TRUE then the layout will be influenced by the absolute correlations (i.e. edge weights) such that highly correlated variables will be put closer together. If FALSE, then the layout is independent of the correlation structure. |
edge_weight_function |
Function which takes one vector input (absolute correlation values) and
outputs transformation of this vector (must be >= 0). Defines edge weights.
Only relevant if |
seed |
Set random seed to ensure reproducibility of results. Can be fixed to obtain same layout but vary edge widths, correlation functions etc. Can also be used to obtain nicer looking graph layouts. |
return_network |
If TRUE, the |
mar |
|
vertex.size , margin , asp , vertex.frame.color , vertex.label.color , edge.label.color , edge.label.cex
|
Arguments to |
vertex.color |
Argument passed to |
edge.color |
Argument passed to |
For an explanation of all parameters not listed here, please refer to
igraph::plot
.
If return_network
is TRUE
, then an igraph
network object is returned
that can be plotted by the user using e.g. the interactive
igraph::tkplot
function. Otherwise, the network
object is plotted directly and no output is returned.
plot_cor_network(default)
: Function to be used for correlation matrix.
plot_cor_network(simdesign_mvtnorm)
: Function to be used with simdesign_mvtnorm
S3 class object to visualize initial correlation network of the underlying
multivariate normal distribution.
plot_cor_network.simdesign_mvtnorm
,
plot_estimated_cor_network
Based on approximation via simulation specified by given simulation design.
Convenience wrapper for combining estimate_final_correlation
and
plot_cor_network
.
plot_estimated_cor_network( obj, n_obs = 1e+05, cor_type = "pearson", seed = NULL, show_categorical = TRUE, return_network = FALSE, ... )
plot_estimated_cor_network( obj, n_obs = 1e+05, cor_type = "pearson", seed = NULL, show_categorical = TRUE, return_network = FALSE, ... )
obj |
S3 class object of type |
n_obs |
Number of observations to simulate. |
cor_type |
Can be either a character ( |
seed |
Random number seed. NULL does not change the current seed. |
show_categorical |
If TRUE, marks categorical variables differently from numeric ones.
Determined by the |
return_network |
If TRUE, the |
... |
Passed to |
This function is useful to estimate the correlation network of a simulation
setup after the initial underlying distribution Z
has been transformed to
the final dataset X
.
If return_network
is TRUE
, then an igraph
network object is returned
that can be plotted by the user using e.g. the interactive
igraph::tkplot
function. Otherwise, the network
object is plotted directly and no output is returned.
plot_cor_network
,
estimate_final_correlation
Truncation based on the interquartile range to be applied to a dataset.
process_truncate_by_iqr(x, truncate_multipliers = NA, only_numeric = TRUE)
process_truncate_by_iqr(x, truncate_multipliers = NA, only_numeric = TRUE)
x |
Matrix or Data.frame. |
truncate_multipliers |
Vector of truncation parameters. Either a single value which is
replicated as necessary or of same dimension as |
only_numeric |
If TRUE and if |
Truncation is processed as follows:
Compute the 1st and 3rd quartile q1 / q3 of variables in x
.
Multiply these quantities by values in truncate_multipliers
to obtain
L and U. If a value is NA, the corresponding variable is not truncated.
Set any value smaller / larger than L / U to L / U.
Truncation multipliers can be specified in three ways (note that whenever
only_numeric
is set to TRUE, then only numeric columns are affected):
A single numeric - then all columns will be processed in the same way
A numeric vector without names - it is assumed that the length can be
replicated to the number of columns in x
, each column is processed by the
corresponding value in the vector
A numeric vector with names - length can differ from the columns in
x
and only the columns for which the names occur in the vector are
processed
Matrix or data.frame of same dimensions as input.
Truncation based on fixed thresholds to be applied to a dataset. Allows to implement truncation by measures derived from the overall data generating mechanism.
process_truncate_by_threshold( x, truncate_lower = NA, truncate_upper = NA, only_numeric = TRUE )
process_truncate_by_threshold( x, truncate_lower = NA, truncate_upper = NA, only_numeric = TRUE )
x |
Matrix or Data.frame. |
truncate_lower , truncate_upper
|
Vectors of truncation parameters, i.e. lower and upper tresholds for
truncation.
Either a single value which is replicated as necessary or of same dimension
as |
only_numeric |
If TRUE and if |
Truncation is defined by setting all values below or above the truncation threshold to the truncation threshold.
Truncation parameters can be specified in three ways (note that whenever
only_numeric
is set to TRUE, then only numeric columns are affected):
A single numeric - then all columns will be processed in the same way
A numeric vector without names - it is assumed that the length can be
replicated to the number of columns in x
, each column is processed by the
corresponding value in the vector
A numeric vector with names - length can differ from the columns in
x
and only the columns for which the names occur in the vector are
processed
Matrix or data.frame of same dimensions as input.
Helper to estimate quantile functions from data for NORTA
quantile_functions_from_data( data, method_density = "linear", n_density = 200, method_quantile = "constant", probs_quantile = seq(0, 1, 0.01), n_small = 10, use_quantile = c(), ... ) quantile_function_from_density( x, method_density = "linear", n_density = 200, ... ) quantile_function_from_quantiles( x, method_quantile = "constant", probs_quantile = seq(0, 1, 0.01) )
quantile_functions_from_data( data, method_density = "linear", n_density = 200, method_quantile = "constant", probs_quantile = seq(0, 1, 0.01), n_small = 10, use_quantile = c(), ... ) quantile_function_from_density( x, method_density = "linear", n_density = 200, ... ) quantile_function_from_quantiles( x, method_quantile = "constant", probs_quantile = seq(0, 1, 0.01) )
data |
A matrix or data.frame for which quantile function should be estimated. |
method_density |
Interpolation method used for density based quantile functions,
passed to |
n_density |
Number of points at which the density is estimated for density bsed
quantile, functions, passed to |
method_quantile |
Interpolation method used for quantile based quantile functions,
passed to |
probs_quantile |
Specification of quantiles to be estimated from data for quantile based
quantile functions, passed to |
n_small |
An integer giving the number of distinct values below which quantile
functions are estimated using |
use_quantile |
A vector of names indicating columns for which the quantile function
should be estimated using |
... |
Passed to |
x |
Single vector representing variable input to |
The NORTA approach requires the specification of the marginals by quantile functions. This helper estimates those given a dataset automatically and non-parametrically. There are two ways implemented to estimate quantile functions from data.
Estimate the quantile function by interpolating the observed
quantiles from the data. This is most useful for categorical data, when
the interpolation is using a step-function (default). Implemented in
quantile_function_from_quantiles()
.
Estimate the quantile function via the the empirical cumulative
density function derived from the density of the data. Since the density
is only estimated at specific points, any values in between are interpolated
linearly (default, other options are possible). This is most useful for
continuous data. Implemented in quantile_function_from_density()
.
A named list of functions with length ncol(data)
giving the quantile
functions of the input data. Each entry is a function returned from
stats::approxfun
.
Stores information necessary to simulate and visualize datasets based
on underlying distribution Z
.
simdesign( generator, transform_initial = base::identity, n_var_final = -1, types_final = NULL, names_final = NULL, prefix_final = "v", process_final = list(), name = "Simulation design", check_and_infer = TRUE, ... )
simdesign( generator, transform_initial = base::identity, n_var_final = -1, types_final = NULL, names_final = NULL, prefix_final = "v", process_final = list(), name = "Simulation design", check_and_infer = TRUE, ... )
generator |
Function which generates data from the underlying base distribution. It is
assumed it takes the number of simulated observations |
transform_initial |
Function which specifies the transformation of the underlying
dataset |
n_var_final |
Integer, number of columns in final datamatrix |
types_final |
Optional vector of length equal to |
names_final |
NULL or character vector with variable names for final dataset |
prefix_final |
NULL or prefix attached to variables in final dataset |
process_final |
List of lists specifying post-processing functions applied to final
datamatrix |
name |
Character, optional name of the simulation design. |
check_and_infer |
If TRUE, then the simulation design is tested by simulating 5 observations
using |
... |
Further arguments are directly stored in the list object to be passed to
|
The simdesign
class should be used in the following workflow:
Specify a design template which will be used in subsequent data generating / visualization steps.
Sample / visualize datamatrix following template (possibly
multiple times) using simulate_data
.
Use sampled datamatrix for simulation study.
For more details on generators and transformations, please see the
documentation of simulate_data
.
For details on post-processing, please see the documentation of
do_processing
.
List object with class attribute "simdesign" (S3 class) containing the following entries (if no further information given, entries are directly saved from user input):
generator
name
transform_initial
n_var_final
types_final
names_final
process_final
entries for further information as passed by the user
If check_and_infer
is set to TRUE, the following procedure determines
the names of the variables:
use names_final
if specified and of correct length
otherwise, use the names of transform_initial
if present and of
correct length
otherwise, use prefix_final
to prefix the variable number if
not NULL
otherwise, use names from dataset as generated by the generator
function
This class is intended to be used as a template for simulation designs
which are based on specific underlying distributions. All such a template
needs to define is the generator
function and its construction and
pass it to this function along with the other arguments. See
simdesign_mvtnorm
for an example.
simdesign_mvtnorm
,
simulate_data
,
simulate_data_conditional
generator <- function(n) mvtnorm::rmvnorm(n, mean = 0) sim_design <- simdesign(generator) simulate_data(sim_design, 10, seed = 19)
generator <- function(n) mvtnorm::rmvnorm(n, mean = 0) sim_design <- simdesign(generator) simulate_data(sim_design, 10, seed = 19)
Provides 2-dimensional points, spread uniformly over disc, or partial disc segment (i.e. a circle, or ring, or ring segment). Useful for e.g. building up clustering exercises.
simdesign_discunif( r_min = 0, r_max = 1, angle_min = 0, angle_max = 2 * pi, name = "Uniform circle simulation design", ... )
simdesign_discunif( r_min = 0, r_max = 1, angle_min = 0, angle_max = 2 * pi, name = "Uniform circle simulation design", ... )
r_min |
Minimum radius of points. |
r_max |
Maximum radius of points. |
angle_min |
Minimum angle of points (between 0 and 2pi). |
angle_max |
Maximum angle of points (between 0 and 2pi). |
name |
Character, optional name of the simulation design. |
... |
Further arguments are passed to the |
The distribution of points on a disk depends on the radius - the farther out, the more area the points need to cover. Thus, simply sampling two uniform values for radius and angle will not work. See references.
List object with class attribute "simdesign_discunif" (S3 class), inheriting
from "simdesign". It contains the same entries as a simdesign
object but in addition the following entries:
r_min
r_max
angle_min
angle_max
https://mathworld.wolfram.com/DiskPointPicking.html
disc_sampler <- simdesign_discunif() plot(simulate_data(disc_sampler, 1000, seed = 19)) ring_segment_sampler <- simdesign_discunif(r_min = 0.5, angle_min = 0.5 * pi) plot(simulate_data(ring_segment_sampler, 1000, seed = 19)) circle_sampler <- simdesign_discunif(r_min = 1) plot(simulate_data(circle_sampler, 1000, seed = 19))
disc_sampler <- simdesign_discunif() plot(simulate_data(disc_sampler, 1000, seed = 19)) ring_segment_sampler <- simdesign_discunif(r_min = 0.5, angle_min = 0.5 * pi) plot(simulate_data(ring_segment_sampler, 1000, seed = 19)) circle_sampler <- simdesign_discunif(r_min = 1) plot(simulate_data(circle_sampler, 1000, seed = 19))
Stores information necessary to simulate and visualize datasets based
on underlying distribution multivariate normal distribution Z
.
simdesign_mvtnorm( relations_initial, mean_initial = 0, sd_initial = 1, is_correlation = TRUE, method = "svd", name = "Multivariate-normal based simulation design", ... )
simdesign_mvtnorm( relations_initial, mean_initial = 0, sd_initial = 1, is_correlation = TRUE, method = "svd", name = "Multivariate-normal based simulation design", ... )
relations_initial |
Correlation / Covariance matrix of the initial multivariate
Normal distribution |
mean_initial |
Vector of mean values of the initial multivariate Normal
distribution |
sd_initial |
Vector of standard deviations of the initial multivariate
Normal distribution Z. Dimension needs to correspond to dimension
of |
is_correlation |
If TRUE, then |
method |
|
name |
Character, optional name of the simulation design. |
... |
Further arguments are passed to the |
This S3 class implements a simulation design based on an underlying
multivariate normal distribution by creating a generator
function
based on mvtnorm::rmvnorm
.
List object with class attribute "simdesign_mvtnorm" (S3 class), inheriting
from "simdesign". It contains the same entries as a simdesign
object but in addition the following entries:
mean_initial
sd_initial
cor_initial
Initial correlation matrix of multivariate normal distribution
Data will be generated by simulate_data
using the
following procedure:
The underlying data matrix Z
is sampled from a
multivariate Normal distribution (number of dimensions specified by
dimensions of relations
).
Z
is then transformed into the final dataset X
by applying
the transform_initial
function to Z
.
X
is post-processed if specified.
Note that relations
specifies the correlation / covariance
of the underlying Normal data Z
and thus does not directly translate into
correlations between the variables of the final datamatrix X
.
simdesign
,
simulate_data
,
simulate_data_conditional
,
plot_cor_network.simdesign_mvtnorm
Stores information necessary to simulate datasets based on the NORTA procedure (Cario and Nelson 1997).
simdesign_norta( cor_target_final = NULL, cor_initial = NULL, dist = list(), tol_initial = 0.001, n_obs_initial = 10000, seed_initial = 1, conv_norm_type = "O", method = "svd", name = "NORTA based simulation design", ... )
simdesign_norta( cor_target_final = NULL, cor_initial = NULL, dist = list(), tol_initial = 0.001, n_obs_initial = 10000, seed_initial = 1, conv_norm_type = "O", method = "svd", name = "NORTA based simulation design", ... )
cor_target_final |
Target correlation matrix for simulated datasets. At least one of
|
cor_initial |
Correlation matrix for underlying multivariate standard normal distribution
on which the final data is based on. At least one of |
dist |
List of functions of marginal distributions for simulated variables.
Must have the same length as the specified correlation matrix
( |
tol_initial |
If |
n_obs_initial |
If |
seed_initial |
Seed used for draws of the initial distribution used during optimization to estimate correlations. |
conv_norm_type |
If |
method |
|
name |
Character, optional name of the simulation design. |
... |
Further arguments are passed to the |
This S3 class implements a simulation design based on the NORmal-To-Anything (NORTA) procedure by Cario and Nelson (1997). See the corresponding NORTA vignette for usage examples how to approximate real datasets.
List object with class attribute "simdesign_norta" (S3 class), inheriting
from "simdesign". It contains the same entries as a simdesign
object but in addition the following entries:
cor_target_final
cor_initial
Initial correlation matrix of multivariate normal distribution
dist
tol_initial
n_obs_initial
conv_norm_type
method
Data will be generated using the following procedure:
An underlying data matrix Z
is sampled from a
multivariate standard Normal distribution with correlation structure given by
cor_initial
.
Z
is then transformed into a dataset X
by applying
the functions given in dist
to the columns of Z
. The resulting dataset
X
will then have the desired marginal distributions, and approximate the
target correlation cor_target_final
, if specified.
X
is further transformed by the transformation transform_initial
(note that this may affect the correlation of the final dataset and is not
respected by the optimization procedure), and post-processed if specified.
A list of functions dist
is used to define the marginal distributions of
the variables. Each entry must be a quantile function, i.e. a function
that maps [0, 1]
to the domain of a probability distribution. Each entry
must take a single input vector, and return a single numeric vector.
Examples for acceptable entries include all standard quantile functions
implemented in R (e.g. qnorm
, qbinom
, ...), user defined functions
wrapping these (e.g. function(x) = qnorm(x, mean = 10, sd = 4)
), or
empirical quantile functions. The helper function
quantile_functions_from_data can be used to automatically
estimate empirical quantile functions from a given data to reproduce it using
the NORTA approach.See the example in the NORTA vignette of this package for
workflow details.
Not every valid correlation matrix (i.e. symmetric, positive-definite matrix
with elements in [-1, 1]
and unity diagonal) for a number of variables
is feasible for given desired marginal distributions (see e.g.
Ghosh and Henderson 2003). Therefore, if cor_target_final
is specified
as target correlation, this class optimises cor_initial
in such a
way, that the final simulated dataset has a correlation which approximates
cor_target_final
. However, the actual correlation in the end may differ
if cor_target_final
is infeasible for the given specification, or the
NORTA procedure cannot exactly reproduce the target correlation. In general,
however, approximations should be acceptable if target correlations and
marginal structures are derived from real datasets.
See e.g. Ghosh and Henderson 2003 for the motivation why this works.
Cario, M. C. and Nelson, B. L. (1997) Modeling and generating random vectors with arbitrary marginal distributions and correlation matrix. Technical Report, Department of Industrial Engineering and Management Sciences, Northwestern University, Evanston, Illinois.
Ghosh, S. and Henderson, S. G. (2003) Behavior of the NORTA method for correlated random vector generation as the dimension increases. ACM Transactions on Modeling and Computer Simulation.
simdesign
,
simulate_data
,
simulate_data_conditional
,
quantile_functions_from_data
Generate simulated dataset based on transformation of an underlying base distribution.
simulate_data(generator, ...) ## Default S3 method: simulate_data( generator = function(n) matrix(rnorm(n)), n_obs = 1, transform_initial = base::identity, names_final = NULL, prefix_final = NULL, process_final = list(), seed = NULL, ... ) ## S3 method for class 'simdesign' simulate_data( generator, n_obs = 1, seed = NULL, apply_transformation = TRUE, apply_processing = TRUE, ... )
simulate_data(generator, ...) ## Default S3 method: simulate_data( generator = function(n) matrix(rnorm(n)), n_obs = 1, transform_initial = base::identity, names_final = NULL, prefix_final = NULL, process_final = list(), seed = NULL, ... ) ## S3 method for class 'simdesign' simulate_data( generator, n_obs = 1, seed = NULL, apply_transformation = TRUE, apply_processing = TRUE, ... )
generator |
Function which generates data from the underlying base distribution. It is
assumed it takes the number of simulated observations |
... |
Further arguments passed to |
n_obs |
Number of simulated observations. |
transform_initial |
Function which specifies the transformation of the underlying
dataset |
names_final |
NULL or character vector with variable names for final dataset |
prefix_final |
NULL or prefix attached to variables in final dataset |
process_final |
List of lists specifying post-processing functions applied to final
datamatrix |
seed |
Set random seed to ensure reproducibility of results. |
apply_transformation |
This argument can be set to FALSE to override the information stored in the
passed |
apply_processing |
This argument can be set to FALSE to override the information stored in the
passed |
Data is generated using the following procedure:
An underlying dataset Z
is sampled from some distribution. This is
done by a call to the generator
function.
Z
is then transformed into the final dataset X
by applying the
transform
function to Z
.
X
is post-processed if specified (e.g. truncation to avoid
outliers).
Data.frame or matrix with n_obs
rows for simulated dataset X
.
simulate_data(default)
: Function to be used if no simdesign
S3 class is used.
simulate_data(simdesign)
: Function to be used with simdesign
S3 class.
The generator
function which is either passed directly, or via a
simdata::simdesign
object, is assumed to provide the same interface
as the random generation functions in the R stats and extraDistr
packages. Specifically, that means it takes the number of observations as
first argument. All further arguments can be set via passing them as
named argument to this function. It is expected to return a two-dimensional
array (matrix or data.frame) for which the number of columns can be
determined. Otherwise the check_and_infer
step will fail.
Transformations should be applicable to the output of the generator
function (i.e. take a data.frame or matrix as input) and output another
data.frame or matrix. A convenience function function_list
is
provided by this package to specify transformations as a list of functions,
which take the whole datamatrix Z
as single argument and can be used to
apply specific transformations to the columns of that matrix. See the
documentation for function_list
for details.
Post-processing the datamatrix is based on do_processing
.
Variables are named by names_final
if not NULL and of correct length.
Otherwise, if prefix_final
is not NULL, it is used as prefix for variable
numbers. Otherwise, variables names remain as returned by the generator
function.
This function is best used in conjunction with the simdesign
S3 class or any template based upon it, which facilitates further data
visualization and conveniently stores information as a template for
simulation tasks.
simdesign
,
simdesign_mvtnorm
,
simulate_data_conditional
,
do_processing
generator <- function(n) mvtnorm::rmvnorm(n, mean = 0) simulate_data(generator, 10, seed = 24)
generator <- function(n) mvtnorm::rmvnorm(n, mean = 0) simulate_data(generator, 10, seed = 24)
Generate simulated dataset based on transformation of an underlying base distribution while checking that certain conditions are met.
simulate_data_conditional( generator, n_obs = 1, reject = function(x) TRUE, reject_max_iter = 10, on_reject = "ignore", return_tries = FALSE, seed = NULL, ... )
simulate_data_conditional( generator, n_obs = 1, reject = function(x) TRUE, reject_max_iter = 10, on_reject = "ignore", return_tries = FALSE, seed = NULL, ... )
generator |
Function which generates data from the underlying base distribution. It is
assumend it takes the number of simulated observations |
n_obs |
Number of simulated observations. |
reject |
Function which takes a matrix or data.frame |
reject_max_iter |
Integer > 0. In case of rejection, how many times should a new datamatrix be
simulated until the conditions in |
on_reject |
If "stop", an error is returned if after |
return_tries |
If TRUE, then the function also outputs the number of tries necessary to find a dataset fulfilling the condition. Useful to record to assess the possible bias of the simulated datasets. See Value. |
seed |
Set random seed to ensure reproducibility of results. See Note below. |
... |
All further parameters are passed to |
For details on generating, transforming and post-processing datasets, see
simulate_data
. This function simulates data conditional
on certain requirements that must be met by the final datamatrix X
.
This checking is conducted on the output of simulate_data
(i.e.
also includes possible post-processing steps).
Data.frame or matrix with n_obs
rows for simulated dataset X
if all
conditions are met within the iteration limit. Otherwise NULL.
If return_tries
is TRUE, then the output is a list with the first entry
being the data.frame or matrix as described above, and the second entry
(n_tries
) giving a numeric with the number of tries necessary to
find the returned dataset.
Examples for restrictions include
variance restrictions (e.g. no constant columns which could happen due
to extreme transformations of the initial gaussian distribution Z
),
ensuring a sufficient number of observations in a given class (e.g. certain
binary variables should have at least x\
multicollinearity (e.g. X
must have full column rank). If reject
evaluates to FALSE, the current datamatrix X
is rejected.
In case of rejection, new datasets can be simulated until the conditions
are met or a given maximum iteration limit is hit (reject_max_iter
),
after which the latest datamatrix is returned or an error is reported.
The reject
function should take a single input (a data.frame or matrix)
and output TRUE if the dataset is to be rejected or FALSE if it is to be
accepted.
This package provides the function_list
convenience function
which allows to easily create a rejection function which assesses several
conditions on the input dataset by simply passing individual test functions
to function_list
. Such test function templates are found in
is_collinear
and contains_constant
.
See the example below.
Seeding the random number generator is tricky in this case. The seed can not
be passed to simulate_data
but is set before calling it, otherwise
the random number generation is the same for each of the tries.
This means that the seed used to call this function might not be the seed
corresponding to the returned dataset.
simdesign
,
simulate_data
,
function_list
,
is_collinear
,
contains_constant
dsgn <- simdesign_mvtnorm(diag(5)) simulate_data_conditional(dsgn, 10, reject = function_list(is_collinear, contains_constant), seed = 18)
dsgn <- simdesign_mvtnorm(diag(5)) simulate_data_conditional(dsgn, 10, reject = function_list(is_collinear, contains_constant), seed = 18)