Performs parametric statistical testing — association

Performs parametric statistical testing (T-test) on (1) the marginal effect of each covariate in C1 at source-specific level (2) the joint effect across all sources for each covariate in C1 (3) non-source-specific effect for each covariate in C2. In the context of bulk genomic data containing a mixture of cell types, these correspond to the marginal effect of each covariate in C1 (potentially including the phenotype of interest) at each cell type, joint tissue-level effect for each covariate in C1, and tissue-level effect for each covariate in C2.

association_parametric(
  X,
  Unico.mdl,
  slot_name = "parametric",
  diag_only = FALSE,
  intercept = TRUE,
  X_max_stds = 2,
  Q_max_stds = Inf,
  XQ_max_stds = Inf,
  parallel = TRUE,
  num_cores = NULL,
  log_file = "Unico.log",
  verbose = FALSE,
  debug = FALSE
)

Arguments

X: An m by n matrix of measurements of m features for n observations. Each column in X is assumed to be a mixture of k sources. Note that X must include row names and column names and that NA values are currently not supported. X should not include features that are constant across all observations. Note that X must be the same X used to learn Unico.mdl (i.e. the original observed 2D mixture used to fit the model).
Unico.mdl: The entire set of model parameters estimated by Unico on the 2D mixture matrix (i.e. the list returned by applying function Unico to X).
slot_name: A string indicating the key for storing the results under Unico.mdl
diag_only: A logical value indicating whether to only use the estimated source-level variances (and thus ignoring the estimate covariance) for controlling the heterogeneity in the observed mixture. if set to FALSE, Unico instead estimates the observation- and feature-specific variance in the mixture by leveraging the entire k by k variance-covariance matrix.
intercept: A logical value indicating whether to fit the intercept term when performing the statistical testing.
X_max_stds: A non-negative numeric value indicating, for each feature, the portions of data that are considered as outliers due to the observed mixture value. Only samples whose observed mixture value fall within X_max_stds standard deviations from the mean will be used for the statistical testing of a given feature.
Q_max_stds: A non-negative numeric value indicating, for each feature, the portions of data that are considered as outliers due to the estimated mixture variance. Only samples whose estimated mixture variance fall within Q_max_stds standard deviations from the mean will be used for the statistical testing of a given feature.
XQ_max_stds: A non-negative numeric value indicating, for each feature, the portions of data that are considered as outliers due to the weighted mixture value. Only samples whose weighted mixture value fall within XQ_max_stds standard deviations from the mean will be used for the statistical testing of a given feature.
parallel: A logical value indicating whether to use parallel computing (possible when using a multi-core machine).
num_cores: A numeric value indicating the number of cores to use (activated only if parallel == TRUE). If num_cores == NULL then all available cores except for one will be used.
log_file: A path to an output log file. Note that if the file log_file already exists then logs will be appended to the end of the file. Set log_file to NULL to prevent output from being saved into a file; note that if verbose == FALSE then no output file will be generated regardless of the value of log_file.
verbose: A logical value indicating whether to print logs.
debug: A logical value indicating whether to set the logger to a more detailed debug level; set debug to TRUE before reporting issues.

Value

An updated Unico.mdl object with the the following list of effect size and p-value estimates stored in an additional key specified by slot_name

gammas_hat: An m by k*p1 matrix of the estimated effects of the p1 covariates in C1 on each of the m features in X, where the first p1 columns are the source-specific effects of the p1 covariates on the first source, the following p1 columns are the source-specific effects on the second source and so on.
betas_hat: An m by p2 matrix of the estimated effects of the p2 covariates in C2 on the mixture values of each of the m features in X.
gammas_hat_pvals: An m by k*p1 matrix of p-values for the estimates in gammas_hat (based on a T-test).
betas_hat_pvals: An m by p2 matrix of p-values for the estimates in betas_hat (based on a T-test).
gammas_hat_pvals.joint: An m by p1 matrix of p-values for the joint effects (i.e. across all k sources) of each of the p1 covariates in C1 on each of the m features in X (based on a partial F-test). In other words, these are p-values for the combined statistical effects (across all sources) of each one of the p1 covariates on each of the m features under the Unico model.
Q: An m by n matrix of weights used for controlling the heterogeneity of each observation at each feature (activated only if debug == TRUE).
masks: An m by n matrix of logical values indicating whether observation participated in statistical testing at each feature (activated only if debug == TRUE).
phi_hat: An m by k+p1*k+p2 matrix containing the entire estimated effect sizes (including those on source weights) for each feature (activated only if debug == TRUE).
phi_se: An m by k+p1*k+p2 matrix containing the estimated standard errors associated with phi_hat for each feature (activated only if debug == TRUE).
phi_hat_pvals: An m by k+p1*k+p2 matrix containing the p-values associated with phi_hat for each feature (activated only if debug == TRUE).

Details

If we assume that source-specific values $Z_{ijh}$ are normally distributed, under the Unico model, we have the following: $$Z_{ij} \sim \mathcal{N}\left(\mu_{j} + (c_i^{(1)})^T \gamma_{jh}, \sigma_{jh}^2 \right)$$ $$X_{ij} \sim \mathcal{N}\left(w_{i}^T (\mu_{j} + (c_i^{(1)})^T \gamma_{jh}) + (c_i^{(2)})^T \beta_j, \text{Sum}\left((w_i w_i^T ) \odot \Sigma_j\right) + \tau_j^2\right)$$ For a given feature $j$ under test, the above equation corresponds to a heteroskedastic regression problem with $X_{ij}$ as the dependent variable and $\{\{w_i\}, \{w_i c_i^{(1)}\}, \{c_i^{(2)}\}\}$ as the set of independent variables. This view allows us to perform parametric statistical testing (T-test for marginal effects and partial F-test for joint effects) by solving a generalized least squares problem with sample $i$ scaled by the inverse of its estimated standard deviation.

Examples

data = simulate_data(n=100, m=2, k=3, p1=1, p2=1, taus_std=0, log_file=NULL)
res = list()
res$params.hat = Unico(data$X, data$W, data$C1, data$C2, parallel=FALSE, log_file=NULL)
res$params.hat = association_parametric(data$X, res$params.hat, parallel=FALSE, log_file=NULL)