Performs asymptotic statistical testing under no distribution assumption

Performs asymptotic statistical testing on (1) the marginal effect of each covariate in C1 at source-specific level (2) non-source-specific effect for each covariate in C2. In the context of bulk genomic data containing a mixture of cell types, these correspond to the marginal effect of each covariate in C1 (potentially including the phenotype of interest) at each cell type and tissue-level effect for each covariate in C2.

association_asymptotic(
  X,
  Unico.mdl,
  slot_name = "asymptotic",
  diag_only = FALSE,
  intercept = TRUE,
  X_max_stds = 2,
  Q_max_stds = Inf,
  V_min_qlt = 0.05,
  parallel = TRUE,
  num_cores = NULL,
  log_file = "Unico.log",
  verbose = FALSE,
  debug = FALSE
)

Arguments

X: An m by n matrix of measurements of m features for n observations. Each column in X is assumed to be a mixture of k sources. Note that X must include row names and column names and that NA values are currently not supported. X should not include features that are constant across all observations. Note that X must be the same X used to learn Unico.mdl (i.e. the original observed 2D mixture used to fit the model).
Unico.mdl: The entire set of model parameters estimated by Unico on the 2D mixture matrix (i.e. the list returned by applying function Unico to X).
slot_name: A string indicating the key for storing the results under Unico.mdl
diag_only: A logical value indicating whether to only use the estimated source-level variances (and thus ignoring the estimate covariance) for controlling the heterogeneity in the observed mixture. if set to FALSE, Unico instead estimates the observation- and feature-specific variance in the mixture by leveraging the entire k by k variance-covariance matrix.
intercept: A logical value indicating whether to fit the intercept term when performing the statistical testing.
X_max_stds: A non-negative numeric value indicating, for each feature, the portions of data that are considered as outliers due to the observed mixture value. Only samples whose observed mixture value fall within X_max_stds standard deviations from the mean will be used for the statistical testing of a given feature.
Q_max_stds: A non-negative numeric value indicating, for each feature, the portions of data that are considered as outliers due to the estimated mixture variance. Only samples whose estimated mixture variance fall within Q_max_stds standard deviations from the mean will be used for the statistical testing of a given feature.
V_min_qlt: A non-negative numeric value indicating, for each feature, the portions of data that are considered as outliers due to the estimated moment condition variance. This value should be between 0 and 1. Only samples whose estimated moment condition variance fall outside the bottom V_min_qlt quantile will be used for the statistical testing of a given feature.
parallel: A logical value indicating whether to use parallel computing (possible when using a multi-core machine).
num_cores: A numeric value indicating the number of cores to use (activated only if parallel == TRUE). If num_cores == NULL then all available cores except for one will be used.
log_file: A path to an output log file. Note that if the file log_file already exists then logs will be appended to the end of the file. Set log_file to NULL to prevent output from being saved into a file; note that if verbose == FALSE then no output file will be generated regardless of the value of log_file.
verbose: A logical value indicating whether to print logs.
debug: A logical value indicating whether to set the logger to a more detailed debug level; set debug to TRUE before reporting issues.

Value

An updated Unico.mdl object with the the following list of effect size and p-value estimates stored in an additional key specified by slot_name

gammas_hat: An m by k*p1 matrix of the estimated effects of the p1 covariates in C1 on each of the m features in X, where the first p1 columns are the source-specific effects of the p1 covariates on the first source, the following p1 columns are the source-specific effects on the second source and so on.
betas_hat: An m by p2 matrix of the estimated effects of the p2 covariates in C2 on the mixture values of each of the m features in X.
gammas_hat_pvals: An m by k*p1 matrix of p-values for the estimates in gammas_hat (based on a T-test).
betas_hat_pvals: An m by p2 matrix of p-values for the estimates in betas_hat (based on a T-test).
Q: An m by n matrix of weights used for controlling the heterogeneity of each observation at each feature (activated only if debug == TRUE).
masks: An m by n matrix of logical values indicating whether observation participated in statistical testing at each feature (activated only if debug == TRUE).
fphi_hat: An m by n matrix containing the entire estimated moment condition variance for each feature. Note that observations who are considered as outliers due to any of the criteria will be marked as -1 in the estimated moment condition variance (activated only if debug == TRUE).
phi_hat: An m by k+p1*k+p2 matrix containing the entire estimated effect sizes (including those on source weights) for each feature (activated only if debug == TRUE).
phi_se: An m by k+p1*k+p2 matrix containing the estimated standard errors associated with phi_hat for each feature (activated only if debug == TRUE).
phi_hat_pvals: An m by k+p1*k+p2 matrix containing the p-values associated with phi_hat for each feature (activated only if debug == TRUE).

Details

Under no distribution assumption, we can solve for the following weighted least square problem, which is similar to the heteroskedastic regression view described in association_parametric. $$\hat{\phi_j}^{\text{asym}} = \text{argmin}_{\phi_j} (x_j - S\phi_j) ^T Q_j (x_j - S\phi_j)$$

$S$ denotes the design matrix formed by stacking samples in the rows and dependent variables $\{\{w_i\}, \{w_i c_i^{(1)}\}, \{c_i^{(2)}\}\}$ on the columns. $\phi_j$ denotes the corresponding effect sizes on the dependent variables. $Q_j$ denotes the feature-specific weighting scheme. Similar to the parametric counterpart, $Q_j=\text{diag}(q_{1j}^2,...,q_{nj}^2)$, where for each sample $i$, its corresponding weight will be the inverse of the estimated variance in the mixture: $q_{ij}^2 = \frac{1}{sum(w_i w_i^T \odot \hat{\Sigma}_j)}$. Marginal testing can thus be carried out on each dependent variable via the asymptotic distribution of the estimator $\hat{\phi_j}^{\text{asym}}$.

Examples

data = simulate_data(n=100, m=2, k=3, p1=1, p2=1, taus_std=0, log_file=NULL)
res = list()
res$params.hat = Unico(data$X, data$W, data$C1, data$C2, parallel=FALSE, log_file=NULL)
res$params.hat = association_asymptotic(data$X, res$params.hat, parallel=FALSE, log_file=NULL)