Last updated: 2018-05-15

workflowr checks: (Click a bullet for more information)
  • R Markdown file: up-to-date

    Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

  • Environment: empty

    Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

  • Seed: set.seed(12345)

    The command set.seed(12345) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

  • Session information: recorded

    Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

  • Repository version: 388e65e

    Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility. The version displayed above was the version of the Git repository at the time these results were generated.

    Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:
    
    Ignored files:
        Ignored:    .DS_Store
        Ignored:    .Rhistory
        Ignored:    .Rproj.user/
        Ignored:    analysis/.DS_Store
        Ignored:    analysis/BH_robustness_cache/
        Ignored:    analysis/FDR_Null_cache/
        Ignored:    analysis/FDR_null_betahat_cache/
        Ignored:    analysis/Rmosek_cache/
        Ignored:    analysis/StepDown_cache/
        Ignored:    analysis/alternative2_cache/
        Ignored:    analysis/alternative_cache/
        Ignored:    analysis/ash_gd_cache/
        Ignored:    analysis/average_cor_gtex_2_cache/
        Ignored:    analysis/average_cor_gtex_cache/
        Ignored:    analysis/brca_cache/
        Ignored:    analysis/cash_deconv_cache/
        Ignored:    analysis/cash_fdr_1_cache/
        Ignored:    analysis/cash_fdr_2_cache/
        Ignored:    analysis/cash_fdr_3_cache/
        Ignored:    analysis/cash_fdr_4_cache/
        Ignored:    analysis/cash_fdr_5_cache/
        Ignored:    analysis/cash_fdr_6_cache/
        Ignored:    analysis/cash_plots_cache/
        Ignored:    analysis/cash_sim_1_cache/
        Ignored:    analysis/cash_sim_2_cache/
        Ignored:    analysis/cash_sim_3_cache/
        Ignored:    analysis/cash_sim_4_cache/
        Ignored:    analysis/cash_sim_5_cache/
        Ignored:    analysis/cash_sim_6_cache/
        Ignored:    analysis/cash_sim_7_cache/
        Ignored:    analysis/correlated_z_2_cache/
        Ignored:    analysis/correlated_z_3_cache/
        Ignored:    analysis/correlated_z_cache/
        Ignored:    analysis/create_null_cache/
        Ignored:    analysis/cutoff_null_cache/
        Ignored:    analysis/design_matrix_2_cache/
        Ignored:    analysis/design_matrix_cache/
        Ignored:    analysis/diagnostic_ash_cache/
        Ignored:    analysis/diagnostic_correlated_z_2_cache/
        Ignored:    analysis/diagnostic_correlated_z_3_cache/
        Ignored:    analysis/diagnostic_correlated_z_cache/
        Ignored:    analysis/diagnostic_plot_2_cache/
        Ignored:    analysis/diagnostic_plot_cache/
        Ignored:    analysis/efron_leukemia_cache/
        Ignored:    analysis/fitting_normal_cache/
        Ignored:    analysis/gaussian_derivatives_2_cache/
        Ignored:    analysis/gaussian_derivatives_3_cache/
        Ignored:    analysis/gaussian_derivatives_4_cache/
        Ignored:    analysis/gaussian_derivatives_5_cache/
        Ignored:    analysis/gaussian_derivatives_cache/
        Ignored:    analysis/gd-ash_cache/
        Ignored:    analysis/gd_delta_cache/
        Ignored:    analysis/gd_lik_2_cache/
        Ignored:    analysis/gd_lik_cache/
        Ignored:    analysis/gd_w_cache/
        Ignored:    analysis/knockoff_10_cache/
        Ignored:    analysis/knockoff_2_cache/
        Ignored:    analysis/knockoff_3_cache/
        Ignored:    analysis/knockoff_4_cache/
        Ignored:    analysis/knockoff_5_cache/
        Ignored:    analysis/knockoff_6_cache/
        Ignored:    analysis/knockoff_7_cache/
        Ignored:    analysis/knockoff_8_cache/
        Ignored:    analysis/knockoff_9_cache/
        Ignored:    analysis/knockoff_cache/
        Ignored:    analysis/knockoff_var_cache/
        Ignored:    analysis/marginal_z_alternative_cache/
        Ignored:    analysis/marginal_z_cache/
        Ignored:    analysis/mosek_reg_2_cache/
        Ignored:    analysis/mosek_reg_4_cache/
        Ignored:    analysis/mosek_reg_5_cache/
        Ignored:    analysis/mosek_reg_6_cache/
        Ignored:    analysis/mosek_reg_cache/
        Ignored:    analysis/pihat0_null_cache/
        Ignored:    analysis/plot_diagnostic_cache/
        Ignored:    analysis/poster_obayes17_cache/
        Ignored:    analysis/real_data_simulation_2_cache/
        Ignored:    analysis/real_data_simulation_3_cache/
        Ignored:    analysis/real_data_simulation_4_cache/
        Ignored:    analysis/real_data_simulation_5_cache/
        Ignored:    analysis/real_data_simulation_cache/
        Ignored:    analysis/rmosek_primal_dual_2_cache/
        Ignored:    analysis/rmosek_primal_dual_cache/
        Ignored:    analysis/seqgendiff_cache/
        Ignored:    analysis/simulated_correlated_null_2_cache/
        Ignored:    analysis/simulated_correlated_null_3_cache/
        Ignored:    analysis/simulated_correlated_null_cache/
        Ignored:    analysis/simulation_real_se_2_cache/
        Ignored:    analysis/simulation_real_se_cache/
        Ignored:    analysis/smemo_2_cache/
        Ignored:    data/LSI/
        Ignored:    docs/.DS_Store
        Ignored:    docs/figure/.DS_Store
        Ignored:    output/fig/
    
    
    Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
Expand here to see past versions:
    File Version Author Date Message
    html e05bc83 LSun 2018-05-12 Update to 1.0
    Rmd cc0ab83 Lei Sun 2018-05-11 update
    html 0f36d99 LSun 2017-12-21 Build site.
    html 853a484 LSun 2017-11-07 Build site.
    html 6639968 LSun 2017-11-05 transfer
    Rmd 7344c2d LSun 2017-03-30 constraints
    html 7344c2d LSun 2017-03-30 constraints
    Rmd b7aaa28 LSun 2017-03-29 debug
    html b7aaa28 LSun 2017-03-29 debug
    Rmd c803b43 LSun 2017-03-27 fitting and write-up
    html c803b43 LSun 2017-03-27 fitting and write-up
    Rmd 3adfbee LSun 2017-03-23 fit

Last updated: 2018-05-15

Code version: 388e65e

Empirical null

Correlation among summary statistics can distort hypothesis testing and signal estimation by making the observed statistics deviate from their theoretical distribution. See previous simulations (here, here, and here) for illustration.

Efron is his series of publications proposed an “empirical null” idea to tackle the correlation among \(z\) scores in the simultaneous inference setting. Essentially, he assumed that \(z\) scores from those null hypotheses among all hypotheses simultaneously considered should follow an “empirical null” distribution that’s different from their marginal \(N(0, 1)\), but still a normal with estimated mean and standard deviation, thus to produce what Efron terms as the “correlation induced inflation.”

However, as Matthew first noted, this doesn’t seem to hold true in our RNA-seq gene expression study. After transforming the original count data to \(z\) scores using limma and edgeR (see here for the pipeline), these \(z\) scores will have a \(N(0, 1)\) marginal distribution, but correlation makes their empirical distribution far from normal. Normality suggests that all observations should be inflated at the same time, yet oftentimes, the more moderate observations tend to be inflated, but not the more extreme ones. Indeed, if anything, the most extreme observations tend to be deflated. See previous simulations (here and here) for illustrations.

In his response to Efron’s paper, Schwartzman pointed out that “the core issue… is the behavior of a large collection of correlated normal variables.” Let \(z\) scores \(z_1, \ldots, z_n\) be \(N(0, 1)\) variables with pairwise correlations \(\rho_{ij} = \text{cor}(z_i, z_j)\), and \(\hat F(x) = \frac1n\sum\limits_{i=1}^n1(z_i\geq x)\), the right-sided empirical cdf of all \(z\) scores. Schwartzman showed that when the number of null hypothesis \(n\to\infty\), \(\hat F(x)\) approximates a random function \(F_0(x)\) generated as

\[ F_0(x) = \Phi^+(x) + \sum\limits_{k = 1}^\infty W_k\varphi^{(k - 1)}(x) \]

where \(\Phi^+(x) = 1 - \Phi(x)\) is the right-sided cdf of \(N(0, 1)\) and \(\varphi^{(k - 1)}(x)\) is the \((k - 1)^\text{th}\) order derivative of the pdf of \(N(0,1)\). \(W_k\)’s are independent random variables with mean \(0\) and variance \(\text{var}(W_k) = \alpha_k/k!\), where

\[ \alpha_k = \frac{2}{n(n-1)}\sum\limits_{i<j}\rho_{ij}^k = \int_{-1}^{1}\rho^kdG(\rho) \] is the \(k^\text{th}\) moment of \(G(\rho)\), the empirical distribution of all the \(n(n-1)/2\) pairwise correlations \(\rho_{ij}\).

Differentiating \(F_0(x)\) gives

\[ f_0(x) = \varphi(x) + \sum\limits_{k = 1}^\infty W_k\varphi^{(k)}(x) \] which can be seen as approximating the density of the observed distribution of correlated \(z\) scores when the number of them \(n\) is large. Or in other words, \(f_0(x)\) can be seen as to describe the histogram of those \(z\) scores.

Gaussian derivatives and Hermite polynomials

\(f_0(x)\) involves Gaussian derivatives \(\varphi^{(k)}\), and many properties of them have been thoroughly studied. For starters, they are closely related to the probabilists’ Hermite polynomials \(h_k\) in that

\[ \varphi^{(k)}(x) = (-1)^kh_k(x)\varphi(x) \] where \(h_k(x)\) is defined by its generating function

\[ \begin{array}{rcl} h_k(x) &=& (-1)^ke^{\frac{x^2}{2}}\frac{d^k}{dx^k} e^{-\frac{x^2}{2}}\\ &=& \left(x - \frac{d}{dx}\right)^k\cdot1 \end{array} \] The relationship gives

\[ \begin{array}{rcl} f_0(x) &=& \varphi(x) + \sum\limits_{k = 1}^\infty W_k(-1)^kh_k(x)\varphi(x)\\ &=& \varphi(x)\left(1 + \sum\limits_{k = 1}^\infty W_kh_k(x)\right)\\ &=& \sum\limits_{k = 0}^\infty W_kh_k(x)\varphi(x) \end{array} \] The simplification comes from the observation that \(W_k\)’s are zero mean independent random variables, so all \((-1)^k\) can be absorbed into \(W_k\). Furthermore, let \(W_0\equiv1\), and use the fact that \(h_0(x)\equiv1\), so \(f_0(x)\) can be written in a more compact way.

Probabilists’ Hermite polynomials are orthogonal with respect to \(\varphi(x)\), the pdf of \(N(0, 1)\). That is,

\[ \int_{-\infty}^\infty h_k(x)h_l(x)\varphi(x)dx = \begin{cases} k! & k = l \\0 & k\neq l \end{cases} \] This orthogonality gives us

\[ W_k = \frac{1}{k!}\int_{-\infty}^\infty f_0(x)h_k(x)dx \] With \(W_k\)’s defined and obtained this way,

\[ \begin{array}{rcl} F_0(x) &=& \displaystyle\int_{-\infty}^x f_0(u)du \\ &=& \displaystyle\int_{-\infty}^x \sum\limits_{k = 0}^\infty W_k h_k(u)\varphi(u)du\\ &=& \displaystyle\int_{-\infty}^x \sum\limits_{k = 0}^\infty W_k(-1)^k(-1)^k h_k(u)\varphi(u)du\\ &=& \displaystyle\int_{-\infty}^x \sum\limits_{k = 0}^\infty W_k(-1)^k \varphi^{(k)}(u)du\\ &=& \displaystyle\int_{-\infty}^x \varphi(u)du + \sum\limits_{k = 1}^\infty W_k(-1)^k \displaystyle\int_{-\infty}^x\varphi^{(k)}(u)du\\ &=& \displaystyle\Phi(x) + \sum\limits_{k = 1}^\infty W_k(-1)^k \varphi^{(k - 1)}(x)\\ &=& \displaystyle\Phi(x) + \sum\limits_{k = 1}^\infty W_k(-1)^k(-1)^{k - 1} h_{k-1}(x)\varphi(x)\\ &=& \displaystyle\Phi(x) - \sum\limits_{k = 1}^\infty W_kh_{k-1}(x)\varphi(x) \end{array} \]

Illustration

First orders of Gaussian derivatives and Hermite polynomials

Scaled and unscaled Gaussian derivatives

The expectation of \(W_k^2\) vanishes in the order of \(k!\), and the average \(h_k^2\) also scales in the order of \(k!\), and we can define “scaled” version of these functions and quantities:

\[ \begin{array}{c} \varphi_s^{(k)} = \frac1{\sqrt{k!}}\varphi^{(k)}\\ h_k^s(x) = \frac1{\sqrt{k!}}h_k(x)\\ W_k^s = \sqrt{k!}W_k \end{array} \]

We can see that the scaled Gaussian derivatives (\(\varphi_s^{(k)}\)) and Hermite polynomials \(h_k^s(x)\) are basically in the same order of magnitude for increasing orders. Therefore, it might make more sense to think of certain quantities in terms of their scaled versions

\[ \begin{array} {rcl} f_0(x) &=& \varphi(x)\left(1 + \sum\limits_{k = 1}^\infty W_kh_k(x)\right)\\ &= & \varphi(x)\left(1 + \sum\limits_{k = 1}^\infty W_k^sh_k^s(x)\right) \end{array} \]

Fitting the empirical null

Given \(z\) scores \(z_1, \ldots, z_n\) with known marginal distribution \(N(0, 1)\) but unknown correlation, their (right-sided) empirical cdf \(\hat F(x)\) can be observed. Then these \(z\) scores can be seen as if they are independently and identically sampled from this observed empirical cdf.

Furthermore, when \(n\) is sufficiently large, this observed empirical cdf approximates a cdf \(F_0(x)\) which has a density

\[ f_0(x) = \varphi(x)\left(1 + \sum\limits_{k = 1}^\infty W_kh_k(x)\right) \] Therefore, with observed \(z\) scores, we can fit \(f_0\) with estimated (realized) \(W_k\). One way of doing that is by maxmimum likelihood. In particular, we maximize the approximated log-likelihood

\[\log\prod\limits_{i = 1}^nf_0(z_i) = \sum\limits_{i = 1}^n\log f_0(z_i) = \sum\limits_{i = 1}^n\log \varphi(z_i) + \sum\limits_{i = 1}^n\log\left(1 +\sum\limits_{k = 1}^\infty w_kh_k(z_i)\right) \]

by a constrained optimization problem

\[ \begin{array}{rl} \max\limits_{w_1, w_2, \ldots} & \sum\limits_{i = 1}^n\log\left(1 +\sum\limits_{k = 1}^\infty w_kh_k(z_i)\right)\\ \text{s.t.} & \int_{-\infty}^{\infty}f_0(x)dx=1\Leftrightarrow \sum\limits_{k = 1}^\infty w_k\int_{-\infty}^\infty h_k(x)\varphi(x)dx = 0\\ & f_0(x) \geq 0 \Leftrightarrow 1 + \sum\limits_{k = 1}^\infty w_kh_k(x) \geq0, \forall x\in\mathbb{R} \end{array} \] The first constraint is self-satisfied for all \(w\): \(\int_{-\infty}^\infty h_k(x)\varphi(x)dx = 0\) for any \(k\geq1\) since \(h_k\) and \(h_0 \equiv1\) are orthogonal with respect to \(\varphi\).

The second constraint and the objective are intractable in the present form because of the involvement of \(\infty\). Therefore we need to make another two key assumptions as follows.

Assumption 1: the pairwise correlation \(\rho_{ij}\) are on average moderate enough, such that the influence of \(W_kh_k\) will vanish for sufficiently large \(k\).

We assume that the pairwise correlation \(\rho_{ij}\) are on average moderate enough, such that the influence of \(W_kh_k\) will vanish for sufficiently large \(k\). Note that it doesn’t hold true in general. Recall that

\[ \text{var}(W_k) = \alpha_k / k! = \bar{\rho_{ij}^k} / k! \leq 1/k! \ , \] so it seems \(W_k^2\) will vanish in factorial order, yet on the other hand,

\[ \int_{-\infty}^\infty h_k^2(x)\varphi(x)dx = k! \]

so \(h_k^2\) also scales in factorial order. Therefore, it’s not at all clear that the influence of \(W_kh_k\) will decay in the most general case. So the decay of higher orders largely rely on the decay of higher order empirical moments of pairwise correlation \(\alpha_k\), which characterizes \(\rho_{ij}^k\).

As a result, we here assume that \(\rho_{ij}\)’s are on average moderate, so \(\alpha_k\) decays faster, leading to faster decaying of \(W_k\), probably as well as that of \(W_kh_k\). Ignoring this assumption would cause problem at least theoretically, as detailed later.

With this assumption, we can stop at a large \(K\); that is, \(f_0(x) = \varphi(x)\left(1 + \sum\limits_{k = 1}^K W_kh_k(x)\right)\). We also need to specify a method to choose \(K\) in a systematic way.

Assumption 2: we further assume that the number of observations \(n\) is sufficiently large, so that we can use the \(n\) observed \(z\) scores in place of the intractable all \(x\in\mathbb{R}\) in the second constraint.

We further assume that the number of observations \(n\) is sufficiently large, so that we can use the \(n\) observed \(z\) scores in place of the intractable all \(x\in\mathbb{R}\) in the second constraint.

That is, \(1 + \sum\limits_{k = 1}^K w_kh_k(z_i) \geq0\). Ignoring this assumption would also lead to numerical instability, as detailed later.

Convex optimization

With these assumptions, the problem is now a convex optimization.

\[ \begin{array}{rl} \max\limits_{w_1, \ldots, w_K} & \sum\limits_{i = 1}^n\log\left(1 +\sum\limits_{k = 1}^K w_kh_k(z_i)\right)\\ \text{s.t.} & 1 + \sum\limits_{k = 1}^K w_kh_k(z_i) \geq0 \end{array} \] It can also be written as

\[ \begin{array}{rl} \max\limits_{w} & \sum\log\left(1 +Hw\right)\\ \text{s.t.} & 1 +Hw \geq0 \end{array} \]

where \(H_{ik} = h_k(z_i)\).

Session Information

Session information

sessionInfo()
R version 3.4.3 (2017-11-30)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.4

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] workflowr_1.0.1   Rcpp_0.12.16      digest_0.6.15    
 [4] rprojroot_1.3-2   R.methodsS3_1.7.1 backports_1.1.2  
 [7] git2r_0.21.0      magrittr_1.5      evaluate_0.10.1  
[10] stringi_1.1.6     whisker_0.3-2     R.oo_1.21.0      
[13] R.utils_2.6.0     rmarkdown_1.9     tools_3.4.3      
[16] stringr_1.3.0     yaml_2.1.18       compiler_3.4.3   
[19] htmltools_0.3.6   knitr_1.20       

This reproducible R Markdown analysis was created with workflowr 1.0.1