Last updated: 2018-05-15
workflowr checks: (Click a bullet for more information)Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
set.seed(12345)
The command set.seed(12345)
was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
wflow_publish
or wflow_git_commit
). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:
Ignored files:
Ignored: .DS_Store
Ignored: .Rhistory
Ignored: .Rproj.user/
Ignored: analysis/.DS_Store
Ignored: analysis/BH_robustness_cache/
Ignored: analysis/FDR_Null_cache/
Ignored: analysis/FDR_null_betahat_cache/
Ignored: analysis/Rmosek_cache/
Ignored: analysis/StepDown_cache/
Ignored: analysis/alternative2_cache/
Ignored: analysis/alternative_cache/
Ignored: analysis/ash_gd_cache/
Ignored: analysis/average_cor_gtex_2_cache/
Ignored: analysis/average_cor_gtex_cache/
Ignored: analysis/brca_cache/
Ignored: analysis/cash_deconv_cache/
Ignored: analysis/cash_fdr_1_cache/
Ignored: analysis/cash_fdr_2_cache/
Ignored: analysis/cash_fdr_3_cache/
Ignored: analysis/cash_fdr_4_cache/
Ignored: analysis/cash_fdr_5_cache/
Ignored: analysis/cash_fdr_6_cache/
Ignored: analysis/cash_plots_cache/
Ignored: analysis/cash_sim_1_cache/
Ignored: analysis/cash_sim_2_cache/
Ignored: analysis/cash_sim_3_cache/
Ignored: analysis/cash_sim_4_cache/
Ignored: analysis/cash_sim_5_cache/
Ignored: analysis/cash_sim_6_cache/
Ignored: analysis/cash_sim_7_cache/
Ignored: analysis/correlated_z_2_cache/
Ignored: analysis/correlated_z_3_cache/
Ignored: analysis/correlated_z_cache/
Ignored: analysis/create_null_cache/
Ignored: analysis/cutoff_null_cache/
Ignored: analysis/design_matrix_2_cache/
Ignored: analysis/design_matrix_cache/
Ignored: analysis/diagnostic_ash_cache/
Ignored: analysis/diagnostic_correlated_z_2_cache/
Ignored: analysis/diagnostic_correlated_z_3_cache/
Ignored: analysis/diagnostic_correlated_z_cache/
Ignored: analysis/diagnostic_plot_2_cache/
Ignored: analysis/diagnostic_plot_cache/
Ignored: analysis/efron_leukemia_cache/
Ignored: analysis/fitting_normal_cache/
Ignored: analysis/gaussian_derivatives_2_cache/
Ignored: analysis/gaussian_derivatives_3_cache/
Ignored: analysis/gaussian_derivatives_4_cache/
Ignored: analysis/gaussian_derivatives_5_cache/
Ignored: analysis/gaussian_derivatives_cache/
Ignored: analysis/gd-ash_cache/
Ignored: analysis/gd_delta_cache/
Ignored: analysis/gd_lik_2_cache/
Ignored: analysis/gd_lik_cache/
Ignored: analysis/gd_w_cache/
Ignored: analysis/knockoff_10_cache/
Ignored: analysis/knockoff_2_cache/
Ignored: analysis/knockoff_3_cache/
Ignored: analysis/knockoff_4_cache/
Ignored: analysis/knockoff_5_cache/
Ignored: analysis/knockoff_6_cache/
Ignored: analysis/knockoff_7_cache/
Ignored: analysis/knockoff_8_cache/
Ignored: analysis/knockoff_9_cache/
Ignored: analysis/knockoff_cache/
Ignored: analysis/knockoff_var_cache/
Ignored: analysis/marginal_z_alternative_cache/
Ignored: analysis/marginal_z_cache/
Ignored: analysis/mosek_reg_2_cache/
Ignored: analysis/mosek_reg_4_cache/
Ignored: analysis/mosek_reg_5_cache/
Ignored: analysis/mosek_reg_6_cache/
Ignored: analysis/mosek_reg_cache/
Ignored: analysis/pihat0_null_cache/
Ignored: analysis/plot_diagnostic_cache/
Ignored: analysis/poster_obayes17_cache/
Ignored: analysis/real_data_simulation_2_cache/
Ignored: analysis/real_data_simulation_3_cache/
Ignored: analysis/real_data_simulation_4_cache/
Ignored: analysis/real_data_simulation_5_cache/
Ignored: analysis/real_data_simulation_cache/
Ignored: analysis/rmosek_primal_dual_2_cache/
Ignored: analysis/rmosek_primal_dual_cache/
Ignored: analysis/seqgendiff_cache/
Ignored: analysis/simulated_correlated_null_2_cache/
Ignored: analysis/simulated_correlated_null_3_cache/
Ignored: analysis/simulated_correlated_null_cache/
Ignored: analysis/simulation_real_se_2_cache/
Ignored: analysis/simulation_real_se_cache/
Ignored: analysis/smemo_2_cache/
Ignored: data/LSI/
Ignored: docs/.DS_Store
Ignored: docs/figure/.DS_Store
Ignored: output/fig/
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
File | Version | Author | Date | Message |
---|---|---|---|---|
html | e05bc83 | LSun | 2018-05-12 | Update to 1.0 |
Rmd | cc0ab83 | Lei Sun | 2018-05-11 | update |
html | 0f36d99 | LSun | 2017-12-21 | Build site. |
html | 853a484 | LSun | 2017-11-07 | Build site. |
html | 6639968 | LSun | 2017-11-05 | transfer |
Rmd | 7344c2d | LSun | 2017-03-30 | constraints |
html | 7344c2d | LSun | 2017-03-30 | constraints |
Rmd | b7aaa28 | LSun | 2017-03-29 | debug |
html | b7aaa28 | LSun | 2017-03-29 | debug |
Rmd | c803b43 | LSun | 2017-03-27 | fitting and write-up |
html | c803b43 | LSun | 2017-03-27 | fitting and write-up |
Rmd | 3adfbee | LSun | 2017-03-23 | fit |
Last updated: 2018-05-15
Code version: 388e65e
Correlation among summary statistics can distort hypothesis testing and signal estimation by making the observed statistics deviate from their theoretical distribution. See previous simulations (here, here, and here) for illustration.
Efron is his series of publications proposed an “empirical null” idea to tackle the correlation among z scores in the simultaneous inference setting. Essentially, he assumed that z scores from those null hypotheses among all hypotheses simultaneously considered should follow an “empirical null” distribution that’s different from their marginal N(0,1), but still a normal with estimated mean and standard deviation, thus to produce what Efron terms as the “correlation induced inflation.”
However, as Matthew first noted, this doesn’t seem to hold true in our RNA-seq gene expression study. After transforming the original count data to z scores using limma
and edgeR
(see here for the pipeline), these z scores will have a N(0,1) marginal distribution, but correlation makes their empirical distribution far from normal. Normality suggests that all observations should be inflated at the same time, yet oftentimes, the more moderate observations tend to be inflated, but not the more extreme ones. Indeed, if anything, the most extreme observations tend to be deflated. See previous simulations (here and here) for illustrations.
In his response to Efron’s paper, Schwartzman pointed out that “the core issue… is the behavior of a large collection of correlated normal variables.” Let z scores z1,…,zn be N(0,1) variables with pairwise correlations ρij=cor(zi,zj), and ˆF(x)=1nn∑i=11(zi≥x), the right-sided empirical cdf of all z scores. Schwartzman showed that when the number of null hypothesis n→∞, ˆF(x) approximates a random function F0(x) generated as
F0(x)=Φ+(x)+∞∑k=1Wkφ(k−1)(x)
where Φ+(x)=1−Φ(x) is the right-sided cdf of N(0,1) and φ(k−1)(x) is the (k−1)th order derivative of the pdf of N(0,1). Wk’s are independent random variables with mean 0 and variance var(Wk)=αk/k!, where
αk=2n(n−1)∑i<jρkij=∫1−1ρkdG(ρ) is the kth moment of G(ρ), the empirical distribution of all the n(n−1)/2 pairwise correlations ρij.
Differentiating F0(x) gives
f0(x)=φ(x)+∞∑k=1Wkφ(k)(x) which can be seen as approximating the density of the observed distribution of correlated z scores when the number of them n is large. Or in other words, f0(x) can be seen as to describe the histogram of those z scores.
f0(x) involves Gaussian derivatives φ(k), and many properties of them have been thoroughly studied. For starters, they are closely related to the probabilists’ Hermite polynomials hk in that
φ(k)(x)=(−1)khk(x)φ(x) where hk(x) is defined by its generating function
hk(x)=(−1)kex22dkdxke−x22=(x−ddx)k⋅1 The relationship gives
f0(x)=φ(x)+∞∑k=1Wk(−1)khk(x)φ(x)=φ(x)(1+∞∑k=1Wkhk(x))=∞∑k=0Wkhk(x)φ(x) The simplification comes from the observation that Wk’s are zero mean independent random variables, so all (−1)k can be absorbed into Wk. Furthermore, let W0≡1, and use the fact that h0(x)≡1, so f0(x) can be written in a more compact way.
Probabilists’ Hermite polynomials are orthogonal with respect to φ(x), the pdf of N(0,1). That is,
∫∞−∞hk(x)hl(x)φ(x)dx={k!k=l0k≠l This orthogonality gives us
Wk=1k!∫∞−∞f0(x)hk(x)dx With Wk’s defined and obtained this way,
F0(x)=∫x−∞f0(u)du=∫x−∞∞∑k=0Wkhk(u)φ(u)du=∫x−∞∞∑k=0Wk(−1)k(−1)khk(u)φ(u)du=∫x−∞∞∑k=0Wk(−1)kφ(k)(u)du=∫x−∞φ(u)du+∞∑k=1Wk(−1)k∫x−∞φ(k)(u)du=Φ(x)+∞∑k=1Wk(−1)kφ(k−1)(x)=Φ(x)+∞∑k=1Wk(−1)k(−1)k−1hk−1(x)φ(x)=Φ(x)−∞∑k=1Wkhk−1(x)φ(x)
The expectation of W2k vanishes in the order of k!, and the average h2k also scales in the order of k!, and we can define “scaled” version of these functions and quantities:
φ(k)s=1√k!φ(k)hsk(x)=1√k!hk(x)Wsk=√k!Wk
We can see that the scaled Gaussian derivatives (φ(k)s) and Hermite polynomials hsk(x) are basically in the same order of magnitude for increasing orders. Therefore, it might make more sense to think of certain quantities in terms of their scaled versions
f0(x)=φ(x)(1+∞∑k=1Wkhk(x))=φ(x)(1+∞∑k=1Wskhsk(x))
Given z scores z1,…,zn with known marginal distribution N(0,1) but unknown correlation, their (right-sided) empirical cdf ˆF(x) can be observed. Then these z scores can be seen as if they are independently and identically sampled from this observed empirical cdf.
Furthermore, when n is sufficiently large, this observed empirical cdf approximates a cdf F0(x) which has a density
f0(x)=φ(x)(1+∞∑k=1Wkhk(x)) Therefore, with observed z scores, we can fit f0 with estimated (realized) Wk. One way of doing that is by maxmimum likelihood. In particular, we maximize the approximated log-likelihood
logn∏i=1f0(zi)=n∑i=1logf0(zi)=n∑i=1logφ(zi)+n∑i=1log(1+∞∑k=1wkhk(zi))
by a constrained optimization problem
max The first constraint is self-satisfied for all w: \int_{-\infty}^\infty h_k(x)\varphi(x)dx = 0 for any k\geq1 since h_k and h_0 \equiv1 are orthogonal with respect to \varphi.
The second constraint and the objective are intractable in the present form because of the involvement of \infty. Therefore we need to make another two key assumptions as follows.
We assume that the pairwise correlation \rho_{ij} are on average moderate enough, such that the influence of W_kh_k will vanish for sufficiently large k. Note that it doesn’t hold true in general. Recall that
\text{var}(W_k) = \alpha_k / k! = \bar{\rho_{ij}^k} / k! \leq 1/k! \ , so it seems W_k^2 will vanish in factorial order, yet on the other hand,
\int_{-\infty}^\infty h_k^2(x)\varphi(x)dx = k!
so h_k^2 also scales in factorial order. Therefore, it’s not at all clear that the influence of W_kh_k will decay in the most general case. So the decay of higher orders largely rely on the decay of higher order empirical moments of pairwise correlation \alpha_k, which characterizes \rho_{ij}^k.
As a result, we here assume that \rho_{ij}’s are on average moderate, so \alpha_k decays faster, leading to faster decaying of W_k, probably as well as that of W_kh_k. Ignoring this assumption would cause problem at least theoretically, as detailed later.
With this assumption, we can stop at a large K; that is, f_0(x) = \varphi(x)\left(1 + \sum\limits_{k = 1}^K W_kh_k(x)\right). We also need to specify a method to choose K in a systematic way.
We further assume that the number of observations n is sufficiently large, so that we can use the n observed z scores in place of the intractable all x\in\mathbb{R} in the second constraint.
That is, 1 + \sum\limits_{k = 1}^K w_kh_k(z_i) \geq0. Ignoring this assumption would also lead to numerical instability, as detailed later.
With these assumptions, the problem is now a convex optimization.
\begin{array}{rl} \max\limits_{w_1, \ldots, w_K} & \sum\limits_{i = 1}^n\log\left(1 +\sum\limits_{k = 1}^K w_kh_k(z_i)\right)\\ \text{s.t.} & 1 + \sum\limits_{k = 1}^K w_kh_k(z_i) \geq0 \end{array} It can also be written as
\begin{array}{rl} \max\limits_{w} & \sum\log\left(1 +Hw\right)\\ \text{s.t.} & 1 +Hw \geq0 \end{array}
where H_{ik} = h_k(z_i).
sessionInfo()
R version 3.4.3 (2017-11-30)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.4
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] workflowr_1.0.1 Rcpp_0.12.16 digest_0.6.15
[4] rprojroot_1.3-2 R.methodsS3_1.7.1 backports_1.1.2
[7] git2r_0.21.0 magrittr_1.5 evaluate_0.10.1
[10] stringi_1.1.6 whisker_0.3-2 R.oo_1.21.0
[13] R.utils_2.6.0 rmarkdown_1.9 tools_3.4.3
[16] stringr_1.3.0 yaml_2.1.18 compiler_3.4.3
[19] htmltools_0.3.6 knitr_1.20
This reproducible R Markdown analysis was created with workflowr 1.0.1