Last updated: 2018-05-15

workflowr checks: (Click a bullet for more information)
  • R Markdown file: up-to-date

    Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

  • Environment: empty

    Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

  • Seed: set.seed(12345)

    The command set.seed(12345) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

  • Session information: recorded

    Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

  • Repository version: 388e65e

    Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility. The version displayed above was the version of the Git repository at the time these results were generated.

    Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:
    
    Ignored files:
        Ignored:    .DS_Store
        Ignored:    .Rhistory
        Ignored:    .Rproj.user/
        Ignored:    analysis/.DS_Store
        Ignored:    analysis/BH_robustness_cache/
        Ignored:    analysis/FDR_Null_cache/
        Ignored:    analysis/FDR_null_betahat_cache/
        Ignored:    analysis/Rmosek_cache/
        Ignored:    analysis/StepDown_cache/
        Ignored:    analysis/alternative2_cache/
        Ignored:    analysis/alternative_cache/
        Ignored:    analysis/ash_gd_cache/
        Ignored:    analysis/average_cor_gtex_2_cache/
        Ignored:    analysis/average_cor_gtex_cache/
        Ignored:    analysis/brca_cache/
        Ignored:    analysis/cash_deconv_cache/
        Ignored:    analysis/cash_fdr_1_cache/
        Ignored:    analysis/cash_fdr_2_cache/
        Ignored:    analysis/cash_fdr_3_cache/
        Ignored:    analysis/cash_fdr_4_cache/
        Ignored:    analysis/cash_fdr_5_cache/
        Ignored:    analysis/cash_fdr_6_cache/
        Ignored:    analysis/cash_plots_cache/
        Ignored:    analysis/cash_sim_1_cache/
        Ignored:    analysis/cash_sim_2_cache/
        Ignored:    analysis/cash_sim_3_cache/
        Ignored:    analysis/cash_sim_4_cache/
        Ignored:    analysis/cash_sim_5_cache/
        Ignored:    analysis/cash_sim_6_cache/
        Ignored:    analysis/cash_sim_7_cache/
        Ignored:    analysis/correlated_z_2_cache/
        Ignored:    analysis/correlated_z_3_cache/
        Ignored:    analysis/correlated_z_cache/
        Ignored:    analysis/create_null_cache/
        Ignored:    analysis/cutoff_null_cache/
        Ignored:    analysis/design_matrix_2_cache/
        Ignored:    analysis/design_matrix_cache/
        Ignored:    analysis/diagnostic_ash_cache/
        Ignored:    analysis/diagnostic_correlated_z_2_cache/
        Ignored:    analysis/diagnostic_correlated_z_3_cache/
        Ignored:    analysis/diagnostic_correlated_z_cache/
        Ignored:    analysis/diagnostic_plot_2_cache/
        Ignored:    analysis/diagnostic_plot_cache/
        Ignored:    analysis/efron_leukemia_cache/
        Ignored:    analysis/fitting_normal_cache/
        Ignored:    analysis/gaussian_derivatives_2_cache/
        Ignored:    analysis/gaussian_derivatives_3_cache/
        Ignored:    analysis/gaussian_derivatives_4_cache/
        Ignored:    analysis/gaussian_derivatives_5_cache/
        Ignored:    analysis/gaussian_derivatives_cache/
        Ignored:    analysis/gd-ash_cache/
        Ignored:    analysis/gd_delta_cache/
        Ignored:    analysis/gd_lik_2_cache/
        Ignored:    analysis/gd_lik_cache/
        Ignored:    analysis/gd_w_cache/
        Ignored:    analysis/knockoff_10_cache/
        Ignored:    analysis/knockoff_2_cache/
        Ignored:    analysis/knockoff_3_cache/
        Ignored:    analysis/knockoff_4_cache/
        Ignored:    analysis/knockoff_5_cache/
        Ignored:    analysis/knockoff_6_cache/
        Ignored:    analysis/knockoff_7_cache/
        Ignored:    analysis/knockoff_8_cache/
        Ignored:    analysis/knockoff_9_cache/
        Ignored:    analysis/knockoff_cache/
        Ignored:    analysis/knockoff_var_cache/
        Ignored:    analysis/marginal_z_alternative_cache/
        Ignored:    analysis/marginal_z_cache/
        Ignored:    analysis/mosek_reg_2_cache/
        Ignored:    analysis/mosek_reg_4_cache/
        Ignored:    analysis/mosek_reg_5_cache/
        Ignored:    analysis/mosek_reg_6_cache/
        Ignored:    analysis/mosek_reg_cache/
        Ignored:    analysis/pihat0_null_cache/
        Ignored:    analysis/plot_diagnostic_cache/
        Ignored:    analysis/poster_obayes17_cache/
        Ignored:    analysis/real_data_simulation_2_cache/
        Ignored:    analysis/real_data_simulation_3_cache/
        Ignored:    analysis/real_data_simulation_4_cache/
        Ignored:    analysis/real_data_simulation_5_cache/
        Ignored:    analysis/real_data_simulation_cache/
        Ignored:    analysis/rmosek_primal_dual_2_cache/
        Ignored:    analysis/rmosek_primal_dual_cache/
        Ignored:    analysis/seqgendiff_cache/
        Ignored:    analysis/simulated_correlated_null_2_cache/
        Ignored:    analysis/simulated_correlated_null_3_cache/
        Ignored:    analysis/simulated_correlated_null_cache/
        Ignored:    analysis/simulation_real_se_2_cache/
        Ignored:    analysis/simulation_real_se_cache/
        Ignored:    analysis/smemo_2_cache/
        Ignored:    data/LSI/
        Ignored:    docs/.DS_Store
        Ignored:    docs/figure/.DS_Store
        Ignored:    output/fig/
    
    
    Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
Expand here to see past versions:
    File Version Author Date Message
    html ddf9062 LSun 2018-05-12 Update to 1.0
    rmd cc0ab83 Lei Sun 2018-05-11 update
    html 0f36d99 LSun 2017-12-21 Build site.
    html 853a484 LSun 2017-11-07 Build site.
    html 043bf89 LSun 2017-11-05 transfer
    html 321f66a LSun 2017-05-30 alternative
    rmd 0f49e8a LSun 2017-03-31 alternative simulation
    html 0f49e8a LSun 2017-03-31 alternative simulation
    rmd 2a9b0b7 LSun 2017-03-30 weights
    html 2a9b0b7 LSun 2017-03-30 weights
    rmd 3c146da LSun 2017-03-30 N(0, 2) pdf
    html 3c146da LSun 2017-03-30 N(0, 2) pdf
    rmd e5e405c LSun 2017-03-30 tails
    html e5e405c LSun 2017-03-30 tails
    rmd e36f505 LSun 2017-03-30 tail
    html e36f505 LSun 2017-03-30 tail
    rmd c472cb3 LSun 2017-03-29 n(0,2)
    html c472cb3 LSun 2017-03-29 n(0,2)

Identifiablity of true signals from correlated noise

We’ve shown that in many real data sets when we have correlated null \(z\) scores, we can fit their empirical distribution with Gaussian and its derivatives.

But what if we have true signals instead of the global null? Theoretically, any distribution can be decomposed by Gaussian and its derivatives, also called Edgeworth series or Edgeworth expansion. We’ve shown that the Dirac delta function \(\delta_z\) and the associated \(0\)-\(1\) step function can be decomposed by Gaussian derivatives. Essentially all distributions can be represented by (usually infinitely many) \(\delta_z\), and thus be decomposed by Gaussian and its derivatives. There is a rich literature on this topic, probably of further use to this project.

Now the more urgent problem is: can true signals also be fitted by Gaussian derivatives in a similar way as correlated null? Let normalized weights \(W_k^s = W_k\sqrt{k!}\). As shown previously, under correlated null, the variance \(\text{var}(W_k^s) = \alpha_k = \bar{\rho_{ij}^k}\). Thus, under correlated null, the Gaussian derivative decomposition of the empirical distribution should have “reasonable” weights of similar decaying patterns.

If it turns out Gaussian derivatives with limited orders (say, \(K \leq 10\)) and reasonable normalized weights are only able to fit the empirical correlated null, but nothing else, then properly regularized Gaussian derivatives can be readily used to control the usually correlated noise, which are correlated null, and leave the signal to ash. But if true signals can also be fitted this way, the identifiability of true signals from correlated noise becomes an issue.

Let’s start with the simplest case: \(z \sim N(0, \sqrt{2}^2)\) independently. This data set can be seen as generated as follows.

\[ \begin{array}{c} \beta_j \sim N(0, 1)\\ z_j \sim N(\beta_j, 1) \end{array} \]

That is, a \(N(0, 1)\) true signal is polluted by a \(N(0, 1)\) noise.

Illustration

n = 1e4
m = 5
set.seed(777)
zmat = matrix(rnorm(n * m, 0, sd = sqrt(2)), nrow = m, byrow = TRUE)
library(ashr)
source("../code/ecdfz.R")
res = list()
for (i in 1:m) {
  z = zmat[i, ]
  p = (1 - pnorm(abs(z))) * 2
  bh.fd = sum(p.adjust(p, method = "BH") <= 0.05)
  pihat0.ash = get_pi0(ash(z, 1, method = "fdr"))
  ecdfz.fit = ecdfz.optimal(z)
  res[[i]] = list(z = z, p = p, bh.fd = bh.fd, pihat0.ash = pihat0.ash, ecdfz.fit = ecdfz.fit)
}
Example 1 : Number of Discoveries: 246 ; pihat0 = 0.3245191 
Log-likelihood with N(0, 2): -17704.62 
Log-likelihood with Gaussian Derivatives: -17702.15 
Log-likelihood ratio between true N(0, 2) and fitted Gaussian derivatives: -2.473037 
Normalized weights:
1 : -0.0126888368547959 ; 2 : 0.717062378249889 ; 3 : -0.0184536200134752 ; 4 : 0.649465525394262 ; 5 : 0.00859163522314002 ; 6 : 0.521325079359314 ; 7 : 0.0334885164431775 ; 8 : 0.22636494735755 ;

Zoom in to the left tail:

Zoom in to the right tail:

Example 2 : Number of Discoveries: 218 ; pihat0 = 0.3007316 
Log-likelihood with N(0, 2): -17620.91 
Log-likelihood with Gaussian Derivatives: -17618.13 
Log-likelihood ratio between true N(0, 2) and fitted Gaussian derivatives: -2.787631 
Normalized weights:
1 : 0.0102680011779709 ; 2 : 0.696012169853609 ; 3 : 0.0113000171720435 ; 4 : 0.544236663386519 ; 5 : -0.0208432030918437 ; 6 : 0.359654087688657 ; 7 : 0.00449356234470338 ; 8 : 0.129368209367989 ;

Zoom in to the left tail:

Zoom in to the right tail:

Example 3 : Number of Discoveries: 201 ; pihat0 = 0.3524008 
Log-likelihood with N(0, 2): -17627.66 
Log-likelihood with Gaussian Derivatives: -17623.26 
Log-likelihood ratio between true N(0, 2) and fitted Gaussian derivatives: -4.397359 
Normalized weights:
1 : 0.000611199281683122 ; 2 : 0.697833563596919 ; 3 : -9.24232505276873e-05 ; 4 : 0.593310577011007 ; 5 : 0.0690423192366928 ; 6 : 0.402719962212205 ; 7 : 0.0821756084741036 ; 8 : 0.137136244590824 ;

Zoom in to the left tail:

Zoom in to the right tail:

Example 4 : Number of Discoveries: 134 ; pihat0 = 0.3039997 
Log-likelihood with N(0, 2): -17572.28 
Log-likelihood with Gaussian Derivatives: -17589.35 
Log-likelihood ratio between true N(0, 2) and fitted Gaussian derivatives: 17.07424 
Normalized weights:
1 : -0.00303021567753385 ; 2 : 0.667140676046508 ; 3 : -0.00744442518950379 ; 4 : 0.4335954662891 ; 5 : 0.00652056989516479 ; 6 : 0.163579551221406 ; 7 : 0.0434395776822699 ;

Zoom in to the left tail:

Zoom in to the right tail:

Example 5 : Number of Discoveries: 201 ; pihat0 = 0.3864133 
Log-likelihood with N(0, 2): -17602.8 
Log-likelihood with Gaussian Derivatives: -17607.36 
Log-likelihood ratio between true N(0, 2) and fitted Gaussian derivatives: 4.565327 
Normalized weights:
1 : -0.0149505230188178 ; 2 : 0.681006373173563 ; 3 : -0.029408092099831 ; 4 : 0.526597120212115 ; 5 : -0.0649823448928799 ; 6 : 0.248323484516014 ; 7 : -0.077154633635199 ;

Zoom in to the left tail:

Zoom in to the right tail:

Session information

sessionInfo()
R version 3.4.3 (2017-11-30)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.4

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] workflowr_1.0.1   Rcpp_0.12.16      digest_0.6.15    
 [4] rprojroot_1.3-2   R.methodsS3_1.7.1 backports_1.1.2  
 [7] git2r_0.21.0      magrittr_1.5      evaluate_0.10.1  
[10] stringi_1.1.6     whisker_0.3-2     R.oo_1.21.0      
[13] R.utils_2.6.0     rmarkdown_1.9     tools_3.4.3      
[16] stringr_1.3.0     yaml_2.1.18       compiler_3.4.3   
[19] htmltools_0.3.6   knitr_1.20       

This reproducible R Markdown analysis was created with workflowr 1.0.1