Truncated Adaptive Shrinkage (`truncash`)

Last updated: 2018-12-14

workflowr checks: (Click a bullet for more information)

✔ R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

✔ Repository version: 73ed3b7

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility. The version displayed above was the version of the Git repository at the time these results were generated.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .DS_Store
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    analysis/.DS_Store
    Ignored:    analysis/BH_robustness_cache/
    Ignored:    analysis/FDR_Null_cache/
    Ignored:    analysis/FDR_null_betahat_cache/
    Ignored:    analysis/Rmosek_cache/
    Ignored:    analysis/StepDown_cache/
    Ignored:    analysis/alternative2_cache/
    Ignored:    analysis/alternative_cache/
    Ignored:    analysis/ash_gd_cache/
    Ignored:    analysis/average_cor_gtex_2_cache/
    Ignored:    analysis/average_cor_gtex_cache/
    Ignored:    analysis/brca_cache/
    Ignored:    analysis/cash_deconv_cache/
    Ignored:    analysis/cash_fdr_1_cache/
    Ignored:    analysis/cash_fdr_2_cache/
    Ignored:    analysis/cash_fdr_3_cache/
    Ignored:    analysis/cash_fdr_4_cache/
    Ignored:    analysis/cash_fdr_5_cache/
    Ignored:    analysis/cash_fdr_6_cache/
    Ignored:    analysis/cash_paper_fig1_cache/
    Ignored:    analysis/cash_paper_fig_avgecdf_cache/
    Ignored:    analysis/cash_paper_fig_deconv_cache/
    Ignored:    analysis/cash_paper_fig_leukemia_cache/
    Ignored:    analysis/cash_plots_2_cache/
    Ignored:    analysis/cash_plots_3_cache/
    Ignored:    analysis/cash_plots_4_cache/
    Ignored:    analysis/cash_plots_5_cache/
    Ignored:    analysis/cash_plots_cache/
    Ignored:    analysis/cash_sim_1_cache/
    Ignored:    analysis/cash_sim_2_cache/
    Ignored:    analysis/cash_sim_3_cache/
    Ignored:    analysis/cash_sim_4_cache/
    Ignored:    analysis/cash_sim_5_cache/
    Ignored:    analysis/cash_sim_6_cache/
    Ignored:    analysis/cash_sim_7_cache/
    Ignored:    analysis/correlated_z_2_cache/
    Ignored:    analysis/correlated_z_3_cache/
    Ignored:    analysis/correlated_z_cache/
    Ignored:    analysis/create_null_cache/
    Ignored:    analysis/cutoff_null_cache/
    Ignored:    analysis/design_matrix_2_cache/
    Ignored:    analysis/design_matrix_cache/
    Ignored:    analysis/diagnostic_ash_cache/
    Ignored:    analysis/diagnostic_correlated_z_2_cache/
    Ignored:    analysis/diagnostic_correlated_z_3_cache/
    Ignored:    analysis/diagnostic_correlated_z_cache/
    Ignored:    analysis/diagnostic_plot_2_cache/
    Ignored:    analysis/diagnostic_plot_cache/
    Ignored:    analysis/efron_leukemia_cache/
    Ignored:    analysis/fitting_normal_cache/
    Ignored:    analysis/gaussian_derivatives_2_cache/
    Ignored:    analysis/gaussian_derivatives_3_cache/
    Ignored:    analysis/gaussian_derivatives_4_cache/
    Ignored:    analysis/gaussian_derivatives_5_cache/
    Ignored:    analysis/gaussian_derivatives_cache/
    Ignored:    analysis/gd-ash_cache/
    Ignored:    analysis/gd_delta_cache/
    Ignored:    analysis/gd_lik_2_cache/
    Ignored:    analysis/gd_lik_cache/
    Ignored:    analysis/gd_w_cache/
    Ignored:    analysis/knockoff_10_cache/
    Ignored:    analysis/knockoff_2_cache/
    Ignored:    analysis/knockoff_3_cache/
    Ignored:    analysis/knockoff_4_cache/
    Ignored:    analysis/knockoff_5_cache/
    Ignored:    analysis/knockoff_6_cache/
    Ignored:    analysis/knockoff_7_cache/
    Ignored:    analysis/knockoff_8_cache/
    Ignored:    analysis/knockoff_9_cache/
    Ignored:    analysis/knockoff_cache/
    Ignored:    analysis/knockoff_var_cache/
    Ignored:    analysis/marginal_z_alternative_cache/
    Ignored:    analysis/marginal_z_cache/
    Ignored:    analysis/mosek_reg_2_cache/
    Ignored:    analysis/mosek_reg_4_cache/
    Ignored:    analysis/mosek_reg_5_cache/
    Ignored:    analysis/mosek_reg_6_cache/
    Ignored:    analysis/mosek_reg_cache/
    Ignored:    analysis/pihat0_null_cache/
    Ignored:    analysis/plot_diagnostic_cache/
    Ignored:    analysis/poster_obayes17_cache/
    Ignored:    analysis/real_data_simulation_2_cache/
    Ignored:    analysis/real_data_simulation_3_cache/
    Ignored:    analysis/real_data_simulation_4_cache/
    Ignored:    analysis/real_data_simulation_5_cache/
    Ignored:    analysis/real_data_simulation_cache/
    Ignored:    analysis/rmosek_primal_dual_2_cache/
    Ignored:    analysis/rmosek_primal_dual_cache/
    Ignored:    analysis/seqgendiff_cache/
    Ignored:    analysis/simulated_correlated_null_2_cache/
    Ignored:    analysis/simulated_correlated_null_3_cache/
    Ignored:    analysis/simulated_correlated_null_cache/
    Ignored:    analysis/simulation_real_se_2_cache/
    Ignored:    analysis/simulation_real_se_cache/
    Ignored:    analysis/smemo_2_cache/
    Ignored:    data/LSI/
    Ignored:    docs/.DS_Store
    Ignored:    docs/figure/.DS_Store
    Ignored:    output/fig/
    Ignored:    output/paper/

Unstaged changes:
    Modified:   analysis/cash_plots_2.rmd

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

Expand here to see past versions:

File	Version	Author	Date	Message
html	562bda2	Lei Sun	2018-10-21	Build site.
Rmd	7b6b664	Lei Sun	2018-10-21	wflow_publish(c(“analysis/cash_paper_fig_gd.rmd”,
html	c8099e5	LSun	2018-10-17	Build site.
Rmd	5db2f75	LSun	2018-10-17	wflow_publish(“index.Rmd”)
html	39b2b84	LSun	2018-10-12	Build site.
Rmd	f20502c	LSun	2018-10-12	wflow_publish(c(“analysis/cash_plots_2.rmd”, “analysis/cash_plots_4.rmd”, “analysis/cash_plots_5.rmd”,
html	b4e8312	LSun	2018-10-07	Build site.
Rmd	409c566	LSun	2018-10-07	wflow_publish(“analysis/index.Rmd”)
html	f7bde81	Lei Sun	2018-10-06	Build site.
Rmd	3b81007	Lei Sun	2018-10-06	wflow_publish(c(“analysis/cash_paper_fig_g1sep.rmd”,
html	133541a	LSun	2018-10-05	Build site.
Rmd	49f870a	LSun	2018-10-05	wflow_publish(c(“cash_paper_fig_leukemia.rmd”, “cash_paper_fig1.rmd”,
html	f576ac1	LSun	2018-09-05	Build site.
Rmd	814582e	LSun	2018-09-05	wflow_publish(“analysis/index.Rmd”)
html	4521488	LSun	2018-09-05	Build site.
Rmd	c0f1a6f	LSun	2018-09-05	wflow_publish(“analysis/index.Rmd”)
html	6007071	LSun	2018-06-08	Build site.
Rmd	a3e7d09	LSun	2018-06-08	wflow_publish(“analysis/index.Rmd”)
html	658535a	LSun	2018-05-18	Build site.
Rmd	5eb3d95	LSun	2018-05-18	wflow_publish(c(“analysis/gd_lik_2.rmd”,
html	4d653b1	LSun	2018-05-15	Build site.
html	b7cf71b	LSun	2018-05-14	Build site.
Rmd	69e0c7d	LSun	2018-05-14	Update to 1.0
html	69e0c7d	LSun	2018-05-14	Update to 1.0
html	f70370c	Lei Sun	2018-05-14	revised _site.yml
html	eaaa5f9	LSun	2018-05-14	Build site.
Rmd	a9d12a8	LSun	2018-05-14	wflow_publish(c(“analysis/BH_robustness.rmd”, “analysis/index.Rmd”))
html	e05bc83	LSun	2018-05-12	Update to 1.0
Rmd	cc0ab83	Lei Sun	2018-05-11	update
html	720d179	LSun	2018-04-18	Build site.
Rmd	b82e2bc	LSun	2018-04-18	wflow_publish(c(“analysis/gd_w.rmd”, “analysis/index.Rmd”))
html	38dd7e1	LSun	2018-04-14	index
Rmd	ac4d19c	LSun	2018-04-14	index
html	4b179a9	LSun	2018-04-05	Build site.
Rmd	20ea328	LSun	2018-04-05	wflow_publish(c(“analysis/knockoff_7.rmd”, “analysis/knockoff_8.rmd”,
html	cb9d445	LSun	2018-03-11	Build site.
Rmd	2c9cf26	Lei Sun	2018-03-11	obser
html	c9714a8	LSun	2018-02-23	Build site.
Rmd	4b4505a	LSun	2018-02-23	wflow_publish(c(“knockoff_7.rmd”, “index.rmd”))
html	85b6c52	LSun	2018-02-16	Build site.
Rmd	7235300	LSun	2018-02-16	wflow_publish(“analysis/index.Rmd”)
Rmd	016ccb2	LSun	2018-02-09	plots added
html	016ccb2	LSun	2018-02-09	plots added
Rmd	34434cf	LSun	2018-02-06	index
html	34434cf	LSun	2018-02-06	index
html	3e6f3ab	LSun	2018-01-30	Build site.
Rmd	c17f67c	LSun	2018-01-30	wflow_publish(c(“analysis/index.rmd”, “analysis/knockoff.rmd”, “analysis/knockoff_2.rmd”, “analysis/knockoff_3.rmd”))
html	b831895	LSun	2018-01-25	Build site.
Rmd	4bd59b5	LSun	2018-01-25	wflow_publish(files = c(“analysis/knockoff.rmd”, “analysis/index.Rmd”))
html	2a15a32	LSun	2017-12-21	Build site.
Rmd	e3b0d2c	LSun	2017-12-21	wflow_publish(all = TRUE)
html	8ae6c11	LSun	2017-12-08	Build site.
Rmd	547ac2c	LSun	2017-12-08	analysis/poster_obayes17.rmd
html	fb76f41	LSun	2017-11-29	webpages
Rmd	32adf96	LSun	2017-11-29	index
html	8de2239	LSun	2017-11-28	Build site.
Rmd	10a2525	LSun	2017-11-28	wflow_publish(“analysis/index.Rmd”)
html	8e5c59f	LSun	2017-11-28	Build site.
Rmd	0cc70b9	LSun	2017-11-28	wflow_publish(c(“analysis/cash_deconv.rmd”, “analysis/index.Rmd”))
html	92244e4	LSun	2017-11-28	Build site.
Rmd	b8bcdda	LSun	2017-11-28	wflow_publish(c(“analysis/index.Rmd”, “analysis/cash_fdr_1.rmd”,
html	1f91ef0	LSun	2017-11-22	Build site.
Rmd	b64bef2	LSun	2017-11-22	wflow_publish(“analysis/index.Rmd”)
html	501bd28	LSun	2017-11-10	Build site.
Rmd	d337287	LSun	2017-11-10	wflow_publish(c(“cash_sim_1.rmd”, “cash_sim_2.rmd”, “index.Rmd”))
Rmd	4b385a1	LSun	2017-07-21	brca analysis
html	4b385a1	LSun	2017-07-21	brca analysis
html	76a05c2	LSun	2017-06-23	Build site.
Rmd	4fb0ff1	LSun	2017-06-23	wflow_publish(“index.rmd”)
html	b9706c5	LSun	2017-06-23	Build site.
Rmd	1542061	LSun	2017-06-23	wflow_publish(“index.rmd”)
Rmd	a1e16ac	LSun	2017-06-18	smemo
html	a1e16ac	LSun	2017-06-18	smemo
Rmd	e6d5a64	LSun	2017-06-17	GD-Lik
html	e6d5a64	LSun	2017-06-17	GD-Lik
Rmd	73b4bb7	LSun	2017-06-17	GD-Lik simulations
html	73b4bb7	LSun	2017-06-17	GD-Lik simulations
Rmd	40fc7d9	LSun	2017-06-02	index
html	a930ad9	LSun	2017-06-02	BH
Rmd	a40ffff	LSun	2017-06-01	webpage
html	a40ffff	LSun	2017-06-01	webpage
Rmd	c4782cc	LSun	2017-05-30	index
html	c4782cc	LSun	2017-05-30	index
Rmd	9f7dd01	LSun	2017-05-21	website
html	9f7dd01	LSun	2017-05-21	website
Rmd	055f9f0	LSun	2017-05-17	websites
html	055f9f0	LSun	2017-05-17	websites
Rmd	7b4811c	LSun	2017-05-17	index
html	7b4811c	LSun	2017-05-17	index
Rmd	e7d9466	LSun	2017-05-16	index
html	e7d9466	LSun	2017-05-16	index
Rmd	807eb71	LSun	2017-05-14	index
html	807eb71	LSun	2017-05-14	index
html	603b826	LSun	2017-05-11	writeups
Rmd	3501eca	LSun	2017-05-11	index
Rmd	8dcfa2f	LSun	2017-05-09	index
html	8dcfa2f	LSun	2017-05-09	index
Rmd	b864975	LSun	2017-05-09	revision
html	b864975	LSun	2017-05-09	revision
html	b141020	LSun	2017-05-09	writeups
Rmd	c0c29a2	LSun	2017-05-09	index
html	27bcaf9	LSun	2017-05-08	writeups
Rmd	6bac69e	LSun	2017-05-08	writeups
html	e368074	LSun	2017-05-06	Rmosek
Rmd	06eff07	LSun	2017-05-06	Rmosek
Rmd	8751494	LSun	2017-05-05	diagnostic
html	8751494	LSun	2017-05-05	diagnostic
Rmd	5a3f36c	LSun	2017-05-05	index
html	7425db5	LSun	2017-05-05	index
Rmd	98a94d1	LSun	2017-05-03	uniformity diagnostic
html	98a94d1	LSun	2017-05-03	uniformity diagnostic
Rmd	84a84ca	LSun	2017-04-23	diagnostic plots
html	84a84ca	LSun	2017-04-23	diagnostic plots
Rmd	3c00d15	LSun	2017-04-21	de bogged
html	3c00d15	LSun	2017-04-21	de bogged
Rmd	239f195	LSun	2017-04-21	GD-ASH
html	239f195	LSun	2017-04-21	GD-ASH
Rmd	4eb2aa0	LSun	2017-04-19	diagnostic
html	4eb2aa0	LSun	2017-04-19	diagnostic
Rmd	020da62	LSun	2017-04-06	SNR
html	020da62	LSun	2017-04-06	SNR
Rmd	0f49e8a	LSun	2017-03-31	alternative simulation
html	0f49e8a	LSun	2017-03-31	alternative simulation
Rmd	d914e14	LSun	2017-03-30	weights
html	d914e14	LSun	2017-03-30	weights
Rmd	2a9b0b7	LSun	2017-03-30	weights
html	2a9b0b7	LSun	2017-03-30	weights
Rmd	c472cb3	LSun	2017-03-29	n(0,2)
html	c472cb3	LSun	2017-03-29	n(0,2)
Rmd	5bd5d70	LSun	2017-03-28	revision
html	5bd5d70	LSun	2017-03-28	revision
Rmd	c803b43	LSun	2017-03-27	fitting and write-up
html	c803b43	LSun	2017-03-27	fitting and write-up
Rmd	bfde8e5	LSun	2017-03-10	theoretical
html	bfde8e5	LSun	2017-03-10	theoretical
Rmd	894c395	LSun	2017-03-07	histogram
html	894c395	LSun	2017-03-07	histogram
Rmd	03366d9	LSun	2017-03-06	correlated_z
html	03366d9	LSun	2017-03-06	correlated_z
Rmd	1c0be20	LSun	2017-03-06	write-ups
html	1c0be20	LSun	2017-03-06	write-ups
Rmd	be223d3	LSun	2017-03-04	t
html	be223d3	LSun	2017-03-04	t
Rmd	0a13d52	LSun	2017-02-23	index
html	0a13d52	LSun	2017-02-23	index
Rmd	dc6ad76	LSun	2017-02-23	truncash
html	dc6ad76	LSun	2017-02-23	truncash
Rmd	242b122	LSun	2017-02-22	pihat0
html	242b122	LSun	2017-02-22	pihat0
Rmd	2dbb6bb	LSun	2017-02-16	index
html	2dbb6bb	LSun	2017-02-16	index
Rmd	7db293d	LSun	2017-02-14	index
html	7db293d	LSun	2017-02-14	index
html	81d84f5	LSun	2017-02-10	FDR
Rmd	2696cb2	LSun	2017-02-10	fwer
Rmd	313897f	LSun	2017-02-03	details
html	313897f	LSun	2017-02-03	details
html	36c1e4c	LSun	2017-02-03	Build site.
Rmd	d25a6e3	LSun	2017-02-03	step-down
Rmd	d616c3d	LSun	2017-02-03	occurrence
html	d616c3d	LSun	2017-02-03	occurrence
Rmd	858f0e4	LSun	2017-02-01	background
html	858f0e4	LSun	2017-02-01	background
html	d575294	LSun	2017-01-31	Build site.
Rmd	58b04b8	LSun	2017-01-29	intro
html	58b04b8	LSun	2017-01-29	intro
Rmd	1d187aa	LSun	2017-01-29	First Commit
html	1d187aa	LSun	2017-01-29	First Commit

truncash (Truncated ASH) is an exploratory project with Matthew, built on ashr.

Matthew’s initial observation on null, correlated data

Prof. Matthew Stephens did a quick investigation of the p values and z scores obtained for simulated null data (using just voom transform, no correction) from real RNA-seq data of GTEx. Here is what he found.

“I found something that I hadn’t realized, although is obvious in hindsight: although you sometimes see inflation under null of \(p\)-values/\(z\)-scores, the most extreme values are not inflated compared with expectations (and tend to be deflated). That is the histograms of \(p\)-values that show inflation near \(0\) (and deflation near \(1\)) actually hide something different going on in the very left hand side near \(0\). The qq-plots are clearer… showing most extreme values are deflated, or not inflated. This is expected under positive correlation i think. For example, if all \(z\)-scores were the same (complete correlation), then most extreme of n would just be \(N(0,1)\). but if independent the most extreme of n would have longer tails…”

Matthew’s initial observation inspired this project. If under positive correlation, the most extreme tend to be not inflated, maybe we can use them to control the false discoveries. Meanwhile, if the moderate are more prone to inflation due to correlation, maybe it’s better to make only partial use of their information.

Occurrence of extreme observations

As Prof. Michael Stein pointed during a conversation with Matthew, if the marginal distribution is correct then the expected number exceeding any threshold should be correct. So if the tail is “usually”" deflated, it should be that with some small probability there are many large \(z\)-scores (even in the tail). Therefore, if “on average” we have the right number of large \(z\)-scores/small \(p\)-values, and “usually” we have too few, then “rarely” we should have too many. A simulation is run to check this intuition.

Two FWER-controlling procedures on correlated null

In order to understand the behavior of \(p\)-values of top expressed, correlated genes under the global null, simulated from GTEx data, we apply two FWER-controlling multiple comparison procedures, Holm’s “step-down” ([Holm 1979]) and Hochberg’s “step-up.” ([Hochberg 1988])

Using a toy model to examine and document the pipeline to simulate null summary statistics at each step, including edgeR::calcNormFactors, limma::voom, limma::lmFit, limma::eBayes.

Apply two FDR-controlling procedures, BH and BY, as well as two \(s\) value models, ash and truncash to the simulated, correlated null data, and compare the numbers of false discoveries (by definition, all discoveries should be false) obtained. Part 1 uses \(z\) scores only, Part 2 uses \(\hat \beta\) and moderated \(\hat s\).

\(\hat\pi_0\) estimated in correlated global null

\(\hat\pi_0\) estimated by ash and truncash with \(T = 1.96\) on correlated global null data simulated from GTEx/Liver. Ideally they should be close to \(1\).

Ordered \(p\) values vs critical values

For various FWER / FDR controlling procedures, and for truncash, what matters the most is the behavior of the most extreme observations. Here these extreme \(p\) values are plotted along with common critical values used by various procedures, in order to shed light on their behavior.

It’s very exploratory. May be related to Extreme Value Thoery and Concentration of Measure. To be continued.

Single most extreme observation

What will happen if we allow the threshold in truncash dependent on data? Let’s start from the case when we only know the single most extreme observation.

Handling \(t\) likelihood

When moving to \(t\) likelihood, or in other words, when taking randomness of \(\hat s\) into consideration, things get trickier. Here we review several techiniques currently used in Matthew’s lab, regarding \(t\) likelihood and uniform mixture priors, based on a discussion with Matthew.

An implicit key question of this inquiry is: what’s the behavior of \(z\) scores under dependency? We take a look at their histograms. First for randomly sampled data sets. Second for those most “hostile” to ash. Third for those most “hostile” to BH. The bottom line is we are reproducing what Efron observed in microarray data, that “the theoretical null may fail” in different ways. Now the key questions are

Why the theoretical null may fail? What does it mean by correlation?
Can truncash make ash more robust against some of the foes that make the theoretical null fail?
Generally, how robust is empirical Bayes? Is empirical Bayes robust or non-robust to certain kinds of correlation?

Inspired by Schwartzman 2010, we experiment a new way to tackle “empirical null.”

Fitting empirical null with Gaussian derivatives: Weight constraints

In Gaussian derivatives decomposition, weights \(W_k\) and especially normalized weights \(W_k^s = W_k\sqrt{k!}\) contain substantial information. In order to produce a proper density, and in order to regularize the fitted density to make it describe a plausible empiricall correlated null, constraints need to be imposed on the weights.

Both true effects and correlation can distort the empirical distribution away from the standard normal \(N(0, 1)\), and Gaussian derivatives are presumably able to fit both. Therefore, is there a way to let Gaussian derivatives with a reasonable number of reasonable weights automatically identify correlated null from true effects? It works well when effects are not too small right now.

Fitting \(N\left(0, \sigma^2\right)\) with Gaussian derivatives

Continuing from the empirical studies above, we are looking at why \(N\left(0, \sigma^2\right)\) when \(\sigma^2\) is small can be satisfactorily fitted by a limited number of Gaussian derivatives.

Under the exchangeability assumption, the goodness of fit of empirical Bayes can be measured by the behavior of \(\left\{\hat F_j = \hat F_{\hat\beta_j | \hat s_j}\left(\hat\beta_j\mid\hat s_j\right)\right\}\). Meanwhile, if ASH is applied to correlated null \(z\) scores, estimated \(\hat g\) will not only be different from \(\delta_0\); moreover, with this estimated \(\hat g\), \(\left\{\hat F_j\right\}\) might not behave like \(\text{Unif}\left[0, 1\right]\).

Essentially we hope to tell whether \(n\) random samples are uniformly distributed, and the task seems more complicated than what it sounds. There are multiple ways to do it, and some of them have been absorbed into the ashr package.

Take a look at the null \(z\) scores we’ve been experimenting with so far and check whether they truly are marginally \(N\left(0, 1\right)\). It’s an enhanced version of the simulation on occurrence of extreme observations. The non-null \(z\) scores are produced for comparison.

Can we create correlated null from the raw data?

This whole project comes from a ubiquitous issue that in the simultaneous inference problem, the deviation from what would be expected under the null can come from both distortion due to correlation and enrichment due to true effects. Ideally, it would be perfect if we could create controls which keep the distortion by correlation unchanged, but remove the true effects. Efron in a series of papers (for example, Efron et al 2001) suggests to create the null from the raw data by resampling or permutation. We apply the idea to our data sets, and the results are not promising.

Normal means inference with heteroskedastic and correlated noise: Model

The previous investigation is boiled down to a prototypical problem: the classic normal means inference, but with heteroskedastic and especially correlated noise. Gaussian derivatives and ASH provide a way to tackle it.

cvxr is still under active development. We are moving to the more established Rmosek, exploring and experimenting this convex optimization package.

Several ways are explored to make fitting Gaussian derivatives more efficient and accurate.

Fitting the corrrelated null with Gaussian derivatives has two constraints: non-negativity and orderly decay, and both are intractable. We are using \(L_1\) regularization to try to solve both.

Using the seqgendiff pipeline developed by David, Mengyin, and Matthew, we are adding simulated \(\pi_0\delta_0 + \left(1 - \pi0\right)N\left(0, \sigma^2\right)\) signals to GTEx/Liver RNA-seq expression matrix, and compare our method with BH, qvalue, locfdr, ash, sva, cate, mouthwash. The pipeline and result here are mostly exploratory and in development stage. More mature version will follow.

Following Gao’s suggestion, we are taking a look at whether Gaussian derivatives can satisfactorily fit synthetic correlated \(N\left(0, 1\right)\) \(z\) scores. Suggested by Matthew, it might have something to do with the distribution of eigenvalues in a random matrix, and the Wigner semicircle distribution. It’s also an interesting comparison with Gaussian mixtures.

The large-scale simulation testing for GD-ASH with simulated effects under different sparsity, strength, and shape, and realistic correlated noises.

Why AUC of \(p\) values and BH might be different?

AUC-wise, \(p\) values and \(BH\) should perform almost the same, since BH doesn’t change the order of \(p\) values, but in practice their performance might differ. It turns out ties should be one important reason.

Group meeting presentation on GD-ASH

Talked about using Gaussian derivatives to handle correlation on Matthew’s group meeting.

An RNA-seq data set of mouse hearts contains \(23K\)+ genes and only 4 samples, two for each condition. GD-ASH applying to this data set raises several interesting questions.

When the noise is correlated, the likelihood should be correlated. However, in the GD-ASH model, noise is essentially treated to be independently sampled from an empirical null fitted by Gaussian derivatives, so it would make more sense to use Gaussian derivatives, rather than simple Gaussian, as likelihood.

A more in-depth look into the primal vs dual problem in Rmosek implementation.

CASH on Poisson thinned real data

Explore the seqgendiff::poisthin function developed by David.

Data analysis: Efron’s BRCA data

A close look at Efron’s classic BRCA data, used in many of his papers to illustrate the empirical Bayes idea which inspires this project.

Using the same framework of Gaussian derivative likelihood, extensive simulations are performed to compare CASH with other methods, smaller simulations are used to facilitate the exploration, different visualization methods are explored to illustrate the advantages of CASH.

An extensive study of CASH and other methods in different scenarios shows that 1. CASH produces more accurate \(\hat\pi_0\); 2. CASH produces less variable false discovery proportions at nominal FDR cutoff; 3. CASH tends to overfit when the noise is under-dispersed or looks like independent; 4. CASH doesn’t work well when Unimodal Assumption (UA) is broken.

CASH on Deconvolution: Accuracy of \(\hat g\)

On the deconvolution front, CASH generates more robust \(\hat g\) than ASH in the correlated noise case, and doesn’t show noticeable overfitting in the independent noise case.

Poster for O’Bayes 2017

First public presentation of CASH.

A very important quantity for the shape of the empirical distribution of a large number of correlated null is average pairwise correlation \(=\sqrt{\overline{\rho_{ij}^2}}\).

A close look at Efron’s classic BRCA data, used in many of his papers to illustrate the empirical null idea.

The coefficients of Gaussian derivatives for correlated null
The underappreciated and surprising robustness of BH for correlation. See more systemic explorations at BHq
CASH: lfsr calculation
Frequentist and Bayesian interpretations of FDR

cashr paper material

Next it’s a series of investigation of a number of variable selection and false discovery rate control methods in linear regression.

Commonly used design matrix \(X\) simulation schemes and its effects on average pairwise correlation in \(X\) \(=\sqrt{\overline{\rho\left(X_i, X_j\right)^2}}\) and in \(\hat\beta\) \(=\sqrt{\overline{\rho\left(\hat\beta_i, \hat\beta_j\right)^2}}\)

Initial comparison

Scenarios that might be less friendly to knockoff

This reproducible R Markdown analysis was created with workflowr 1.1.1

Truncated Adaptive Shrinkage (truncash)

Truncated Adaptive Shrinkage (`truncash`)