Last updated: 2018-05-15
workflowr checks: (Click a bullet for more information) ✔ R Markdown file: up-to-date
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
✔ Environment: empty
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
✔ Seed:
set.seed(12345)
The command set.seed(12345)
was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.
✔ Session information: recorded
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
✔ Repository version: 388e65e
wflow_publish
or wflow_git_commit
). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:
Ignored files:
Ignored: .DS_Store
Ignored: .Rhistory
Ignored: .Rproj.user/
Ignored: analysis/.DS_Store
Ignored: analysis/BH_robustness_cache/
Ignored: analysis/FDR_Null_cache/
Ignored: analysis/FDR_null_betahat_cache/
Ignored: analysis/Rmosek_cache/
Ignored: analysis/StepDown_cache/
Ignored: analysis/alternative2_cache/
Ignored: analysis/alternative_cache/
Ignored: analysis/ash_gd_cache/
Ignored: analysis/average_cor_gtex_2_cache/
Ignored: analysis/average_cor_gtex_cache/
Ignored: analysis/brca_cache/
Ignored: analysis/cash_deconv_cache/
Ignored: analysis/cash_fdr_1_cache/
Ignored: analysis/cash_fdr_2_cache/
Ignored: analysis/cash_fdr_3_cache/
Ignored: analysis/cash_fdr_4_cache/
Ignored: analysis/cash_fdr_5_cache/
Ignored: analysis/cash_fdr_6_cache/
Ignored: analysis/cash_plots_cache/
Ignored: analysis/cash_sim_1_cache/
Ignored: analysis/cash_sim_2_cache/
Ignored: analysis/cash_sim_3_cache/
Ignored: analysis/cash_sim_4_cache/
Ignored: analysis/cash_sim_5_cache/
Ignored: analysis/cash_sim_6_cache/
Ignored: analysis/cash_sim_7_cache/
Ignored: analysis/correlated_z_2_cache/
Ignored: analysis/correlated_z_3_cache/
Ignored: analysis/correlated_z_cache/
Ignored: analysis/create_null_cache/
Ignored: analysis/cutoff_null_cache/
Ignored: analysis/design_matrix_2_cache/
Ignored: analysis/design_matrix_cache/
Ignored: analysis/diagnostic_ash_cache/
Ignored: analysis/diagnostic_correlated_z_2_cache/
Ignored: analysis/diagnostic_correlated_z_3_cache/
Ignored: analysis/diagnostic_correlated_z_cache/
Ignored: analysis/diagnostic_plot_2_cache/
Ignored: analysis/diagnostic_plot_cache/
Ignored: analysis/efron_leukemia_cache/
Ignored: analysis/fitting_normal_cache/
Ignored: analysis/gaussian_derivatives_2_cache/
Ignored: analysis/gaussian_derivatives_3_cache/
Ignored: analysis/gaussian_derivatives_4_cache/
Ignored: analysis/gaussian_derivatives_5_cache/
Ignored: analysis/gaussian_derivatives_cache/
Ignored: analysis/gd-ash_cache/
Ignored: analysis/gd_delta_cache/
Ignored: analysis/gd_lik_2_cache/
Ignored: analysis/gd_lik_cache/
Ignored: analysis/gd_w_cache/
Ignored: analysis/knockoff_10_cache/
Ignored: analysis/knockoff_2_cache/
Ignored: analysis/knockoff_3_cache/
Ignored: analysis/knockoff_4_cache/
Ignored: analysis/knockoff_5_cache/
Ignored: analysis/knockoff_6_cache/
Ignored: analysis/knockoff_7_cache/
Ignored: analysis/knockoff_8_cache/
Ignored: analysis/knockoff_9_cache/
Ignored: analysis/knockoff_cache/
Ignored: analysis/knockoff_var_cache/
Ignored: analysis/marginal_z_alternative_cache/
Ignored: analysis/marginal_z_cache/
Ignored: analysis/mosek_reg_2_cache/
Ignored: analysis/mosek_reg_4_cache/
Ignored: analysis/mosek_reg_5_cache/
Ignored: analysis/mosek_reg_6_cache/
Ignored: analysis/mosek_reg_cache/
Ignored: analysis/pihat0_null_cache/
Ignored: analysis/plot_diagnostic_cache/
Ignored: analysis/poster_obayes17_cache/
Ignored: analysis/real_data_simulation_2_cache/
Ignored: analysis/real_data_simulation_3_cache/
Ignored: analysis/real_data_simulation_4_cache/
Ignored: analysis/real_data_simulation_5_cache/
Ignored: analysis/real_data_simulation_cache/
Ignored: analysis/rmosek_primal_dual_2_cache/
Ignored: analysis/rmosek_primal_dual_cache/
Ignored: analysis/seqgendiff_cache/
Ignored: analysis/simulated_correlated_null_2_cache/
Ignored: analysis/simulated_correlated_null_3_cache/
Ignored: analysis/simulated_correlated_null_cache/
Ignored: analysis/simulation_real_se_2_cache/
Ignored: analysis/simulation_real_se_cache/
Ignored: analysis/smemo_2_cache/
Ignored: data/LSI/
Ignored: docs/.DS_Store
Ignored: docs/figure/.DS_Store
Ignored: output/fig/
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
File | Version | Author | Date | Message |
---|---|---|---|---|
html | e05bc83 | LSun | 2018-05-12 | Update to 1.0 |
Rmd | cc0ab83 | Lei Sun | 2018-05-11 | update |
html | 0f36d99 | LSun | 2017-12-21 | Build site. |
html | 853a484 | LSun | 2017-11-07 | Build site. |
Rmd | 7db293d | LSun | 2017-02-14 | index |
html | 7db293d | LSun | 2017-02-14 | index |
html | 7693edf | LSun | 2017-02-14 | pipeline |
Rmd | ab7e085 | LSun | 2017-02-14 | Files commited by wflow_commit. |
html | 3a12124 | LSun | 2017-02-14 | pipeline |
Rmd | b68f251 | LSun | 2017-02-14 | Files commited by wflow_commit. |
Rmd | d623e66 | LSun | 2017-02-13 | pipeline |
Last updated: 2018-05-15
Code version: 388e65e
Up until now, we are using simulated correlated null data for exploration, and it would be helpful to examine and document the pipeline for generating these summary statistics, including \(\hat\beta\), \(\hat s\), \(z\) score, \(t\) statistics, \(p\) value.
An expression matrix \(X_{G \times N}\) will go through the following steps to get measurements of length \(g\): Choose the top \(g\) expressed genes; randomly sample \(n\) cases vs \(n\) controls; calculate normalizing factor by edgeR::calcNormFactors
; turn counts to \(\log_2\); calculate variance weight by limma::voom
. We’ll go through this pipeline with a toy data set.
Sometimes it’s very helpful to read the source code of these different functions.
library(edgeR)
Loading required package: limma
library(limma)
Suppose the raw counts are stored in a \(10 \times 8\) matrix with \(10\) genes and \(8\) samples, counts simulated from Possion.
set.seed(777)
r = matrix(rpois(n = 80, lambda = 5), nrow = 10)
r
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,] 6 6 4 7 1 6 6 2
[2,] 5 6 6 6 4 4 3 7
[3,] 4 5 5 7 5 2 7 9
[4,] 12 5 3 4 4 3 0 4
[5,] 6 7 1 4 3 5 8 4
[6,] 1 2 7 4 11 3 2 8
[7,] 4 4 5 6 8 2 2 4
[8,] 3 8 5 4 8 9 3 6
[9,] 9 4 11 6 8 4 9 5
[10,] 3 4 6 6 3 1 5 4
In this example we choose top \(6\) genes from a total of \(10\) genes.
lcpm = function(r){R = colSums(r); t(log2(((t(r)+0.5)/(R+1))* 10^6))}
top_genes_index=function(g,X){return(order(rowSums(X),decreasing =TRUE)[1:g])}
Y=lcpm(r)
subset = top_genes_index(6,Y)
r = r[subset,]
r
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,] 9 4 11 6 8 4 9 5
[2,] 3 8 5 4 8 9 3 6
[3,] 4 5 5 7 5 2 7 9
[4,] 5 6 6 6 4 4 3 7
[5,] 6 7 1 4 3 5 8 4
[6,] 6 6 4 7 1 6 6 2
In this example we use 2 cases vs 2 controls, and put labes on them by condition
, based on which a design matrix is generated.
counts = r[,sample(1:ncol(r),2*2)]
counts
[,1] [,2] [,3] [,4]
[1,] 6 5 11 4
[2,] 4 6 5 9
[3,] 7 9 5 2
[4,] 6 7 6 4
[5,] 4 4 1 5
[6,] 7 2 4 6
condition = c(rep(0,2),rep(1,2))
condition
[1] 0 0 1 1
design = model.matrix(~condition)
design
(Intercept) condition
1 1 0
2 1 0
3 1 1
4 1 1
attr(,"assign")
[1] 0 1
edgeR::calcNormFactors
This might be potentially a problem, because the normalizing factors are only calculated based on top \(g\) genes. 1. A DGEList
object is generated by counts
and condition
, using edgeR::DGEList
2. The normalizing factors are calculated by edgeR::calcNormFactors
.
DGEList_obj = edgeR::DGEList(counts=counts,group=condition)
DGEList_obj
An object of class "DGEList"
$counts
Sample1 Sample2 Sample3 Sample4
1 6 5 11 4
2 4 6 5 9
3 7 9 5 2
4 6 7 6 4
5 4 4 1 5
6 7 2 4 6
$samples
group lib.size norm.factors
Sample1 0 34 1
Sample2 0 33 1
Sample3 1 32 1
Sample4 1 30 1
dgecounts = edgeR::calcNormFactors(DGEList_obj)
dgecounts
An object of class "DGEList"
$counts
Sample1 Sample2 Sample3 Sample4
1 6 5 11 4
2 4 6 5 9
3 7 9 5 2
4 6 7 6 4
5 4 4 1 5
6 7 2 4 6
$samples
group lib.size norm.factors
Sample1 0 34 1.1201997
Sample2 0 33 0.9397760
Sample3 1 32 0.9051827
Sample4 1 30 1.0494070
limma::voom
limma::voom
takes in counts and normalizing factors, and gives, among other things, modified library sizes, \(\log_2\) expression, weights for each expression, all for Gaussian-based linear modeling.
The peculiarity of limma::voom
lies in the way it calculates the “average” log2-count for each gene.: sx <- fit$Amean + mean(log2(lib.size + 1)) - log2(1e+06)
. Note that for each gene, the count as well as log2-count could vary wildly from sample to sample due to library size, sequencing depth, and / or experimental design, so the way to find an “average” is not self-evident.
v = voom(dgecounts, design, plot=FALSE)
v
An object of class "EList"
$targets
group lib.size norm.factors
Sample1 0 38.08679 1.1201997
Sample2 0 31.01261 0.9397760
Sample3 1 28.96585 0.9051827
Sample4 1 31.48221 1.0494070
$E
Sample1 Sample2 Sample3 Sample4
1 17.34340 17.39043 18.54988 17.07992
2 16.81288 17.63144 17.48575 18.15792
3 17.54985 18.17893 17.48575 16.23192
4 17.34340 17.83789 17.72676 17.07992
5 16.81288 17.10093 15.61128 17.36942
6 17.54985 16.25293 17.19625 17.61043
$weights
[,1] [,2] [,3] [,4]
[1,] 1.849269 2.060690 1.849269 1.849269
[2,] 4.229762 2.147872 1.849269 1.849269
[3,] 1.849269 1.849269 1.260232 1.424488
[4,] 1.849269 4.502218 2.095619 2.171379
[5,] 2.133848 1.565478 1.260232 1.260232
[6,] 2.111233 1.462248 2.095619 2.171379
$design
(Intercept) condition
1 1 0
2 1 0
3 1 1
4 1 1
attr(,"assign")
[1] 0 1
Here is what limma::voom
does as in Law et al 2013
voom: variance modeling at the observation-level
The limma-trend pipeline models the variance at the gene level. However, in RNA-seq applications, the count sizes may vary considerably from sample to sample for the same gene. Different samples may be sequenced to different depths, so different count sizes may be quite different even if the cpm values are the same. For this reason, we wish to model the mean-variance trend of the log-cpm values at the individual observation level, instead of applying a gene-level variability estimate to all observations from the same gene.
Our strategy is to estimate non-parametrically the mean-variance trend of the logged read counts and to use this mean-variance relationship to predict the variance of each log-cpm value. The predicted variance is then encapsulated as an inverse weight for the log-cpm value. When the weights are incorporated into a linear modeling procedure, the mean-variance relationship in the log-cpm values is effectively eliminated.
A technical difficulty is that we want to predict the variances of individual observations although there is, by definition, no replication at the observational level from which variances could be estimated. We work around this inconvenience by estimating the mean-variance trend at the gene level, then interpolating this trend to predict the variances of individual observations.
The algorithm proceeds as follows. First, gene-wise linear models are fitted to the normalized log-cpm values, taking into account the experimental design, treatment conditions, replicates and so on. This generates a residual standard deviation for each gene. A robust trend is then fitted to the residual standard deviations as a function of the average log-count for each gene.
Also available from the linear models is a fitted value for each log-cpm observation. Taking the library sizes into account, the fitted log-cpm for each observation is converted into a predicted count. The standard deviation trend is then interpolated to predict the standard deviation of each individual observation based on its predicted count size. Finally, the inverse squared predicted standard deviation for each observation becomes the weight for that observation.
The log-cpm values and associated weights are then input into limma’s standard differential expression pipeline. Most limma functions are designed to accept quantitative weights, providing the ability to perform microarray-like analyses while taking account of the mean-variance relationship of the log-cpm values at the observation level.
limma::lmFit
Taking in the EList
from limma::voom
, limma::lmFit
produces \(g\) \(t\)-tests, giving vectors of \(\hat\beta\), \(\hat s\), degree of freedom.
Note that $stdev.unscaled
\(= (X^TWX)^{-1}\), \(W\) being the weights calculated by limma::voom
.
lim = lmFit(v)
lim
An object of class "MArrayLM"
$coefficients
(Intercept) condition
1 17.36819 0.4467125
2 17.08856 0.7332766
3 17.86439 -1.0439088
4 17.69392 -0.2963208
5 16.93478 -0.4444242
6 17.01916 0.3878582
$stdev.unscaled
(Intercept) condition
1 0.5057244 0.7253511
2 0.3959772 0.6535863
3 0.5199780 0.8017827
4 0.3967914 0.6259395
5 0.5199226 0.8167446
6 0.5289983 0.7170746
$sigma
[1] 1.0000276 0.8283184 0.8417619 0.6191636 1.0056442 0.9044428
$df.residual
[1] 2 2 2 2 2 2
$cov.coefficients
(Intercept) condition
(Intercept) 0.5 -0.5
condition -0.5 1.0
$pivot
[1] 1 2
$rank
[1] 2
$Amean
1 2 3 4 5 6
17.59091 17.52200 17.36161 17.49699 16.72363 17.15236
$method
[1] "ls"
$design
(Intercept) condition
1 1 0
2 1 0
3 1 1
4 1 1
attr(,"assign")
[1] 0 1
betahat.lim = lim$coefficients[,2]
betahat.lim
1 2 3 4 5 6
0.4467125 0.7332766 -1.0439088 -0.2963208 -0.4444242 0.3878582
sebetahat.lim = lim$stdev.unscaled[,2]*lim$sigma
sebetahat.lim
1 2 3 4 5 6
0.7253712 0.5413776 0.6749101 0.3875590 0.8213545 0.6485530
t.lim = betahat.lim / sebetahat.lim
df.lim = lim$df.residual
df.lim
[1] 2 2 2 2 2 2
limma::eBayes
Is it a variance shrinkage procedure?
Taking in summary statistics (\(\hat \beta\), \(\hat s\), d.f.) for each gene, limma::eBayes
produces shrunk standard deviation for each gene, and thus, moderated \(t\)-statistics and \(p\)-values.
lim.ebayes = ebayes(lim)
lim.ebayes
$df.prior
[1] Inf
$s2.prior
[1] 0.7679051
$s2.post
[1] 0.7679051 0.7679051 0.7679051 0.7679051 0.7679051 0.7679051
$t
(Intercept) condition
1 39.19105 0.7027908
2 49.24719 1.2802980
3 39.20572 -1.4857719
4 50.88712 -0.5402266
5 37.16953 -0.6209514
6 36.71385 0.6172411
$df.total
[1] 12 12 12 12 12 12
$p.value
(Intercept) condition
1 4.912765e-14 0.4955946
2 3.220173e-15 0.2246354
3 4.890901e-14 0.1631317
4 2.177295e-15 0.5989240
5 9.230571e-14 0.5462455
6 1.069038e-13 0.5486090
$var.prior
[1] 20.83590882 0.01302244
$lods
(Intercept) condition
1 751.8553 -4.601380
2 1196.5418 -4.585889
3 751.9251 -4.583232
4 1277.9980 -4.606774
5 675.1656 -4.601095
6 658.2650 -4.602920
sebetahat.ebayes = lim$stdev.unscaled[, 2] * sqrt(lim.ebayes$s2.post)
sebetahat.ebayes
1 2 3 4 5 6
0.6356266 0.5727390 0.7026037 0.5485120 0.7157149 0.6283739
t.ebayes = betahat.lim / sebetahat.ebayes
t.ebayes
1 2 3 4 5 6
0.7027908 1.2802980 -1.4857719 -0.5402266 -0.6209514 0.6172411
lim.ebayes$t[, 2]
1 2 3 4 5 6
0.7027908 1.2802980 -1.4857719 -0.5402266 -0.6209514 0.6172411
df.ebayes = lim.ebayes$df.total
df.ebayes
[1] 12 12 12 12 12 12
p.ebayes = (1 - pt(abs(t.ebayes), df.ebayes)) * 2
p.ebayes
1 2 3 4 5 6
0.4955946 0.2246354 0.1631317 0.5989240 0.5462455 0.5486090
lim.ebayes$p.value[, 2]
1 2 3 4 5 6
0.4955946 0.2246354 0.1631317 0.5989240 0.5462455 0.5486090
z = qnorm(1 - lim.ebayes$p[, 2] / 2) * sign(lim.ebayes$t[, 2])
z
1 2 3 4 5 6
0.6814377 1.2142943 -1.3946160 -0.5259485 -0.6033956 0.5998458
sessionInfo()
R version 3.4.3 (2017-11-30)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.4
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] edgeR_3.20.2 limma_3.34.4
loaded via a namespace (and not attached):
[1] locfit_1.5-9.1 workflowr_1.0.1 Rcpp_0.12.16
[4] lattice_0.20-35 digest_0.6.15 rprojroot_1.3-2
[7] R.methodsS3_1.7.1 grid_3.4.3 backports_1.1.2
[10] git2r_0.21.0 magrittr_1.5 evaluate_0.10.1
[13] stringi_1.1.6 whisker_0.3-2 R.oo_1.21.0
[16] R.utils_2.6.0 rmarkdown_1.9 tools_3.4.3
[19] stringr_1.3.0 yaml_2.1.18 compiler_3.4.3
[22] htmltools_0.3.6 knitr_1.20
This reproducible R Markdown analysis was created with workflowr 1.0.1