ballgown-release

所属分类:生物医药技术
开发工具:R
文件大小:0KB
下载次数:0
上传日期:2015-11-12 18:59:18
上 传 者sh-1993
说明:  发布版本的“舞会服”生物导体包,
(release version of "ballgown" Bioconductor package,)

文件列表:
.Rbuildignore (130, 2015-11-12)
.travis.yml (570, 2015-11-12)
DESCRIPTION (874, 2015-11-12)
NAMESPACE (1637, 2015-11-12)
NEWS (2167, 2015-11-12)
R/ (0, 2015-11-12)
R/AllClass.R (1962, 2015-11-12)
R/AllGenerics.R (2782, 2015-11-12)
R/annotate_assembly.R (2788, 2015-11-12)
R/assessSim.R (6777, 2015-11-12)
R/ballgown-expr-methods.R (7742, 2015-11-12)
R/ballgown-get-methods.R (7334, 2015-11-12)
R/ballgown-package.R (372, 2015-11-12)
R/ballgown-replace-methods.R (1611, 2015-11-12)
R/ballgown-show-method.R (216, 2015-11-12)
R/ballgown-subset-method.R (4913, 2015-11-12)
R/ballgownrsem.R (10075, 2015-11-12)
R/bg-data.R (360, 2015-11-12)
R/checkAssembledTx.R (5415, 2015-11-12)
R/closestColor.R (155, 2015-11-12)
R/clusterTranscripts.R (2681, 2015-11-12)
R/collapseTranscripts.R (4050, 2015-11-12)
R/constructor.R (15819, 2015-11-12)
R/contains.R (2923, 2015-11-12)
R/exprfilter.R (2427, 2015-11-12)
R/getAttributeField.R (1457, 2015-11-12)
R/getGenes.R (2121, 2015-11-12)
R/gffRead.R (1530, 2015-11-12)
R/gffReadGR.R (1971, 2015-11-12)
R/last.R (401, 2015-11-12)
R/pctOverlap.R (1928, 2015-11-12)
R/plotLatentTranscripts.R (3953, 2015-11-12)
R/plotMeans.R (7801, 2015-11-12)
R/plotTranscripts.R (11713, 2015-11-12)
R/ss.R (151, 2015-11-12)
R/stattest.R (15873, 2015-11-12)
R/writeFiles.R (2379, 2015-11-12)
data/ (0, 2015-11-12)
... ...

# Introduction and Preprocessing Ballgown is a software package designed to facilitate flexible differential expression analysis of RNA-Seq data. It also provides functions to organize, visualize, and analyze the expression measurements for your transcriptome assembly. Before using the Ballgown R package, a few preprocessing steps are necessary: 1. RNA-Seq reads should be aligned to a reference genome. 2. A transcriptome should be assembled, or a reference transcriptome should be downloaded. 3. Expression for the features (transcript, exon, and intron junctions) in the transcriptome should be estimated in a Ballgown readable format. Two sample pipelines for preprocessing are as follows: 1. **Pipeline 1:** _TopHat2_ (1) + _Stringtie_ (2,3) 1. _TopHat2_ [Trapnell et al. (2009)] is built on the ultrafast short read mapping program _Bowtie_ and aligns RNA-Seq reads to a genome while identifying exonic splice junctions. Sample command: ` tophat2 -G reference.gff -o outputDirectory -p 6 referenceIndex reads ` 2. _Stringtie_ [M. Pertea et al. (2015)] is a highly efficient assembler for RNA-Seq alignments using a novel network flow algorithm. It simultaneously assembles and quantifies expression levels for the features of the transcriptome in a Ballgown readable format (by using the option -B). One command to _Stringtie_ satisfies steps 2 and 3 above. Sample command: ` stringtie -B -G reference.gff -p 6 accepted_hits.bam -o stringtie.gff ` 2. **Pipeline 2:** _TopHat2_ (1) + _Cufflinks_ (2) + _Tablemaker_ (3) 1. _Tophat2_ produces alignments as noted above. 2. _Cufflinks_ [Trapnell et al. (2010)] also assembles transcriptomes from RNA-Seq data and quantifies their expression. Sample command: ` cufflinks -g reference.gff -o outputDirectory accepted_hits.bam ` 3. _Tablemaker_ calls _Cufflinks_ to estimate feature expressions in a Ballgown readable format. _Tablemaker_ access and instructions can be found [here](https://github.com/leekgroup/tablemaker). # Installation Ballgown is [available via Bioconductor](http://bioconductor.org/packages/devel/bioc/html/ballgown.html). To install, start R and run: ```r source("http://bioconductor.org/biocLite.R") biocLite("ballgown") ``` #### "alpha" version `ballgown` changed significantly in July 2014. For backwards compatibility with code written before July 2014, we've saved the original version of Ballgown in the `alpha` branch of this repository. You can install this version from GitHub: ```r install.packages("devtools") #if needed devtools::install_github('ballgown', 'alyssafrazee', ref='alpha') ``` # Ballgown readable expression output The files that _Stringtie_ and _Tablemaker_ produce for Ballgown to load is as follows: * `e_data.ctab`: exon-level expression measurements. One row per exon. Columns are `e_id` (numeric exon id), `chr`, `strand`, `start`, `end` (genomic location of the exon), and the following expression measurements for each sample: * `rcount`: reads overlapping the exon * `ucount`: uniquely mapped reads overlapping the exon * `mrcount`: multi-map-corrected number of reads overlapping the exon * `cov` average per-base read coverage * `cov_sd`: standard deviation of per-base read coverage * `mcov`: multi-map-corrected average per-base read coverage * `mcov_sd`: standard deviation of multi-map-corrected per-base coverage * `i_data.ctab`: intron- (i.e., junction-) level expression measurements. One row per intron. Columns are `i_id` (numeric intron id), `chr`, `strand`, `start`, `end` (genomic location of the intron), and the following expression measurements for each sample: * `rcount`: number of reads supporting the intron * `ucount`: number of uniquely mapped reads supporting the intron * `mrcount`: multi-map-corrected number of reads supporting the intron * `t_data.ctab`: transcript-level expression measurements. One row per transcript. Columns are: * `t_id`: numeric transcript id * `chr`, `strand`, `start`, `end`: genomic location of the transcript * `t_name`: Cufflinks-generated transcript id * `num_exons`: number of exons comprising the transcript * `length`: transcript length, including both exons and introns * `gene_id`: gene the transcript belongs to * `gene_name`: HUGO gene name for the transcript, if known * `cov`: per-base coverage for the transcript (available for each sample) * `FPKM`: Cufflinks-estimated FPKM for the transcript (available for each sample) * `e2t.ctab`: table with two columns, `e_id` and `t_id`, denoting which exons belong to which transcripts. These ids match the ids in the `e_data` and `t_data` tables. * `i2t.ctab`: table with two columns, `i_id` and `t_id`, denoting which introns belong to which transcripts. These ids match the ids in the `i_data` and `t_data` tables. # Loading data into R The default directory structure produced by _Stringtie_ and _Tablemaker_ should mirror the `extdata` folder in the Ballgown pacakge: ``` extdata/ sample01/ e2t.ctab e_data.ctab i2t.ctab i_data.ctab t_data.ctab sample02/ e2t.ctab e_data.ctab i2t.ctab i_data.ctab t_data.ctab ... sample20/ e2t.ctab e_data.ctab i2t.ctab i_data.ctab t_data.ctab ``` Data is loaded using the `ballgown` function. If your data is stored in directories matching the above structure (one root folder, subfolders named by sample, and `.ctab` files in the subfolders), you can use the `dataDir` and `samplePattern` arguments to load the data. `samplePattern` takes a regular expressions specifying the subfolders that should be included in the ballgown object: ```r library(ballgown) data_directory = system.file('extdata', package='ballgown') # automatically finds ballgown's installation directory # examine data_directory: data_directory ``` ``` ## [1] "/Library/Frameworks/R.framework/Versions/3.1/Resources/library/ballgown/extdata" ``` ```r # make the ballgown object: bg = ballgown(dataDir=data_directory, samplePattern='sample', meas='all') bg ``` ``` ## ballgown instance with 100 transcripts and 20 samples ``` If your data is stored in a directory structure other than the one specified above, you can use the `samples` argument in the `ballgown` function: `samples` should be a vector (1-d array) with one entry per sample, where the entry gives the path to the folder containing that sample's `.ctab` files. The result from either of these approaches is an object of class `ballgown` (named `bg` in these examples). In the rest of this document, we use `bg` to refer to the first example, where samples are named `sample01` through `sample20`. A note for large experiments (with many samples or with large genomes): loading the data might require a lot of time and memory. In these cases, it's often useful to do the data loading in non-interactive mode. More specifically, you could create a script called `load.R` that contains these lines: ```R library(ballgown) data_directory = system.file('extdata', package='ballgown') bg = ballgown(dataDir=data_directory, samplePattern='sample', meas='all') save(bg, file='bg.rda') ``` You could then run this script non-interactively using `R CMD BATCH`: from the command line, run: ``` R CMD BATCH load.R ``` This may take some time, but when it finishes, the file `bg.rda` will be saved in the current directory, and you can read it back into R using the `load()` function. Rda files are usually only a few Gb on disk, even for large experiments. It is also possible to load only a subset of all the expression measurements by changing the `meas` argument to the `ballgown` function. For example, to only load transcript-level FPKMs, set `meas = 'FPKM'` and to load average coverage values and read counts, set `meas=c('cov', 'rcount').` See `?ballgown` for detailed information on creating Ballgown objects. # Accessing assembly data A `ballgown` object has six slots: `structure`, `expr`, `indexes`, `dirs`, `mergedDate`, and `meas`. #### structure The `structure` slot depends heavily on the `GenomicRanges` Bioconductor package (Lawrence et al. (2013)). The slot specifies the structure, i.e., genomic locations and relationships between exons, introns, and transcripts, of the transcriptome assembly. It is convenient to represent exons and introns as intervals and to represent transcripts as a set of intervals (exons), so assembled exons and introns are available as `GRanges` objects, and the assembled transcripts are available as a `GRangesList` object. This means that useful range operations, such as `findOverlaps` and `reduce`, are readily available for assembled features. Exon, intron, and transcript structures are easily extracted from the main `ballgown` object: ```r structure(bg)$exon ``` ``` ## GRanges object with 633 ranges and 2 metadata columns: ## seqnames ranges strand | id transcripts ## | ## [1] 18 [24412069, 24412331] * | 12 10 ## [2] 22 [17308271, 17308950] + | 55 25 ## [3] 22 [17309432, 17310226] + | 56 25 ## [4] 22 [18121428, 18121652] + | 88 35 ## [5] 22 [18138428, 18138598] + | 89 35 ## ... ... ... ... ... ... ... ## [629] 22 [51221929, 51222113] - | 3777 1294 ## [630] 22 [51221319, 51221473] - | 3782 1297 ## [631] 22 [51221929, 51222162] - | 3783 1297 ## [632] 22 [51221929, 51222168] - | 3784 1301 ## [633] 6 [31248149, 31248334] * | 3794 1312 ## ------- ## seqinfo: 3 sequences from an unspecified genome; no seqlengths ``` ```r structure(bg)$intron ``` ``` ## GRanges object with 536 ranges and 2 metadata columns: ## seqnames ranges strand | id ## | ## [1] 22 [17308951, 17309431] + | 33 ## [2] 22 [18121653, 18138427] + | 57 ## [3] 22 [18138599, 18185008] + | 58 ## [4] 22 [18185153, 18209442] + | 59 ## [5] 22 [18385514, 18387397] - | 72 ## ... ... ... ... ... ... ## [532] 22 [51216410, 51220615] - | 2750 ## [533] 22 [51220776, 51221928] - | 2756 ## [534] 22 [51220780, 51221318] - | 2757 ## [535] 22 [51221474, 51221928] - | 2758 ## [536] 22 [51220780, 51221928] - | 2759 ## transcripts ## ## [1] 25 ## [2] 35 ## [3] 35 ## [4] 35 ## [5] 41 ## ... ... ## [532] c(1294, 1297, 1301) ## [533] 1294 ## [534] 1297 ## [535] 1297 ## [536] 1301 ## ------- ## seqinfo: 1 sequence from an unspecified genome; no seqlengths ``` ```r structure(bg)$trans ``` ``` ## GRangesList object of length 100: ## $10 ## GRanges object with 1 range and 2 metadata columns: ## seqnames ranges strand | id transcripts ## | ## [1] 18 [24412069, 24412331] * | 12 10 ## ## $25 ## GRanges object with 2 ranges and 2 metadata columns: ## seqnames ranges strand | id transcripts ## [1] 22 [17308271, 17308950] + | 55 25 ## [2] 22 [17309432, 17310226] + | 56 25 ## ## $35 ## GRanges object with 4 ranges and 2 metadata columns: ## seqnames ranges strand | id transcripts ## [1] 22 [18121428, 18121652] + | 88 35 ## [2] 22 [18138428, 18138598] + | 89 35 ## [3] 22 [18185009, 18185152] + | 90 35 ## [4] 22 [18209443, 18212080] + | 91 35 ## ## ... ## <97 more elements> ## ------- ## seqinfo: 3 sequences from an unspecified genome; no seqlengths ``` #### expr The `expr` slot is a list that contains tables of expression data for the genomic features. These tables are very similar to the `*_data.ctab` _Tablemaker_ output files. Ballgown implements the following syntax to access components of the `expr` slot: ```R *expr(ballgown_object_name, ) ``` where `*` is either e for exon, i for intron, t for transcript, or g for gene, and is an expression-measurement column name from the appropriate `.ctab` file. Gene-level measurements are calculated by aggregating the transcript-level measurements for that gene. All of the following are valid ways to extract expression data from the `bg` ballgown object: ```r transcript_fpkm = texpr(bg, 'FPKM') transcript_cov = texpr(bg, 'cov') whole_tx_table = texpr(bg, 'all') exon_mcov = eexpr(bg, 'mcov') junction_rcount = iexpr(bg) whole_intron_table = iexpr(bg, 'all') gene_expression = gexpr(bg) ``` Calculating the gene-level expression measurements can be slow for large experiments. The `*expr` functions return matrices unless `meas = 'all'`, in which case some additional feature metadata is returned and the result is a `data.frame`. #### indexes The `indexes` slot of a ballgown object connects the pieces of the assembly and provides other experimental information. `indexes(bg)` is a list with several components that can be extracted with the `$` operator. Perhaps most importantly, there is a component called `pData` that should hold a data frame of phenotype information for the samples in the experiment. This must be created manually. It is **very important** that the rows of pData are in the correct order. Each row corresponds to a sample, and the rows of pData should be ordered the same as the tables in the `expr` slot. You can check that order by running `sampleNames(bg)`. The `pData` component can be added during construction (you can pass a data frame to the `ballgown` function), or you can add it later: ```r pData(bg) = data.frame(id=sampleNames(bg), group=rep(c(1,0), each=10)) ``` The other components of `indexes` are the `e2t` and `i2t` tables described in the _Tablemaker_ section, as well as a `t2g` table denoting which transcripts belong to which genes. There is also a `bamfiles` component, designed to hold paths to the read alignment files for each sample. The `bamfiles` component isn't currently used by any ballgown functions, but it could come in handy for users of `RSamtools` or similar packages. Here are some examples of how to extract `indexes` components from ballgown objects: ```r exon_transcript_table = indexes(bg)$e2t transcript_gene_table = indexes(bg)$t2g head(transcript_gene_table) ``` ``` ## t_id g_id ## 1 10 XLOC_000010 ## 2 25 XLOC_000014 ## 3 35 XLOC_000017 ## 4 41 XLOC_000246 ## 5 45 XLOC_000019 ## 6 67 XLOC_000255 ``` ```r phenotype_table = pData(bg) ``` #### other slots The `dirs` slot gives full filepaths to _Tablemaker_ output: ```r head(bg@dirs) ``` ``` ## sample01 ## "/Library/Frameworks/R.framework/Versions/3.1/Resources/library/ballgown/extdata/sample01" ## sample02 ## "/Library/Frameworks/R.framework/Versions/3.1/Resources/library/ballgown/extdata/sample02" ## sample03 ## "/Library/Frameworks/R.framework/Versions/3.1/Resources/library/ballgown/extdata/sample03" ## sample04 ## "/Library/Frameworks/R.framework/Versions/3.1/Resources/library/ballgown/extdata/sample04" ## sample05 ## "/Library/Frameworks/R.framework/Versions/3.1/Resources/library/ballgown/extdata/sample05" ## sample06 ## "/Library/Frameworks/R.framework/Versions/3.1/Resources/library/ballgown/extdata/sample06" ``` The `mergedDate` slot indicates when the `ballgown` object was created: ```r bg@mergedDate ``` ``` ## [1] "Sun Feb 22 23:51:45 2015" ``` And the `meas` slot gives the expression measurements present in the object: ```r bg@meas ``` ``` ## [1] "all" ``` # Plotting transcript structures Visualization of the assembled transcripts is done with the `plotTranscripts` function. Transcripts or exons can be colored by expression level. This plot colors transcripts by expression level: ```r plotTranscripts(gene='XLOC_000454', gown=bg, samples='sample12', meas='FPKM', colorby='transcript', main='transcripts from gene XLOC_000454: sample 12, FPKM') ``` ![](figure/plotTranscripts-1.png) It is also possible to plot several samples at once: ```r plotTranscripts('XLOC_000454', bg, samples=c('sample01', 'sample06', 'sample12', 'sample19'), meas='FPKM', colorby='transcript') ``` ![](figure/plotTranscripts2-1.png) You can also make side-by-side plots comparing mean abundances between groups (here, 0 and 1): ```r plotMeans('XLOC_000454', bg, groupvar='group', meas='FPKM', colorby='transcript') ``` ![](figure/plotMeans-1.png) # Differential expression analysis Ballgown provides a wide selection of simple, fast statistical methods for testing whether transcripts are differentially expressed between experimental conditions or across a continuous covariate (such as time). The default statistical test in ballgown is a parametric F-test comparing nested linear models; details are available in the Ballgown manuscript (Frazee et al. (2014)). These models are conceptually simialar to the models used by Smyth (2005) in the `limma` package. In `limma`, more sophisticated empirical Bayes shrinkage methods are used, and generally a single linear model is fit per feature instead of doing a nested model comparison, but the flavor is similar (and in fact, `limma` can easily be run on any of the data matrices in a `ballgown` object). Ballgown's statistical models are implem ... ...

近期下载者

相关文件


收藏者