E-MTAB-6141

所属分类:其他
开发工具:R
文件大小:0KB
下载次数:0
上传日期:2020-05-28 12:30:39
上 传 者sh-1993
说明:  数据来自Lewis,Barnes,Blighe等人,Cell Rep.2019年8月27日;28(9):2455–2470.e5。,
(Data from Lewis, Barnes, Blighe et al., Cell Rep. 2019 Aug 27; 28(9): 2455–2470.e5.,)

文件列表:
prep.R (1666, 2020-05-28)
rdata/ (0, 2020-05-28)
rdata/E-MTAB-6141.rdata (13055784, 2020-05-28)
rdata/mat.RDS (13992310, 2020-05-28)
rdata/mat.tsv (28689587, 2020-05-28)
rdata/metadata.RDS (5801, 2020-05-28)
rdata/metadata.tsv (2705, 2020-05-28)
rdata/sig_genes.RDS (36970, 2020-05-28)
rdata/sig_genes.list (17535, 2020-05-28)

A simple tutorial for a complex ComplexHeatmap ================ Kevin Blighe 2020-05-28 # 1, introduction The data used for the heatmap is taken from a large bulk RNA-seq study on rheumatoid arthritis synovial biopsies, a project on which I co-led as Senior Bioinformatician. The work was published as [Molecular Portraits of Early Rheumatoid Arthritis Identify Clinical and Treatment Response Phenotypes](https://pubmed.ncbi.nlm.nih.gov/31461658/) (Lewis et al. (2019)). *ComplexHeatmap* (Gu, Eils, and Schlesner (2016)) is an R Programming Language (R Core Team (2020)) package that is currently listed in the [Bioconductor](https://bioconductor.org/) package repository. # 2, install and load required packages ``` r require(RColorBrewer) require(ComplexHeatmap) require(circlize) require(digest) require(cluster) ``` If all load successfully, proceed to **Part 3**. Otherwise, go through the following code chunks in order to ensure that each package is installed and loaded properly. *BiocManager* (Morgan (2019)) ``` r if (!requireNamespace('BiocManager', quietly = TRUE)) install.packages('BiocManager') ``` *RColorBrewer* (Neuwirth (2014)) ``` r if (!requireNamespace('RColorBrewer', quietly = TRUE)) BiocManager::install('RColorBrewer') require(RColorBrewer) ``` *ComplexHeatmap* (Gu, Eils, and Schlesner (2016)) ``` r if (!requireNamespace('ComplexHeatmap', quietly = TRUE)) BiocManager::install('ComplexHeatmap') require(ComplexHeatmap) ``` *circlize* (Gu et al. (2014)) ``` r if (!requireNamespace('circlize', quietly = TRUE)) BiocManager::install('circlize') require(circlize) ``` *digest* (Antoine Lucas et al. (2020)) ``` r if (!requireNamespace('digest', quietly = TRUE)) BiocManager::install('digest') require(digest) ``` *cluster* (Maechler et al. (2019)) ``` r if (!requireNamespace('cluster', quietly = TRUE)) BiocManager::install('cluster') require(cluster) ``` # 3, obtain the data Sample information and raw data FASTQ files are stored in the public domain under accessions [E-MTAB-6141](https://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-6141/) and [PRJEB23131](https://www.ebi.ac.uk/ena/data/view/PRJEB23131), respectively. I have separately stored an expression matrix and sample metadata in my GitHub repository ([github.com/kevinblighe/](https://github.com/kevinblighe/)). The data will be downloaded to temporary files that will later be deleted by your operating system. The files are presented as uncompressed, plain text in order to ensure compatibility across all platforms outside of a standard build system. ``` r tmpfile <- tempfile() download.file('https://github.com/kevinblighe/E-MTAB-6141/raw/master/rdata/mat.tsv', tmpfile, method = 'auto') mat <- read.table(tmpfile, sep = '\t', row.names = 1, header = TRUE, stringsAsFactors = FALSE) tmpfile <- tempfile() download.file('https://github.com/kevinblighe/E-MTAB-6141/raw/master/rdata/metadata.tsv', tmpfile, method = 'auto') metadata <- read.table(tmpfile, sep = '\t', row.names = 1, header = TRUE, stringsAsFactors = FALSE) tmpfile <- tempfile() download.file('https://github.com/kevinblighe/E-MTAB-6141/raw/master/rdata/sig_genes.list', tmpfile, method = 'auto') sig_genes <- read.table(tmpfile, sep = '\t', header = FALSE, stringsAsFactors = FALSE)[,1] ``` Check the md5 checksums to ensure data integrity / security. The checksums should be: - ‘mat’ object: 1fbbe9568738577a2f3e3dc42e6c75cf - ‘metadata’ object: 542a40ccf8b14c51ffa45361c5d3aed9 - ‘sig\_genes’ object: fdc3e52c9cf0adff3747c7683b69d371 ``` r digest::digest(mat, algo = 'md5') ``` ## [1] "1fbbe9568738577a2f3e3dc42e6c75cf" ``` r digest::digest(metadata, algo = 'md5') ``` ## [1] "542a40ccf8b14c51ffa45361c5d3aed9" ``` r digest::digest(sig_genes, algo = 'md5') ``` ## [1] "fdc3e52c9cf0adff3747c7683b69d371" Take a look at the contents of the data. ``` r # first 5 rows; first 5 columns mat[1:5,1:5] ``` ## SAM9103802 SAM9103803 SAM9103804 SAM9103805 SAM9103806 ## A1BG 0.1288745 0.1637147 -0.1106011 -0.1113405 0.09776126 ## A1CF 1.4491133 1.6378292 1.4676648 1.5119170 1.40215292 ## A2M 15.0932787 14.8324464 14.7205315 14.6949978 14.70800150 ## A2ML1 3.4826292 3.7443431 4.4786253 3.1842529 4.80886450 ## A3GALT2 1.2259417 1.0704200 1.2452696 1.1006621 1.11011460 ``` r # take a peek at the metadata head(metadata) ``` ## Pathotype CD3 CD20 CD68L CD68SL CD138 ## SAM9103802 Lymphoid 2 3 1 3 3 ## SAM9103803 Lymphoid 3 4 4 3 3 ## SAM9103804 Myeloid 3 0 0 4 0 ## SAM9103805 Lymphoid 2 2 3 1 1 ## SAM9103806 Fibroid 0 0 0 1 0 ## SAM9103807 Fibroid 0 0 NA 1 0 ``` r # take a peek at the genes identified as statistically significant head(sig_genes) ``` ## [1] "A2ML1" "AADACL2" "ABCA10" "ABCA12" "AC010646.3" ## [6] "ACACB" ``` r # dimensions of expression data and metadata, and length of sig_genes dim(mat) ``` ## [1] 19279 87 ``` r dim(metadata) ``` ## [1] 87 6 ``` r length(sig_genes) ``` ## [1] 2772 ``` r # verify integrity of metadata and expression matrix: # --check that both objects are aligned by name all(rownames(metadata) == colnames(mat)) ``` ## [1] TRUE ``` r # Subset the expression matrix for the statistically significant genes mat <- mat[sig_genes,] ``` # 4, generate the heatmap #### \[a\] scale the data to Z-scores (by row) This is quite standard when performing clustering and generating a heatmap. ``` r heat <- t(scale(t(mat))) ``` #### \[b\] set colour scheme and choose breaks ``` r myCol <- colorRampPalette(c('dodgerblue', 'black', 'yellow'))(100) myBreaks <- seq(-3, 3, length.out = 100) ``` #### \[c\] create annotation: histo-pathotype and histology scores First, we will just generate some colour mappings for the metadata histology scores. ``` r # CD3 cd3 <- metadata$CD3 cd3 <- cd3[!is.na(cd3)] # remove missing values - we don't want to include these in the mapping pick.col <- brewer.pal(9, 'Greens') col.cd3 <- colorRampPalette(pick.col)(length(unique(cd3))) # CD20 cd20 <- metadata$CD20 cd20 <- cd20[!is.na(cd20)] pick.col <- brewer.pal(9, 'Blues') col.cd20 <- colorRampPalette(pick.col)(length(unique(cd20))) # CD68L cd68L <- metadata$CD68L cd68L <- cd68L[!is.na(cd68L)] pick.col <- brewer.pal(9, 'Reds') col.cd68L <- colorRampPalette(pick.col)(length(unique(cd68L))) # CD68SL cd68SL <- metadata$CD68SL cd68SL <- cd68L[!is.na(cd68L)] pick.col <- brewer.pal(9, 'Oranges') col.cd68SL <- colorRampPalette(pick.col)(length(unique(cd68SL))) # CD138 cd138 <- metadata$CD138 cd138 <- cd138[!is.na(cd138)] pick.col <- brewer.pal(9, 'Purples') col.cd138 <- colorRampPalette(pick.col)(length(unique(cd68SL))) ``` The use of *brewer.pal* and *colorRampPalette* above are to just automatically produce hexidecimal colour codes that will be used later for mapping between each histology score and each colour. ``` r unique(col.cd3) ``` ## [1] "#F7FCF5" "#C7E9C0" "#74C476" "#238B45" "#00441B" ``` r unique(col.cd20) ``` ## [1] "#F7FBFF" "#C6DBEF" "#6BAED6" "#2171B5" "#08306B" ``` r unique(col.cd68L) ``` ## [1] "#FFF5F0" "#FCBBA1" "#FB6A4A" "#CB181D" "#67000D" ``` r unique(col.cd68SL) ``` ## [1] "#FFF5EB" "#FDD0A2" "#FD8D3C" "#D94801" "#7F2704" ``` r unique(col.cd138) ``` ## [1] "#FCFBFD" "#DADAEB" "#9E9AC8" "#6A51A3" "#3F007D" Now let’s build the actual annotation object, i.e., the legend: ``` r # Create an initial data-frame of the annotation that we want to use # In this example, the 'ann' object turns out to be the exact same as 'metadata' ann <- data.frame( Pathotype = metadata$Pathotype, CD3 = metadata$CD3, CD20 = metadata$CD20, CD68L = metadata$CD68L, CD68SL = metadata$CD68SL, CD138 = metadata$CD138, stringsAsFactors = FALSE) # create the colour mapping colours <- list( Pathotype = c('Lymphoid' = 'blue', 'Myeloid' = 'red', 'Fibroid' = 'green3', 'Ungraded' = 'grey'), CD3 = c('0' = '#F7FCF5', '1' = '#C7E9C0', '2' = '#74C476', '3' = '#238B45', '4' = '#00441B'), CD20 = c('0' = '#F7FBFF', '1' = '#C6DBEF', '2' = '#6BAED6', '3' = '#2171B5', '4' = '#08306B'), CD68L = c('0' = '#FFF5F0', '1' = '#FCBBA1', '2' = '#FB6A4A', '3' = '#CB181D', '4' = '#67000D'), CD68SL = c('0' = '#FFF5EB', '1' = '#FDD0A2', '2' = '#FD8D3C', '3' = '#D94801', '4' = '#7F2704'), CD138 = c('0' = '#FCFBFD', '1' = '#DADAEB', '2' = '#9E9AC8', '3' = '#6A51A3', '4' = '#3F007D')) # now create the ComplexHeatmap annotation object # as most of these parameters are self-explanatory, comments will only appear where needed colAnn <- HeatmapAnnotation( df = ann, which = 'col', # 'col' (samples) or 'row' (gene) annotation? na_col = 'white', # default colour for any NA values in the annotation data-frame, 'ann' col = colours, annotation_height = 0.6, annotation_width = unit(1, 'cm'), gap = unit(1, 'mm'), annotation_legend_param = list( Pathotype = list( nrow = 4, # number of rows across which the legend will be arranged title = 'Pathotype', title_position = 'topcenter', legend_direction = 'vertical', title_gp = gpar(fontsize = 12, fontface = 'bold'), labels_gp = gpar(fontsize = 12, fontface = 'bold')), CD3 = list( nrow = 5, title = 'CD3', title_position = 'topcenter', legend_direction = 'vertical', title_gp = gpar(fontsize = 12, fontface = 'bold'), labels_gp = gpar(fontsize = 12, fontface = 'bold')), CD20 = list( nrow = 5, title = 'CD20', title_position = 'topcenter', legend_direction = 'vertical', title_gp = gpar(fontsize = 12, fontface = 'bold'), labels_gp = gpar(fontsize = 12, fontface = 'bold')), CD68L = list( nrow = 5, title = 'CD68L', title_position = 'topcenter', legend_direction = 'vertical', title_gp = gpar(fontsize = 12, fontface = 'bold'), labels_gp = gpar(fontsize = 12, fontface = 'bold')), CD68SL = list( nrow = 5, title = 'CD68SL', title_position = 'topcenter', legend_direction = 'vertical', title_gp = gpar(fontsize = 12, fontface = 'bold'), labels_gp = gpar(fontsize = 12, fontface = 'bold')), CD138 = list( nrow = 5, title = 'CD138', title_position = 'topcenter', legend_direction = 'vertical', title_gp = gpar(fontsize = 12, fontface = 'bold'), labels_gp = gpar(fontsize = 12, fontface = 'bold')))) ``` Believe me, there are many more parameters that can be configured than those which I show here. #### \[d\] create annotation: box-and-whisker plots ``` r boxplotCol <- HeatmapAnnotation( boxplot = anno_boxplot( heat, border = FALSE, gp = gpar(fill = '#CCCCCC'), pch = '.', size = unit(2, 'mm'), axis = TRUE, axis_param = list( gp = gpar(fontsize = 12), side = 'left')), annotation_width = unit(c(2.0), 'cm'), which = 'col') boxplotRow <- HeatmapAnnotation( boxplot = row_anno_boxplot( heat, border = FALSE, gp = gpar(fill = '#CCCCCC'), pch = '.', size = unit(2, 'mm'), axis = TRUE, axis_param = list( gp = gpar(fontsize = 12), side = 'top')), annotation_width = unit(c(2.0), 'cm'), which = 'row') ``` #### \[e\] create annotation: gene labels Many heatmaps are produced from a large number of variables / genes, which result in it being difficult to label each gene in the plot space. Here, we can ‘step through’ the variables / genes and choose to only label a select few. The number of rows (genes) in our object is: 2772 In this code snippet, we ‘step through’ the rownames and only retain every 40th successive label. ``` r genelabels <- rowAnnotation( Genes = anno_mark( at = seq(1, nrow(heat), 40), labels = rownames(heat)[seq(1, nrow(heat), 40)], labels_gp = gpar(fontsize = 10, fontface = 'bold'), padding = 0.75), width = unit(2.0, 'cm') + max_text_width( rownames(heat)[seq(1, nrow(heat), 40)], gp = gpar(fontsize = 10, fontface = 'bold'))) ``` #### \[f\] perform partitioning around medoids (PAM) to identify clusters in the data Performing k-means or PAM on our data can help us to identify internal ‘structure’ in the data that may relate to biologically meaningful pathways, as an example. ``` r pamClusters <- cluster::pam(heat, k = 4) # pre-select k = 4 centers pamClusters$clustering <- paste0('Cluster ', pamClusters$clustering) # fix order of the clusters to have 1 to 4, top to bottom pamClusters$clustering <- factor(pamClusters$clustering, levels = c('Cluster 1', 'Cluster 2', 'Cluster 3', 'Cluster 4')) ``` #### \[g\] create the actual heatmap object ``` r hmap <- Heatmap(heat, # split the genes / rows according to the PAM clusters split = pamClusters$clustering, cluster_row_slices = FALSE, name = 'Gene\nZ-\nscore', col = colorRamp2(myBreaks, myCol), # parameters for the colour-bar that represents gradient of expression heatmap_legend_param = list( color_bar = 'continuous', legend_direction = 'vertical', legend_width = unit(8, 'cm'), legend_height = unit(5.0, 'cm'), title_position = 'topcenter', title_gp=gpar(fontsize = 12, fontface = 'bold'), labels_gp=gpar(fontsize = 12, fontface = 'bold')), # row (gene) parameters cluster_rows = TRUE, show_row_dend = TRUE, #row_title = 'Statistically significant genes', row_title_side = 'left', row_title_gp = gpar(fontsize = 12, fontface = 'bold'), row_title_rot = 90, show_row_names = FALSE, row_names_gp = gpar(fontsize = 10, fontface = 'bold'), row_names_side = 'left', row_dend_width = unit(25,'mm'), # column (sample) parameters cluster_columns = TRUE, show_column_dend = TRUE, column_title = '', column_title_side = 'bottom', column_title_gp = gpar(fontsize = 12, fontface = 'bold'), column_title_rot = 0, show_column_names = FALSE, column_names_gp = gpar(fontsize = 10, fontface = 'bold'), column_names_max_height = unit(10, 'cm'), column_dend_height = unit(25,'mm'), # cluster methods for rows and columns clustering_distance_columns = function(x) as.dist(1 - cor(t(x))), clustering_method_columns = 'ward.D2', clustering_distance_rows = function(x) as.dist(1 - cor(t(x))), clustering_method_rows = 'ward.D2', # specify top and bottom annotations top_annotation = colAnn, bottom_annotation = boxplotCol) ``` #### \[j\] draw the heatmap ``` r draw(hmap + genelabels, heatmap_legend_side = 'left', annotation_legend_side = 'right', row_sub_title_side = 'left') ``` ![Molecular Portraits of Early Rheumatoid Arthritis Identify Clinical and Treatment Response Phenotypes](README_files/figure-gfm/clusterheatmap_fig1-1.png) # 5, extra: change colour scheme, breaks, and do extra clustering on columns Modifying colour and breaks can help to emphasise expression patterns in the heatmap. Columns (samples) can be clustered ‘on the fly’ via *column\_km*. ``` r myCol <- colorRampPalette(c('royalblue', 'white', 'red3'))(100) myBreaks <- seq(-1.5, 1.5, length.out = 100) hmap1 <- Heatmap(heat, name = 'Gene Z-score', col = colorRamp2(myBreaks, myCol), heatmap_legend_param = list( color_bar = 'continuous', legend_direction = 'horizontal', legend_width = unit(8, 'cm'), legend_height = unit(5.0, 'cm'), title_position = 'topcenter', title_gp=gpar(fontsize = 30, fontface = 'bold'), labels_gp=gpar(fontsize = 24, fontface = 'bold')), cluster_rows = TRUE, show_row_dend = TRUE, row_title = 'Statistically significant genes', row_title_side = 'left', row_title_gp = gpar(fontsize = 30, fontface = 'bold'), row_title_rot = 90, show_row_names = FALSE, row_names_gp = gpar(fontsize = 11, fontface = 'bold'), row_names_side = 'left', row_dend_width = unit(25,'mm'), cluster_columns = TRUE, show_column_dend = TRUE, column_title = 'Samples', column_title_side = 'bottom', column_title_gp = gpar(fontsize = 30, fontface = 'bold'), column_title_rot = 0, show_column_names = FALSE, column_names_gp = gpar(fontsize = 8, fontface = 'bold'), column_names_max_height = unit(10, 'cm'), column_dend_height = unit(25,'mm'), clustering_distance_columns = function(x) as.dist(1 - cor(t(x))), clustering_method_columns = 'ward.D2', clustering_distance_rows = function(x) as.dist(1 - cor(t(x))), clustering_method_rows = 'ward.D2') myCol <- colorRampPalette(c('forestgreen', 'black', 'purple'))(100) myBreaks <- seq(-2, 2, length.out = 100) hmap2 <- Heatmap(heat, split = pamClusters$clustering, cluster_row_slices = FALSE, column_km = 6, name = 'Gene Z-score', col = colorRamp2(myBreaks, myCol), heatmap_legend_param = list( color_bar = 'continuous', legend_direction = 'horizontal', legend_width = unit(8, 'cm'), legend_height = unit(5.0, 'cm'), title_position = 'topcenter', title_gp=gpar(fontsize = 30, fontface = 'bold'), labels_gp=gpar(fontsize = 24, fontface = 'bold')), cluster_rows = TRUE, show_row_dend = FALSE, #row_title = 'Statistically significant genes', row_title_side = 'right', row_title_gp = gpar(fontsize = 30, fontface = 'bold'), row_title_rot = 90, show_row_names = FALSE, row_names_gp = gpar(fontsize = 12, fontface = 'bold'), row_names_side = 'left', row_dend_width = unit(25,'mm'), cluster_columns = TRUE, show_column_dend = TRUE, column_title = 'Samples', column_title_side = 'bottom', column_title_gp = gpar(fontsize = 30, fontface = 'bold'), column_title_rot = 0, show_column_names = FALSE, column_names_gp = gpar(fontsize = 8, fontface = 'bold'), column_names_max_height = unit(10, 'cm'), column_dend_height = unit(25,'mm'), clustering_distance_columns = function(x) as.dist(1 - cor(t(x))), clustering_method_columns = 'ward.D2', clustering_distance_rows = function(x) as.dist(1 - cor(t(x))), clustering_method_rows = 'ward.D2') pushViewport(viewport(layout = grid.layout(nr = 1, nc = 2))) pushViewport(viewport(layout.pos.row = 1, layout.pos.col = 1)) draw(hmap1, heatmap_legend_side = 'top', row_sub_title_side = 'left', newpage = FALSE) popViewport() pushViewport(viewport(layout.pos.row = 1, layout.pos.col = 2)) draw(hmap2, heatmap_legend_side = 'top', row_sub_title_side = 'right', newpage = FALSE) popViewport() ... ...

近期下载者

相关文件


收藏者