rapids-single-cell-examples:使用RAPIDS加速单细胞基因组分析的示例

  • C3_702526
    了解作者
  • 14.2MB
    文件大小
  • zip
    文件格式
  • 0
    收藏次数
  • VIP专享
    资源类型
  • 0
    下载次数
  • 2022-05-27 10:16
    上传日期
使用RAPIDS的GPU加速的单细胞基因组分析 该存储库包含示例笔记本,演示了如何使用进行GPU加速的单细胞测序数据分析。 RAPIDS是一套开源Python库,可通过GPU加速来加快数据科学工作流程.RAPIDS库从单单元格计数矩阵开始,可用于执行数据处理,降维,聚类,可视化和比较细胞簇。 我们的几个示例均受启发,并基于格式。目前,我们提供了scRNA-seq和scATAC-seq的示例,并且已扩展到100万个细胞。我们还将展示如何创建基于GPU的交互式浏览器内可视化工具,以探索单细胞数据集。 单细胞基因组学研究的数据集规模正在增加,目前已达到数百万个细胞。借助RAPIDS,可以轻松地以交互方式实时地分析大型数据集,从而实现更快的科学发现。 安装 Docker容器 包含所有依赖项,笔记本和源代码的容器可从。 请执行以下命令以启动笔记本电脑,并按照日志中的URL打开Jupyter We
rapids-single-cell-examples-master.zip
  • rapids-single-cell-examples-master
  • images
  • 70k_lung.png
    87.6KB
  • atacworks_notebook_img.png
    108.3KB
  • viz3-2.gif
    8.7MB
  • dashboard_2.png
    360.9KB
  • dashboard.png
    557.7KB
  • 1M_brain.png
    65.1KB
  • 60k_bmmc_dsciATAC.png
    161.7KB
  • Dockerfile
    1KB
  • launch
    12.8KB
  • LICENSE
    11.1KB
  • conda
  • rapidgenomics_cuda10.1.yml
    329B
  • rapidgenomics_cuda10.2.viz.yml
    369B
  • rapidgenomics_cuda11.0.yml
    329B
  • rapidgenomics_cuda10.2.yml
    329B
  • cpu_notebook_env.yml
    314B
  • .gitignore
    139B
  • .dockerignore
    36B
  • README.md
    18.1KB
  • notebooks
  • hlca_lung_gpu_analysis.ipynb
    1.1MB
  • visualize.py
    19.8KB
  • utils.py
    4.3KB
  • coverage.py
    9.7KB
  • hlca_lung_gpu_analysis-visualization.ipynb
    71.2KB
  • dsci_bmmc_60k_cpu.ipynb
    901.8KB
  • hlca_lung_cpu_analysis.ipynb
    1.3MB
  • csv_to_h5ad.ipynb
    3KB
  • 5k_pbmc_coverage_gpu.ipynb
    283.6KB
  • 1M_brain_gpu_analysis_uvm.ipynb
    733.6KB
  • rapids_scanpy_funcs.py
    14.7KB
  • dsci_bmmc_60k_gpu.ipynb
    602.7KB
  • 1M_brain_cpu_analysis.ipynb
    1MB
  • build.sh
    3.3KB
内容介绍
# GPU-Accelerated Single-Cell Genomics Analysis with RAPIDS This repository contains example notebooks demonstrating how to use [RAPIDS](https://rapids.ai) for GPU-accelerated analysis of single-cell sequencing data. RAPIDS is a suite of open-source Python libraries that can speed up data science workflows using GPU acceleration.Starting from a single-cell count matrix, RAPIDS libraries can be used to perform data processing, dimensionality reduction, clustering, visualization, and comparison of cell clusters. Several of our examples are inspired by the [Scanpy tutorials](https://scanpy.readthedocs.io/en/stable/tutorials.html) and based upon the [AnnData](https://anndata.readthedocs.io/en/latest/index.html) format. Currently, we provide examples for scRNA-seq and scATAC-seq, and we have scaled up to 1 million cells. We also show how to create GPU-powered interactive, in-browser visualizations to explore single-cell datasets. Dataset sizes for single-cell genomics studies are increasing, presently reaching millions of cells. With RAPIDS, it becomes easy to analyze large datasets interactively and in real time, enabling faster scientific discoveries. ## Installation ### Docker container A container with all dependencies, notebooks and source code is available at https://hub.docker.com/r/claraparabricks/single-cell-examples_rapids_cuda11.0. Please execute the following commands to start the notebook and follow the URL in the log to open Jupyter web application. ```bash docker pull claraparabricks/single-cell-examples_rapids_cuda11.0 docker run --gpus all --rm -v /mnt/data:/data claraparabricks/single-cell-examples_rapids_cuda11.0 ``` ### conda All dependencies for these examples can be installed with conda. CUDA versions 10.1 and higher are supported currently. ```bash conda env create --name rapidgenomics -f conda/rapidgenomics_cuda10.2.yml conda activate rapidgenomics python -m ipykernel install --user --display-name "Python (rapidgenomics)" ``` If installing for a system running a CUDA 10.1 driver, use `conda/rapidgenomics_cuda10.1.yml`. For CUDA 11.0, use `conda/rapidgenomics_cuda11.0.yml` After installing the necessary dependencies, you can just run `jupyter lab`. ### Launch Script Lanuch script (./launch) can be used to start example notebooks either on a host or in a docker container. This script prepares the environment and acquires the dataset for the examples. ```bash # rapids-single-cell-examples$ ./launch -h usage: launch <command> [<args rel='nofollow' onclick='return false;'>] Following commands are wrapped by this tool: container : Start Jupyter notebook in a container host : Start Jupyter notebook on the host dataset : Download dataset execute : Execute an example create_env : Create conda environment for an example To execute 'hlca_lung' example in container, please execute the following command: ./launch container -d /path/to/store/dataset -e hlca_lung Example launcher positional arguments: command Subcommand to run optional arguments: -h, --help show this help message and exit ``` ```./launch host``` can be used to create a conda environment for executing any of the examples. To list all supported examples, please execute ```./launch host -h```. ```./launch container``` can be used to setup a container for the example. ```./launch execute```, can be used to run an example in the background. Results are saved inplace. ## Configuration Unified Virtual Memory (UVM) can be used to [oversubscribe](https://developer.nvidia.com/blog/beyond-gpu-memory-limits-unified-memory-pascal/) your GPU memory so that chunks of data will be automatically offloaded to main memory when necessary. This is a great way to explore data without having to worry about out of memory errors, but it does degrade performance in proportion to the amount of oversubscription. UVM is enabled by default in these examples and can be enabled/disabled in any RAPIDS workflow with the following: ```python import cupy as cp import rmm rmm.reinitialize(managed_memory=True) cp.cuda.set_allocator(rmm.rmm_cupy_allocator) ``` RAPIDS provides a [GPU Dashboard](https://medium.com/rapids-ai/gpu-dashboards-in-jupyter-lab-757b17aae1d5), which contains useful tools to monitor GPU hardware right in Jupyter. ## Example 1: Single-cell RNA-seq of 70,000 Human Lung Cells <img align="left" width="240" height="200" src="https://github.com/clara-parabricks/rapids-single-cell-examples/blob/master/images/70k_lung.png?raw=true"> We use RAPIDS to accelerate the analysis of a ~70,000-cell single-cell RNA sequencing dataset from human lung cells. This example includes preprocessing, dimension reduction, clustering, visualization and gene ranking. ### Example Dataset The dataset is from [Travaglini et al. 2020](https://www.biorxiv.org/content/10.1101/742320v2). If you wish to run the example notebook using the same data, use the following command to download the count matrix for this dataset and store it in the `data` folder: ```bash wget -P <path to this repository>/data https://rapids-single-cell-examples.s3.us-east-2.amazonaws.com/krasnow_hlca_10x.sparse.h5ad ``` ### Example Code Follow this [Jupyter notebook](notebooks/hlca_lung_gpu_analysis.ipynb) for RAPIDS analysis of this dataset. In order for the notebook to run, the file [rapids_scanpy_funcs.py](notebooks/rapids_scanpy_funcs.py) needs to be in the same folder as the notebook. We provide a second notebook with the CPU version of this analysis [here](notebooks/hlca_lung_cpu_analysis.ipynb). ### Acceleration We report the runtime of these notebooks on various GCP instances below. All runtimes are given in seconds. Acceleration is given in parentheses. Benchmarking was performed on Dec 16, 2020. | Step | CPU <br> e2-standard-32 <br> 32 vCPUs | GPU <br> n1-standard-16 <br> T4 16 GB GPU <br> (Acceleration) | GPU <br> n1-highmem-8 <br> Tesla V100 16 GB GPU <br> (Acceleration) | GPU <br> a2-highgpu-1g <br> Tesla A100 40GB GPU <br> (Acceleration) | |------------------------------|---------------------------|---------------------------|---------------|--------------| | Preprocessing | 351 | 87 (4x) | 78 (5x) | 91 (4x) | | PCA | 5.7 | 4.6 (1.2x) | 3.2 (2x) | 2.7 (2x) | | t-SNE | 235 | 3.8 (62x) | 1.9 (124x) | 2.2 (107x) | | k-means (single iteration) | 17.6 | 0.55 (32x) | 0.14 (126x) | 0.09 (196x) | | KNN | 39 | 20.6 (2x) | 20.9 (2x) | 5.3 (7x) | | UMAP | 46 | 0.97 (47x) | 0.52 (88x) | 0.63 (73x) | | Louvain clustering | 16.9 | 0.22 (77x) | 0.19 (89x) | 0.14 (121x) | | Leiden clustering | 16.5 | 0.15 (110x) | 0.12 (138x) | 0.12 (138x) | | Differential Gene Expression | 108 | 6.9 (16x) | 2.5 (43x) | 2.0 (54x) | | Re-analysis of subgroup | 28 | 5.1 (5x) | 4.3 (7x) | 4.1 (7x) | | End-to-end notebook run | 883 | 154 | 142 | 125 | | Price ($/hr) | 1.073 | 1.110 | 2.953 | 4.00 | | Total cost ($) | 0.263 | 0.047 | 0.116 | 0.139 | ## Example 2: Single-cell RNA-seq of 1 Million Mouse Brain Cells <img align="left" width="240" height="200" src="https://github.com/clara-parabricks/rapids-single-cell-examples/blob/master/images/1M_brain.png?raw=true"> We demonstrate the use of RAPIDS to accelerate the analysis of single-cell R
评论
    相关推荐