GenomicsBench


The GenomicsBench benchmark suite consists of 12 representative kernels spanning the major steps in short and long-read sequence analysis pipelines such as basecalling, sequence alignment, de-novo assembly, variant calling, and polishing. GenomicsBench includes parallel versions of the source code with CPU and GPU implementations as applicable along with representative input datasets of two sizes - small and large.

Out of the 12 kernels, 9 run on CPUs and 3 require GPUs. We plan to update GenomicsBench to include more GPU implementations for the other benchmarks. Please check out the latest updates on Github.


Common Workflows in Genomics


pipelines

The above figure shows the steps performed in some common genomics pipelines to analyze short and long read sequencing data, reference-guided assembly, de-novo assembly and metagenomics classification. All three pipelines start with the raw sequencer output. Please see the paper for details.


Key Computational Kernels


kernels

The table above shows the 12 benchmarks and their corresponding parallelism motifs. bsw, phmm, chain, spoa and abea are dynamic programming based but have important differences. Also present in the benchmark suite are kernels that manipulate hash tables and perform graph construction (dbg, spoa) and index lookup (fmi, kmer-cnt). The benchmark suite includes benchmarks with regular/irregular control-flow and memory accesses. Please see the paper for details.


Getting Started


Download

  • Latest source code
git clone --recursive https://github.com/arun-sub/genomicsbench.git
  • Input datasets
wget https://genomicsbench.eecs.umich.edu/input-datasets.tar.gz

Prerequisites

  • RHEL/Fedora system prerequisites
sudo yum -y install $(cat rhel.prerequisites)
  • Debian system prerequisites
sudo apt-get install $(cat debian.prerequisites)

Python setup (optional: only needed for GPU benchmarks)

To run Python-based benchmarks nn-base, nn-variant and abea, follow the steps below:

  • Download and install miniconda from this link.

  • Follow the steps below to set up a conda environment:

# make sure channels are added in conda
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge

# create conda environment named "genomicsbench"
conda create -n genomicsbench -c bioconda clair python==3.6.8
conda activate genomicsbench
conda install deepdish

pip install --upgrade pip
pip install -r requirements.txt
pypy3 -m ensurepip
pypy3 -m pip install --no-cache-dir intervaltree==3.0.2

Compiling benchmarks

  • CPU benchmarks
    • MKLROOT and MKL_IOMPS_DIR variables need to be set in Makefile to run grm. If you don’t want to run grm, please comment grm related commands in Makefile
    • VTUNE_HOME variable needs to be set if you want to run any VTune based analyses
make -j<num_threads>
  • GPU benchmarks
    • Set CUDA_LIB=/usr/local/cuda or to the path of the local CUDA installation in Makefile.
    • Also ensure environment variables PATH and LD_LIBRARY_PATH include the path to CUDA binaries and libraries.
make -j<num_threads> gpu

Running benchmarks

  • CPU benchmarks
cd scripts
chmod +x ./run_cpu.sh
./run_cpu.sh <path to input dataset folder> <input size to run: small | large>
  • GPU benchmarks
cd scripts
chmod +x ./run_gpu.sh
./run_gpu.sh <path to input dataset folder> <input size to run: small | large>

Questions / Feedback ?


GenomicsBench is under active development and we appreciate any feedback and suggestions from the community. Feel free to raise an issue or submit a pull request on Github. For assistance in using GenomicsBench, please contact: Arun Subramaniyan (arunsub@umich.edu), Yufeng Gu (yufenggu@umich.edu), Timothy Dunn (timdunn@umich.edu)


Acknowledgement


This work was supported in part by Precision Health at the University of Michigan, by the Kahn foundation, by the NSF under the CAREER-1652294 award and the Applications Driving Architectures (ADA) Research Center, a JUMP Center co-sponsored by SRC and DARPA.


License and Usage


Each benchmark is individually licensed according to the tool it is extracted from. Please see the README.md with individual benchmarks on the Github page for more details.

If you use GenomicsBench or find GenomicsBench useful, please cite this work:

@inproceedings{genomicsbench,
    title={GenomicsBench: A Benchmark Suite for Genomics},
    author={Subramaniyan, Arun and
            Gu, Yufeng and
            Dunn, Timothy and
            Paul, Somnath and
            Vasimuddin, Md.
            and Misra, Sanchit and
            Blaauw, David and
            Narayanasamy, Satish and
            Das, Reetuparna},
    booktitle={Proceedings of the 
               IEEE International Symposium 
               on Performance Analysis of
               Systems and Software (ISPASS)},
    year={2021}
}