Posted by: shrikantmantri | September 30, 2009

Bioinformatics, Genomes, EC2, and Hadoop

via Amazon Web Services Blog by AWS Evangelist on 9/29/09

I think it is really interesting to see how breakthroughs and process improvements in one scientific or technical discipline can drive that discipline forward while also enabling progress in other seemingly unrelated disciplines.

The Bioinformatics field is rife with examples of this pattern. Declining hardware costs, cloud computing, the ability to do parallel processing, and algorithmic advances have driven down the cost and time of gene sequencing by multiple orders of magnitude in the space of a decade or two. Processing that was once measured by years and megabucks is now denominated by hours and dollars.

My colleague Deepak Singh pointed out a number of recent AWS-related developments in this space:

JCVI Cloud Bio-Linux

Built on top of a 64-bit Ubuntu distribution, the JCVI Cloud Bio-Linux gives scientists the ability to launch EC2 instances chock-full of the latest bioinformatics packages including BLAST (Basic Local Alignment Search Tool), glimmer (Microbial Gene-Finding System), hmmer (Biosequence Analysis Using Profile Hidden Markov Models), phylip (Phylogeny Inference Package), rasmol (Molecular Visualization) genespring (statistical analysis, data mining, and visualization tools), clustalw (general purpose multiple sequence alignment), the Celera Assembler (de novo whole-genome shotgun DNA sequence assembler), and the NIH EMBOSS utilities. The Celera Assembler can be used to assemble entire bacterial genome sequences on Amazon EC2 today!

There’s a getting-started guide for the JCVI AMI. Graphical and command- line bioinformatics tools can be launched from a shell window connected to a running instance of the AMI.


CloudBurst is described as a “new parallel read-mapping algorithm optimized for mapping next-generation sequence data to the human genome and other reference genomes, for use in a variety of biological analyses including SNP discovery, genotyping, and personal genomics.”

In laymen’s terms, CloudBurst uses Hadoop to implement a linearly scalable search tool. Once loaded with a reference genome, it maps the “short reads” (snippets of sequenced DNA approximately 30 base pairs long) to a location (or locations) on the reference genome. Think of it as a very advanced form of string matching, with support for partial matches, insertions, deletions, and subtle differences. This is a highly parallelizable operation; CloudBurst reduces operations involving millions of short reads from hours to minutes when run on a large-scale cluster of EC2 instances.

You can read more about CloudBurst in the research paper. This paper includes benchmarks of CloudBurst on EC2 along with performance and scaling information.



Crossbow was built to do “Whole Genome Resequencing in the Clouds.” It combines Bowtie for ultra-fast short read alignment and SOAPsnp for sequence assembly and high quality SNP calling. The Crossbow home page claims that it can sequence an entire genome in an afternoon on EC2, for less than $250. Crossbow is so new that the papers and the code distribution are still a little ways off. There’s a lot of good information in this poster:

Michael Shatz (the principal author of CloudBurst and Bowtie) wrote a really interesting note on Hadoop for Computational Biology. He states that “CloudBurst is just the beginning of the story, not the end.” and endorses the Map/Reduce model for processing 100+GB datasets. I will echo Mike’s conclusion to wrap up this somewhat long post:

In short, there is no shortage of opportunities for utilizing MapReduce/Hadoop for computational biology, so if your users are skeptical now, I just ask that they are patient for a little bit longer and reserve judgment on MapReduce/Hadoop until we can publish a few more results.

I really learned a lot while putting this post together and I hope that you will learn something by reading it. If you are using EC2 in a bioinformatics context, I’d love to hear from you. Leave a comment or send me some mail.

— Jeff;



  1. Hi,

    It is an very interesting post. I would like to see more bio tools that are deployed to cloud platforms including commercial ones such as EC2 and Azure, as well as others used mostly for research.

    I am also interested in the limitations of those cloud platforms in the sense of supporting paralleled bio applications comparing to HPC.

    If you have any comments or could point me to a few articles or papers, I would very appreciate.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


%d bloggers like this: