Discussion of Megahit

Megahit was easy to install and it ran very quickly on large datasets.

We thought it seems like a fine approach for a low-complexity dataset. For my data, though, Megahit assembled 12% of the reads from one of my samples, and only 3% of the coassembly using the default settings. Perhaps a better strategy for a high-complexity dataset would be to normalize k-mers using, for example, diginorm or stacks before running megahit meta-large or even an assembler with more options.

We also discussed other assemblers, and decided that it might be best to pick your assembler based on the dataset in question.



Next Friday, June 24, we’ll discuss this paper:

Li et al: MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph. 2015 Bioinformatics.

Megahit is an assembler for metagenomics data. It was developed to work on large, complex datasets.  It’s available from github, and doesn’t do any pre-processing for you.

Kraken: taxonomic sequence classification system

We will discuss this paper on Friday, April 22nd from noon-1pm:

Wood DE, Salzberg SL: Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biology 2014, 15:R46.

Obtain software from here.

Kraken is a system for assigning taxonomic labels to short DNA sequences, usually obtained through metagenomic studies. Previous attempts by other bioinformatics software to accomplish this task have often used sequence alignment or machine learning techniques that were quite slow, leading to the development of less sensitive but much faster abundance estimation programs. Kraken aims to achieve high sensitivity and high speed by utilizing exact alignments of k-mers and a novel classification algorithm.

Have any installation/running questions? Ask here.