anvi’o, the binning and visualization pipeline, is supposed to be THE COOLEST.  Here’s their methods paper.

We found that it is ridiculously easy to install on a mac, which is a big help, but it requires approx 40 GB per sample, so it probably won’t run in a timely manner on a laptop. Profiling is the most memory and time intensive part.

To use anvi’o you need to have contigs and mapping before you try to bin into genomes. The tutorials (and other documentation) provided on their website are pretty good, even walking you through the pre-anvio steps.

One really helpful feature is that it plays nice with other software, allowing you to include taxonomic and functional annotations from the method of your choice to your visualization. Anvi’o includes CONCOCT in its pipeline, but you can import bins you made with other software to compare and visualize.

And, bonus, they now have scripts to perform cpr searching!!

Some BioBakery

The biobakery tools are many and varied; we  focusing on taxonomic (MetaPhlAn) and functional (HUMAnN) annotation of metagenomes.

PiCrust uses 16S marker gene data to predict metagenomes and thereby functional profiles. It discards unidentified OTUs from, for example, QIIME, so the longer the 16S sequences you use to initially generate taxonomic IDs, the better. Documentation and readability of the output could be improved.

MetaPhlAn matches reference genomes and sequences to classify based on similarity and calculates abundances. MetaPhlAn does have the capability for generating a custom database against which to run reference genomes. However, we found Kraken to be a better use of time as it does the same thing and runs faster.

HUMAnN generates a functional abundance table and assesses the completeness pathways. HUMAnN pulls the organisms that MetaPhlAn identifies and runs them. It can run without MetaPhlAn data if one runs nonstratified input. Abundances are normalized by gene length and depth of sequences.

MetaPalette and Bracken

This week, we mostly discussed MetaPalette. Though it can be tricky to install correctly, and I for one have had problems with memory limits, I like it because it pulls genomes from NCBI databases including Bacteria, Archaea, Viruses and Eukaryotes, which is rare in an annotation software workflow.  I also like the fact that it relies on kmers of two sizes (30 and 50) and assigns based on the lowest common ancestor.  A suggestion we posited was to run the program in a virtual server environment.

We also briefly discussed Bracken, which takes adjusts results from Kraken using genome size and Bayesian statistics. The product is a table that includes the original associated numbers, the adjusted reads, and the final percentages.  Matt described an experiment investigating viral reads horse cells.  A custom database worked well here, classifying about 60% of the reads of interest.


We had a good (half) conversation about workflows today.

The first thing we started talking about is getting the sample that you want.  It seems like it’s more difficult than it should be to isolate the gunk/biomass of interest, especially if you’re working in a host system.  There’s probably always going to be host contamination, but it’s a waste to sequence all of that.  So, we decided that the best approach is to prepare your sample to get the best yield rather than trying to sort out the sequences you want later.

Library preparation maybe should be done in a separate room/bio safety cabinet.  And clean your pipettes. Perhaps a good model to follow is procedures used by those who work with ancient human DNA.

Also, make sure to sequence your kit! And use negative controls.  If you get a result from a negative control, should you eliminate taxa?  One approach we discussed was to use multiple water blanks and take a median of the blanks, then compare samples to blanks.  If the blanks have a higher median abundance than a sample, throw out the sample.

Technical replicates are a good idea, but how do you deal with those?  Try comparing the coefficient of variance between biological and technical replicates and samples.

If you can, fit your whole experiment on one run.  We swapped horror stories of different runs separating when principle component analysis was done 😦

Once you have your sequences, prepare them for downstream analysis by trimming adapters, filtering to a quality you’re comfortable with, and, possibly, merging paired ends.  If you have 16S data, merge first, then qc. With metagenomics, you can merge the high quality reads (after QC).  PEAR and Flash are two read-joiners we’ve used and liked.

As far as quality control goes, Matt here at the genome center has a set of tools you could use if you have a known insert size.  Guilluame uses custom script to trim adapters and remove low quality reads.  Trimmomatic and the FASTX Toolkit does this too.  They’re probably all going to do the same thing, and the differences will be in the run time.

So, now that you have reads you’re comfortable with, the first thing most everyone wants to do appears to be taxonomic assignment.  Make sure if you’re merging ends to use merged reads plus forward reads OR a tool that takes pairing information into account so you’re not double-counting the same read.  Some tools we talked about: Kraken/Braken, Metapalette, Discribinate, MEGAN.

That’s as far as our conversation got in an hour.  We’ll definitely pick up from here in the future!



Discussion of Megahit

Megahit was easy to install and it ran very quickly on large datasets.

We thought it seems like a fine approach for a low-complexity dataset. For my data, though, Megahit assembled 12% of the reads from one of my samples, and only 3% of the coassembly using the default settings. Perhaps a better strategy for a high-complexity dataset would be to normalize k-mers using, for example, diginorm or stacks before running megahit meta-large or even an assembler with more options.

We also discussed other assemblers, and decided that it might be best to pick your assembler based on the dataset in question.


Next Friday, June 24, we’ll discuss this paper:

Li et al: MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph. 2015 Bioinformatics.

Megahit is an assembler for metagenomics data. It was developed to work on large, complex datasets.  It’s available from github, and doesn’t do any pre-processing for you.

Kraken: taxonomic sequence classification system

We will discuss this paper on Friday, April 22nd from noon-1pm:

Wood DE, Salzberg SL: Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biology 2014, 15:R46.

Obtain software from here.

Kraken is a system for assigning taxonomic labels to short DNA sequences, usually obtained through metagenomic studies. Previous attempts by other bioinformatics software to accomplish this task have often used sequence alignment or machine learning techniques that were quite slow, leading to the development of less sensitive but much faster abundance estimation programs. Kraken aims to achieve high sensitivity and high speed by utilizing exact alignments of k-mers and a novel classification algorithm.

Have any installation/running questions? Ask here.