guppy

Authors:Erick Matsen and Aaron Gallagher
Title:guppy
Version:1.1
License:GPL v3
Date:September 2011

guppy is a tool for working with, visualizing, and comparing collections of phylogenetic placements, such as those made by pplacer or RAxML’s EPA. “GUPPY” is an acronym: Grand Unified Phylogenetic Placement Yanalyzer.

Introduction

To use the statistical comparison features of guppy, it’s a good idea to have a basic understanding of what the Kantorovich-Rubinstein (KR, a.k.a. earth-mover’s distance) is doing, and how the edge PCA and squash clustering algorithms work. There is a gentle introduction in Matsen and Evans, and a more full treatment in Evans and Matsen.

Here’s a table to demonstrate the relation of guppy concepts to ones which may be more familiar to the reader:

familiar concept guppy concept
weighted UniFrac Kantorovich-Rubinstein distance (kr)
UPGMA using UniFrac “squash” clustering (squash)
PCA using UniFrac Edge PCA (pca)
OTU alpha diversity Abundance-weighted phylogenetic diversity measures (fpd)

This table does not show equivalences, but rather a list of hints for further exploration. For example, the KR distance is really a generalization of weighted UniFrac, and edge PCA is a type of PCA that takes advantage of the special structure of phylogenetic placement data. The heat tree (kr_heat) and barycenter (bary) have no analogs in previous types of phylogenetic microbial analysis.

Usage

guppy does lots of different things– it makes heat trees, makes matrices of Kantorovich-Rubinstein distances, does edge PCA, etc. Each of these have their own options. Rather than make a suite of little programs, we have opted for an interface analogous to git and svn: namely, a collection of different actions wrapped up into a single interface.

There are two ways to access the commands– through the command line interface, and through the batch mode. A list of these programs is below, and can always be found using guppy --cmds.

Command line interface

The general way to invoke guppy is guppy COMMAND [options] placefile[s] where COMMAND is one of the guppy commands. For example:

guppy heat --gray-black coastal.jplace DCM.jplace

These programs are listed with more detail below, and can always be found using guppy --cmds .

guppy can also be invoked as guppy --quiet COMMAND [...], which prevents the specified command from writing to stdout unless explicitly requested.

Batch mode

It’s easy to run lots of commands at once with batch mode. However, unlike running the equivalent set of commands on the command line, placefiles are only loaded once per batch file run. guppy will load a given file the first time it is used in a command.

Batch files are files with one guppy command per line, specified exactly as would be written in a shell, except without the leading guppy. Arguments can be enclosed in double quotes to preserve whitespace, and double quotes within quoted strings are quoted by doubling (e.g. "spam ""and"" eggs"). Globbing (e.g. *.jplace) is not allowed. Comments are also allowed in batch files; everything on a line after a # is ignored.

An example batch file:

# Whole-line comment.
pca -o pca -c some.refpkg src/a.jplace src/b.jplace
squash -c some.refpkg -o squash_out src/a.jplace src/b.jplace
classify -c some.refpkg some.jplace  # inline comment

If this was saved as example.batch, it would be invoked from guppy as:

guppy --batch example.batch

Advanced features

Batch files also have two unique features: virtual placefiles, and parameter substitution.

Within a batch file, if a placefile is saved to or loaded from a path beginning with a @, the data will be stored in memory instead of written to disk. For example:

merge -o @merged.jplace src/a.jplace src/b.jplace
info @merged.jplace

will effectively run guppy info on the placefile resulting from merging the two arguments, but without ever writing that merged file to disk.

Additionally, parameters can be passed in from the command line to the batch file. On the command line, parameters are specified as additional arguments to guppy in key=value format. In the batch file, substitutions are done from identifiers in {key} format. For example, when a batch file containing

info {k1} {k2}

is invoked with

guppy --batch example.batch k1=1.jplace k2=2.jplace

the impact will be the same as running

guppy 1.jplace 2.jplace

Braces can also be quoted by doubling (e.g. {{foo}} will become {foo}).

Pqueries versus placements

One of the key features of pplacer is that it is able to express uncertainty concerning placement in a reasonable manner. Specifically, if there is uncertainty in a given read’s optimal placement, it returns a collection of placements that are weighted according to likelihood weight or posterior probability. This feature requires a bit of additional wording. We will use “pquery” to denote a “placed query (sequence)”, i.e. the collection of weighted placements for a sequence. “Placement,” on the other hand, signifies a single location on a tree along with its optimal pendant branch length.

About multiplicities

pplacer and guppy support “multiplicities,” i.e. multiple reads being treated as one. For example, if some reads are identical, they can be treated as a group. Doing so makes guppy operations much faster.

By default, they are used “as is” for guppy calculations– a single placement with multiplicity four is the same as four reads placed individually. However, if one would like to decrease the impact of multiplicities on downstream analysis (e.g. if PCR artifacts are suspected) one can use the --transform option of mft to choose a transform for the multiplicies before use. Doing so will convert your placements into labeled masses.

‘Split’ placefiles

It’s often convenient to split up place files in all sorts of ways, but it’s nice not to have to duplicate the information in the placefiles multiple times. For that reason, we have introduced “split” placefiles. The syntax for these is my.jplace:my.csv, where the CSV file maps from query sequence names to the split placefile name. For example, say my.csv was like:

"read1","a"
"read2","b"

and my.jplace has placements for read1 and read2. Then my.jplace:my.csv would act like a list of two placefiles named a and b, with read1 and read2, respectively. Not every placement name in the place file needs to appear in the CSV file, so you can use this for subsetting.

BIOM files

Any command which expects a placefile (in both guppy and rppr) can also be given a BIOM file. As BIOM files (unlike placefiles) do not contain a tree, it must be passed at the same time, with a colon delimiting the two paths. For example, to run guppy info on my.biom with a tree my.tre, the invocation would be guppy info my.tre:my.biom. Trees must be in the Newick format.

As the BIOM format describes counts at leaves, the placements generated by parsing a BIOM file will all have zero distal branch length and zero pendant branch length.

BIOM files will also automatically be split by sample. If a BIOM file has columns Sample1 and Sample2, that file will be interpreted as two placefiles, named respectively Sample1 and Sample2.

phyloXML viewing notes

guppy makes fattened and annotated trees to visualize the results of various analyses. We have chosen to use phyloXML as the format for these trees, as it has width and color tags for edges; if you see .xml files coming out of a guppy analysis that’s what they are for. We like looking at these trees using the tree viewer archaeopteryx. If you open archaeopteryx with the default settings, you will see nothing interesting: simply the reference tree. You need to click on the “Colorize Branches” and “Use branch-width” check boxes. If you don’t see those check boxes, then use this configuration file (if you are going to copy and paste it click on “raw” first).

List of subcommands

The following table provides links to more in-depth documentation for each guppy subcommand:

Command Description
adcl calculates ADCL for each pquery in a placefile
bary draws the barycenter of a placement collection on the reference tree
check checks placefiles for common problems
classify outputs classification information in SQLite format
compress compresses a placefile’s pqueries
demulti splits apart placements with multiplicity, undoing a round procedure
distmat prints out a pairwise distance matrix between the edges
edpl calculates the EDPL uncertainty values for a collection of pqueries
epca performs edge principal components
error finds the error between two placefiles
fat makes trees with edges fattened in proportion to the number of reads
filter filters one or more placefiles by placement name
fpd calculates various alpha diversity metrics of placefiles
heat maps an an arbitrary vector of the correct length to the tree
indep_c calculates the independent contrasts of pqueries
info writes the number of leaves of the reference tree and the number of pqueries
islands finds the mass islands of one or more pqueries
kr calculates the Kantorovich-Rubinstein distance and corresponding p-values
kr_heat makes a heat tree
lpca performs length principal components
mcl cluster pqueries using Markov clustering via MCL
merge merges placefiles together
mft Multi-Filter and Transform placefiles
ograph finds the overlap graph of one or more pqueries
placemat prints out a pairwise distance matrix between placements
pmlpca performs poor-man’s length principal components
rarefact calculates phylogenetic rarefaction curves
rarefy performs rarefaction on collections of placements
redup restores duplicates to deduped placefiles
round clusters the placements by rounding branch lengths
sing makes one tree for each query sequence, showing uncertainty
splitify writes out differences of masses for the splits of the tree
squash performs squash clustering
to_csv turns a placefile into a csv file
to_json converts old-style .place files to .jplace placement files
to_rdp convert a reference package to a format RDP wants
tog makes a tree with each of the reads represented as a pendant edge
trim trims placefiles down to only containing an informative subset of the mass
unifrac calculates unifrac on two or more placefiles