guppy ¶

Authors:	Erick Matsen and Aaron Gallagher
Title:	guppy
Version:	1.1
License:	GPL v3
Date:	September 2011

guppy is a tool for working with, visualizing, and comparing collections of phylogenetic placements, such as those made by pplacer or RAxML’s EPA. “GUPPY” is an acronym: Grand Unified Phylogenetic Placement Yanalyzer.

Contents

guppy

Introduction ¶

To use the statistical comparison features of guppy, it’s a good idea to have a basic understanding of what the Kantorovich-Rubinstein (KR, a.k.a. earth-mover’s distance) is doing, and how the edge PCA and squash clustering algorithms work. There is a gentle introduction in Matsen and Evans, and a more full treatment in Evans and Matsen.

Here’s a table to demonstrate the relation of guppy concepts to ones which may be more familiar to the reader:

familiar concept	guppy concept
weighted UniFrac	Kantorovich-Rubinstein distance (kr)
UPGMA using UniFrac	“squash” clustering (squash)
PCA using UniFrac	Edge PCA (pca)
OTU alpha diversity	Abundance-weighted phylogenetic diversity measures (fpd)

This table does not show equivalences, but rather a list of hints for further exploration. For example, the KR distance is really a generalization of weighted UniFrac, and edge PCA is a type of PCA that takes advantage of the special structure of phylogenetic placement data. The heat tree (kr_heat) and barycenter (bary) have no analogs in previous types of phylogenetic microbial analysis.

Usage ¶

guppy does lots of different things– it makes heat trees, makes matrices of Kantorovich-Rubinstein distances, does edge PCA, etc. Each of these have their own options. Rather than make a suite of little programs, we have opted for an interface analogous to git and svn: namely, a collection of different actions wrapped up into a single interface.

There are two ways to access the commands– through the command line interface, and through the batch mode. A list of these programs is below, and can always be found using guppy --cmds.

Command line interface ¶

The general way to invoke guppy is guppy COMMAND [options] placefile[s] where COMMAND is one of the guppy commands. For example:

guppy heat --gray-black coastal.jplace DCM.jplace

These programs are listed with more detail below, and can always be found using guppy --cmds .

guppy can also be invoked as guppy --quiet COMMAND [...], which prevents the specified command from writing to stdout unless explicitly requested.

Batch mode ¶

It’s easy to run lots of commands at once with batch mode. However, unlike running the equivalent set of commands on the command line, placefiles are only loaded once per batch file run. guppy will load a given file the first time it is used in a command.

Batch files are files with one guppy command per line, specified exactly as would be written in a shell, except without the leading guppy. Arguments can be enclosed in double quotes to preserve whitespace, and double quotes within quoted strings are quoted by doubling (e.g. "spam ""and"" eggs"). Globbing (e.g. *.jplace) is not allowed. Comments are also allowed in batch files; everything on a line after a # is ignored.

An example batch file:

# Whole-line comment.
pca -o pca -c some.refpkg src/a.jplace src/b.jplace
squash -c some.refpkg -o squash_out src/a.jplace src/b.jplace
classify -c some.refpkg some.jplace  # inline comment

If this was saved as example.batch, it would be invoked from guppy as:

guppy --batch example.batch

Advanced features¶

Batch files also have two unique features: virtual placefiles, and parameter substitution.

Within a batch file, if a placefile is saved to or loaded from a path beginning with a @, the data will be stored in memory instead of written to disk. For example:

merge -o @merged.jplace src/a.jplace src/b.jplace
info @merged.jplace

will effectively run guppy info on the placefile resulting from merging the two arguments, but without ever writing that merged file to disk.

Additionally, parameters can be passed in from the command line to the batch file. On the command line, parameters are specified as additional arguments to guppy in key=value format. In the batch file, substitutions are done from identifiers in {key} format. For example, when a batch file containing

info {k1} {k2}

is invoked with

guppy --batch example.batch k1=1.jplace k2=2.jplace

the impact will be the same as running

guppy 1.jplace 2.jplace

Braces can also be quoted by doubling (e.g. {{foo}} will become {foo}).

Pqueries versus placements ¶

One of the key features of pplacer is that it is able to express uncertainty concerning placement in a reasonable manner. Specifically, if there is uncertainty in a given read’s optimal placement, it returns a collection of placements that are weighted according to likelihood weight or posterior probability. This feature requires a bit of additional wording. We will use “pquery” to denote a “placed query (sequence)”, i.e. the collection of weighted placements for a sequence. “Placement,” on the other hand, signifies a single location on a tree along with its optimal pendant branch length.

About multiplicities ¶

pplacer and guppy support “multiplicities,” i.e. multiple reads being treated as one. For example, if some reads are identical, they can be treated as a group. Doing so makes guppy operations much faster.

By default, they are used “as is” for guppy calculations– a single placement with multiplicity four is the same as four reads placed individually. However, if one would like to decrease the impact of multiplicities on downstream analysis (e.g. if PCR artifacts are suspected) one can use the --transform option of mft to choose a transform for the multiplicies before use. Doing so will convert your placements into labeled masses.

‘Split’ placefiles ¶

It’s often convenient to split up place files in all sorts of ways, but it’s nice not to have to duplicate the information in the placefiles multiple times. For that reason, we have introduced “split” placefiles. The syntax for these is my.jplace:my.csv, where the CSV file maps from query sequence names to the split placefile name. For example, say my.csv was like:

"read1","a"
"read2","b"

and my.jplace has placements for read1 and read2. Then my.jplace:my.csv would act like a list of two placefiles named a and b, with read1 and read2, respectively. Not every placement name in the place file needs to appear in the CSV file, so you can use this for subsetting.

BIOM files ¶

Any command which expects a placefile (in both guppy and rppr) can also be given a BIOM file. As BIOM files (unlike placefiles) do not contain a tree, it must be passed at the same time, with a colon delimiting the two paths. For example, to run guppy info on my.biom with a tree my.tre, the invocation would be guppy info my.tre:my.biom. Trees must be in the Newick format.

As the BIOM format describes counts at leaves, the placements generated by parsing a BIOM file will all have zero distal branch length and zero pendant branch length.

BIOM files will also automatically be split by sample. If a BIOM file has columns Sample1 and Sample2, that file will be interpreted as two placefiles, named respectively Sample1 and Sample2.

phyloXML viewing notes ¶

guppy makes fattened and annotated trees to visualize the results of various analyses. We have chosen to use phyloXML as the format for these trees, as it has width and color tags for edges; if you see .xml files coming out of a guppy analysis that’s what they are for. We like looking at these trees using the tree viewer archaeopteryx. If you open archaeopteryx with the default settings, you will see nothing interesting: simply the reference tree. You need to click on the “Colorize Branches” and “Use branch-width” check boxes. If you don’t see those check boxes, then use this configuration file (if you are going to copy and paste it click on “raw” first).

List of subcommands ¶

The following table provides links to more in-depth documentation for each guppy subcommand:

Command	Description
adcl	calculates ADCL for each pquery in a placefile
bary	draws the barycenter of a placement collection on the reference tree
check	checks placefiles for common problems
classify	outputs classification information in SQLite format
compress	compresses a placefile’s pqueries
demulti	splits apart placements with multiplicity, undoing a round procedure
distmat	prints out a pairwise distance matrix between the edges
edpl	calculates the EDPL uncertainty values for a collection of pqueries
epca	performs edge principal components
error	finds the error between two placefiles
fat	makes trees with edges fattened in proportion to the number of reads
filter	filters one or more placefiles by placement name
fpd	calculates various alpha diversity metrics of placefiles
heat	maps an an arbitrary vector of the correct length to the tree
indep_c	calculates the independent contrasts of pqueries
info	writes the number of leaves of the reference tree and the number of pqueries
islands	finds the mass islands of one or more pqueries
kr	calculates the Kantorovich-Rubinstein distance and corresponding p-values
kr_heat	makes a heat tree
lpca	performs length principal components
mcl	cluster pqueries using Markov clustering via MCL
merge	merges placefiles together
mft	Multi-Filter and Transform placefiles
ograph	finds the overlap graph of one or more pqueries
placemat	prints out a pairwise distance matrix between placements
pmlpca	performs poor-man’s length principal components
rarefact	calculates phylogenetic rarefaction curves
rarefy	performs rarefaction on collections of placements
redup	restores duplicates to deduped placefiles
round	clusters the placements by rounding branch lengths
sing	makes one tree for each query sequence, showing uncertainty
splitify	writes out differences of masses for the splits of the tree
squash	performs squash clustering
to_csv	turns a placefile into a csv file
to_json	converts old-style .place files to .jplace placement files
to_rdp	convert a reference package to a format RDP wants
tog	makes a tree with each of the reads represented as a pendant edge
trim	trims placefiles down to only containing an informative subset of the mass
unifrac	calculates unifrac on two or more placefiles

Table Of Contents

Previous topic

Next topic

This Page

guppy ¶

Introduction ¶

Usage ¶

Command line interface ¶

Batch mode ¶

Advanced features¶

Pqueries versus placements ¶

About multiplicities ¶

‘Split’ placefiles ¶

BIOM files ¶

phyloXML viewing notes ¶

List of subcommands ¶

Navigation

Previous topic

Next topic

This Page

Quick search

Advanced features¶

Navigation