Authors: | Erick Matsen and Aaron Gallagher |
---|---|
Title: | guppy |
Version: | 1.1 |
License: | GPL v3 |
Date: | September 2011 |
guppy is a tool for working with, visualizing, and comparing collections of phylogenetic placements, such as those made by pplacer or RAxML’s EPA. “GUPPY” is an acronym: Grand Unified Phylogenetic Placement Yanalyzer.
Contents
To use the statistical comparison features of guppy, it’s a good idea to have a basic understanding of what the Kantorovich-Rubinstein (KR, a.k.a. earth-mover’s distance) is doing, and how the edge PCA and squash clustering algorithms work. There is a gentle introduction in Matsen and Evans, and a more full treatment in Evans and Matsen.
Here’s a table to demonstrate the relation of guppy concepts to ones which may be more familiar to the reader:
familiar concept | guppy concept |
---|---|
weighted UniFrac | Kantorovich-Rubinstein distance (kr) |
UPGMA using UniFrac | “squash” clustering (squash) |
PCA using UniFrac | Edge PCA (pca) |
OTU alpha diversity | Abundance-weighted phylogenetic diversity measures (fpd) |
This table does not show equivalences, but rather a list of hints for further exploration. For example, the KR distance is really a generalization of weighted UniFrac, and edge PCA is a type of PCA that takes advantage of the special structure of phylogenetic placement data. The heat tree (kr_heat) and barycenter (bary) have no analogs in previous types of phylogenetic microbial analysis.
guppy does lots of different things– it makes heat trees, makes matrices of Kantorovich-Rubinstein distances, does edge PCA, etc. Each of these have their own options. Rather than make a suite of little programs, we have opted for an interface analogous to git and svn: namely, a collection of different actions wrapped up into a single interface.
There are two ways to access the commands– through the command line interface, and through the batch mode. A list of these programs is below, and can always be found using guppy --cmds.
The general way to invoke guppy is guppy COMMAND [options] placefile[s] where COMMAND is one of the guppy commands. For example:
guppy heat --gray-black coastal.jplace DCM.jplace
These programs are listed with more detail below, and can always be found using guppy --cmds .
guppy can also be invoked as guppy --quiet COMMAND [...], which prevents the specified command from writing to stdout unless explicitly requested.
It’s easy to run lots of commands at once with batch mode. However, unlike running the equivalent set of commands on the command line, placefiles are only loaded once per batch file run. guppy will load a given file the first time it is used in a command.
Batch files are files with one guppy command per line, specified exactly as would be written in a shell, except without the leading guppy. Arguments can be enclosed in double quotes to preserve whitespace, and double quotes within quoted strings are quoted by doubling (e.g. "spam ""and"" eggs"). Globbing (e.g. *.jplace) is not allowed. Comments are also allowed in batch files; everything on a line after a # is ignored.
An example batch file:
# Whole-line comment.
pca -o pca -c some.refpkg src/a.jplace src/b.jplace
squash -c some.refpkg -o squash_out src/a.jplace src/b.jplace
classify -c some.refpkg some.jplace # inline comment
If this was saved as example.batch, it would be invoked from guppy as:
guppy --batch example.batch
Batch files also have two unique features: virtual placefiles, and parameter substitution.
Within a batch file, if a placefile is saved to or loaded from a path beginning with a @, the data will be stored in memory instead of written to disk. For example:
merge -o @merged.jplace src/a.jplace src/b.jplace
info @merged.jplace
will effectively run guppy info on the placefile resulting from merging the two arguments, but without ever writing that merged file to disk.
Additionally, parameters can be passed in from the command line to the batch file. On the command line, parameters are specified as additional arguments to guppy in key=value format. In the batch file, substitutions are done from identifiers in {key} format. For example, when a batch file containing
info {k1} {k2}
is invoked with
guppy --batch example.batch k1=1.jplace k2=2.jplace
the impact will be the same as running
guppy 1.jplace 2.jplace
Braces can also be quoted by doubling (e.g. {{foo}} will become {foo}).
One of the key features of pplacer is that it is able to express uncertainty concerning placement in a reasonable manner. Specifically, if there is uncertainty in a given read’s optimal placement, it returns a collection of placements that are weighted according to likelihood weight or posterior probability. This feature requires a bit of additional wording. We will use “pquery” to denote a “placed query (sequence)”, i.e. the collection of weighted placements for a sequence. “Placement,” on the other hand, signifies a single location on a tree along with its optimal pendant branch length.
pplacer and guppy support “multiplicities,” i.e. multiple reads being treated as one. For example, if some reads are identical, they can be treated as a group. Doing so makes guppy operations much faster.
By default, they are used “as is” for guppy calculations– a single placement with multiplicity four is the same as four reads placed individually. However, if one would like to decrease the impact of multiplicities on downstream analysis (e.g. if PCR artifacts are suspected) one can use the --transform option of mft to choose a transform for the multiplicies before use. Doing so will convert your placements into labeled masses.
It’s often convenient to split up place files in all sorts of ways, but it’s nice not to have to duplicate the information in the placefiles multiple times. For that reason, we have introduced “split” placefiles. The syntax for these is my.jplace:my.csv, where the CSV file maps from query sequence names to the split placefile name. For example, say my.csv was like:
"read1","a"
"read2","b"
and my.jplace has placements for read1 and read2. Then my.jplace:my.csv would act like a list of two placefiles named a and b, with read1 and read2, respectively. Not every placement name in the place file needs to appear in the CSV file, so you can use this for subsetting.
Any command which expects a placefile (in both guppy and rppr) can also be given a BIOM file. As BIOM files (unlike placefiles) do not contain a tree, it must be passed at the same time, with a colon delimiting the two paths. For example, to run guppy info on my.biom with a tree my.tre, the invocation would be guppy info my.tre:my.biom. Trees must be in the Newick format.
As the BIOM format describes counts at leaves, the placements generated by parsing a BIOM file will all have zero distal branch length and zero pendant branch length.
BIOM files will also automatically be split by sample. If a BIOM file has columns Sample1 and Sample2, that file will be interpreted as two placefiles, named respectively Sample1 and Sample2.
guppy makes fattened and annotated trees to visualize the results of various analyses. We have chosen to use phyloXML as the format for these trees, as it has width and color tags for edges; if you see .xml files coming out of a guppy analysis that’s what they are for. We like looking at these trees using the tree viewer archaeopteryx. If you open archaeopteryx with the default settings, you will see nothing interesting: simply the reference tree. You need to click on the “Colorize Branches” and “Use branch-width” check boxes. If you don’t see those check boxes, then use this configuration file (if you are going to copy and paste it click on “raw” first).
The following table provides links to more in-depth documentation for each guppy subcommand:
Command | Description |
---|---|
adcl | calculates ADCL for each pquery in a placefile |
bary | draws the barycenter of a placement collection on the reference tree |
check | checks placefiles for common problems |
classify | outputs classification information in SQLite format |
compress | compresses a placefile’s pqueries |
demulti | splits apart placements with multiplicity, undoing a round procedure |
distmat | prints out a pairwise distance matrix between the edges |
edpl | calculates the EDPL uncertainty values for a collection of pqueries |
epca | performs edge principal components |
error | finds the error between two placefiles |
fat | makes trees with edges fattened in proportion to the number of reads |
filter | filters one or more placefiles by placement name |
fpd | calculates various alpha diversity metrics of placefiles |
heat | maps an an arbitrary vector of the correct length to the tree |
indep_c | calculates the independent contrasts of pqueries |
info | writes the number of leaves of the reference tree and the number of pqueries |
islands | finds the mass islands of one or more pqueries |
kr | calculates the Kantorovich-Rubinstein distance and corresponding p-values |
kr_heat | makes a heat tree |
lpca | performs length principal components |
mcl | cluster pqueries using Markov clustering via MCL |
merge | merges placefiles together |
mft | Multi-Filter and Transform placefiles |
ograph | finds the overlap graph of one or more pqueries |
placemat | prints out a pairwise distance matrix between placements |
pmlpca | performs poor-man’s length principal components |
rarefact | calculates phylogenetic rarefaction curves |
rarefy | performs rarefaction on collections of placements |
redup | restores duplicates to deduped placefiles |
round | clusters the placements by rounding branch lengths |
sing | makes one tree for each query sequence, showing uncertainty |
splitify | writes out differences of masses for the splits of the tree |
squash | performs squash clustering |
to_csv | turns a placefile into a csv file |
to_json | converts old-style .place files to .jplace placement files |
to_rdp | convert a reference package to a format RDP wants |
tog | makes a tree with each of the reads represented as a pendant edge |
trim | trims placefiles down to only containing an informative subset of the mass |
unifrac | calculates unifrac on two or more placefiles |