Table Of Contents

Previous topic

check

Next topic

compress

This Page

classify

classify outputs classification information in SQLite format.

usage: classify [options] placefile[s]

Options

-c Reference package path. Required.
--sqlite Specify the database file to use. Required.
--seed Set the random seed, an integer > 0. Default is 1.
--classifier Which classifier to use, out of ‘pplacer’, ‘nbc’, ‘hybrid2’, ‘hybrid5’ or ‘rdp’. default: pplacer
--cutoff The default value for the likelihood_cutoff param. Default: 0.90
--bayes-cutoff The default value for the bayes_cutoff param. Default: 1.00
--multiclass-min
 The default value for the multiclass_min param. Default: 0.20
--bootstrap-cutoff
 The default value for the bootstrap_cutoff param. Default: 0.80
--bootstrap-extension-cutoff
 The default value for the bootstrap_cutoff param. Default: 0.40
--pp Use posterior probability for our criteria in the pplacer classifier.
--tax-median-identity-from
 Calculate the median identity for each sequence per-tax_id from the specified alignment.
--mrca-class Classify against a placefile that was generated with MRCA classification
--nbc-sequences
 The query sequences to use for the NBC classifier. Can be specified multiple times for multiple inputs.
--word-length The length of the words used for NBC classification. default: 8
--nbc-rank The desired most specific rank for NBC classification. ‘all’ puts each sequence in the classifier at every rank of its lineage. default: genus
--n-boot The number of times to bootstrap a sequence with the NBC classifier. 0 = no bootstrap. default: 100
-j The number of processes to spawn to do NBC classification. default: 2
--no-pre-mask Don’t pre-mask the sequences for NBC classification.
--nbc-counts Read/write NBC k-mer counts to the given file. File cannot be NFS mounted.
--nbc-as-rdp Do NBC classification like RDP: find the lineage of the full-sequence classification, then bootstrap to find support for it.
--rdp-results The RDP results file for use with the RDP classifier. Can be specified multiple times for multiple inputs.
--blast-results
 The BLAST results file for use with the BLAST classifier. Can be specified multiple times for multiple inputs.
--no-random-tie-break
 Take the first NBC hit even if there are others that are equally good.

Details

This subcommand outputs the classifications made by pplacer in a database.

The classifications made by the current implementation of pplacer are done with a simple, root-dependent algorithm. We are currently working on improved algorithms. For best results, first taxonomically root the tree in your reference package (so that the root of the tree corresponds to the “deepest” evolutionary event according to the taxonomy). This can be done automatically the taxit reroot command in taxtastic. (Note that as of 27 May 2011, this requires the dev version of biopython available on github.)

The classifications are simply done by containment. Say clade A of the reference tree is the smallest such that contains a given placement. The most specific classification for that read will be the lowest common ancestor of the taxonomic classifications for the leaves of A. If the desired classification is more specific than that, then we get a disconnect between the desired and the actual classification. For example, if we try to classify at the species level and the clade LCA is a genus, then we will get a genus name. If there is uncertainty in read placement, then there is uncertainty in classification.

Classifiers

guppy classify has a variety of classifiers.

pplacer

Takes placefiles as input and classifies using the method described above.

When refining classifications for the multiclass table, first, all of the classifications with a likelihood of less than the value of --multiclass-cutoff are discarded. Next, if the value for --bayes-cutoff is nonzero, ranks below the most specific rank with bayes factor evidence greater than or equal to that cutoff are discarded. Otherwise, the likelihoods of remaining classifications are summed per-rank, and ranks with a likelihood sum of less than --cutoff are discarded.

nbc

Takes sequences via the --nbc-sequences flag and classifies them with a naive bayes classifier.

The input sequences must be an alignment by default, either aligned to the reference sequences or include an alignment of the reference sequences (in the same manner pplacer does). If the --no-pre-mask flag is specified, the input sequences may be unaligned, and must not also contain reference sequences.

When refining classifications for the multiclass table, for each rank, the best classification with a bootstrap above the value of --bootstrap-cutoff is selected. Ranks with no such classifications are discarded.

rdp

Takes the output of Mothur’s classify.seqs via the --rdp-results flag and inserts the classifications into the database. This should be the .taxonomy file.

Refinement for multiclass is done the same as for nbc.

blast

Takes the output of BLAST via the --blast-results flag and inserts the classifications into the database. The output must be in outfmt 6.

Refinement for multiclass is done the same as for nbc.

hybrid
Still in flux.

Sqlite

guppy classify writes its output into a sqlite3 database. The argument to the --sqlite flag is the sqlite3 database into which the results should be put. This database must have first been intialized using rppr prep_db.

The following tables are populated by guppy classify:

  • runs – describes each separate invocation of guppy classify; exactly one row will be added for each invocation.
  • placements – describes groups of sequences. Each row will represent one or more sequences and indicate which classifier was used.
  • placement_names – indicates which sequences are in this group of sequences and where each sequence came from.
  • placement_classifications – indicates tax_id and likelihood for the pplacer and hybrid classifiers.
  • placement_evidence – indicates bayes factor evidence for the pplacer and hybrid classifiers.
  • placement_position – indicates placement position for the pplacer and hybrid classifiers.
  • placement_median_identities – indicates sequence median percent identity for the pplacer and hybrid classifiers when run with the --tax-median-identity-from flag.
  • placement_nbc – indicates tax_id and bootstrap value for the nbc, rdp, blast, and hybrid classifiers.
  • multiclass – indicates the best classification and rank of classification from any classifier for a given sequence name and desired rank of classification. There might be multiple classifications for a particular sequence and desired rank, but only when using the pplacer or hybrid classifiers.