:tocdepth: 3 .. _guppy_classify: ======== classify ======== `classify` outputs classification information in SQLite format. :: usage: classify [options] placefile[s] Options ======= -c Reference package path. Required. --sqlite Specify the database file to use. Required. --seed Set the random seed, an integer > 0. Default is 1. --classifier Which classifier to use, out of 'pplacer', 'nbc', 'hybrid2', 'hybrid5' or 'rdp'. default: pplacer --cutoff The default value for the likelihood_cutoff param. Default: 0.90 --bayes-cutoff The default value for the bayes_cutoff param. Default: 1.00 --multiclass-min The default value for the multiclass_min param. Default: 0.20 --bootstrap-cutoff The default value for the bootstrap_cutoff param. Default: 0.80 --bootstrap-extension-cutoff The default value for the bootstrap_cutoff param. Default: 0.40 --pp Use posterior probability for our criteria in the pplacer classifier. --tax-median-identity-from Calculate the median identity for each sequence per-tax_id from the specified alignment. --mrca-class Classify against a placefile that was generated with MRCA classification --nbc-sequences The query sequences to use for the NBC classifier. Can be specified multiple times for multiple inputs. --word-length The length of the words used for NBC classification. default: 8 --nbc-rank The desired most specific rank for NBC classification. 'all' puts each sequence in the classifier at every rank of its lineage. default: genus --n-boot The number of times to bootstrap a sequence with the NBC classifier. 0 = no bootstrap. default: 100 -j The number of processes to spawn to do NBC classification. default: 2 --no-pre-mask Don't pre-mask the sequences for NBC classification. --nbc-counts Read/write NBC k-mer counts to the given file. File cannot be NFS mounted. --nbc-as-rdp Do NBC classification like RDP: find the lineage of the full-sequence classification, then bootstrap to find support for it. --rdp-results The RDP results file for use with the RDP classifier. Can be specified multiple times for multiple inputs. --blast-results The BLAST results file for use with the BLAST classifier. Can be specified multiple times for multiple inputs. --no-random-tie-break Take the first NBC hit even if there are others that are equally good. Details ======= This subcommand outputs the classifications made by pplacer in a database. *The classifications made by the current implementation of pplacer are done with a simple, root-dependent algorithm. We are currently working on improved algorithms.* For best results, first taxonomically root the tree in your reference package (so that the root of the tree corresponds to the "deepest" evolutionary event according to the taxonomy). This can be done automatically the `taxit reroot` command in taxtastic. (Note that as of 27 May 2011, this requires the dev version of biopython available on github.) The classifications are simply done by containment. Say clade *A* of the reference tree is the smallest such that contains a given placement. The most specific classification for that read will be the lowest common ancestor of the taxonomic classifications for the leaves of *A*. If the desired classification is more specific than that, then we get a disconnect between the desired and the actual classification. For example, if we try to classify at the species level and the clade LCA is a genus, then we will get a genus name. If there is uncertainty in read placement, then there is uncertainty in classification. Classifiers =========== ``guppy classify`` has a variety of classifiers. .. glossary:: pplacer Takes placefiles as input and classifies using the method described above. When refining classifications for the ``multiclass`` table, first, all of the classifications with a likelihood of less than the value of ``--multiclass-cutoff`` are discarded. Next, if the value for ``--bayes-cutoff`` is nonzero, ranks below the most specific rank with bayes factor evidence greater than or equal to that cutoff are discarded. Otherwise, the likelihoods of remaining classifications are summed per-rank, and ranks with a likelihood sum of less than ``--cutoff`` are discarded. nbc Takes sequences via the ``--nbc-sequences`` flag and classifies them with a naive bayes classifier. The input sequences must be an alignment by default, either aligned to the reference sequences or include an alignment of the reference sequences (in the same manner pplacer does). If the ``--no-pre-mask`` flag is specified, the input sequences may be unaligned, and must not also contain reference sequences. When refining classifications for the ``multiclass`` table, for each rank, the best classification with a bootstrap above the value of ``--bootstrap-cutoff`` is selected. Ranks with no such classifications are discarded. rdp Takes the output of Mothur's `classify.seqs`_ via the ``--rdp-results`` flag and inserts the classifications into the database. This should be the ``.taxonomy`` file. Refinement for ``multiclass`` is done the same as for :term:`nbc`. blast Takes the output of BLAST_ via the ``--blast-results`` flag and inserts the classifications into the database. The output must be in outfmt 6. Refinement for ``multiclass`` is done the same as for :term:`nbc`. hybrid Still in flux. Sqlite ====== ``guppy classify`` writes its output into a sqlite3 database. The argument to the ``--sqlite`` flag is the sqlite3 database into which the results should be put. This database must have first been intialized using :ref:`rppr prep_db `. The following tables are populated by ``guppy classify``: * ``runs`` -- describes each separate invocation of ``guppy classify``; exactly one row will be added for each invocation. * ``placements`` -- describes groups of sequences. Each row will represent one or more sequences and indicate which classifier was used. * ``placement_names`` -- indicates which sequences are in this group of sequences and where each sequence came from. * ``placement_classifications`` -- indicates tax_id and likelihood for the :term:`pplacer` and :term:`hybrid` classifiers. * ``placement_evidence`` -- indicates bayes factor evidence for the :term:`pplacer` and :term:`hybrid` classifiers. * ``placement_position`` -- indicates placement position for the :term:`pplacer` and :term:`hybrid` classifiers. * ``placement_median_identities`` -- indicates sequence median percent identity for the :term:`pplacer` and :term:`hybrid` classifiers when run with the ``--tax-median-identity-from`` flag. * ``placement_nbc`` -- indicates tax_id and bootstrap value for the :term:`nbc`, :term:`rdp`, :term:`blast`, and :term:`hybrid` classifiers. * ``multiclass`` -- indicates the best classification and rank of classification from any classifier for a given sequence name and desired rank of classification. There might be multiple classifications for a particular sequence and desired rank, but only when using the :term:`pplacer` or :term:`hybrid` classifiers. .. _classify.seqs: http://www.mothur.org/wiki/Classify.seqs .. _BLAST: http://www.ncbi.nlm.nih.gov/books/NBK1763/