:tocdepth: 3

.. _guppy_classify:

========
classify
========

`classify` outputs classification information in SQLite format.

::

  usage: classify [options] placefile[s]

Options
=======

-c  Reference package path. Required.
--sqlite  Specify the database file to use. Required.
--seed  Set the random seed, an integer > 0. Default is 1.
--classifier  Which classifier to use, out of 'pplacer', 'nbc', 'hybrid2', 'hybrid5' or 'rdp'. default: pplacer
--cutoff  The default value for the likelihood_cutoff param. Default: 0.90
--bayes-cutoff  The default value for the bayes_cutoff param. Default: 1.00
--multiclass-min  The default value for the multiclass_min param. Default: 0.20
--bootstrap-cutoff  The default value for the bootstrap_cutoff param. Default: 0.80
--bootstrap-extension-cutoff  The default value for the bootstrap_cutoff param. Default: 0.40
--pp  Use posterior probability for our criteria in the pplacer classifier.
--tax-median-identity-from  Calculate the median identity for each sequence per-tax_id from the specified alignment.
--mrca-class  Classify against a placefile that was generated with MRCA classification
--nbc-sequences  The query sequences to use for the NBC classifier. Can be specified multiple times for multiple inputs.
--word-length  The length of the words used for NBC classification. default: 8
--nbc-rank  The desired most specific rank for NBC classification. 'all' puts each sequence in the classifier at every rank of its lineage. default: genus
--n-boot  The number of times to bootstrap a sequence with the NBC classifier. 0 = no bootstrap. default: 100
-j  The number of processes to spawn to do NBC classification. default: 2
--no-pre-mask  Don't pre-mask the sequences for NBC classification.
--nbc-counts  Read/write NBC k-mer counts to the given file. File cannot be NFS mounted.
--nbc-as-rdp  Do NBC classification like RDP: find the lineage of the full-sequence classification, then bootstrap to find support for it.
--rdp-results  The RDP results file for use with the RDP classifier. Can be specified multiple times for multiple inputs.
--blast-results  The BLAST results file for use with the BLAST classifier. Can be specified multiple times for multiple inputs.
--no-random-tie-break  Take the first NBC hit even if there are others that are equally good.

Details
=======

This subcommand outputs the classifications made by pplacer in a database.

*The classifications made by the current implementation of pplacer are done with a simple, root-dependent algorithm.
We are currently working on improved algorithms.*
For best results, first taxonomically root the tree in your reference package (so that the root of the tree corresponds to the "deepest" evolutionary event according to the taxonomy).
This can be done automatically the `taxit reroot` command in taxtastic.
(Note that as of 27 May 2011, this requires the dev version of biopython available on github.)

The classifications are simply done by containment.
Say clade *A* of the reference tree is the smallest such that contains a given placement.
The most specific classification for that read will be the lowest common ancestor of the taxonomic classifications for the leaves of *A*.
If the desired classification is more specific than that, then we get a disconnect between the desired and the actual classification.
For example, if we try to classify at the species level and the clade LCA is a genus, then we will get a genus name.
If there is uncertainty in read placement, then there is uncertainty in classification.

Classifiers
===========

``guppy classify`` has a variety of classifiers.

.. glossary::

    pplacer
      Takes placefiles as input and classifies using the method described
      above.

      When refining classifications for the ``multiclass`` table, first, all of
      the classifications with a likelihood of less than the value of
      ``--multiclass-cutoff`` are discarded. Next, if the value for
      ``--bayes-cutoff`` is nonzero, ranks below the most specific rank with
      bayes factor evidence greater than or equal to that cutoff are discarded.
      Otherwise, the likelihoods of remaining classifications are summed
      per-rank, and ranks with a likelihood sum of less than ``--cutoff`` are
      discarded.

    nbc
      Takes sequences via the ``--nbc-sequences`` flag and classifies them with
      a naive bayes classifier.

      The input sequences must be an alignment by default, either aligned to
      the reference sequences or include an alignment of the reference
      sequences (in the same manner pplacer does). If the ``--no-pre-mask``
      flag is specified, the input sequences may be unaligned, and must not
      also contain reference sequences.

      When refining classifications for the ``multiclass`` table, for each
      rank, the best classification with a bootstrap above the value of
      ``--bootstrap-cutoff`` is selected. Ranks with no such classifications
      are discarded.

    rdp
      Takes the output of Mothur's `classify.seqs`_ via the ``--rdp-results``
      flag and inserts the classifications into the database. This should be
      the ``.taxonomy`` file.

      Refinement for ``multiclass`` is done the same as for :term:`nbc`.

    blast
      Takes the output of BLAST_ via the ``--blast-results`` flag and inserts
      the classifications into the database. The output must be in outfmt 6.

      Refinement for ``multiclass`` is done the same as for :term:`nbc`.

    hybrid
      Still in flux.


Sqlite
======

``guppy classify`` writes its output into a sqlite3 database. The argument to
the ``--sqlite`` flag is the sqlite3 database into which the results should be
put. This database must have first been intialized using :ref:`rppr prep_db
<rppr_prep_db>`.

The following tables are populated by ``guppy classify``:

* ``runs`` -- describes each separate invocation of ``guppy classify``; exactly
  one row will be added for each invocation.
* ``placements`` -- describes groups of sequences. Each row will represent one
  or more sequences and indicate which classifier was used.
* ``placement_names`` -- indicates which sequences are in this group of
  sequences and where each sequence came from.
* ``placement_classifications`` -- indicates tax_id and likelihood for the
  :term:`pplacer` and :term:`hybrid` classifiers.
* ``placement_evidence`` -- indicates bayes factor evidence for the
  :term:`pplacer` and :term:`hybrid` classifiers.
* ``placement_position`` -- indicates placement position for the
  :term:`pplacer` and :term:`hybrid` classifiers.
* ``placement_median_identities`` -- indicates sequence median percent identity
  for the :term:`pplacer` and :term:`hybrid` classifiers when run with the
  ``--tax-median-identity-from`` flag.
* ``placement_nbc`` -- indicates tax_id and bootstrap value for the
  :term:`nbc`, :term:`rdp`, :term:`blast`, and :term:`hybrid` classifiers.
* ``multiclass`` -- indicates the best classification and rank of
  classification from any classifier for a given sequence name and desired rank
  of classification. There might be multiple classifications for a particular
  sequence and desired rank, but only when using the :term:`pplacer` or
  :term:`hybrid` classifiers.


.. _classify.seqs: http://www.mothur.org/wiki/Classify.seqs
.. _BLAST: http://www.ncbi.nlm.nih.gov/books/NBK1763/