scripts

The pplacer package comes with a few scripts to perform common tasks on reference packages and placements:

Installing

All scripts can be used by specifying the full path. For convenience, a Python setup.py file is provided, which will install them globally. To install, run:

$ python setup.py install

from the scripts/ subdirectory, prefixed with sudo if the python installation directory is not writable.

refpkg_align.py

refpkg_align.py works with reference package alignments and alignment profiles, providing methods to align sequences to a reference package alignment, and extract an alignment from a reference package.

refpkg_align.py depends on BioPython, as well as the external tools used for alignment: HMMER3, Infernal, and PyNAST.

List of subcommands

align

refpkg_align.py align aligns sequences to a reference package alignment for use with pplacer. For reference packages built with Infernal, cmalign is used for alignment. For packages built using HMMER3, alignment is performed with hmmalign. Reference packages lacking a profile entry are aligned using PyNAST. By default, an alignment mask is applied if it exists.

The output format varies: Stockholm for Infernal- and HMMER-based reference packages, FASTA for all others.

For Infernal-based reference packages, MPI may be used.

usage: refpkg_align.py align [options] refpkg seqfile outfile
Options
-h, --help            show this help message and exit
--align-opts OPTS     Alignment options, such as "--mapali $aln_sto". '$'
                      characters will need to be escaped if using template
                      variables. Available template variables are $aln_sto,
                      $profile. Defaults are as follows for the different
                      profiles: (PyNAST: "-l 150 -f /dev/null -g /dev/null")
                      (INFERNAL: "-1 --hbanded --sub --dna")
--alignment-method {PyNAST,HMMER3,INFERNAL}
                      Profile version to use. [default: Guess. PyNAST is
                      used if a valid CM or HMM is not found in the
                      reference package.]
--no-mask             Do not trim the alignment to unmasked columns.
                      [default: apply mask if it exists]
--debug               Enable debug output
--verbose             Enable verbose output

MPI Options

--use-mpi             Use MPI [infernal only]
--mpi-arguments MPI_ARGUMENTS
                      Arguments to pass to mpirun
--mpi-run MPI_RUN     Name of mpirun executable

extract

extract extracts a reference alignment from a reference package, apply a mask if it exists by default.

usage: refpkg_align.py extract [options] refpkg output_file
Options
positional arguments:
  refpkg                Reference package directory
  output_file           Destination

optional arguments:
  -h, --help            show this help message and exit
  --output-format OUTPUT_FORMAT
                        output format [default: stockholm]
  --no-mask             Do not apply mask to alignment [default: apply mask
                        if it exists]

mask

Warning: masking is experimental and we may change our mind about how it gets implemented.

Alignment masks may be specified through an entry named “mask” in the CONTENTS.json file of a reference package pointing to a file with a comma-delimited set of 0-based indices in an alignment to keep after masking.

For example, a mask specification of:

0,1,2,3,4,5,6,28,29

Would discard all columns in an alignment except for 0-7, 28, and 29.

sort_placefile.py

sort_placefile.py takes a placefile and sorts and formats its contents for then performing a visual diff of placefiles. Output defaults to being emitted to stdout.

usage: sort_placefile.py [-h] [-o FILE] infile

update_refpkg.py

update_refpkg.py updates a reference package from the 1.0 format to the 1.1 format. It takes the CONTENTS.json file in the reference package as its parameter and updates it in place, after making a backup copy.

usage: update_refpkg.py [-h] CONTENTS.json

check_placements.py

check_placements.py checks a placefile for potential issues, including:

  • Any like_weight_ratio being equal to 0.
  • The sum of the like_weight_ratios not being equal to 1.
  • Any post_prob being equal to 0.
  • The sum of the post_probs being equal to 0.
  • The sum of the post_probs not being equal to 1.
usage: check_placements.py example.jplace

deduplicate_sequences.py

deduplicate_sequences.py deduplicates a sequence file and produces a dedup file suitable for use with guppy redup -m. See the redup documentation for details.

pca_for_qiime.py

pca_for_qiime.py converts the trans file output by guppy pca into the tab-delimited format expected by QIIME’s plotting functions.

usage: pca_for_qiime.py [-h] trans tsv

extract_taxonomy_from_biom.py

extract_taxonomy_from_biom.py extracts the taxonomy information from a BIOM file, producing seqinfo and taxonomy files which can then be placed into a reference package.

usage: extract_taxonomy_from_biom.py [-h] biom taxtable seqinfo

hrefpkg_query.py

hrefpkg_query.py classifies sequences using a hrefpkg. The output is a sqlite database with the same schema as created by rppr prep_db.

usage: hrefpkg_query.py [options] hrefpkg query_seqs classification_db

positional arguments:
  hrefpkg               hrefpkg to classify using
  query_seqs            input query sequences
  classification_db     output sqlite database

optional arguments:
  -h, --help            show this help message and exit
  -j CORES, --ncores CORES
                        number of cores to tell commands to use
  -r RANK, --classification-rank RANK
                        rank to perform the initial NBC classification at
  --workdir DIR         directory to write intermediate files to (default: a
                        temporary directory)
  --disable-cleanup     don't remove the work directory as the final step
  --use-mpi             run refpkg_align with MPI
  --alignment {align-each,merge-each,none}
                        respectively: align each input sequence; subset an
                        input stockholm alignment and merge each sequence to a
                        reference alignment; only subset an input stockholm
                        alignment (default: align-each)
  --cmscores FILE       in align-each mode, write out a file containing the
                        cmalign scores

external binaries:
  --pplacer PROG        pplacer binary to call
  --guppy PROG          guppy binary to call
  --rppr PROG           rppr binary to call
  --refpkg-align PROG   refpkg_align binary to call
  --cmalign PROG        cmalign binary to call

multiclass_concat.py

multiclass_concat.py takes a database which has been classified using guppy classify and creates a view multiclass_concat. This view has the same schema as multiclass, with the addition of an id_count column. However, instead of getting multiple rows when a sequence has multiple classifications at a rank, the tax_id column will be all of the tax_ids concatenated together, delimited by ,.

To ensure that it’s still easy to join multiclass_concat to the taxa table, rows are inserted into the taxa table for each concatenated tax_id present in the multiclass_concat table which have a tax_name created by concatenating the names of all the constituent tax_ids.

usage: multiclass_concat.py [options] database

positional arguments:
  database    sqlite database (output of `rppr prep_db` after `guppy
              classify`)

optional arguments:
  -h, --help  show this help message and exit

split_qiime.py

split_qiime.py takes sequences in QIIME’s preprocessed FASTA format and generates a FASTA file which contains the original sequence names. Optionally, a specimen map can also be written out which maps from the original sequence names to their specimens as listed in the QIIME file.

For example, an incoming sequence identified by >PC.634_1 FLP3FBN01ELBSX will be written out as >FLP3FBN01ELBSX with an entry in the specimen_map of FLP3FBN01ELBSX,PC.634.

usage: split_qiime.py [-h] [qiime] [fasta] [specimen_map]

Extract the original sequence names from a QIIME FASTA file.

positional arguments:
  qiime         input QIIME file (default: stdin)
  fasta         output FASTA file (default: stdout)
  specimen_map  if specified, output specimen map (default: don't write)

optional arguments:
  -h, --help    show this help message and exit