The pplacer package comes with a few scripts to perform common tasks on reference packages and placements:
All scripts can be used by specifying the full path. For convenience, a Python setup.py file is provided, which will install them globally. To install, run:
$ python setup.py install
from the scripts/ subdirectory, prefixed with sudo if the python installation directory is not writable.
refpkg_align.py works with reference package alignments and alignment profiles, providing methods to align sequences to a reference package alignment, and extract an alignment from a reference package.
refpkg_align.py depends on BioPython, as well as the external tools used for alignment: HMMER3, Infernal, and PyNAST.
refpkg_align.py align aligns sequences to a reference package alignment for use with pplacer. For reference packages built with Infernal, cmalign is used for alignment. For packages built using HMMER3, alignment is performed with hmmalign. Reference packages lacking a profile entry are aligned using PyNAST. By default, an alignment mask is applied if it exists.
The output format varies: Stockholm for Infernal- and HMMER-based reference packages, FASTA for all others.
For Infernal-based reference packages, MPI may be used.
usage: refpkg_align.py align [options] refpkg seqfile outfile
-h, --help show this help message and exit
--align-opts OPTS Alignment options, such as "--mapali $aln_sto". '$'
characters will need to be escaped if using template
variables. Available template variables are $aln_sto,
$profile. Defaults are as follows for the different
profiles: (PyNAST: "-l 150 -f /dev/null -g /dev/null")
(INFERNAL: "-1 --hbanded --sub --dna")
--alignment-method {PyNAST,HMMER3,INFERNAL}
Profile version to use. [default: Guess. PyNAST is
used if a valid CM or HMM is not found in the
reference package.]
--no-mask Do not trim the alignment to unmasked columns.
[default: apply mask if it exists]
--debug Enable debug output
--verbose Enable verbose output
MPI Options
--use-mpi Use MPI [infernal only]
--mpi-arguments MPI_ARGUMENTS
Arguments to pass to mpirun
--mpi-run MPI_RUN Name of mpirun executable
extract extracts a reference alignment from a reference package, apply a mask if it exists by default.
usage: refpkg_align.py extract [options] refpkg output_file
positional arguments:
refpkg Reference package directory
output_file Destination
optional arguments:
-h, --help show this help message and exit
--output-format OUTPUT_FORMAT
output format [default: stockholm]
--no-mask Do not apply mask to alignment [default: apply mask
if it exists]
Warning: masking is experimental and we may change our mind about how it gets implemented.
Alignment masks may be specified through an entry named “mask” in the CONTENTS.json file of a reference package pointing to a file with a comma-delimited set of 0-based indices in an alignment to keep after masking.
For example, a mask specification of:
0,1,2,3,4,5,6,28,29
Would discard all columns in an alignment except for 0-7, 28, and 29.
sort_placefile.py takes a placefile and sorts and formats its contents for then performing a visual diff of placefiles. Output defaults to being emitted to stdout.
usage: sort_placefile.py [-h] [-o FILE] infile
update_refpkg.py updates a reference package from the 1.0 format to the 1.1 format. It takes the CONTENTS.json file in the reference package as its parameter and updates it in place, after making a backup copy.
usage: update_refpkg.py [-h] CONTENTS.json
check_placements.py checks a placefile for potential issues, including:
- Any like_weight_ratio being equal to 0.
- The sum of the like_weight_ratios not being equal to 1.
- Any post_prob being equal to 0.
- The sum of the post_probs being equal to 0.
- The sum of the post_probs not being equal to 1.
usage: check_placements.py example.jplace
deduplicate_sequences.py deduplicates a sequence file and produces a dedup file suitable for use with guppy redup -m. See the redup documentation for details.
pca_for_qiime.py converts the trans file output by guppy pca into the tab-delimited format expected by QIIME’s plotting functions.
usage: pca_for_qiime.py [-h] trans tsv
extract_taxonomy_from_biom.py extracts the taxonomy information from a BIOM file, producing seqinfo and taxonomy files which can then be placed into a reference package.
usage: extract_taxonomy_from_biom.py [-h] biom taxtable seqinfo
hrefpkg_query.py classifies sequences using a hrefpkg. The output is a sqlite database with the same schema as created by rppr prep_db.
usage: hrefpkg_query.py [options] hrefpkg query_seqs classification_db
positional arguments:
hrefpkg hrefpkg to classify using
query_seqs input query sequences
classification_db output sqlite database
optional arguments:
-h, --help show this help message and exit
-j CORES, --ncores CORES
number of cores to tell commands to use
-r RANK, --classification-rank RANK
rank to perform the initial NBC classification at
--workdir DIR directory to write intermediate files to (default: a
temporary directory)
--disable-cleanup don't remove the work directory as the final step
--use-mpi run refpkg_align with MPI
--alignment {align-each,merge-each,none}
respectively: align each input sequence; subset an
input stockholm alignment and merge each sequence to a
reference alignment; only subset an input stockholm
alignment (default: align-each)
--cmscores FILE in align-each mode, write out a file containing the
cmalign scores
external binaries:
--pplacer PROG pplacer binary to call
--guppy PROG guppy binary to call
--rppr PROG rppr binary to call
--refpkg-align PROG refpkg_align binary to call
--cmalign PROG cmalign binary to call
multiclass_concat.py takes a database which has been classified using guppy classify and creates a view multiclass_concat. This view has the same schema as multiclass, with the addition of an id_count column. However, instead of getting multiple rows when a sequence has multiple classifications at a rank, the tax_id column will be all of the tax_ids concatenated together, delimited by ,.
To ensure that it’s still easy to join multiclass_concat to the taxa table, rows are inserted into the taxa table for each concatenated tax_id present in the multiclass_concat table which have a tax_name created by concatenating the names of all the constituent tax_ids.
usage: multiclass_concat.py [options] database
positional arguments:
database sqlite database (output of `rppr prep_db` after `guppy
classify`)
optional arguments:
-h, --help show this help message and exit
split_qiime.py takes sequences in QIIME’s preprocessed FASTA format and generates a FASTA file which contains the original sequence names. Optionally, a specimen map can also be written out which maps from the original sequence names to their specimens as listed in the QIIME file.
For example, an incoming sequence identified by >PC.634_1 FLP3FBN01ELBSX will be written out as >FLP3FBN01ELBSX with an entry in the specimen_map of FLP3FBN01ELBSX,PC.634.
usage: split_qiime.py [-h] [qiime] [fasta] [specimen_map]
Extract the original sequence names from a QIIME FASTA file.
positional arguments:
qiime input QIIME file (default: stdin)
fasta output FASTA file (default: stdout)
specimen_map if specified, output specimen map (default: don't write)
optional arguments:
-h, --help show this help message and exit