scripts¶
The pplacer package comes with a few scripts to perform common tasks on reference packages and placements:
Installing¶
All scripts can be used by specifying the full path. For convenience, a Python
setup.py
file is provided, which will install them globally. To install,
run:
$ python setup.py install
from the scripts/
subdirectory, prefixed with sudo
if the python
installation directory is not writable.
refpkg_align.py
¶
refpkg_align.py
works with reference package alignments and alignment
profiles, providing methods to align sequences to a reference package
alignment, and extract an alignment from a reference package.
refpkg_align.py
depends on BioPython, as
well as the external tools used for alignment: HMMER3, Infernal, and PyNAST.
List of subcommands¶
align
¶
refpkg_align.py align
aligns sequences to a reference package alignment for
use with pplacer. For reference packages built with Infernal, cmalign
is
used for alignment. For packages built using HMMER3, alignment is performed
with hmmalign
. Reference packages lacking a profile
entry are aligned
using PyNAST. By default, an alignment mask is applied if it exists.
The output format varies: Stockholm for Infernal- and HMMER-based reference packages, FASTA for all others.
For Infernal-based reference packages, MPI may be used.
usage: refpkg_align.py align [options] refpkg seqfile outfile
Options¶
-h, --help show this help message and exit
--align-opts OPTS Alignment options, such as "--mapali $aln_sto". '$'
characters will need to be escaped if using template
variables. Available template variables are $aln_sto,
$profile. Defaults are as follows for the different
profiles: (PyNAST: "-l 150 -f /dev/null -g /dev/null")
(INFERNAL: "-1 --hbanded --sub --dna")
--alignment-method {PyNAST,HMMER3,INFERNAL}
Profile version to use. [default: Guess. PyNAST is
used if a valid CM or HMM is not found in the
reference package.]
--no-mask Do not trim the alignment to unmasked columns.
[default: apply mask if it exists]
--debug Enable debug output
--verbose Enable verbose output
MPI Options
--use-mpi Use MPI [infernal only]
--mpi-arguments MPI_ARGUMENTS
Arguments to pass to mpirun
--mpi-run MPI_RUN Name of mpirun executable
extract
¶
extract
extracts a reference alignment from a reference package, apply a
mask if it exists by default.
usage: refpkg_align.py extract [options] refpkg output_file
Options¶
positional arguments:
refpkg Reference package directory
output_file Destination
optional arguments:
-h, --help show this help message and exit
--output-format OUTPUT_FORMAT
output format [default: stockholm]
--no-mask Do not apply mask to alignment [default: apply mask
if it exists]
mask¶
Warning: masking is experimental and we may change our mind about how it gets implemented.
Alignment masks may be specified through an entry named “mask” in the
CONTENTS.json
file of a reference package pointing to a file with a
comma-delimited set of 0-based indices in an alignment to keep after
masking.
For example, a mask specification of:
0,1,2,3,4,5,6,28,29
Would discard all columns in an alignment except for 0-7, 28, and 29.
sort_placefile.py
¶
sort_placefile.py
takes a placefile and sorts and formats its contents for
then performing a visual diff of placefiles. Output defaults to being emitted
to stdout.
usage: sort_placefile.py [-h] [-o FILE] infile
update_refpkg.py
¶
update_refpkg.py
updates a reference package from the 1.0 format to the 1.1
format. It takes the CONTENTS.json
file in the reference package as its
parameter and updates it in place, after making a backup copy.
usage: update_refpkg.py [-h] CONTENTS.json
check_placements.py
¶
check_placements.py
checks a placefile for potential issues, including:
- Any
like_weight_ratio
being equal to 0.- The sum of the
like_weight_ratios
not being equal to 1.- Any
post_prob
being equal to 0.- The sum of the
post_probs
being equal to 0.- The sum of the
post_probs
not being equal to 1.
usage: check_placements.py example.jplace
deduplicate_sequences.py
¶
deduplicate_sequences.py
deduplicates a sequence file and produces a dedup
file suitable for use with guppy redup -m
. See the
redup documentation for details.
pca_for_qiime.py
¶
pca_for_qiime.py
converts the trans
file output by guppy pca
into
the tab-delimited format expected by QIIME’s plotting functions.
usage: pca_for_qiime.py [-h] trans tsv
extract_taxonomy_from_biom.py
¶
extract_taxonomy_from_biom.py
extracts the taxonomy information from a BIOM
file, producing seqinfo and taxonomy files which can then be placed into a
reference package.
usage: extract_taxonomy_from_biom.py [-h] biom taxtable seqinfo
hrefpkg_query.py
¶
hrefpkg_query.py
classifies sequences using a hrefpkg. The output is a
sqlite database with the same schema as created by rppr prep_db.
usage: hrefpkg_query.py [options] hrefpkg query_seqs classification_db
positional arguments:
hrefpkg hrefpkg to classify using
query_seqs input query sequences
classification_db output sqlite database
optional arguments:
-h, --help show this help message and exit
-j CORES, --ncores CORES
number of cores to tell commands to use
-r RANK, --classification-rank RANK
rank to perform the initial NBC classification at
--workdir DIR directory to write intermediate files to (default: a
temporary directory)
--disable-cleanup don't remove the work directory as the final step
--use-mpi run refpkg_align with MPI
--alignment {align-each,merge-each,none}
respectively: align each input sequence; subset an
input stockholm alignment and merge each sequence to a
reference alignment; only subset an input stockholm
alignment (default: align-each)
--cmscores FILE in align-each mode, write out a file containing the
cmalign scores
external binaries:
--pplacer PROG pplacer binary to call
--guppy PROG guppy binary to call
--rppr PROG rppr binary to call
--refpkg-align PROG refpkg_align binary to call
--cmalign PROG cmalign binary to call
multiclass_concat.py
¶
multiclass_concat.py
takes a database which has been classified using
guppy classify and creates a view
multiclass_concat
. This view has the same schema as multiclass
, with
the addition of an id_count
column. However, instead of getting multiple
rows when a sequence has multiple classifications at a rank, the tax_id
column will be all of the tax_ids concatenated together, delimited by ,
.
To ensure that it’s still easy to join multiclass_concat
to the taxa
table, rows are inserted into the taxa
table for each concatenated tax_id
present in the multiclass_concat
table which have a tax_name
created by
concatenating the names of all the constituent tax_ids.
usage: multiclass_concat.py [options] database
positional arguments:
database sqlite database (output of `rppr prep_db` after `guppy
classify`)
optional arguments:
-h, --help show this help message and exit
split_qiime.py
¶
split_qiime.py
takes sequences in QIIME’s preprocessed FASTA format and
generates a FASTA file which contains the original sequence names. Optionally,
a specimen map can also be written out which maps from the original sequence
names to their specimens as listed in the QIIME file.
For example, an incoming sequence identified by >PC.634_1 FLP3FBN01ELBSX
will be written out as >FLP3FBN01ELBSX
with an entry in the specimen_map of
FLP3FBN01ELBSX,PC.634
.
usage: split_qiime.py [-h] [qiime] [fasta] [specimen_map]
Extract the original sequence names from a QIIME FASTA file.
positional arguments:
qiime input QIIME file (default: stdin)
fasta output FASTA file (default: stdout)
specimen_map if specified, output specimen map (default: don't write)
optional arguments:
-h, --help show this help message and exit