redup¶
redup restores duplicates to deduped placefiles.
usage: redup -d dupfile placefile
Options¶
-o | Specify the filename to write to. |
--out-dir | Specify the directory to write files to. |
--prefix | Specify a string to be prepended to filenames. |
-d | The dedup file to use to restore duplicates. |
-m | If specified, redup with counts instead of a name list. |
--as-mass | If specified, add mass instead of names to each pquery. |
Details¶
From placefiles generated by running pplacer
on a deduplicated sequence
file, restore duplicated sequences to the placefiles.
A script is included to deduplicate sequences: deduplicate_sequences.py:
scripts/deduplicate_sequences.py --deduplicated-sequences-file sample.dedup sample.fasta sample_deduped.fasta
pplacer -c sample.refpkg sample_deduped.fasta
guppy redup -d sample.dedup --prefix reduped_ sample_deduped.jplace
If you wish to use ‘Split’ placefiles in guppy
analysis, specify a map
using --split-map
, and at least one read from each group will be added to
dedup file, and added in guppy redup
. By default, duplicate sequences are
reduced to a single sequence per split with mass totalling all the identical
sequences within the split. If you want to retain all sequence IDs after
reduplicating, use the --keep-ids
flag.
The format for dedup files is very simple, for ease of reading and writing. Each dedup file is just a CSV file with three columns: the sequence name in the deduplicated file, the name to put in the reduplicated file, and the mass associated with the latter name. For most purposes, the ingoing mass will simply be the number of reads. For example with the following dedup file:
A_0,A_0,1
A_0,A_1,3
A_0,A_2,1
B_0,B_0,1
C_0,C_0,2
C_0,C_1,2
For one set of name/mass pairs, the transformation done is:
[A_0, 1] -> [A_0, 1; A_1, 3; A_2, 1]
[B_0, 1] -> [B_0, 1]
[C_0, 1] -> [C_0, 2; C_1, 2]
And a more complicated example:
[A_0, 2; B_0, 1] -> [A_0, 2; A_1, 6; A_2, 2; B_0, 1]
[C_0, 0.5] -> [C_0, 1; C_1, 1]
[D_0, 1] -> [D_0, 1]
Sequences with no changes need not be present in the dedup file.