Table Of Contents

Previous topic

rarefy

Next topic

round

This Page

redup

redup restores duplicates to deduped placefiles.

usage: redup -d dupfile placefile

Options

-o Specify the filename to write to.
--out-dir Specify the directory to write files to.
--prefix Specify a string to be prepended to filenames.
-d The dedup file to use to restore duplicates.
-m If specified, redup with counts instead of a name list.
--as-mass If specified, add mass instead of names to each pquery.

Details

From placefiles generated by running pplacer on a deduplicated sequence file, restore duplicated sequences to the placefiles.

A script is included to deduplicate sequences: deduplicate_sequences.py:

scripts/deduplicate_sequences.py --deduplicated-sequences-file sample.dedup sample.fasta sample_deduped.fasta
pplacer -c sample.refpkg sample_deduped.fasta
guppy redup -d sample.dedup --prefix reduped_ sample_deduped.jplace

If you wish to use ‘Split’ placefiles in guppy analysis, specify a map using --split-map, and at least one read from each group will be added to dedup file, and added in guppy redup. By default, duplicate sequences are reduced to a single sequence per split with mass totalling all the identical sequences within the split. If you want to retain all sequence IDs after reduplicating, use the --keep-ids flag.

The format for dedup files is very simple, for ease of reading and writing. Each dedup file is just a CSV file with three columns: the sequence name in the deduplicated file, the name to put in the reduplicated file, and the mass associated with the latter name. For most purposes, the ingoing mass will simply be the number of reads. For example with the following dedup file:

A_0,A_0,1
A_0,A_1,3
A_0,A_2,1
B_0,B_0,1
C_0,C_0,2
C_0,C_1,2

For one set of name/mass pairs, the transformation done is:

[A_0, 1] -> [A_0, 1; A_1, 3; A_2, 1]
[B_0, 1] -> [B_0, 1]
[C_0, 1] -> [C_0, 2; C_1, 2]

And a more complicated example:

[A_0, 2; B_0, 1] -> [A_0, 2; A_1, 6; A_2, 2; B_0, 1]
[C_0, 0.5] -> [C_0, 1; C_1, 1]
[D_0, 1] -> [D_0, 1]

Sequences with no changes need not be present in the dedup file.