:tocdepth: 3 .. _guppy_redup: ===== redup ===== `redup` restores duplicates to deduped placefiles. :: usage: redup -d dupfile placefile Options ======= -o Specify the filename to write to. --out-dir Specify the directory to write files to. --prefix Specify a string to be prepended to filenames. -d The dedup file to use to restore duplicates. -m If specified, redup with counts instead of a name list. --as-mass If specified, add mass instead of names to each pquery. Details ======= From placefiles generated by running ``pplacer`` on a deduplicated sequence file, restore duplicated sequences to the placefiles. A script is included to deduplicate sequences: :ref:`deduplicate-sequences`:: scripts/deduplicate_sequences.py --deduplicated-sequences-file sample.dedup sample.fasta sample_deduped.fasta pplacer -c sample.refpkg sample_deduped.fasta guppy redup -d sample.dedup --prefix reduped_ sample_deduped.jplace If you wish to use :ref:`split-placefiles` in ``guppy`` analysis, specify a map using ``--split-map``, and at least one read from each group will be added to dedup file, and added in ``guppy redup``. By default, duplicate sequences are reduced to a single sequence per split with mass totalling all the identical sequences within the split. If you want to retain all sequence IDs after reduplicating, use the ``--keep-ids`` flag. The format for dedup files is very simple, for ease of reading and writing. Each dedup file is just a CSV file with three columns: the sequence name in the deduplicated file, the name to put in the reduplicated file, and the mass associated with the latter name. For most purposes, the ingoing mass will simply be the number of reads. For example with the following dedup file:: A_0,A_0,1 A_0,A_1,3 A_0,A_2,1 B_0,B_0,1 C_0,C_0,2 C_0,C_1,2 For one set of name/mass pairs, the transformation done is:: [A_0, 1] -> [A_0, 1; A_1, 3; A_2, 1] [B_0, 1] -> [B_0, 1] [C_0, 1] -> [C_0, 2; C_1, 2] And a more complicated example:: [A_0, 2; B_0, 1] -> [A_0, 2; A_1, 6; A_2, 2; B_0, 1] [C_0, 0.5] -> [C_0, 1; C_1, 1] [D_0, 1] -> [D_0, 1] Sequences with no changes need not be present in the dedup file.