Filter & Dedup

Filter bam

samtools view -F 1804 -f 2 -u -q 30 xxx.PE2SE.bam | sambamba sort -n  /dev/stdin -o /output_dir/xxx.PE2SE.dupmark.bam

Filtered by:

  • Improper mapping marker (1804): ummapped, not primary, failing platform, duplicates
  • poor mapping score (<30) including multi-mapped reads
  • unmmated reads - [f -2] output fwd and rev. both mapped pairs

output:

  • [u]ncompressed bam
  • Sort the bam by name (-n) and prepair for the deduplicating step
samtools fixmate -r xxx.PE2SE.dupmark.bam (tmp)  xxx.PE2SE.dupmark.bam.fixmate.bam (tmp)
  • Fill in mate coordinate. ISIZE (insert size) and mate related flags from the name-sorted bam and remove secondary and ummapped reads (-r)
  • Fixmate try to compute serveral attributes, for example, column 7 and 8, tags such as MQ, Q2 and R2 and etc.
 samtools view -F 1804 -f 2 -u xxx.PE2SE.dupmark.bam.fixmate.bam | sambamba sort  /dev/stdin -o xxx.PE2SE.filt.bam
  • Remove reads with improper mapping marker again (why?)
  • sorted by index

Dedup

  1. Use picard to mark duplicates & QC
    • algorithm for markduplicate is here
    • output: xxx.PE2SE.filt.dupmark.bam
  2. Then remove duplicates *
  3. Output the final bam file: PE2SE.nodup.bam