Scripts for Mapping

Contents

Preparation
Before Mapping
Mapping Methods
Helper Scripts
- sam2fas.pl
- sam_filter.pl
Examples

Preparation

Our mapping pipeline currently adopts two mapping programs, i.e., Bowtie2 and BLAT, a mapping file manipulation tool, i.e., SAMtools, and a set of perl scripts. So here is a checklist:

BioPerl is installed.
Bowtie2 is installed and the enviornment variable PATH is pointing to its directory.
BLAT is installed and the enviornment variable PATH is pointing to its directory.
SAMtools is installed and the enviornment variable PATH is pointing to its directory.
The enviornment variable PATH is pointing to RACKJ subdirectories "scripts" and "scripts/pipeline".
The enviornment variable INC is created and containing RACKJ subdirectory "scripts/pipeline".

For bash users, setting on environment variables corresponds to adding something like the following lines into ~/.bash_profile.

PATH="/path/to/Bowtie2/executable:$PATH"
PATH="/path/to/BLAT/executable:$PATH"
PATH="/path/to/SAMtools/executable:$PATH"
PATH="/path/to/rackJ/scripts/:$PATH"
PATH="/path/to/rackJ/scripts/pipeline:$PATH"

INC="/path/to/rackJ/scripts/pipeline:$PATH"
export INC

Before Mapping

Upper- or lower-cased sequence headers

Although our mapping scripts work well regardless of capitalization of sequence headers, we found that it may affect presentation of visualization tools. For example, reads mapped to "chrm" may not been shown in some visualization tools simply because its corresponding chromosome inside was named "ChrM". So time would be saved if some check and/or modifications could have been done on (1) reference sequence headers, (2) chromosome names in GFF3 or canonical GFF files, and (3) the visualization tool you use.

preindex.pl

In some circumstance, especially when running multiple mapping tasks under a cluster environment like the PBS, we need to build indexes at first to avoid race condition on index building, which may cause some error when accessing reference sequences. This script simply builds indexes using Bio::DB::Fasta and faidx of SAMtools.

Usage

preindex.pl <inputFasta>

Parameters

inputFastq: the FASTA file to be indexed

fixFasta.pl

The faidx program of SAMtools may have problem when handling a FASTA file with sequence lines of inconsistent lengths. This script fixes a FASTA file so that the faidx program shall index the fixed version without any problem.

Usage

fixFasta.pl <inFasta> <outFasta>

Parameters

inFasta: the FASTA file to be fixed
outFasta: the FASTA file fixed

Mapping Methods

Mapping target

All our mapping scripts read three fixed parameters from the command line: (1) target, (2) query, and (3) output. The target is a string indicating which genome to be mapped but not a file nor a path to index prefix. Before executing Bowtie2 or BLAT, the target string will be translated into an appropriate parameter so that users can always use the same target string for all our mapping scripts.

To do this, a mapping table containing the translation of the target and actual target file (or index prefix) should be maintained and indicated by the environment variable TargetTableFile. Below is an example mapping table, in which <TAB> means the tab character.

#target<TAB>blat<TAB>bowtie2
TAIR10<TAB>/myDB/TAIR10.fasta<TAB>/myDB/TAIR10
HG19<TAB>/myDB/HG19.fasta<TAB>/myDB/HG19

In this example, /myDB/TAIR10.fasta will be used when calling BLAT with target string "TAIR10", and /myDB/HG19 will be used when calling Bowtie2 with target string "HG19".

MappingBlat.pl

This script calls BLAT and does some post processing on the mapping results.

Usage

MappingBlat.pl [options] <target> <query> <SAMout> <BLAT parameters>

Options/Parameters

-ID <threshold>: a float number of identity threshold for filtering SAM output. (default: -1, for no filtering; 0 for filtering unmapped reads)
-SamTmp <filename>: temporary SAM file for filtered queries. One record per query. (default: the null device)
-target <target>: user-defined target FASTA file, to replace that in the mapping table
-unmap: apply this to include unmapped records in the output
<target>: the target string
<query>: the query file in FASTA format
<SAMout>: the output file in SAM format (not BAM)
<BLAT parameters>: parameters to be passed to BLAT

Notes

This script automatically adds -out=pslx for BLAT. DO NOT add any other format options for BLAT.
Options -q=type -t=type were suggested for BLAT. For example, -q=rna -t=dna would be good for RNAseq, -q=dna -t=dna would be good for DNAseq, and -q=dnax -t=dnax would be good for translation-level mapping (but time-consuming).

MappingBowtie.pl

This script calls Bowtie2 and does some post processing on the mapping results.

Usage

MappingBowtie.pl [options] <target> <query> <SAMout> <Bowtie2 parameters>

Options/Parameters

-ID <threshold>: a float number of identity threshold for filtering SAM output. (default: -1, for no filtering; 0 for filtering unmapped reads)
-SamTmp <filename>: temporary SAM file for filtered queries. One record per query. (default: the null device)
-target <target>: user-defined target index prefix, to replace that in the mapping table
<target>: the target string
<query>: the query file in FASTA format
<SAMout>: the output file in SAM format (not BAM)
<Bowtie2 parameters>: parameters to be passed to Bowtie2

Notes

This script automatically adds -f --quiet for Bowtie2, so the query file must be in FASTA format.
The filtering procedure exactly processes what outputted by Bowtie2. So, for example, -ID 0.9 -SamTmp tmp.sam would let all reads without alignment of at least 0.9 identity recorded be in tmp.sam, including reads without any alignment. And -ID 0.9 -SamTmp tmp.sam plus --no-unal for Bowtie2 would let only aligned reads without alignment at least 0.9 identity recorded be in tmp.sam.

MappingBowtieBlat.pl

This script wraps MappingBowtie.pl and MappingBlat.pl. All query reads will be mapped by Bowtie through MappingBowtie.pl, and rest reads (reads not passing the identity filter) will be mapped by BLAT through MappingBlat.pl.

Usage

MappingBowtieBlat.pl <target> <query> <SAMout> -bowtie <MappingBowtie.pl options> -blat <MappingBlat.pl options>

Options/Parameters

<target>: the target string
<query>: the query file in FASTA format
<SAMout>: the output file in SAM format (not BAM)
<MappingBowtie.pl options>: options to be passed to MappingBowtie.pl, including those to be passed to Bowtie2
<MappingBlat.pl options>: options to be passed to MappingBlat.pl, including those to be passed to BLAT

Notes

This script automatically applies -SamTmp when invoking MappingBowtie.pl, which means, for MappingBowtie.pl, -ID should be appropriately set and -SamTmp should NOT be applied manually.
This script will append -ID 0 to <MappingBowtie.pl options> when there is no -ID option.

MappingBowtieBlat2.pl

This script wraps MappingBowtie.pl and MappingBlat.pl. All query reads will be mapped by Bowtie through MappingBowtie.pl, and rest reads (reads not passing the identity filter) will be processed by a two-pass application of MappingBlat.pl.

Usage

MappingBowtieBlat2.pl <target> <query> <SAMout> -bowtie <MappingBowtie.pl options> -blat1 <MappingBlat.pl 1st-pass options> -blat2 <MappingBlat.pl 2nd-pass options>

Options/Parameters

<target>: the target string
<query>: the query file in FASTA format
<SAMout>: the output file in SAM format (not BAM)
<MappingBowtie.pl options>: options to be passed to MappingBowtie.pl, including those to be passed to Bowtie2
<MappingBlat.pl 1st-pass options>: options to be passed to the first pass of MappingBlat.pl, including those to be passed to BLAT
<MappingBlat.pl 2nd-pass options>: options to be passed to the second pass of MappingBlat.pl, including those to be passed to BLAT

Notes

This script automatically applies -SamTmp when invoking MappingBowtie.pl and first pass of MappingBlat.pl, which means, for these two invocations, -ID should be appropriately set and -SamTmp should NOT be applied manually.
This script will append -ID 0 to <MappingBowtie.pl options> and <MappingBlat.pl 1st-pass options> when there is no -ID option.
A reasonable usage of this script is to apply different search strategies on the two BLAT invocations. For example, default parameters for the first, and finer parameters (e.g. -tilSize=9 -stepSize=3) for the second.

Mapping.pl -- the wrapper

This script is a super-wrapper of above four mapping scripts, it (1) splits an input FASTA/FASTQ file into the specified number of parts, (2) concurrently maps all parts using the specified mapping script, and (3) outputs mapping results in SAM or BAM (binary SAM) format.

Usage

Mapping.pl [options] <target> <query> <SAMfile> <MethodScript> <method parameters>

Options/Parameters

<target>: the target string
<query>: the query file in FASTA or FASTQ format
<SAMfile>: the output file in SAM or BAM (binary SAM) format, the format is decided according to the extension filename. The output format will be BAM if not decidable.
<MethodScript>: One of above four scripts, i.e., MappingBlat.pl, MappingBowtie.pl, MappingBowtieBlat.pl, and MappingBowtieBlat2.pl.
<method parameters>: options to be passed to the <MethodScript>
-space <X>: replace space characters in read IDs to <X> (default: '_', underline)
-slash <Y>: replace slash characters ('/') in read IDs to <Y> (default: '_', underline)
-split <Z>: split the query FASTA or FASTQ file into <Z> shares of FASTA files for mapping (default: 1, no splitting)
-table <altTable>: replace the mapping table by <altTable>
-tmpdir <tmpDir>: the place to generate a temporary directory for storing all intermediate files of mapping (default: current working directory)
-keep: keep all intermediate files; otherwise, those files, including the temporary direcotry containing them, will be removed
-sortname: sort SAM records in the output file by read IDs. This is the default sorting order.
-sortpos: sort SAM records in the output file by mapping coordinates
-showcmd: show the command of running this script and those commands invoked by this script

Notes

The -split option makes this script generates <Z> processes to run <MethodScript> for those <Z> parts of reads. So it is important not to set <Z> to exceed your computer's capacity.
It may not be necessary to apply -split option when taking MappingBowtie.pl as the method script because the Bowtie2 option -p can be applied for multi-threading.
No -SamTmp option for the method script is allowed when <Z> (-split) is greater than 1 because this will cause two or more processes writing to one same temporary SAM file.

Helper Scripts

sam2fas.pl

This script translates a SAM file into a FASTA file.

Usage

sam2fas.pl <inSAM> <outFASTA>

Options/Parameters

<inSAM>: the input SAM file
<outFASTA>: the output FASTA file

Notes

Sequences of reversed-and-complemented SAM records will be their reversed-and-complemented SEQ. That is, all sequences in the output FASTA file follow their very original direction.
This script exacly generates one sequence for each SAM records, regardless of redundancy.

sam_filter.pl

This script reads SAM records from the standard input and write reads with qualified records and reads without qualified records into two separate SAM files.

Usage

sam_filter.pl [-ID <idThreshold>] <passedSAM> <filteredSAM>

Options/Parameters

-ID <idThreshold>: identity threshold for filtering. Setting this threshold 0 will let those unmapped records filtered. (default: 0.9)
<passedSAM>: the output SAM file for reads with only qualified records
<filteredSAM>: the output SAM file for reads without any qualified records

Notes

This script works properly with SAM files whose records of the same read ID are consecutive. Generally speaking, SAM or BAM files made directly by most aligners and BAM files sorted by names should fit this requirement.
There will be only one record for each read in <filteredSAM>.
SAM headers from the input will be written to both output SAM files.

Examples

Mapping RNAseq reads using BLAT

With mapping table set as described above and read sequences in file MyQuery.fastq, the following command performs these operations:

splits input MyQuery.fastq into 20 FASTA files,
maps 20 FASTA files using 20 processes of BLAT,
translates output PSLX files into SAM files,
translates SAM files into BAM files,
merges BAM files into one single BAM file, and
sorts the BAM file by name.

Mapping.pl -split 20 TAIR10 MyQuery.fastq Result.bam MappingBlat.pl -q=rna -t=dna

Note that it would be better to make sure these 20 BLAT processes can be run concurrently without much context-switch, this can be observed by process mointors like top under UN*X.

Mapping.pl -split 20 x MyQuery.fastq Result.bam MappingBlat.pl -q=rna -t=dna -target /myDB/TAIR10.fasta

Alternatively, without mapping table, we may leave a dummy string for the <target> parameter and assign the genome file by -target option in <method parameters>.

Mapping RNAseq reads using Bowtie2 and then BLAT

The following command performs these operations:

maps reads using Bowtie2 with options -p 8 -k 30 --sensitive,
filters ailgnment results with identity threshold 0.9, and
maps unqualified reads using BLAT.
translates, merges, and sorts mapping results into one single BAM file which is sorted by name.

Mapping.pl TAIR10 MyQuery.fastq Result.bam MappingBowtieBlat.pl -bowtie -p 8 -k 30 --sensitive -ID 0.9 -blat -q=rna -t=dna

Again, it would be better to make sure that -p 8 for Bowtie2 can be performed appropriately. Also note that this command creates only one process for BLAT. The following steps do the same thing and run BLAT faster.

Mapping.pl TAIR10 MyQuery.fastq Result.bowtie.bam MappingBowtie.pl -p 8 -k 30 --sensitive -ID 0.9 -SamTmp Result.bowtieTmp.sam

Map reads with bowtie, filter alignments with identity threshold 0.9, and save unqualified reads/records in Result.bowtieTmp.sam. It is important not to apply the Bowtie2 option --no-unal so that unmapped reads would be also unqualified.

sam2fas.pl Result.bowtieTmp.sam Result.bowtieTmp.fasta

Translate those unqualified reads into a FASTA file.

Mapping.pl -split 20 TAIR10 Result.bowtieTmp.fasta Result.blat.bam MappingBlat.pl -q=rna -t=dna

Mapping with 20 processes of BLAT.

samtools merge -n Result.bam Result.blat.bam Result.bowtie.bam

Merge mapping results made by Bowtie2 and BLAT. cat of SAMtools would be much faster than merge but not sorted by name.

Translation level mapping using BLAT

The following command runs translation level BLAT against transcript sequences of TAIR10. Note that BLAT options -stepSize=1 minIdentity=15 -minScore=20 refine default BLAT parameters and cost much more running time. These BLAT parameters are all adjustable.

Mapping.pl -split 10 x MyQuery.fasta Result.bam MappingBlat.pl -q=dnax -t=dnax -stepSize=1 -minIdentity=15 -minScore=20 -target TAIR10_cdna_20101214_updated