Contents
- Preparation
- Align reads against the genome sequence
- Building genome annotation file: .cgff and .model
- RPKM computation and exon-level counting
- Comparing two samples
- Visualization
System Requirement:
- Java: Be sure that you have Java Runtime Environment (JRE) installed in your computer. You may download it here.
- Memory requirement: In practice, we found that RACKJ programs work well on 64-bit Linux machines. In 32-bit machines, better run java with -Xmx option.
RACKJ is specifically designed for analyzing reads of NGS (Next-Generation Sequencing) technologies longer than 50bps. Taking the advantage of "not-short" lengths, it is no longer a requirement to map reads to an "expanded genome". All we have to do in the mapping stage is to align reads against the genome using a tool like BLAT -- just be sure that (1) the genome and the genome annotation file (in GFF3 or a seq_gene.md file) are of the same version, and (2) the chromosome sequence names are the same with those in the genome annotation file.
Suppose that we are using BLAT to align reads against the genome, the commands we would apply are:
blat -q=rna -t=dna -out=pslx <GenomeFasta> <ShortReadFasta> <BlatOutputFile>
and
pslReps -minNearTopSize=70 <BlatOutputFile> <BetterOutputFile> <BlatPsrOutputFile>
Note that, in the second command, the parameter 70 is for reads of 75bp-long, adjust it if necessary. Also note that, in order to shorten the response time for the first command, you may split the short-read file for mapping them at the same time, and then merge the output files. This kind of tasks now can be easily done using our mapping scripts. For example, with Bowtie2 index built and mapping table written.
wdlin@head2:/RAID2/Projects/20130503_example$ cat DB/MappingTable
#target bowtie2 blat
tair10 /RAID2/Projects/20130503_example/DB/TAIR10_chr_all /RAID2/Projects/20130503_example/DB/TAIR10_chr_all.fas
wdlin@head2:/RAID2/Projects/20130503_example$ Mapping.pl -table DB/MappingTable tair10 ES.fastq ES.bam MappingBowtieBlat.pl -bowtie -p 10 --sensitive -k 10 -ID 0.95 -blat -q=rna -t=dna
Total query reads: 5925048
([Bowtie]68257_Query1.fasta) SAM ID filter (0.95) reads: 5925048 -> 4232249
(SAM to FASTA) 69031_bowtieTmp.sam -> 69031_BlatQuery.fas
Loaded 119667750 letters in 7 sequences
Searched 126959925 bases in 1692799 sequences
[samopen] SAM header is present: 7 sequences.
[bam_sort_core] merging from 2 files...
Total SAM output reads: 5852391
This command runs Bowtie2 for mapping, filters alignments with identity threshold 0.95, and then maps filtered reads using BLAT for mapping them with intronic gaps. In this example, 10 threads were created by Bowtie2 for mapping and only one process was used by BLAT. To speed up, you may try the faster commands as in our example, which create multiple processes for BLAT. Or, alternatively, the following command gives exactly the same output by creating 8 processes for running Bowtie and then BLAT separately.
wdlin@head2:/RAID2/Projects/20130503_example$ Mapping.pl -split 8 -table DB/MappingTable tair10 ES.fastq ES.bam MappingBowtieBlat.pl -bowtie --sensitive -k 10 -ID 0.95 -blat -q=rna -t=dna
Split processing...done
Total query reads: 5925048
([Bowtie]89437_Query3.fasta) SAM ID filter (0.95) reads: 740631 -> 531480
([Bowtie]89437_Query8.fasta) SAM ID filter (0.95) reads: 740631 -> 521930
(SAM to FASTA) 90233_bowtieTmp.sam -> 90233_BlatQuery.fas
Loaded 119667750 letters in 7 sequences
(SAM to FASTA) 90238_bowtieTmp.sam -> 90238_BlatQuery.fas
([Bowtie]89437_Query5.fasta) SAM ID filter (0.95) reads: 740631 -> 539655
Loaded 119667750 letters in 7 sequences
(SAM to FASTA) 90235_bowtieTmp.sam -> 90235_BlatQuery.fas
Loaded 119667750 letters in 7 sequences
([Bowtie]89437_Query1.fasta) SAM ID filter (0.95) reads: 740631 -> 511117
([Bowtie]89437_Query2.fasta) SAM ID filter (0.95) reads: 740631 -> 523996
([Bowtie]89437_Query7.fasta) SAM ID filter (0.95) reads: 740631 -> 530496
([Bowtie]89437_Query4.fasta) SAM ID filter (0.95) reads: 740631 -> 537080
(SAM to FASTA) 90231_bowtieTmp.sam -> 90231_BlatQuery.fas
([Bowtie]89437_Query6.fasta) SAM ID filter (0.95) reads: 740631 -> 536495
(SAM to FASTA) 90232_bowtieTmp.sam -> 90232_BlatQuery.fas
Loaded 119667750 letters in 7 sequences
(SAM to FASTA) 90237_bowtieTmp.sam -> 90237_BlatQuery.fas
(SAM to FASTA) 90234_bowtieTmp.sam -> 90234_BlatQuery.fas
Loaded 119667750 letters in 7 sequences
Loaded 119667750 letters in 7 sequences
(SAM to FASTA) 90236_bowtieTmp.sam -> 90236_BlatQuery.fas
Loaded 119667750 letters in 7 sequences
Loaded 119667750 letters in 7 sequences
Searched 15073200 bases in 200976 sequences
Searched 15686325 bases in 209151 sequences
Searched 16402575 bases in 218701 sequences
Searched 15266325 bases in 203551 sequences
Searched 15310200 bases in 204136 sequences
Searched 15760125 bases in 210135 sequences
Searched 16247625 bases in 216635 sequences
Searched 17213550 bases in 229514 sequences
[samopen] SAM header is present: 7 sequences.
[samopen] SAM header is present: 7 sequences.
[samopen] SAM header is present: 7 sequences.
[samopen] SAM header is present: 7 sequences.
[samopen] SAM header is present: 7 sequences.
[samopen] SAM header is present: 7 sequences.
[samopen] SAM header is present: 7 sequences.
[samopen] SAM header is present: 7 sequences.
[bam_sort_core] merging from 2 files...
Total SAM output reads: 5852391
Instead of GFF3 files and seq_gene.md files, RACKJ uses a very simple data format for storing genome annotations, namely, where is the range of one gene and where are its exons. RACKJ includes two modlues that translate GFF3 files and seq_gene.md files into this very simple data format, respectively. In this section, we show the GFF3 case for Arabidopsis thaliana. For the case of seq_gene.md files, please refer the SeqGeneMd program or download prebuild files.
Because a GFF3 file downloaded anywhere would contain any kind of values like "transposable_element_gene" analog to "gene" and and any kind of values like "ncRNA" analog to "mRNA" in the "type" field (column 3), we do not expect to have a fully-automatic program that correctly processes all kinds of GFF3 files. Instead of a fully-automatic solution, we give a semi-automatic solution. The first thing we have to do is to "observe" the contents of the GFF3 file we have.
wdlin@head2:/RAID2/Projects/20130503_example/DB$ java -classpath /home/wdlin/Project1/rackJ/rackj.jar misc.GffTree -I TAIR10_GFF3_genes_transposons.gff
ID attribute (-idAttr) not assigned, using default: ID
parent attribute list (-parentAttr) not assigned, using default: [Parent, Derives_from]
program: GffTree
input GFF file (-I): TAIR10_GFF3_genes_transposons.gff
ID attribute (-idAttr): ID
parent attribute list (-parentAttr): [Parent, Derives_from]
wdlin@head2:/RAID2/Projects/20130503_example/DB$
In the above example, the GffTree program reads file TAIR10_GFF3_genes_transposons.gff and outputs a .features file, which tells the type-hierarchy in the GFF3 file. By observing the .features file, it is much easier to find which types are corresponding to genes, mRNA, and exon.
wdlin@head2:/RAID2/Projects/20130503_example/DB$ cat TAIR10_GFF3_genes_transposons.gff.features
GffRoot
chromosome*
gene*
mRNA*
protein*
exon
five_prime_UTR
CDS
three_prime_UTR
miRNA*
exon
tRNA*
exon
ncRNA*
exon
snoRNA*
exon
snRNA*
exon
rRNA*
exon
pseudogene*
pseudogenic_transcript*
pseudogenic_exon
transposable_element_gene*
mRNA*
exon
transposable_element*
transposon_fragment
Note that, in a .features file, asterisked type names are records with ID attributes. It means that records of these types are feasible to be something like genes or transcripts.
After observing the GFF3 file using GffTree, for this example, we decide to generate a .cgff file using the following command, where a .cgff file records genomic location of a "gene" and locations of its exons.
wdlin@head2:/RAID2/Projects/20130503_example/DB$ java -classpath /home/wdlin/Project1/rackJ/rackj.jar misc.CanonicalGFF -I TAIR10_GFF3_genes_transposons.gff -O tair10.strand.cgff -GE gene exon -GE pseudogene pseudogenic_exon -GE transposable_element_gene exon
ID attribute (-idAttr) not assigned, using default: ID
parent attribute list (-parentAttr) not assigned, using default: [Parent, Derives_from]
program: CanonicalGFF
input GFF file (-I): TAIR10_GFF3_genes_transposons.gff
output filename (-O): tair10.strand.cgff
gene-exon feature pairs (-GE): [[gene, exon], [pseudogene, pseudogenic_exon], [transposable_element_gene, exon]]
ID attribute (-idAttr): ID
parent attribute list (-parentAttr)): [Parent, Derives_from]
wdlin@head2:/RAID2/Projects/20130503_example/DB$
This command treats records of type gene (pseudogene, transposable_element_gene) as genes, and exon (pseudogenic_exon, exon) as corresponding exons. For every gene-exon assignment (by -GE option), the CanonicalGFF program traverses the entire hierarchy described by the input GFF3 file, and looks for defined genes and their defined exon regions. Similarly, a .model file, which records defined transcripts and their defined exon regions, can be generated using the following command.
wdlin@head2:/RAID2/Projects/20130503_example/DB$ java -classpath /home/wdlin/Project1/rackJ/rackj.jar misc.CanonicalGFF -I TAIR10_GFF3_genes_transposons.gff -O tair10.strand.model -GE mRNA:miRNA:tRNA:ncRNA:snoRNA:snRNA:rRNA exon -GE pseudogenic_transcript pseudogenic_exon
ID attribute (-idAttr) not assigned, using default: ID
parent attribute list (-parentAttr) not assigned, using default: [Parent, Derives_from]
program: CanonicalGFF
input GFF file (-I): TAIR10_GFF3_genes_transposons.gff
output filename (-O): tair10.strand.model
gene-exon feature pairs (-GE): [[mRNA:miRNA:tRNA:ncRNA:snoRNA:snRNA:rRNA, exon], [pseudogenic_transcript, pseudogenic_exon]]
ID attribute (-idAttr): ID
parent attribute list (-parentAttr)): [Parent, Derives_from]
wdlin@head2:/RAID2/Projects/20130503_example/DB$
Note that the input option -GE mRNA:miRNA:tRNA:ncRNA:snoRNA:snRNA:rRNA exon equals to -GE mRNA exon -GE miRNA exon -GE tRNA exon -GE ncRNA exon -GE snoRNA exon -GE snRNA exon -GE rRNA exon, you may just shrink a number of input options like this if you find some types got the same type-mate. Also note that the major difference between a .cgff file and a .model file described here is that, a .model file describes transcripts and their exons, and a .cgff file "combines" multiple transcripts if they belong to the same gene. In the following picture, the two blue transcripts are separately recorded in the .model file, and the gray gene they belong to is recorded in the .cgff file with "combined" exons.
RPKM computation and exon-level read counting are pretty easy if mapping results (in this example, file ES.bam), the .cgff file, and the .model file are available. Just use the following command.
wdlin@head2:/RAID2/Projects/20130503_example$ java -classpath /home/wdlin/Project1/rackJ/rackj.jar:/home/wdlin/Tools/sam-1.89.jar rnaseq.RPKMComputer -GFF DB/tair10.strand.cgff -model DB/tair10.strand.model -M SAM ES.bam -O ES
program: RPKMComputer
canonical GFF filename (-GFF): DB/tair10.strand.cgff
mapping method/filename (-M):
SAM : ES.bam
output prefix (-O): ES
block join factor (-J): 2
identity cutoff (-ID): 0.95
use exon region (-exon): true
check by containing (-contain, FALSE for by intersecting): false
minimum overlap (-min): 8
check all alignment blocks (-ALL): true
model filename (-model): DB/tair10.strand.model
#uniq reads: 5030009
#multi reads: 172286
#mapped reads: 5353305
wdlin@head2:/RAID2/Projects/20130503_example$
The most important task of the PRKMComputer program is to see which read belongs to which gene based on mapping results and genome annotation, and compute RPKM values (proposed by ERANGE). RACKJ considers a read as mapped to the genome if its best alignment identity of its alignments is above the identify cutoff (defined by -ID), and only alignments with the best identity are involved in the following computation. RACKJ also considers an alignment as a combination of alignment blocks without insertion/deletion, and two nearby blocks will be joined if the distance between them are less than or equal to the block join factor (defined by -J). Then, each alignments are examined using the following four parameters:
- -exon: using either only exonic regions or gene regions (including intronic region) to qualify alignments.
- -contain: qualifying alignments based on containment- or intersection-relationships between alignment blocks and qualifying regions.
- -min: minimum overlapping size for defining containment or intersection.
- -ALL: require all alignment blocks to be qualified?
For genomes of model organisms that are considered as with trustworthy genome annotations, the default setting of the above four parameters (respectively true, false, 8, and true) sould be enough. And we exemplify each option by visual comparison.
parameters: -exon true -contain true -min 6 -ALL true
|
|
parameters: -exon false -contain true -min 6 -ALL true
|
|
description: red-circled reads are included because changing qualifying regions from exonic regions to gene regions.
|
parameters: -exon true -contain true -min 6 -ALL true
|
|
parameters: -exon true -contain false -min 6 -ALL true
|
|
description: red-circled reads are included because qualifying alignments by intersection but not by containment.
|
parameters: -exon true -contain true -min 6 -ALL true
|
|
parameters: -exon true -contain true -min 10 -ALL true
|
|
description: blue-circled reads are excluded because the minimum overlapping size is extended from 6 to 10.
|
parameters: -exon true -contain true -min 6 -ALL true
|
|
parameters: -exon true -contain true -min 6 -ALL false
|
|
description: blue-circled reads are included because they have at least one alignment block contained by the exons.
|
The RPKMComputer computes not only RPKM values but also counts for exons and splicing events. In this example, these outputs will be stored in three tab-delimited text files: ES.geneRPKM, ES.exonCount, and ES.spliceCount, where these files are prefixed by the string "ES" (specified by the -O option).
A .geneRPKM file contains data of five columns, i.e., GeneID, Length(in Kbps), Number of reads, the RPKM value, and the ratio of multi-reads versus all reads. All computations follow the procedure described in the paper of ERANGE.
An example of .geneRPKM file |
GeneID | Length(Kbps) | #Reads | RPKM | multi/ALL |
AT1G01010 | 1.688 | 119.00 | 13.38 | 0 |
AT1G01020 | 1.774 | 56.00 | 5.99 | 0 |
AT1G01030 | 1.905 | 0.00 | 0.00 | 0 |
AT1G01040 | 6.254 | 265.99 | 8.07 | 0.003732696 |
AT1G01046 | 0.207 | 1.10 | 1.01 | 0.092840284 |
AT1G01050 | 0.976 | 393.90 | 76.60 | 0.002296971 |
AT1G01060 | 3.234 | 255.00 | 14.96 | 0 |
AT1G01070 | 1.45 | 84.00 | 10.99 | 0 |
AT1G01073 | 0.111 | 0.00 | 0.00 | 0 |
AT1G01080 | 1.322 | 27.00 | 3.88 | 0 |
AT1G01090 | 1.627 | 1153.00 | 134.49 | 0 |
AT1G01100 | 0.679 | 1677.54 | 468.89 | 3.23E-04 |
AT1G01110 | 1.851 | 52.00 | 5.33 | 0 |
In addition to RPKM values, the .exonCount and the .spliceCount files record counts of reads that are belonging to exons and splicing events in exon level, respectively. Their formats are quite common, the first three columns are GeneID, exon (or exon pair), and read count. For the .exonCount file, we append the exon length and the ratio of multi-reads versus all reads for each exon.
An example of .exonCount file |
GeneID | exonNo | #reads | exonLen | multi/ALL |
AT1G05350 | 1 | 28.0 | 362 | 0.0 |
AT1G05350 | 2 | 22.0 | 53 | 0.0 |
AT1G05350 | 3 | 34.0 | 119 | 0.0 |
AT1G05350 | 4 | 35.0 | 87 | 0.0 |
AT1G05350 | 5 | 30.0 | 75 | 0.0 |
AT1G05350 | 6 | 25.0 | 75 | 0.0 |
AT1G05350 | 7 | 30.0 | 83 | 0.0 |
AT1G05350 | 8 | 18.0 | 37 | 0.0 |
AT1G05350 | 9 | 42.0 | 156 | 0.0 |
AT1G05350 | 10 | 21.0 | 69 | 0.0 |
AT1G05350 | 11 | 38.0 | 129 | 0.0 |
AT1G05350 | 12 | 26.0 | 114 | 0.0 |
AT1G05350 | 13 | 15.0 | 69 | 0.0 |
AT1G05350 | 14 | 17.0 | 199 | 0.0 |
For the .spliceCount file, we append the "jumping" column for checking whether the splicing event is crossing some other exon, and the "novel" column for checking whether the splicing event is recorded in the given .model file (this column will be appended if the .model file is given).
An example of .spliceCount file. |
GeneID | exonPair | #reads | jumping | novel |
AT5G23020 | 1<=>2 | 41.0 |
AT5G23020 | 2<=>3 | 46.0 |
AT5G23020 | 3<=>4 | 34.0 |
AT5G23020 | 4<=>5 | 34.0 |
AT5G23020 | 5<=>6 | 41.0 |
AT5G23020 | 6<=>7 | 64.0 |
AT5G23020 | 7<=>8 | 39.0 |
AT5G23020 | 8<=>9 | 34.0 |
AT5G23020 | 8<=>10 | 3.0 | V | V |
AT5G23020 | 9<=>10 | 57.0 |
The following command outputs read counts of splicing events in nucleotide level in a .fineSplice file in addition to the .spliceCount file.
wdlin@head2:/RAID2/Projects/20130503_example$ java -classpath /home/wdlin/Project1/rackJ/rackj.jar:/home/wdlin/Tools/sam-1.89.jar rnaseq.FineSpliceCounter -GFF DB/tair10.strand.cgff -model DB/tair10.strand.model -M SAM ES.bam -O ES
program: FineSpliceCounter
canonical GFF filename (-GFF): DB/tair10.strand.cgff
mapping method/filename (-M):
SAM : ES.bam
output prefix (-O): ES
block join factor (-J): 2
identity cutoff (-ID): 0.95
use exon region (-exon): false
check by containing (-contain, FALSE for by intersecting): false
minimum overlap (-min): 8
check all alignment blocks (-ALL): false
model filename (-model): DB/tair10.strand.model
#uniq reads: 5100972
wdlin@head2:/RAID2/Projects/20130503_example$
The format of the output .fineSplice file is very similar with that of the .exonCount and .spliceCount files except that the second column is for splicing pattern and the fourth column is for checking whether the splicing pattern is novel or not.
An example of .fineSplice file. |
GeneID | splice | #reads | novel |
AT5G23020 | 1(0)-2(0) | 41.0 |
AT5G23020 | 2(0)-3(0) | 46.0 |
AT5G23020 | 3[106-109] | 1.0 | V |
AT5G23020 | 3(0)-4(0) | 33.0 |
AT5G23020 | 3(+3)-4(-3) | 1.0 | V |
AT5G23020 | 4(0)-5(0) | 34.0 |
AT5G23020 | 5(0)-6(0) | 39.0 |
AT5G23020 | 5(+1)-6(-2) | 2.0 | V |
AT5G23020 | 6(0)-7(0) | 64.0 |
AT5G23020 | 7(0)-8(0) | 39.0 |
AT5G23020 | 8(0)-9(0) | 34.0 |
AT5G23020 | 8(0)-10(0) | 3.0 | V |
AT5G23020 | 9(-1)-10(+2) | 1.0 | V |
AT5G23020 | 9(0)-10(0) | 56.0 |
Also with the following command, we may compute a .intronCount file that gives read numbers related with intron retention, and is like a .exonCount file. Note that, with default parameters -exon false -contain false, an intron read which overlaps some other gene regions will not be considered as uniq to the intron. This can facilitate downstream computation.
wdlin@head2:/RAID2/Projects/20130503_example$ java -classpath /home/wdlin/Project1/rackJ/rackj.jar:/home/wdlin/Tools/sam-1.89.jar rnaseq.ExonCounter -GFF DB/tair10.strand.cgff -M SAM ES.bam -O ES -intronic true
program: ExonCounter
canonical GFF filename (-GFF): DB/tair10.strand.cgff
mapping method/filename (-M):
SAM : ES.bam
output prefix (-O): ES
block join factor (-J): 2
identity cutoff (-ID): 0.95
use exon region (-exon): false
check by containing (-contain, FALSE for by intersecting): false
minimum overlap (-min): 8
check all alignment blocks (-ALL): true
count for intron (-intronic): true
#uniq reads: 5041042
#multi reads: 178070
#mapped reads: 5353305
wdlin@head2:/RAID2/Projects/20130503_example$
An example of .intronCount file |
GeneID | intronNo | #reads | intronLen | multi/ALL |
AT1G05350 | 1 | 0.0 | 103 | 0 |
AT1G05350 | 2 | 0.0 | 117 | 0 |
AT1G05350 | 3 | 0.0 | 105 | 0 |
AT1G05350 | 4 | 0.0 | 120 | 0 |
AT1G05350 | 5 | 0.0 | 134 | 0 |
AT1G05350 | 6 | 0.0 | 177 | 0 |
AT1G05350 | 7 | 0.0 | 81 | 0 |
AT1G05350 | 8 | 0.0 | 172 | 0 |
AT1G05350 | 9 | 0.0 | 278 | 0 |
AT1G05350 | 10 | 0.0 | 84 | 0 |
AT1G05350 | 11 | 0.0 | 96 | 0 |
AT1G05350 | 12 | 0.0 | 78 | 0 |
AT1G05350 | 13 | 1.0 | 274 | 0.0 |
Given two samples, suppose that corresponding .geneRPKM, .exonCount, .spliceCount, and .fineSplice files are ready. It is usually very easy to compare RPKM values of each gene in the two samples using software like Excel: just combine the two RPKM table and use some formula to compare two RPKM values for each gene. For example, we may apply the Z-statistic (p1-p2)/SQRT(p0*(1-p0)/500,000), where p0, p1, and p2 are (RPKM1+RPKM2)/2,000,000, RPKM1/1,000,000, and PRKM2/1,000,000, respectively (see Kal et al. 1999, PubMed ID: PMC25383). This Z-statistic should follow the standard normal distribution and the corresponding P-value can be easily computed using Excel with a formula like 2*(1-NORMSDIST(ABS(Z))).
In addition to comparing RPKM values, RACKJ provides a set of scripts that compare alternative splicing events in specified samples and thus find sample-sensitive alternative splicing events. In this example, we assume that the two samples are named ES and Fe, respectively. We also assume that these .geneRPKM, .exonCount, .spliceCount, and .fineSplice files are placed at the same directory and prefixed by "ES" and "Fe", respectively. The following command computes sample-sensitive exon-skipping events using SSEScompare.pl.
wdlin@head2:/RAID2/Projects/20130503_example$ SSEScompare.pl SSES.xls 75 ES Fe
This command outputs records like the following one, which means an exon-skipping event on exon 3 (count from 5' end of the genome, not gene) could be preferred in the "Fe" sample.
GeneID | exonPair | ES | Fe | depthES | depthFe | chiSquared | xES | xFe | xChiSquared |
AT5G59610 | 2<=>4 | 0 | 3 | 2.44 | 2.75 | 2.38 | 2 | 0 | 56.95 |
The following command computes sample-sensitive intron-retention events using SSIRcompare.pl.
wdlin@head2:/RAID2/Projects/20130503_example$ SSIRcompare.pl SSIR.xls ES ES Fe Fe
This command outputs records like the following one, which means an intron-retention event on intron 2 (count from 5' end of the genome, not gene) could be preferred in the "Fe" sample.
GeneID | intronNo | intronLength | intronES | intronFe | exonES | exonFe | chiSquared | uniqES | uniqFe | uniChiSquared | spliceES | spliceFe | spliceChiSquared |
AT5G15630 | 2 | 90 | 0 | 6 | 33 | 40 | 4.95 | 70 | 86 | 4.88 | 6 | 3 | 12.00 |
Finally, the following command computes sample-sensitive alternative donor/acceptor events using SSDAcompare.pl.
wdlin@head2:/RAID2/Projects/20130503_example$ SSDAcompare.pl SSDA.xls ES.fineSplice ES.geneRPKM ES Fe.fineSplice Fe.geneRPKM Fe /home/wdlin/Project1/rackJ/rackj.jar 0.2 DB/tair10.strand.cgff
Its output records is like the following one, which means a donor change event on intron 1 (count from 5' end of the genome, not gene) could be preferred in the "Fe" sample.
Gene | Splice1 | Database | Splice2 | Database | Donor/Acceptor Change | ES Splice1 | ES Splice2 | Fe Splice1 | Fe Splice2 | p-value | ES Uniq | Fe Uniq | p-value |
AT3G23090 | 1(0)-2(-12) | V | 1(0)-2(0) | V | D | 34 | 4 | 39 | 23 | 0.0048 | 549 | 745 | 0.0048 |
One of the simplest ways to visualize read mapping on the genome with genome annotation is to use a visualization tool like IGV to browse SAM files sorted and indexed using SAMtools. Alternatively, RACKJ provide two data generation programs and one visualization program that enable mapping visualizations in a batch.
After surveying potentially sample-sensitive alternative splicing events in last section, the first thing we may want to do is to see how reads mapped to those alternatively spliced genes. The visualization component of RACKJ contains two steps: data generation and visualization.
wdlin@head2:/RAID2/Projects/20130503_example$ java -classpath /home/wdlin/Project1/rackJ/rackj.jar:/home/wdlin/Tools/sam-1.89.jar rnaseq.GeneTracer -GFF DB/tair10.strand.cgff -model DB/tair10.strand.model -exon false -M SAM ES.bam -O Plot/ES -trace AT5G59610 -trace AT5G15630 -trace AT3G23090
program: GeneTracer
canonical GFF filename (-GFF): DB/tair10.strand.cgff
mapping method/filename (-M):
SAM : ES.bam
output prefix (-O): Plot/ES
block join factor (-J): 2
identity cutoff (-ID): 0.95
use exon region (-exon): false
check by containing (-contain, FALSE for by intersecting): false
minimum overlap (-min): 8
check all alignment blocks (-ALL): true
model filename (-model): DB/tair10.strand.model
tracing gene list (-trace): [AT5G59610, AT5G15630, AT3G23090]
traceing gene list in file (-traceFile): null
buffer size per gene (-buffer): 10240
#uniq reads: 5041042
#multi reads: 178070
#mapped reads: 5353305
wdlin@head2:/RAID2/Projects/20130503_example$
In this example, the GeneTracer program reads the mapping results (ES.bam) and records mapping locations of "qualified" reads for the three specified genes (AT5G59610, AT5G15630, and AT3G23090) at the same time. You may use option -trace to specify as many genes as you like, or use option -traceFile to specify a file that contains genes to be traced. Note that reads will be "qualified" as described in the RPKM computation section, and the default values of the four key parameters are true, false, 8, and true, respectively. In this example, we set -exon false for getting intronic reads as well as exonic reads.
Next, the ReadViz program reads a file outputted by GeneTracer and makes a picture of all gene models and all collected reads.
wdlin@head2:/RAID2/Projects/20130503_example$ java -classpath /home/wdlin/Project1/rackJ/rackj.jar:/home/wdlin/Tools/sam-1.89.jar graphics.ReadViz -I Plot/ES.AT5G59610 -O Plot/AT5G59610.ES.jpg
program: ReadViz
input filename (-I): Plot/ES.AT5G59610
output filename (-O): Plot/AT5G59610.ES.jpg
output mode (-mode): mapping
read height (-h): 2
include splice reads (-splice): true
include uniq reads (-uniq): true
include multi reads (-multi): true
direction (-direction): forward
auto setting (-auto): true
use first model (-F): true
plot size (-size): 1000
font (-font): Courier
auto start (-start): 24013100
auto stop (-stop): 24015100
auto plot scale (-scale): 200
wdlin@head2:/RAID2/Projects/20130503_example$
By doing the same steps for the Fe sample, we may visually compare the pictures of the same gene in the two samples and find an exon-skipping event in one sample.
Similarly, we shall see the second intron of one other gene is retained in the Fe sample.
wdlin@head2:/RAID2/Projects/20130503_example$ java -classpath /home/wdlin/Project1/rackJ/rackj.jar:/home/wdlin/Tools/sam-1.89.jar graphics.ReadViz -mode depthlog -I Plot/Fe.AT5G15630 -O Plot/AT5G15630.Fe.jpg
program: ReadViz
input filename (-I): Plot/Fe.AT5G15630
output filename (-O): Plot/AT5G15630.Fe.jpg
output mode (-mode): depthlog
read height (-h): 2
include splice reads (-splice): true
include uniq reads (-uniq): true
include multi reads (-multi): true
direction (-direction): forward
auto setting (-auto): true
use first model (-F): true
plot size (-size): 1000
font (-font): Courier
auto start (-start): 5084500
auto stop (-stop): 5087000
auto plot scale (-scale): 500
wdlin@head2:/RAID2/Projects/20130503_example$