Add-On Scripts

For scripts placed at RACKJ subdirectory "scripts/special".

Cooperation with the MEME suite
Other scripts
- PTCdetection.pl
- Interval2igv.pl

Cooperation with the MEME suite

One of the next important tasks after comparing gene expression levels and/or alternative splicing events is to find corresponding regulatory motifs in promoter regions or splicing sites. In this section, we describe a set of scripts that cooperate with the MEME suite. With these scripts, it shall be easy to generate input sequences for MEME and parse outputs of MAST. Further, the comparison script MastCompare.pl is capable of finding important motifs that are enirched in the input dataset against a given background dataset.

Notes

A number of scripts described in this section require BioPerl installed.

top

Generating promoter sequences

Promoter sequences can be generated using the java program special.PromoterCGFF and then the perl script SeqGen.pl. Conventionally, the mdust program from the (T)GICL project may be used for low-complexity removal.

top

Using MAST

After MEME computation, computed motif profiles may be used to scan input sequences (and background sequences, if any) for getting all potential motif sites. The following options of MAST are recommended for generating machine-readable and sufficient results for downstream analysis.

-hit_list: for machine-readable output
-mt 0.001: generate motif occurrences with P-value threhold 0.001. Lowering this threshold may cause unreasonably few occurrences of short motifs.
-norc: not to scan the reverse complement strands

top

MastAnnotation.pl

This script is used to append two additional columns, (1) distance between the hit site and squence end and (2) actuall sequence of the hit site, and translate a MAST output file into a tab-delimited file so that the MAST output data can be operated using software like Excel.

Usage

MastAnnotation.pl <mastIn> <fastaIn> <outFilename>

Options/Parameters

<mastIn>: the output file by MAST
<fastaIn>: the input FASTA file of MAST for <mastIn>
<outFilename>: the output filename

top

SeqGenAS.pl

This script is used to generate sequences around specified junctions of genes. A junction of a gene here is a pair of genome positions, and its notation is in one of the following forms.

form	position1	position2
exonA(relativePosA)-exonB(relativePosB)	stop of exonA + relativePosA	start of exonB - relativePosB
exonX[relativePosA-relativePosB]	start of exonX + relativePosA	start of exonX + relativePosB
exonA(relativePosA)-genomicPosB	stop of exonA + relativePosA	genomicPosB
genomicPosA-exonB(relativePosB)	genomicPosA	start of exonB - relativePosB
genomicPosA-genomicPosB	genomicPosA	genomicPosB
exonA<=>exonB	stop of exonA	start of exonB
intronX	this equals "intronX<=>intronX+1"

Note that, in all these forms, the numbering of exons/introns and relative positions follow the genome orientation but not gene orientation. This is for adopting results made by those sample-sensitive alternative splicing scripts, which are based on tables generated by rnaseq.RPKMComputer, rnaseq.ExonCounter, and rnaseq.FineSpliceCounter.

Given a junction defined, sequence extraction following the gene orientation is controlled by five parameters.

The first sequence will be the concatenation of (1) upstream w₁ base-pairs (including the 5' junction point) and (2) downstream w₂ base-pairs (excluding the 5' junction point).
The second sequence will be the concatenation of (1) upstream w₃ base-pairs (excluding the 3' junction point) and (2) downstream w₄ base-pairs (including the 3' junction point).
The two sequences will be merged into one single sequence if (they overlap each other AND the merge option is ON) OR (both w₂ and w₃ are -1).

Usage

SeqGenAS.pl <spliceList> <w1> <w2> <w3> <w4> <overlapMerge> <strandCGFF> <genomeFasta> <outFilename>

Options/Parameters

<spliceList>: a text file containing the junction list in two columns. The first column should be gene IDs and the second column should be in one of aforementioned forms.
<w1> <w2> <w3> <w4>: the four aforementioned controlling parameters w₁, w₂, w₃, and w₄.
<overlapMerge>: the aforementioned merge option. 1 for ON and 0 for OFF.
<strandCGFF>: a canonical GFF file with strand information
<genomeFasta>: the genome FASTA file
<outFilename>: the output FASTA filename

top

MastAnnotationAS.pl

After running MEME with sequences generated using SeqGenAS.pl and scanning these sequence using MAST with recommended options, this script appends the following five columns and saves the output into a tab-delimited file:

actuall sequence of the hit site,
start and stop positions of the hit on the genome, and
relative position of the hit start(stop) to the 5'(3') junction point.

Usage

MastAnnotationAS.pl <spliceList> <w1> <w2> <w3> <w4> <overlapMerge> <mastIn> <strandCGFF> <genomeFasta> <outFilename>

Options/Parameters

<spliceList> <w1> <w2> <w3> <w4> <overlapMerge> <strandCGFF> <genomeFasta>: must be the same with those when using SeqGenAS.pl
<mastIn>: the output file by MAST
<outFilename>: the output filename

top

MastCompare.pl

After scanning target sequences and background sequences using the same set of motif profiles made by MEME, this script compares the two MAST output files with the following assumption.

By selecting reasonable target sequences and background sequences, there should be an appropriate scanning P-value threshold that enriches occurrences of a motif in targets but not backgrounds if the motif does related with targets. By "appropriate" it means looser(more stringent) thresholds would result in more(less) occurrences in targets, and thus worse enrichment levels.

With this understanding, we define that a sequence is hit by a motif under MAST P-value threshold X if there is a hit on the sequence with P-value lower than or equal to X. In so doing, this script computs the following four numbers for every motif with a series of thresholds:

number of hit target sequences,
number of hit background sequences,
number of non-hit target sequences, and
number of non-hit background sequences.

With these four numbers, the fisher exact test tells the enrichment levels of a motif under the series of thresholds, and a motif shows low-high-low enirchment levels along the threshold axis should be a "related" motif. The following plot shows one such example (x-axis: MAST P-value threshold, y-axis: fisher exact test P-value), where this motif was found by MEME with E-value 14,000.

Usage

MastComapre.pl <targetMast> <backgroundMast> <numberTarget> <numberBackground> <backgroundContainingTarget> <rackjJARpath>

Options/Parameters

<targetMast>: MAST output file by scanning target sequences
<backgroundMast>: MAST output file by scanning background sequences
<numberTarget>: number of target sequences
<numberBackground>: number of background sequences
<backgroundContainingTarget>: 1 for background containing target and 0 for otherwise
<rackjJARpath>: the path to RACKJ JAR file

Notes

This script writes output to the standard output, redirecting it would be convenient.

top

Other scripts

top

PTCdetection.pl

This script computes whether a retained intron would cause premature termination codon (PTC). Its output table contains the following columns:

gene ID,
the intron to be retained,
transcript,
size of retained intron of the isoform,
intron size modulo 3,
whether the intron retention causes PTC,
whether a PTC is in the retained intron,
AA sequence translated from the transcript with the retained intron, and
AA sequence translated from the transcript until the retained intron.

Usage

PTCdetection.pl <strandCGFF> <cdsCGFF> <IDtable> <intronList> <genomeFasta> <outFilename>

Options/Parameters

<strandCGFF>: a canonical GFF file with strand information
<cdsCGFF>: a canonical GFF file gives coding transcripts and their coding regions
<IDtable>: a text file gives relationships between transcripts and genes. The first column should be transcripts and the second column should be gene IDs.
<intronList>: a text file containing the intron list in two columns. The first column should be gene IDs and the second column should intron numbers.
<genomeFasta>: the genome FASTA file
<outFilename>: the output filename

Notes

This script required BioPerl installed.
The numbering of introns follows genome orientation.
For example, <cdsCGFF> can be a file generated by misc.CanonicalGFF with the option -GE mRNA CDS, given the GFF3 file as in our RNAseq example.

top

Interval2igv.pl

This script translates interval data with item names into the IGV format for visualization using IGV so that overlapping intervals are associated with different values and can be viewed separately. The input data should be a tab-delimited file containing the four columns: (1) item name of the interval, (2) chromosome of the interval, (3) start point, and (4) stop point.

Usage

Interval2igv.pl <inFile> <rackjJARpath> <trackName>

Options/Parameters

<inFile>: the input interval file
<rackjJARpath>: the path to RACKJ JAR file
<trackName>: track name to be shown in IGV

Notes

This script writes output to the standard output, redirecting it would be convenient.

Contents

Notes

Usage

Options/Parameters

Usage

Options/Parameters

Usage

Options/Parameters

Usage

Options/Parameters

Notes

Usage

Options/Parameters

Notes

Usage

Options/Parameters

Notes