home

Add-On Scripts

For scripts placed at RACKJ subdirectory "scripts/special".

Contents


top

Cooperation with the MEME suite

One of the next important tasks after comparing gene expression levels and/or alternative splicing events is to find corresponding regulatory motifs in promoter regions or splicing sites. In this section, we describe a set of scripts that cooperate with the MEME suite. With these scripts, it shall be easy to generate input sequences for MEME and parse outputs of MAST. Further, the comparison script MastCompare.pl is capable of finding important motifs that are enirched in the input dataset against a given background dataset.

Notes

A number of scripts described in this section require BioPerl installed.


top

Generating promoter sequences

Promoter sequences can be generated using the java program special.PromoterCGFF and then the perl script SeqGen.pl. Conventionally, the mdust program from the (T)GICL project may be used for low-complexity removal.


top

Using MAST

After MEME computation, computed motif profiles may be used to scan input sequences (and background sequences, if any) for getting all potential motif sites. The following options of MAST are recommended for generating machine-readable and sufficient results for downstream analysis.
top

MastAnnotation.pl

This script is used to append two additional columns, (1) distance between the hit site and squence end and (2) actuall sequence of the hit site, and translate a MAST output file into a tab-delimited file so that the MAST output data can be operated using software like Excel.

Usage

MastAnnotation.pl <mastIn> <fastaIn> <outFilename>

Options/Parameters

  1. <mastIn>: the output file by MAST
  2. <fastaIn>: the input FASTA file of MAST for <mastIn>
  3. <outFilename>: the output filename

top

SeqGenAS.pl

This script is used to generate sequences around specified junctions of genes. A junction of a gene here is a pair of genome positions, and its notation is in one of the following forms.

formposition1position2
exonA(relativePosA)-exonB(relativePosB)
stop of exonA + relativePosA start of exonB - relativePosB
exonX[relativePosA-relativePosB]
start of exonX + relativePosA start of exonX + relativePosB
exonA(relativePosA)-genomicPosB
stop of exonA + relativePosA genomicPosB
genomicPosA-exonB(relativePosB)
genomicPosA start of exonB - relativePosB
genomicPosA-genomicPosB
genomicPosA genomicPosB
exonA<=>exonB
stop of exonA start of exonB
intronX
this equals "intronX<=>intronX+1"

Note that, in all these forms, the numbering of exons/introns and relative positions follow the genome orientation but not gene orientation. This is for adopting results made by those sample-sensitive alternative splicing scripts, which are based on tables generated by rnaseq.RPKMComputer, rnaseq.ExonCounter, and rnaseq.FineSpliceCounter.

Given a junction defined, sequence extraction following the gene orientation is controlled by five parameters.

  1. The first sequence will be the concatenation of (1) upstream w1 base-pairs (including the 5' junction point) and (2) downstream w2 base-pairs (excluding the 5' junction point).
  2. The second sequence will be the concatenation of (1) upstream w3 base-pairs (excluding the 3' junction point) and (2) downstream w4 base-pairs (including the 3' junction point).
  3. The two sequences will be merged into one single sequence if (they overlap each other AND the merge option is ON) OR (both w2 and w3 are -1).

Usage

SeqGenAS.pl <spliceList> <w1> <w2> <w3> <w4> <overlapMerge> <strandCGFF> <genomeFasta> <outFilename>

Options/Parameters

  1. <spliceList>: a text file containing the junction list in two columns. The first column should be gene IDs and the second column should be in one of aforementioned forms.
  2. <w1> <w2> <w3> <w4>: the four aforementioned controlling parameters w1, w2, w3, and w4.
  3. <overlapMerge>: the aforementioned merge option. 1 for ON and 0 for OFF.
  4. <strandCGFF>: a canonical GFF file with strand information
  5. <genomeFasta>: the genome FASTA file
  6. <outFilename>: the output FASTA filename

top

MastAnnotationAS.pl

After running MEME with sequences generated using SeqGenAS.pl and scanning these sequence using MAST with recommended options, this script appends the following five columns and saves the output into a tab-delimited file:

  1. actuall sequence of the hit site,
  2. start and stop positions of the hit on the genome, and
  3. relative position of the hit start(stop) to the 5'(3') junction point.

Usage

MastAnnotationAS.pl <spliceList> <w1> <w2> <w3> <w4> <overlapMerge> <mastIn> <strandCGFF> <genomeFasta> <outFilename>

Options/Parameters

  1. <spliceList> <w1> <w2> <w3> <w4> <overlapMerge> <strandCGFF> <genomeFasta>: must be the same with those when using SeqGenAS.pl
  2. <mastIn>: the output file by MAST
  3. <outFilename>: the output filename

top

MastCompare.pl

After scanning target sequences and background sequences using the same set of motif profiles made by MEME, this script compares the two MAST output files with the following assumption.

By selecting reasonable target sequences and background sequences, there should be an appropriate scanning P-value threshold that enriches occurrences of a motif in targets but not backgrounds if the motif does related with targets. By "appropriate" it means looser(more stringent) thresholds would result in more(less) occurrences in targets, and thus worse enrichment levels.

With this understanding, we define that a sequence is hit by a motif under MAST P-value threshold X if there is a hit on the sequence with P-value lower than or equal to X. In so doing, this script computs the following four numbers for every motif with a series of thresholds:

  1. number of hit target sequences,
  2. number of hit background sequences,
  3. number of non-hit target sequences, and
  4. number of non-hit background sequences.
With these four numbers, the fisher exact test tells the enrichment levels of a motif under the series of thresholds, and a motif shows low-high-low enirchment levels along the threshold axis should be a "related" motif. The following plot shows one such example (x-axis: MAST P-value threshold, y-axis: fisher exact test P-value), where this motif was found by MEME with E-value 14,000.

Usage

MastComapre.pl <targetMast> <backgroundMast> <numberTarget> <numberBackground> <backgroundContainingTarget> <rackjJARpath>

Options/Parameters

  1. <targetMast>: MAST output file by scanning target sequences
  2. <backgroundMast>: MAST output file by scanning background sequences
  3. <numberTarget>: number of target sequences
  4. <numberBackground>: number of background sequences
  5. <backgroundContainingTarget>: 1 for background containing target and 0 for otherwise
  6. <rackjJARpath>: the path to RACKJ JAR file

Notes

This script writes output to the standard output, redirecting it would be convenient.


top

Other scripts

top

PTCdetection.pl

This script computes whether a retained intron would cause premature termination codon (PTC). Its output table contains the following columns:

  1. gene ID,
  2. the intron to be retained,
  3. transcript,
  4. size of retained intron of the isoform,
  5. intron size modulo 3,
  6. whether the intron retention causes PTC,
  7. whether a PTC is in the retained intron,
  8. AA sequence translated from the transcript with the retained intron, and
  9. AA sequence translated from the transcript until the retained intron.

Usage

PTCdetection.pl <strandCGFF> <cdsCGFF> <IDtable> <intronList> <genomeFasta> <outFilename>

Options/Parameters

  1. <strandCGFF>: a canonical GFF file with strand information
  2. <cdsCGFF>: a canonical GFF file gives coding transcripts and their coding regions
  3. <IDtable>: a text file gives relationships between transcripts and genes. The first column should be transcripts and the second column should be gene IDs.
  4. <intronList>: a text file containing the intron list in two columns. The first column should be gene IDs and the second column should intron numbers.
  5. <genomeFasta>: the genome FASTA file
  6. <outFilename>: the output filename

Notes

  1. This script required BioPerl installed.
  2. The numbering of introns follows genome orientation.
  3. For example, <cdsCGFF> can be a file generated by misc.CanonicalGFF with the option -GE mRNA CDS, given the GFF3 file as in our RNAseq example.

top

Interval2igv.pl

This script translates interval data with item names into the IGV format for visualization using IGV so that overlapping intervals are associated with different values and can be viewed separately. The input data should be a tab-delimited file containing the four columns: (1) item name of the interval, (2) chromosome of the interval, (3) start point, and (4) stop point.

Usage

Interval2igv.pl <inFile> <rackjJARpath> <trackName>

Options/Parameters

  1. <inFile>: the input interval file
  2. <rackjJARpath>: the path to RACKJ JAR file
  3. <trackName>: track name to be shown in IGV

Notes

This script writes output to the standard output, redirecting it would be convenient.