home

Useful Scripts

Contents


top

Sequence Processing

top

FASTQ => FASTA

perl -pe 's|@|>|;s|.*||s if $.%4==3 || $.%4==0;close $ARGV if eof' Your.*.fastq > output.fasta

This perl one-liner commend helps transfer FASTQ file(s) into a FASTA file.
top

FastqFilter.pl

Used to filter sequences in a FASTQ file according to base calling scores.

Usage

FastqFilter.pl <inputFastq> <trimLength> <qualityType> <qualityFilter> <qualifiedRatio> <outputFastq>

Parameters

  1. inputFastq: input filename of the FASTQ file
  2. trimLength: trim length, every sequence will be trimmed to this length before filtering if this is set greater than 0.
  3. qualityType: type of quality encoding, currently available: Sanger, Solexa/illumina1.0, illumina1.3, illumina1.5, illumina1.8. See here for reference.
  4. qualityFilter: a quality score as a filter for deciding a base is qualified or not. 20 would be a good choice.
  5. qualifiedRatio: a sequence is qualified if it has more than or equal to this ratio of bases are qualified (above qualityFilter).
  6. outputFastq: output filename

Notes

This script accepts only four-line-per-sequence FASTQ files.
top

ShuffleFasta.pl

Sometimes mate-pair reads are stored in two files separated for each ends, this script is used to shuffle them for further processing. RACKJ's mate-pair facility works only for suffled sequence data.

Usage

ShuffleFasta.pl <inFasta1> <inFasta2> <outFasta>

Parameters

  1. inFasta1: input FASTA file of one end
  2. inFasta2: input FASTA file of the other end
  3. outFasta: output FASTA file

top

ShuffleFastq.pl

Sometimes mate-pair reads are stored in two files separated for each ends, this script is used to shuffle them for further processing. RACKJ's mate-pair facility works only for suffled sequence data.

Usage

ShuffleFastq.pl <inFastq1> <inFastq2> <outFastq>

Parameters

  1. inFastq1: input FASTQ file of one end
  2. inFastq2: input FASTQ file of the other end
  3. outFastq: output FASTQ file

top

TrimFasta.pl

Simply trim sequences in a FASTA file.

Usage

TrimFasta.pl <inFasta> <trimLength> <outFasta>

Parameters

  1. inFasta: input FASTA file
  2. trimLength: trim length
  3. outFasta: output FASTA file

Notes

This script can also be used for trimming a FASTQ file.


top

SplitFasta.pl

Split one single FASTA file into multiple FASTA files. This would be useful if the splitted files are to be mapped to a genome at the same time.

Usage

SplitFasta.pl <inFasta> <numSeq>

Parameters

  1. inFasta: input FASTA file
  2. numSeq: number of sequences per output file

top

SplitFastq.pl

Split one single FASTQ file into multiple FASTQ files. This would be useful if the splitted files are to be mapped to a genome at the same time.

Usage

SplitFastq.pl <inFastq> <numSeq>

Parameters

  1. inFastq: input FASTQ file
  2. numSeq: number of sequences per output file

top

Alignment Manipulation

top

psl2sam.pl

This psl2sam.pl is modified from the psl2sam.pl in the SAMtools project. We modified it for the following improvements:
  1. implementation of SQ headers,
  2. precise reflection of BLAT alignment blocks, original psl2sam.pl would transfer an alignment with CIGAR 66M2I1D6M to be with CIGAR 67M1I6M, which misses one mismatch,
  3. implementation of NM and XM tags, where the latter is implemented by Bowtie2 for number of mismatches,
  4. implementation of MD tags, and
  5. translation of PSL output made by the BLAT program with options -q=dnax -t=dnax (translation-level BLAT).

Usage

psl2sam.pl [-u <chrMappingFile>] [-md] <genomeFasta> <inPSL>

Parameters/options

  1. -u <chrMappingFile>: a chromosome ID mapping file used when translating PSL records into SAM records. Earlier rnaseq.AlignmentFilter would produce lower-cased chromosome IDs which may result in problems with visualization tools like IGV. This problem is now corrected.
  2. -md: apply this option if MD tags are needed.
  3. genomeFasta: the genome FASTA file for mapping reads
  4. inPSL: a PSL format file of alignment records

Notes

  1. This script requires BioPerl installed.
  2. This script outputs to standard output.
  3. This script does not store SEQ fields. See samGetSEQ.pl and samGetSEQfast.pl for filling SEQ.

top

samGetSEQ.pl

This script is for adding SEQ fields to SAM records.

Usage

samGetSEQ.pl <inSAM> [<FASTA>]+

Parameters/options

  1. inSAM: input SAM file. Use STDIN for reading from standard input.
  2. FASTA: FASTA file of reads, at least one FASTA file should be inputted.

Notes

  1. This script outputs to standard output.
  2. It is a good idea to pipe psl2sam.pl and samGetSEQ.pl.

top

samGetSEQfast.pl

This script is for adding SEQ fields to SAM records, a faster version of samGetSEQ.pl.

Usage

samGetSEQfast.pl [-unmap] <inSAM> [<FASTA>]+

Parameters/options

  1. -unmap: apply this option to include unaligned reads into the output.
  2. inSAM: input SAM file. Use STDIN for reading from standard input.
  3. FASTA: FASTA file of reads, at least one FASTA file should be inputted.

Notes

  1. This script outputs to standard output.
  2. This script guarantees filling all SEQ fields of the input SAM if read IDs in the SAM is a subsequence of those in FASTA files.
  3. It is a good idea to pipe psl2sam.pl and samGetSEQfast.pl.

top

GFF/CGFF Manipulation

top

GFFExtractor.pl

Extract specified records from a GFF3 file. This script extracts records with the specified feature (column 3 of the GFF file) with specified attribute values (column 9 of the GFF file).

Usage

GFFExtractor.pl <GffFilename> <outputFilename> [<feature> [<attr>]]

Parameters

  1. GffFilename: input GFF3 file
  2. outputFilename: output filename
  3. feature: the type of the feature to be extracted. This will be "gene" if not specified.
  4. attr: the attribute to be extracted. This will be "ID" if not specified.

top

CGFFStatistics.pl

This script gives some basic statistics of intronic intervals and intergenic intervals in a canonical GFF file (made by misc.CanonicalGFF or misc.SeqGeneMd).

Usage

CGFFStatistics.pl <inputCGFF> [<step> <maximum> <outputFileName>]

If step, maximum, and outputFileName are given, the following five columns will be reported in the output file: (1) distance, controlled by step and maximum, (2) ratio of intergenic intervals shorter than the distance, (3) ratio of intronic intervals shorter than the distance, (4) precision of itme 2, and (5) corresponding F-measure.

Parameters

  1. inputCGFF: input canonical GFF file
  2. step: step of distance in the report file
  3. maximum: maximum distance in the report file
  4. outputFileName: output filename

top

SeqGen.pl

Extract sequences from a FASTA file based on genomic locations in a canonical GFF file (made by misc.CanonicalGFF or misc.SeqGeneMd).

Usage

SeqGen.pl [-n <lineLength>] <outputFasta> <genomeFasta> <CGFF> [<strandFile>]

Parameters

  1. -n <lineLength>: output FASTA will contain this many characters per line if this option is specified.
  2. outputFasta: output file
  3. genomeFasta: a FASTA file of genome sequences
  4. CGFF: the canonical GFF file
  5. strandFile: a 2-column file containing gene IDs and strands (+ and -). If this file is given or the CGFF file is with strand information, output sequences will follow the specified strands.

Notes

This script requires BioPerl installed.
top

IntronicCGFF.pl

Transfer a canonical GFF file (made by misc.CanonicalGFF or misc.SeqGeneMd, which usually describe exon regions) into an intronic one. This would be useful for sample-sensitive intron-retention computation.

Usage

IntronicCGFF.pl <inputCGFF> <outputCGFF>

Parameters

  1. inputCGFF: input CGFF file
  2. outputCGFF: output CGFF file

Notes


top

Comparisons

top

RPKMcorrelation.pl

Compute pearson correlation coeficient for all pairs of given samples.

Usage

RPKMcorrelation.pl [-read] [-uniq] [-log] <geneRPKM_1> [<geneRPKM_i>]+

Parameters

  1. -read: apply this option to compute correlation based on read counts but not RPKM values
  2. -uniq: apply this option to compute correlation based on uniq-read counts
  3. -log: apply this option to compute correlation based on log values, zeros will be replaced by one half of the minimum of all non-zero values
  4. geneRPKM_i: .geneRPKM filename of i-th sample

Notes

This script requires the Statistics::Basic module.
top

SSEScompare.pl

Sample-sensitive exon-skipping (ES) comparison.

Usage

SSEScompare.pl <outputFilename> <readLength> <prefix_1> [<prefix_i>]+

Suppose that a number of executions of rnaseq.RPKMComputer had been done in the working directory with different output prefixes (respectively for different samples), this script performs a sample-sensitive ES comparison for samples specified by these prefixes. Suppose n prefixes are inputted in the command line, this script outputs a table that contains the following information:
  1. the ES event: gene ID in column 1 and an exon pair in column 2,
  2. n columns of numbers of reads supporting the ES event,
  3. n columns of coverage depths (number of times that the gene been covered by its reads),
  4. a chi-squared value (goodness-of-fit) comparing item B taking item C as the background,
  5. n columns of numbers of splice-reads that involve exactly one skipped exon and one of the exon pair, and
  6. a minimum chi-squared value (goodness-of-fit) of (i) comparing item B taking item E as the background and (ii) comparing item E taking item B as the background.

Parameters

  1. outputFilename: output file
  2. readLength: read length
  3. prefix_i: output prefix of the rnaseq.RPKMComputer execution of i-th sample

Notes

With the output table, an Excel function like CHIDIST() would help getting corresponding P-values for aforementioned item F (degree of freedom = number of samples -1), which helps find ES events with statistical significance. But be aware that it is still very possible to have false positive ES events reported. The following approach would help get rid of those false positive ES events:
  1. Use larger -min for rnaseq.RPKMComputer, with some loss of sensitivity.
  2. Take a look at corresponding .fineSplice file, a true positive may come with agreements with known splicing sites and/or be supported by reads with various splicing positions on them.
  3. Of course, visualization would give fast judgement than looking those .fineSplice files.

top

SSEScompare_ex.pl

Sample-sensitive exon-skipping (ES) comparison, the computation script. This script is warpped by SSEScompare.pl. Use this script when there is an inconsistency between prefixes of .spliceCount and .geneRPKM files of some sample.

Usage

SSEScompare_ex.pl <outputFilename> <readLength> <spliceCount_1> <geneRPKM_1> <name_1> [<spliceCount_i> <geneRPKM_i> <name_i>]+

Parameters

  1. outputFilename: output file
  2. readLength: read length
  3. spliceCount_i: .spliceCount filename of i-th sample
  4. genePRKM_i: .geneRPKM filename of i-th sample
  5. name_i: name of i-th sample

Notes

See SSEScompare.pl for detailed description.
top

SSIRcompare.pl

Sample-sensitive intron-retention (IR) comparison.

Usage

SSIRcompare.pl <outputFilename> <intronPrefix_1> <exonPrefix_1> [<intronPrefix_i> <exonPrefix_i>]+

Suppose that a number of executions of rnaseq.RPKMComputer have been done in the working directory, and the same number of executions of rnaseq.ExonCounter for .intronCount files have been done in the same working directory. This script performs a sample-sensitive IR comparison for samples specified by the intronPrefixes and exonPrefixes. Suppose n intronPrefixes and n exonPrefixes are inputted in the command line, this script outputs a table contains the following information:
  1. the IR event: gene ID in column 1, intron number in column 2, and intron length in column 3,
  2. n columns of numbers of reads supporting the IR event,
  3. n columns of numbers of reads belonging to neighboring exons,
  4. a chi-squared value (goodness-of-fit) comparing item B taking item C as the background,
  5. n columns of numbers of reads belonging to the gene,
  6. a chi-squared value (goodness-of-fit) comparing item B taking item E as the background,
  7. n columns of numbers of splicing reads spanning the intron, and
  8. a chi-squared value (goodness-of-fit) comparing item B taking item G as the background.

Parameters

  1. outputFilename: output file
  2. readLength: read length
  3. intronPrefix_i: output prefix of rnaseq.ExonCounter execution of i-th sample for intron reads
  4. exonPrefix_i: output prefix of rnaseq.RPKMComputer execution of i-th sample

Notes

With the output table, an Excel function like CHIDIST() would help getting corresponding P-values for aforementioned items D and F (degree of freedom = number of samples -1), which means IR events with significant statistical interference. But be aware that it is still very possible to have false positive IR events. One of the reasons might be incorrectness of genome annotation: sometimes an actual exon appears in an intronic interval of the database, and differential expression may make the "IR" event significant. It is a good idea to get visualization before doing any experiment.
top

SSIRcompare_ex.pl

Sample-sensitive intron-retention (IR) comparison, the computation script. This script is warpped by SSIRcompare.pl. Use this script when there is an inconsistency between prefixes of .intronCount, .exonCount, .geneRPKM, and .spliceCount files of some sample.

Usage

SSIRcompare_ex.pl <outputFilename> <intronCount_1> <exonCount_1> <geneRPKM_1> <spliceCount_1> <name_1> [<intronCount_i> <exonCount_i> <geneRPKM_i> <spliceCount_i> <name_i>]+

Parameters

  1. outputFilename: output file
  2. intronCount_i: .intronCount filename of i-th sample
  3. exonCount_i: .exonCount filename of i-th sample
  4. genePRKM_i: .geneRPKM filename of i-th sample
  5. spliceCount_i: .spliceCount filename of i-th sample
  6. name_i: name of i-th sample

Notes

See SSIRcompare.pl for detailed description.
top

SSDAcompare.pl

Sample-sensitive alternative donor/acceptor (DA) site comparison.

Usage

SSDAcompare.pl <outputFilename> <fineSplice_1> <geneRPKM_1> <name_1> <fineSplice_2> <geneRPKM_2> <name_2> <rackjJARpath> <PvalTh> [<strandCGFF>]

Suppose that read data of two samples have been processed by rnaseq.RPKMComputer (or plus by rnaseq.FineSpliceCounter). For each gene, for every pair of splicing patterns that sharing the same exon pair, this script tests if there is some preference for one of the splicing pattern using the fisher exact test. The output table contains:
  1. gene ID
  2. splicing pattern 1: the first splicing pattern, plus one additional column indicating this pattern in the database or not if the novel column is present in the input .fineSplice files.
  3. splicing pattern 2: the second splicing pattern, plus one additional column indicating this pattern in the database or not if the novel column is present in the input .fineSplice files.
  4. a string indicating the change between the two splicing pattern is of donor and/or acceptor if strandCGFF is given.
  5. numbers of reads that supporting the first (or second) splicing pattern in the first (or second) sample.
  6. P-value of fisher exact test based on the four read numbers.
  7. numbers of uniq-reads of this gene in the two samples.
  8. P-value of fisher exact test based on numbers of uniq-reads of the gene in the two samples and numbers of splice-reads of the only splicing pattern NOT in database (or the splicing pattern with fewer total reads).

Parameters

  1. outputFilename: output file
  2. fineSplice_i: .fineSplice filename of i-th sample
  3. genePRKM_i: .geneRPKM filename of i-th sample
  4. name_i: name of i-th sample
  5. rackjJARpath: the path to RACKJ JAR file
  6. PvalTh: P-value threshold. Records with P-value greater than this threshold will not be reported.
  7. strandCGFF: a canonical GFF file with strand information. The string indicating the change between the two splicing pattern is of donor and/or acceptor will be reported if this parameter is given.

top

SSDAcompare_chisqr.pl

Sample-sensitive alternative donor/acceptor (DA) site comparison, the chi-squared version.

Usage

SSDAcompare_chisqr.pl [-minSplice <minSpliceReads>] [-minDB <minDBmatch>] <outputFilename> <chiSqrTh> <strandCGFF> <fineSplice_1> <geneRPKM_1> <name_1> [<fineSplice_i> <geneRPKM_i> <name_i>]+

This chi-squared version of SSDAcompare.pl supports the comparison of multiple libraries (can be more than two). The major difference between this script and SSDAcompare.pl is that this script gives chi-squared values (goodness-of-fit) but not P-values made by fisher exact tests.

Parameters

  1. -minSplice <minSpliceReads>: set to filter out those records of any splicing pattern with total splice-reads less than minSpliceReads
  2. -minDB <minDBmatch>: set to filter out those records with less than minDBmatch splicing patterns in database
  3. outputFilename: output file
  4. chiSqrTh: records with chi-squared values less than this threshold will not be reported
  5. strandCGFF: a canonical GFF file with strand information
  6. fineSplice_i: .fineSplice filename of i-th sample
  7. genePRKM_i: .geneRPKM filename of i-th sample
  8. name_i: name of i-th sample

Notes

With the output table, an Excel function like CHIDIST() would help getting corresponding P-values (degree of freedom = number of samples -1), which means alternative DA events with significant statistical interference.
top

CoveredRegionDiff.pl

Given two .geneCoverage files, this script computes the maximum difference between the first(last) base-pair of getting transcription evidence for each gene.

Usage

CoveredRegionDiff.pl <outputFilename> <strandCGFF> <prefix1> <prefix2>

The output file contains the following columns:
  1. gene ID
  2. gene start position (5' of the genome)
  3. strand of the gene
  4. number of reads in the first sample
  5. number of reads in the second sample
  6. the first base-pair of getting transcription evidence (5' of the genome) in the first sample
  7. the last base-pair of getting transcription evidence (3' of the genome) in the first sample
  8. the first base-pair of getting transcription evidence (5' of the genome) in the second sample
  9. the last base-pair of getting transcription evidence (3' of the genome) in the second sample
  10. the maximum difference between the first(last) base-pair of getting transcription evidence
  11. the ratio between the maximum difference and the transcription length

Parameters

  1. outputFilename: output filename
  2. strandCGFF: a canonical GFF file with strand information
  3. prefix1: prefix of the first .geneCoverage file
  4. prefix2: prefix of the second .geneCoverage file

top

Visualization

top

Visualization.pl

Help doing visualization tasks automatically.

Usage

We strongly suggest to make one individual copy of this script to the working directory (even rename it), and make necessary changes according to comments inside the script. That means, this script need to be edited to work well. It automatically calls rnaseq.RegionTracer (or rnaseq.GeneTracer) and graphics.ReadViz for making plots.