Contents
perl -pe 's|@|>|;s|.*||s if $.%4==3 || $.%4==0;close $ARGV if eof' Your.*.fastq > output.fasta
This perl one-liner commend helps transfer FASTQ file(s) into a FASTA file.
Used to filter sequences in a FASTQ file according to base calling scores.
Usage
FastqFilter.pl <inputFastq> <trimLength> <qualityType> <qualityFilter> <qualifiedRatio> <outputFastq>
Parameters
- inputFastq: input filename of the FASTQ file
- trimLength: trim length, every sequence will be trimmed to this length before filtering if this is set greater than 0.
- qualityType: type of quality encoding, currently available: Sanger, Solexa/illumina1.0, illumina1.3, illumina1.5, illumina1.8. See here for reference.
- qualityFilter: a quality score as a filter for deciding a base is qualified or not. 20 would be a good choice.
- qualifiedRatio: a sequence is qualified if it has more than or equal to this ratio of bases are qualified (above qualityFilter).
- outputFastq: output filename
Notes
This script accepts only four-line-per-sequence FASTQ files.
Sometimes mate-pair reads are stored in two files separated for each ends, this script is used to shuffle them for further processing. RACKJ's mate-pair facility works only for suffled sequence data.
Usage
ShuffleFasta.pl <inFasta1> <inFasta2> <outFasta>
Parameters
- inFasta1: input FASTA file of one end
- inFasta2: input FASTA file of the other end
- outFasta: output FASTA file
Sometimes mate-pair reads are stored in two files separated for each ends, this script is used to shuffle them for further processing. RACKJ's mate-pair facility works only for suffled sequence data.
Usage
ShuffleFastq.pl <inFastq1> <inFastq2> <outFastq>
Parameters
- inFastq1: input FASTQ file of one end
- inFastq2: input FASTQ file of the other end
- outFastq: output FASTQ file
Simply trim sequences in a FASTA file.
Usage
TrimFasta.pl <inFasta> <trimLength> <outFasta>
Parameters
- inFasta: input FASTA file
- trimLength: trim length
- outFasta: output FASTA file
Notes
This script can also be used for trimming a FASTQ file.
Split one single FASTA file into multiple FASTA files. This would be useful if the splitted files are to be mapped to a genome at the same time.
Usage
SplitFasta.pl <inFasta> <numSeq>
Parameters
- inFasta: input FASTA file
- numSeq: number of sequences per output file
Split one single FASTQ file into multiple FASTQ files. This would be useful if the splitted files are to be mapped to a genome at the same time.
Usage
SplitFastq.pl <inFastq> <numSeq>
Parameters
- inFastq: input FASTQ file
- numSeq: number of sequences per output file
This psl2sam.pl is modified from the psl2sam.pl in the SAMtools project. We modified it for the following improvements:
- implementation of SQ headers,
- precise reflection of BLAT alignment blocks, original psl2sam.pl would transfer an alignment with CIGAR 66M2I1D6M to be with CIGAR 67M1I6M, which misses one mismatch,
- implementation of NM and XM tags, where the latter is implemented by Bowtie2 for number of mismatches,
- implementation of MD tags, and
- translation of PSL output made by the BLAT program with options -q=dnax -t=dnax (translation-level BLAT).
Usage
psl2sam.pl [-u <chrMappingFile>] [-md] <genomeFasta> <inPSL>
Parameters/options
- -u <chrMappingFile>: a chromosome ID mapping file used when translating PSL records into SAM records. Earlier rnaseq.AlignmentFilter would produce lower-cased chromosome IDs which may result in problems with visualization tools like IGV. This problem is now corrected.
- -md: apply this option if MD tags are needed.
- genomeFasta: the genome FASTA file for mapping reads
- inPSL: a PSL format file of alignment records
Notes
- This script requires BioPerl installed.
- This script outputs to standard output.
- This script does not store SEQ fields. See samGetSEQ.pl and samGetSEQfast.pl for filling SEQ.
This script is for adding SEQ fields to SAM records.
Usage
samGetSEQ.pl <inSAM> [<FASTA>]+
Parameters/options
- inSAM: input SAM file. Use STDIN for reading from standard input.
- FASTA: FASTA file of reads, at least one FASTA file should be inputted.
Notes
- This script outputs to standard output.
- It is a good idea to pipe psl2sam.pl and samGetSEQ.pl.
This script is for adding SEQ fields to SAM records, a faster version of samGetSEQ.pl.
Usage
samGetSEQfast.pl [-unmap] <inSAM> [<FASTA>]+
Parameters/options
- -unmap: apply this option to include unaligned reads into the output.
- inSAM: input SAM file. Use STDIN for reading from standard input.
- FASTA: FASTA file of reads, at least one FASTA file should be inputted.
Notes
- This script outputs to standard output.
- This script guarantees filling all SEQ fields of the input SAM if read IDs in the SAM is a subsequence of those in FASTA files.
- It is a good idea to pipe psl2sam.pl and samGetSEQfast.pl.
Extract specified records from a GFF3 file. This script extracts records with the specified feature (column 3 of the GFF file) with specified attribute values (column 9 of the GFF file).
Usage
GFFExtractor.pl <GffFilename> <outputFilename> [<feature> [<attr>]]
Parameters
- GffFilename: input GFF3 file
- outputFilename: output filename
- feature: the type of the feature to be extracted. This will be "gene" if not specified.
- attr: the attribute to be extracted. This will be "ID" if not specified.
This script gives some basic statistics of intronic intervals and intergenic intervals in a canonical GFF file (made by misc.CanonicalGFF or misc.SeqGeneMd).
Usage
CGFFStatistics.pl <inputCGFF> [<step> <maximum> <outputFileName>]
If step, maximum, and outputFileName are given, the following five columns will be reported in the output file: (1) distance, controlled by step and maximum, (2) ratio of intergenic intervals shorter than the distance, (3) ratio of intronic intervals shorter than the distance, (4) precision of itme 2, and (5) corresponding F-measure.
Parameters
- inputCGFF: input canonical GFF file
- step: step of distance in the report file
- maximum: maximum distance in the report file
- outputFileName: output filename
Extract sequences from a FASTA file based on genomic locations in a canonical GFF file (made by misc.CanonicalGFF or misc.SeqGeneMd).
Usage
SeqGen.pl [-n <lineLength>] <outputFasta> <genomeFasta> <CGFF> [<strandFile>]
Parameters
- -n <lineLength>: output FASTA will contain this many characters per line if this option is specified.
- outputFasta: output file
- genomeFasta: a FASTA file of genome sequences
- CGFF: the canonical GFF file
- strandFile: a 2-column file containing gene IDs and strands (+ and -). If this file is given or the CGFF file is with strand information, output sequences will follow the specified strands.
Notes
This script requires BioPerl installed.
Transfer a canonical GFF file (made by misc.CanonicalGFF or misc.SeqGeneMd, which usually describe exon regions) into an intronic one. This would be useful for sample-sensitive intron-retention computation.
Usage
IntronicCGFF.pl <inputCGFF> <outputCGFF>
Parameters
- inputCGFF: input CGFF file
- outputCGFF: output CGFF file
Notes
- In the output canonical GFF file, Gene regions are kept while exon regions are transferred to be intronic.
- This script is now less useful because rnaseq.ExonCounter is capable of giving counts of intronic reads.
Compute pearson correlation coeficient for all pairs of given samples.
Usage
RPKMcorrelation.pl [-read] [-uniq] [-log] <geneRPKM_1> [<geneRPKM_i>]+
Parameters
- -read: apply this option to compute correlation based on read counts but not RPKM values
- -uniq: apply this option to compute correlation based on uniq-read counts
- -log: apply this option to compute correlation based on log values, zeros will be replaced by one half of the minimum of all non-zero values
- geneRPKM_i: .geneRPKM filename of i-th sample
Notes
This script requires the Statistics::Basic module.
Sample-sensitive exon-skipping (ES) comparison.
Usage
SSEScompare.pl <outputFilename> <readLength> <prefix_1> [<prefix_i>]+
Suppose that a number of executions of rnaseq.RPKMComputer had been done in the working directory with different output prefixes (respectively for different samples), this script performs a sample-sensitive ES comparison for samples specified by these prefixes. Suppose n prefixes are inputted in the command line, this script outputs a table that contains the following information:
- the ES event: gene ID in column 1 and an exon pair in column 2,
- n columns of numbers of reads supporting the ES event,
- n columns of coverage depths (number of times that the gene been covered by its reads),
- a chi-squared value (goodness-of-fit) comparing item B taking item C as the background,
- n columns of numbers of splice-reads that involve exactly one skipped exon and one of the exon pair, and
- a minimum chi-squared value (goodness-of-fit) of (i) comparing item B taking item E as the background and (ii) comparing item E taking item B as the background.
Parameters
- outputFilename: output file
- readLength: read length
- prefix_i: output prefix of the rnaseq.RPKMComputer execution of i-th sample
Notes
With the output table, an Excel function like CHIDIST() would help getting corresponding P-values for aforementioned item F (degree of freedom = number of samples -1), which helps find ES events with statistical significance. But be aware that it is still very possible to have false positive ES events reported. The following approach would help get rid of those false positive ES events:
- Use larger -min for rnaseq.RPKMComputer, with some loss of sensitivity.
- Take a look at corresponding .fineSplice file, a true positive may come with agreements with known splicing sites and/or be supported by reads with various splicing positions on them.
- Of course, visualization would give fast judgement than looking those .fineSplice files.
Sample-sensitive exon-skipping (ES) comparison, the computation script. This script is warpped by SSEScompare.pl. Use this script when there is an inconsistency between prefixes of .spliceCount and .geneRPKM files of some sample.
Usage
SSEScompare_ex.pl <outputFilename> <readLength> <spliceCount_1> <geneRPKM_1> <name_1> [<spliceCount_i> <geneRPKM_i> <name_i>]+
Parameters
- outputFilename: output file
- readLength: read length
- spliceCount_i: .spliceCount filename of i-th sample
- genePRKM_i: .geneRPKM filename of i-th sample
- name_i: name of i-th sample
Notes
See SSEScompare.pl for detailed description.
Sample-sensitive intron-retention (IR) comparison.
Usage
SSIRcompare.pl <outputFilename> <intronPrefix_1> <exonPrefix_1> [<intronPrefix_i> <exonPrefix_i>]+
Suppose that a number of executions of rnaseq.RPKMComputer have been done in the working directory, and the same number of executions of rnaseq.ExonCounter for .intronCount files have been done in the same working directory. This script performs a sample-sensitive IR comparison for samples specified by the intronPrefixes and exonPrefixes. Suppose n intronPrefixes and n exonPrefixes are inputted in the command line, this script outputs a table contains the following information:
- the IR event: gene ID in column 1, intron number in column 2, and intron length in column 3,
- n columns of numbers of reads supporting the IR event,
- n columns of numbers of reads belonging to neighboring exons,
- a chi-squared value (goodness-of-fit) comparing item B taking item C as the background,
- n columns of numbers of reads belonging to the gene,
- a chi-squared value (goodness-of-fit) comparing item B taking item E as the background,
- n columns of numbers of splicing reads spanning the intron, and
- a chi-squared value (goodness-of-fit) comparing item B taking item G as the background.
Parameters
- outputFilename: output file
- readLength: read length
- intronPrefix_i: output prefix of rnaseq.ExonCounter execution of i-th sample for intron reads
- exonPrefix_i: output prefix of rnaseq.RPKMComputer execution of i-th sample
Notes
With the output table, an Excel function like CHIDIST() would help getting corresponding P-values for aforementioned items D and F (degree of freedom = number of samples -1), which means IR events with significant statistical interference. But be aware that it is still very possible to have false positive IR events. One of the reasons might be incorrectness of genome annotation: sometimes an actual exon appears in an intronic interval of the database, and differential expression may make the "IR" event significant. It is a good idea to get visualization before doing any experiment.
Sample-sensitive intron-retention (IR) comparison, the computation script. This script is warpped by SSIRcompare.pl. Use this script when there is an inconsistency between prefixes of .intronCount, .exonCount, .geneRPKM, and .spliceCount files of some sample.
Usage
SSIRcompare_ex.pl <outputFilename> <intronCount_1> <exonCount_1> <geneRPKM_1> <spliceCount_1> <name_1> [<intronCount_i> <exonCount_i> <geneRPKM_i> <spliceCount_i> <name_i>]+
Parameters
- outputFilename: output file
- intronCount_i: .intronCount filename of i-th sample
- exonCount_i: .exonCount filename of i-th sample
- genePRKM_i: .geneRPKM filename of i-th sample
- spliceCount_i: .spliceCount filename of i-th sample
- name_i: name of i-th sample
Notes
See SSIRcompare.pl for detailed description.
Sample-sensitive alternative donor/acceptor (DA) site comparison.
Usage
SSDAcompare.pl <outputFilename> <fineSplice_1> <geneRPKM_1> <name_1> <fineSplice_2> <geneRPKM_2> <name_2> <rackjJARpath> <PvalTh> [<strandCGFF>]
Suppose that read data of two samples have been processed by rnaseq.RPKMComputer (or plus by rnaseq.FineSpliceCounter). For each gene, for every pair of splicing patterns that sharing the same exon pair, this script tests if there is some preference for one of the splicing pattern using the fisher exact test. The output table contains:
- gene ID
- splicing pattern 1: the first splicing pattern, plus one additional column indicating this pattern in the database or not if the novel column is present in the input .fineSplice files.
- splicing pattern 2: the second splicing pattern, plus one additional column indicating this pattern in the database or not if the novel column is present in the input .fineSplice files.
- a string indicating the change between the two splicing pattern is of donor and/or acceptor if strandCGFF is given.
- numbers of reads that supporting the first (or second) splicing pattern in the first (or second) sample.
- P-value of fisher exact test based on the four read numbers.
- numbers of uniq-reads of this gene in the two samples.
- P-value of fisher exact test based on numbers of uniq-reads of the gene in the two samples and numbers of splice-reads of the only splicing pattern NOT in database (or the splicing pattern with fewer total reads).
Parameters
- outputFilename: output file
- fineSplice_i: .fineSplice filename of i-th sample
- genePRKM_i: .geneRPKM filename of i-th sample
- name_i: name of i-th sample
- rackjJARpath: the path to RACKJ JAR file
- PvalTh: P-value threshold. Records with P-value greater than this threshold will not be reported.
- strandCGFF: a canonical GFF file with strand information. The string indicating the change between the two splicing pattern is of donor and/or acceptor will be reported if this parameter is given.
Sample-sensitive alternative donor/acceptor (DA) site comparison, the chi-squared version.
Usage
SSDAcompare_chisqr.pl [-minSplice <minSpliceReads>] [-minDB <minDBmatch>] <outputFilename> <chiSqrTh> <strandCGFF> <fineSplice_1> <geneRPKM_1> <name_1> [<fineSplice_i> <geneRPKM_i> <name_i>]+
This chi-squared version of SSDAcompare.pl supports the comparison of multiple libraries (can be more than two). The major difference between this script and SSDAcompare.pl is that this script gives chi-squared values (goodness-of-fit) but not P-values made by fisher exact tests.
Parameters
- -minSplice <minSpliceReads>: set to filter out those records of any splicing pattern with total splice-reads less than minSpliceReads
- -minDB <minDBmatch>: set to filter out those records with less than minDBmatch splicing patterns in database
- outputFilename: output file
- chiSqrTh: records with chi-squared values less than this threshold will not be reported
- strandCGFF: a canonical GFF file with strand information
- fineSplice_i: .fineSplice filename of i-th sample
- genePRKM_i: .geneRPKM filename of i-th sample
- name_i: name of i-th sample
Notes
With the output table, an Excel function like CHIDIST() would help getting corresponding P-values (degree of freedom = number of samples -1), which means alternative DA events with significant statistical interference.
Given two .geneCoverage files, this script computes the maximum difference between the first(last) base-pair of getting transcription evidence for each gene.
Usage
CoveredRegionDiff.pl <outputFilename> <strandCGFF> <prefix1> <prefix2>
The output file contains the following columns:
- gene ID
- gene start position (5' of the genome)
- strand of the gene
- number of reads in the first sample
- number of reads in the second sample
- the first base-pair of getting transcription evidence (5' of the genome) in the first sample
- the last base-pair of getting transcription evidence (3' of the genome) in the first sample
- the first base-pair of getting transcription evidence (5' of the genome) in the second sample
- the last base-pair of getting transcription evidence (3' of the genome) in the second sample
- the maximum difference between the first(last) base-pair of getting transcription evidence
- the ratio between the maximum difference and the transcription length
Parameters
- outputFilename: output filename
- strandCGFF: a canonical GFF file with strand information
- prefix1: prefix of the first .geneCoverage file
- prefix2: prefix of the second .geneCoverage file
Help doing visualization tasks automatically.
Usage
We strongly suggest to make one individual copy of this script to the working directory (even rename it), and make necessary changes according to comments inside the script. That means, this script need to be edited to work well. It automatically calls rnaseq.RegionTracer (or rnaseq.GeneTracer) and graphics.ReadViz for making plots.