home

Command Line Manual

Contents


top

Genome Annotation Manipulation

top

misc.GffTree

This program reads a file in GFF3 format and outputs its type-hierarchy. The output file will be named <inFilename>.features. Type names asterisked mean that records of these types are annotated with ID attributes. The following example was made by using this file as input with default parameters.

GffRoot
  chromosome*
  gene*
    mRNA*
      protein*
      exon
      five_prime_UTR
      CDS
      three_prime_UTR
    miRNA*
      exon
    tRNA*
      exon
    ncRNA*
      exon
    snoRNA*
      exon
    snRNA*
      exon
    rRNA*
      exon
  pseudogene*
    pseudogenic_transcript*
      pseudogenic_exon
  transposable_element_gene*
    mRNA*
      exon
  transposable_element*
    transposon_fragment

With this output, it is easy to tell that DNA level objects like loci are annotated with type "gene", "pseudogene" and "transposable_element_gene", and RNA level objects like isoforms are annotated with type "*RNA" and "pseudogenic_transcript".

Parameters

See also

See here for an example in RNAseq analysis.


top

misc.CanonicalGFF

This program is used to read a GFF3 file and output a .cgff(.model) file, which contains gene(transcript) regsions and corresponding exon regions. The option -GE is used for controlling objects to be reported by assigning two types according to the type-hierarchy made by misc.GffTree: objects of the ancestral type will be reported as individual entries and objects of the descendant type will be collected as corresponding sub-regions. For example, locus AT3G62190 generates two isoforms AT3G62190.1 and AT3G62190.2, each of them contains blue exons as in the following picture. With option -GE mRNA exon, misc.CanonicalGFF generates two entries (AT3G62190.1 and AT3G62190.2). With option -GE gene exon, misc.CanonicalGFF generates one entries (AT3G62190) with gray sub-regions, which are combined regions of blue exons.

Parameters

See also

See here for an example in RNAseq analysis.


top

misc.SeqGeneMd

This program reads a seq_gene.md file and processes it into two files: a .cgff file that contains gene regions and exon regions, and a .model file that contains transcript region and exon regions. Problematic records such as genes with repetitive records will be skipped and saved in a .skipped file. You may process it manually. For prebuild .cgff and .model files, please refer here.

Parameters


top

special.PromoterCGFF

This program reads a list of gene IDs and forms their promoter regions (or regions around transcription stop sites) in .cgff format base on the given canonical GFF file.

Parameters

Notes


top

rnaseq.MappingInfoRecover

This program reads a number of files of mapping results, and outputs the following files: (1) a .exonInfo file: containing genomic intervals that are mapped by reads, (2) a .intronInfo file: containing genomic intervals that are spanned by splicing reads, and (3) a .matePairInfo file: containing pairs of intervals in the .exonInfo file, where each of these pairs are mapped by at least one mate pair of reads. These files are inputs of the rnaseq.TranscriptomeRecover program, which is used for recover unannotated transcriptome regions based on RNAseq evidence.

Parameters


top

rnaseq.TranscriptomeRecover

This program reads outputs of rnaseq.MappingInfoRecover, and recovers unannotated transcriptome regions based on RNAseq evidence. Recovered information will be stored in three files: (1) a .extdCGFF file: in the same format as made by misc.CanonicalGFF or misc.SeqGeneMd, storing gene regions and corresponding exon regions, (2) a .extdReport file: recording extension information of existing genes in the given .cgff file (assigned by -GFF), and (3) a .novelReport file: recording novel transcriptome regions.

Parameters


top

Alignment Manipulation

top

rnaseq.AlignmentFilter

This program is for alignment filtering. The filtering is of three levels: (1) identity filtering: a read is filtered out if it has no alignments with better identity than the threshold, (2) gene model filtering: a read is filtered out if it matches no genes, and (3) mate-pair filtering: a pair of reads are qualified if they have corresponding alignments matching to the same gene.

Parameters

Notes


top

Computation of Numbers

top

rnaseq.RPKMComputer

This program computes RPKM values and counts reads for exon and splicing events. It reports three files:

file extension#columnsdescription
.geneRPKMfive
  1. gene ID
  2. length in Kbp
  3. number of mapped reads, including fractions of multi-reads
  4. RPKM value
  5. ratio of multi-reads versus all reads
.exonCountfive
  1. gene ID
  2. exon number (count from 5' of the genome but not the gene)
  3. number of mapped reads, including fractions of multi-reads
  4. exon length
  5. ratio of multi-reads versus all reads
.spliceCountfive or six
  1. gene ID
  2. exon pair, in the form of a<=>b
  3. number of reads supporting this splicing event
  4. is there any other exon between a and b?
  5. is this splicing event novel? (optional, will be given if -model is specified)
  6. splicing position fequency of reads

Parameters

See also

See here for a detailed explanation of read counting.

Notes


top

rnaseq.ExonCounter

This program reports read counts of each exon, which are also reported by rnaseq.RPKMComputer. The only difference is that you may use this program to get read counts of introns by setting the option -intronic true.

Parameters

Notes


top

rnaseq.FineSpliceCounter

This program reports splicing events in a resolution of nucleotide level. The reported .fineSplice file contains four or five columns: (1) gene ID, (2) splicing pattern (see below), (3) number of reads supporting this splicing event, (4) is this splicing event novel? (optional, will be given if -model is specified), and (5) splicing position fequency of reads.

The second column, splicing pattern, is used to describe the relationships between two consecutive alignment blocks and corresponding exons:

formdescription
exonA(relativePosA)-exonB(relativePosB)
The two alignment blocks overlaps with two different exons, where relativePosA(relativePosB) is the relative position of actual splicing site of the first(second) block to splicing site of exonA(exonB) in the database. Note that a negative(positive) relative position means that the splicing site is inside(outside) the exon, and a zero relative position means that the splicing site agrees with that in the database.
exonX[relativePosA-relativePosB]
The two alignment blocks overlaps with the same exon exonX, where their splicing positions, relative to start point of exonX, are relativePosA and relativePosB.
exonA(relativePosA)-genomicPosB
or
genomicPosA-exonB(relativePosB)
or
genomicPosA-genomicPosB
At least one alignment block overlaps no exon. For those alignment blocks that overlap no exon, it splicing position in genomic coordinate will be presented.

Parameters

Notes


top

rnaseq.GeneCoverageArray

This program reports a array of coverage depths for each gene in the specified .cgff file (-GFF). In the .geneCoverage file, each row represents one gene, where its columns are: (1) gene ID, (2) chromosome and start position of this gene (from 5' of the genome), (3) number of reads, and (4) the array of coverage depths.

Parameters

Notes


top

Statistics

top

statistics.CountComparator

This program reads a table (.exonCount, .spliceCount, or .fineSplice) as a base table, and compares the base table with other many tables in the same format. The output file named <ControlTable>-<TreatmentTable> is in a format as follows:

GroupGeneID11.897405712778585E-38
131715221.8974057128587125E-38
2105891753011.897405712778585E-38
GroupGeneID25.315100671928383E-32
138354530105.315100671962806E-32
222292545105.660985366966761E-25
315701621104.615877710327863E-15
427853028104.817867736707222E-22
516942002106.45000050068055E-26
615941825101.1376381066404553E-22
79131052101.9015496347218025E-18
812981354101.212634909931947E-14
987872113.4675426999854235E-11
10127984815.315100671928383E-32

In each group, for each exon (column 1), we have its read counts in the control sample (column 2) and that in the treatment sample (column 3). By applying the Fisher's exact test with these two counts and those of each other exon of the same gene, we put the most significant P-value in column 5 and corresponding exon in column 4. The P-value assigned to the group is exactly the most significant P-value among its exons. In this example, the 10th exon is mostly reported as differentially expressed when comparing it with other exons of GeneID2.

NOTE: This program was designed for discovering significant difference of exon utility (or isoform utility) between samples by statistical interference, but seems not work very well. Now we have a number of scripts for detection of sample-sensitive alternative-splicing events.

Parameters


top

statistics.GeneCoverageDiff

This program reads two .geneCoverage files and reporets the difference between two coverage distributions (of the two files) for each gene in the following columns: (1) gene ID, (2) start point (5' of the genome), (3) length of the gene in genome, (4) and (5) numbers of uniquely mapped reads of the gene in the two files, say n1 and n2, (6) position of maximum distribution difference, (7) the maximum distribution difference D, and (8) adjusted D value. The distribution difference (D) is computed based on the concept of the KS-test, where the adjusted D is D*SQRT(n1*n2 / (n1+n2)).

Parameters


top

special.ReadStatistics

This program simply give some statistics of uniq-reads, where a read is defined as a uniq-read if it has exactly one best alignment. For those uniq-reads mapping to one single gene, the following statistics of these uniq-gene-mapping-reads are reported: (1) numbers of reads contained by exons, (2) numbers of reads contained by introns, (3) numbers of reads crossing exon-intron boundaries, and (4) numbers of splice-reads matching two or more exons.

Parameters


top

Visualization

top

rnaseq.GeneTracer

This program reports reads and their alignments for with specified genes. A file named OutPrefix.geneID will be produced for each gene, where each such file could be used for visualization of mapping using graphics.ReadViz. Note that whether reads belonging to a gene or not is controlled by the four parameters: -exon, -contain, -min, and -ALL.

Parameters

See also

See here for example.

Notes


top

rnaseq.RegionTracer

This program reports reads and their alignments belonging to specified regions. A file named OutPrefix.regionString will be produced for each specified region, where each such file could be used for visualization of mapping using graphics.ReadViz. A regionString should be exactly a gene ID or a triple of chromosome, start position, and stop position.

Note that the major difference between this program and rnaseq.GeneTracer is the classification of multiply mapped reads: rnaseq.GeneTracer classifies a read as multiply mapped if this read mapped to multiple genes even if it is uniquely mapped to the genome, and this program classifies a reads as multiply mapped if this read does have multiple alignments to the genome. BTW, rnaseq.GeneTracer reports reads belonging to a gene based on the control of the four parameters: -exon, -contain, -min, and -ALL.

Parameters

Notes


top

graphics.ReadViz

This program reads a file that is made by rnaseq.GeneTracer or rnaseq.RegionTracer, and produces a picture of mapping of reads.

Parameters

See also

See here for example.


top

Interact with Scripts

Programs in this sub-section were created for fast answers of some specific questions like interval query and computing fisher exact test. These programs are made interactive so that they can be operated manually or programmatically. See here for an example perl code.

top

statistics.FisherExactTest

This program can be operated in three modes, depending on the number of command-line parameters:
  1. four parameters: the four parameters will be treated as n11, n12, n21, and n22. A fisher exact test shall then be carried out so that the left-tail, right-tail, and two-tail probabilities are reported.
  2. two parameters: the two parameters will be treated as the input file and the output file. The input file should contains four numbers per each line so that a fisher exact test will be computed for each line. Output probabilities will be stored in one line for each test.
  3. no paramters: four numbers should be in every line of the standard input, and the three probabilities will be outputted to the standard output.

Special thanks

The computing module of the fisher exact test was modified from a java script of Dr. Oyvind Langsrud (permission granted, see here for his on-line fisher exact test).
top

special.GeneQuery

This program reads genomic position (or interval) queries and answers matched genic objected described in a given .cgff file. Every line in the input file is considered as a query, where its first token should be a genomic position or interval in one of the following formats:
  1. <chromosome>:<position>
  2. <chromosome>(<start>,<stop>)
Second to fourth tokens are optional:
  1. if the second token exists: this token will be considered as a boolean value, which decides, for every gene, to take only exonic regions to test the query or not (gene regions including intronic region)
  2. if the third token exists: this token will be considered as a boolean value, which decides to the way to test the query. If true, by containment; otherwise, by intersection.
  3. if the fourth token exists: this token will be considered as an integer, which means the minimum length to qualify the containment or the intersection

Parameters


top

special.IntervalQuery

This program maintains interval tree data structures so that interval queries can be answered fast. This program reads two commands from the standard input:
  1. insert <sequence> <start> <stop>: insert the interval [start,stop] (both ends included) on sequence
  2. query <sequence> <start> <stop>: query intervals overlapping [start,stop] on sequence. The output should be in one line. Outputted intervals will be comma-separated.

top

Supported formats of mapping results

RACKJ now supports the PSL/PSLX format and the SAM format. The SAM portion was done by incorporating the Picard program (download sam-x.yz.jar here).

Supported formats and its method string

Here is a list of method strings (for option -M of all above programs) and the corresponding format:

Notes