Skip to content

Reference Manual

maketrnadb.py

maketrnadb.py creates the reference database that is used by processsamples.py for analyzing small RNA-seq data. It uses Bowtie2 to build indexes with tRNA sequences and reference genome sequence for read alignments. It also uses Infernal to create tRNA sequence alignments for annotating the positions of sequencing reads and misincorporations. The tRNA annotations required as inputs can be downloaded from GtRNAdb or generated by running tRNAscan-SE. The reference genome sequence can be downloaded from the UCSC Genome Browser.

Usage

maketrnadb.py --databasename=dbname --genomefile=genome.fa 
--trnascanfile=trnascan.txt [--gtrnafafile=db-tRNAs.fa --namemap=db-namemap.txt]

Options

  • --databasename = dbname (required)
    Directory and database name that will be used for the reference database
  • --genomefile = genome.fa (required)
    FASTA file of reference genome
  • --trnascanfile = trnascan.txt (required)
    tRNAscan-SE output file (*.out file in GtRNAdb downloaded tarball)
  • --gtrnafafile = db-tRNAs.fa (optional)
    tRNA sequence FASTA file downloaded from GtRNAdb
  • --namemap = namemap.txt (optional)
    Map file that converts tRNAscan-SE IDs to GtRNAdb gene symbols
  • --orgmode = mode (optional)
    Sequence source of reference
    Default is "euk" for eukaryotes. Other option includes "mito" for mitochondria, "arch" for archaea, and "bact" for bacteria.
    (Archaeal and bacterial options are currently under development. The mito option may not work properly for degenerative mitochondrial tRNAs. We suggest downloading the pre-built reference databases with the use of quickdb.bash instead.)
  • --forcecca = mode (optional)
    Force addition of 3' CCA to tRNA for bacteria and archaea. By default, 3' CCA will be added automatically to eukaryotic mature tRNAs.

Outputs

The following files are generated upon completion:

  • dbname-tRNAgenome.*: FASTA file of tRNA and reference genome sequences with Bowtie2 indexes
  • dbname-trnaalign.stk: Alignments of mature tRNA isodecoder sequences in Stockholm file format
  • dbname-trnaloci.stk: Alignment of tRNA gene sequences in Stockholm file format
  • dbname-trnatable.txt: Tab-delimited file with tRNA isodecoders and tRNA genes map
  • dbname-maturetRNAs.fa: FASTA file of mature tRNA isodecoder sequences
  • dbname-maturetRNAs.bed: tRNA isodecoders in BED file format
  • dbname-trnaloci.bed: tRNA genes in BED file format
  • dbname-dbinfo.txt: Database creation information and log
  • dbname-alignnum.txt : tRNA position numbering table for tRNA
  • dbname-locusnum.txt : tRNA position numbering table for pre-tRNAs

trimadapters.py

trimadapters.py removes adapter sequences from raw sequencing data as the preprocessing step before data analysis. For single end reads, Cutadapt is used for performing the adapter trimming. For paired end reads, the two read ends have to be merged into single end reads in order to allow correct read alignments by Bowtie2. Therefore, SeqPrep is used that will trim adapter sequences and merge the read ends simultaneously. The tool was designed to work with raw sequencing data generated by Illumina platform and has not been tested with data from other sequencing platform.

Note

The options for supporting Unique Molecular Identifiers (UMI) are experimental and may only be applicable for specific designs. Researchers who are interested in utilizing the tool with sequencing data containing UMI can contact us for feasibility inquiries.

Usage

trimadapters.py --runname=runname --runfile=runfile.txt [--firadapter=FIRADAPTER --secadapter=SECADAPTER --minlength=MINLENGTH --singleend --cores=CORES]

Options

  • --runname = runname (required)
    Name used for output files
  • --runfile = runfile (required)
    Tab or whitespace-delimited file with sample and FASTQ file information (described below)
  • --firadapter = FIRADAPTER (optional)
    3' adapter sequence used in single end reads or forward primer/adapter sequence used in paired end reads
    Default value is AGATCGGAAGAGCACACGTC
  • --secadapter = SECADAPTER (optional)
    Reverse primer/adapter sequence used in paired end reads
    Default value is GATCGTCGGACTGTAGAACTC
  • --minlength = MINLENGTH (optional)
    Minimum length of sequencing reads to be retained after trimming and/or merging
    Default value is 15
  • --singleend
    Flag to specify single end reads
  • --cores = CORES (optional)
    Number of cores to be used
    Default value is the number of available cores on the system
  • --umilength = length (optional)
    Length of UMI
    Default value is 0
  • --umithreeprime (optional)
    Flag to indicate that UMI is located at the 3' end of the reads instead of 5' end. It will have no effect if --umilength is 0.

Input files

Raw sequencing files

These files should be in FASTQ file format and can be compressed by gzip. They are specified in the runfile.

trimadapters.py runfile

This is a two or three-column tab-delimited text file depending on single or paired end reads. The columns are defined as:

  • column 1: sample name
  • column 2: single-end FASTQ file name or read 1 FASTQ file name for paired end reads
  • column 3: read 2 FASTQ file name for paired end reads

The FASTQ file names may include the file paths where the files are located.

Outputs

The following files are always generated upon completion:

  • runname_log.txt: Output log containing statistics generated by Cutadapt or SeqPrep
  • runname_manifest.txt: List of sample names and trimmed/merged FASTQ files

For single end reads, the following files are created:

  • samplename_trim.fastq.gz: FASTQ file containing trimmed sequencing reads and can be used by processsamples.py
  • runname_ca.pdf: histogram summarizing the amount of trimmed, untrimmed, and discarded reads
  • runname_ca.txt: Data source for the histogram in runname_ca.pdf

For paired end reads, the following files are created:

  • samplename_merged.fastq.gz: FASTQ file containing trimmed and merged sequencing reads and can be used by processsamples.py
  • samplename_left.fastq.gz: FASTQ file containing Read 1 that do not merge
  • samplename_right.fastq.gz: FASTQ file containing Read 2 that do not merge
  • runname_sp.pdf: histogram summarizing the amount of merged, unmerged, and discarded reads
  • runname_sp.txt: Data source for the histogram in runname_sp.pdf

processamples.py

processamples.py is the main tool that performs analysis on the small RNA sequencing data. It includes aligning reads to the reference database using Bowtie2, estimating transcript abundance, performing differential expression comparison across samples, and detecting misincorporations as potential RNA modifications. Multiple outputs in text files and images are generated during different steps of the analysis. For details of the analysis methods used in processamples.py, please refer to the Methods section of our publication.

Usage

processsamples.py --experimentname=expname --databasename=dbname --ensemblgtf=genes.gtf.gz --samplefile=samplefile.txt [--exppairs=samplespairs.txt --bedfile=file.bed --minnontrnasize=SIZE --nofrag --lazyremap --makehub --olddeseq --dumpother]

Options

  • --experimentname = expname (required)
    Experiment name that will be used for results
  • --databasename = dbname (required)
    Directory and name of reference database generated by maketrnadb.py
  • --ensemblgtf = genes.gtf.gz (required)
    Non-tRNA annotation file in Ensembl GTF file format
  • --samplefile = samplefile.txt (required)
    Tab or whitespace-delimited file with sample and pre-trimmed FASTQ file information (described below)
  • --exppairs = samplespairs.txt (optional but recommended)
    Tab or whitespace-delimited file with sample pairs for differential expression comparison (described below)
    If not provided, all vs all pairwise comparison will be performed.
  • --bedfile = file.bed (optional)
    Other non-tRNA gene annotations in BED file format
    Multiple files can be specified.
  • --minnontrnasize = SIZE (optional)
    Minimum read length to be retained for non-tRNA genes
    Defaults value is 20
  • --nofrag
    Flag to skip tRNA fragment determination
    Design for full length tRNA sequencing experiments
  • --lazyremap
    Skip read mapping step when BAM files already exist
  • --makehub
    Convert BAM files by interpolating tRNA read alignments to the reference genome and create read coverage tracks in track hub to be used in UCSC Genome Browser
  • --olddeseq
    Use DESeq1 instead of DESeq2
  • --dumpother
    Output read alignments that do not have an associated gene feature into BAM files (files end in "nofeat.bam")
  • --cores
    Number of processing cores to be used (default = 8)

Input files

Trimmed sequencing files

These files should be in FASTQ file format and can be compressed by gzip. They are specified in the samplefile. Adaptor sequences should have been removed from the sequencing reads. Only single end reads are supported. If paired end reads are used in sequencing, the two read ends that always overlap in small RNA-seq should be merged into single end reads. These files can be output files generated by trimadapters.py or other preferred tools.

Sample File

samplefile is a three-column tab-delimited file of sample meta data. The columns are:

  • column 1: unique sample replicate name
  • column 2: sample or group name common for replicates
  • column 3: preprocessed FASTQ file name

An example sample file TestSamples.txt (used for TestRun.bash) can be found with the tRAX source code.

Sample pair file

exppairs is a two-column tab-delimited file consists of sample pairs to be compared with each other. The columns are:

  • column 1: sample name
  • column 2: another sample name

The sample names must match with column 2 values in the sample file above.

Outputs

The results generated by processamples.py are distributed into multiple subfolders, outlined as follows:

file structure

Analysis run info

  • expname-runinfo.txt: Information on the tRAX run, including input parameters, run date, run command, and tRAX version

Quality assessment report

  • expname-qa.html: HTML document showing quality assesment of sequencing reads and alignments

Read alignments

  • sample_name.bam: Sorted BAM file with read alignments of sample_name
  • sample_name.bam.bai: BAM index file of sample_name.bam
  • sample_name-genome.bam: BAM file with read alignments of sample_name mapped to reference genome coordinates, used for UCSC Genome Browser track creation
  • sample_name_nofeat.bam: BAM file without gene features, generated when using --dumpother option
  • expname-mapstats.txt: Bowtie2 command and raw output statistics for each sample
  • expname-mapinfo.txt: Mapping statistics for each sample, used as input for expname-mapinfo.pdf
  • expname-mapinfo.pdf: Histogram of mapping rate for each sample

Summary of read coverage statistics

  • expname-typecounts.txt: Normalized read counts of small RNA types per sample
  • expname-typecounts.pdf: Normalized read distribution of small RNA types per sample
  • expname-typerealcounts.txt: Raw read counts of small RNA types per sample
  • expname-typerealcounts.pdf: Raw read distribution of small RNA types per sample
  • expname-aminocounts.txt: Normalized read counts of tRNA isotypes per sample
  • expname-aminocounts.pdf: Normalized read distribution of tRNA isotypes per sample
  • expname-combinedcoverage.pdf: Read coverage summary of tRNA isotypes per sample
  • expname-readlengths.txt: Counts of read lengths for tRNAs, pre-tRNAs, and other gene types per sample
  • expname-readlengths.pdf: Distribution of read lengths for tRNAs, pre-tRNAs, and other gene types per sample

Abundance of tRNA, tDRs, and other gene types

  • sample1_sample2-typescatter.pdf: Sample pair comparisons of normalized read counts for each gene type, including tRNA fragment type when --nofrag is not used
  • sample1_sample2-aminoscatter.pdf: Sample pair comparisons of normalized read counts for each tRNA fragment type when --nofrag is not used or for tRNAs when --nofrag is used
  • expname-readcounts.txt: Raw read counts for each transcript and tRNA fragment
  • expname-normalizedreadcounts.txt: Normalized read counts for each transcript and tRNA fragment
  • expname-coverage.txt: Per-base read coverage data of all tRNA isodecoders per sample, using Sprinzl tRNA positioning
  • expname-coverage.pdf: Per-base read coverage plot of all tRNA isodecoders per sample colored by specificity of mapping
  • expname-<isotype>_cov.pdf: Individual isotype per-base read coverage plots of tRNA isodecoders per sample colored by specificity of mapping
  • expname-SizeFactors.txt: DESeq2 size factors for each sample
  • expname-trnaendcounts.txt: Counts of tRNA "CCA" tail types

Differential expression comparison

  • expname-logvals.txt: Log2 fold change values of sample pairs generated by DESeq2
  • expname-padjs.txt: Adjusted P-values of sample pairs generated by DESeq2
  • expname-combine.txt: Tab-delimited file including adjusted P-values, log2 fold change values, and normalized read counts
  • sample1_sample2-volcano.pdf: Volcano plots of sample pair with log2 fold change values and adjusted P-values Top 10 results are labeled
  • sample1_sample2-volcano_tRNA.pdf: Volcano plots of sample pair with log2 fold change values and adjusted P-values of only tRNAs and tRNA fragments. Top 10 results are labeled
  • expname-pca.pdf: Principle component analysis of samples, based on normalized read counts of all gene features
  • expname-pcatrna.pdf: Principle component analysis of samples, based on normalized read counts of tRNAs and tDRs

Base mismatch estimation

  • mismatch/expname-trnapositionmismatches.pdf: Maximum mismimatch percentage among all samples for each tRNA position
  • mismatch/expname-trnapositionfiveprime.pdf: Maximum percentage of start position among all samples for each tRNA position
  • mismatch/expname-<isotype>_mismatch.pdf: Per-base coverage of mismatches for tRNA isodecoders
  • mismatch/expname-<isotype>_delete.pdf: Per-base coverage of deletion for tRNA isodecoders
  • mismatch/expname-<isotype>_fiveprimeends.pdf: Per-base coverage of 5' start position for tRNA isodecoders
  • mismatch/expname-<isotype>_mismatchheatmap.pdf: Heatmap of mismatches per base for tRNA isodecoders of specific isotype
  • mismatch/expname-<isotype>_deletionheatmap.pdf: Heatmap of deletions per base for tRNA isodecoders of specific isotype
  • mismatch/expname-<isotype>_threeprimeheatmap.pdf: Heatmap of 3' end position for tRNA isodecoders of specific isotype
  • mismatch/expname-<isotype>_fiveprimeheatmap.pdf: Heatmap of 5' start postion for tRNA isodecoders of specific isotype
  • mismatch/expname-<isotype>_fiveprimecounts.pdf: Distribution of 5' start position of all tRNA transcripts for each sample
  • mismatch/expname-<isotype>_threeprimecounts.pdf: Distribution of 3' end position of all tRNA transcripts for each sample
  • mismatch/expname-<isotype>_trnapositionmismatches.pdf: Maximum mismimatch percentage among all samples for each tRNA position by anticodon
  • mismatch/expname-<position>_posaminomismatches.pdf: Mismatch percentage at position per tRNA isotype
  • mismatch/expname-<position>_possamplemismatches.pdf: Mismatch percentage at position per sample
  • mismatch/expname-<position>_posaminodeletions.pdfs: Deletion percentage at position per tRNA isotype
  • mismatch/expname-<position>_possampledeletions.pdf: Deletion percentage at position per sample
  • mismatch/expname-<position>_possamplereadstarts.pdf: Read start percentage at position per sample

Pre-tRNA read coverage

  • pretRNAs/expname-pretRNAcombinedcoverage.pdf: Read coverage summary of tRNA isotypes per sample, with upstream and downstream regions
  • pretRNAs/expname-pretRNAcoverage.txt: Per-base read coverage data of all tRNA isodecoders per sample, with upstream and downstream regions
  • pretRNAs/expname-pretRNAcoverage.pdf: Per-base read coverage plot of all tRNA isodecoders per sample, with upstream and downstream regions

Unique read coverage

  • unique/expname-trnauniquecounts.txt: Counts of uniquely mapped reads for all transcripts per sample
  • unique/expname-<isotype>_uniqueonlycov.pdf: Individual isotype per-base unique read coverage plots of tRNA isodecoders per sample

Genome browser track hub

  • trackhub/<sample>.Plus.bw: Read coverage BigWig file of specific sample on forward strand
  • trackhub/<sample>.Minus.bw: Read coverage BigWig file of specific sample on reverse strand
  • trackhub/trackdb.txt: UCSC Genome Browser track information in track database definition format

gettranscripts.bash

gettranscripts.bash is a tool for generating transcript predictions from a bam file. It is intended to be used on the output from the --dumpother option of processamples.py after merging with samtools. This requires bedtools to be installed to function

Usage

gettranscripts.bash mergedoutput.bam 30 > output.bed

Arguments

  • First argument is the bam file generated by processamples.py with --dumpother option.
  • Second argument is the minimum number of transcripts.

Output

  • output.bed: BED file with possible unannotated transcripts found in the BAM file
Back to top