Reference Manual

maketrnadb.py

maketrnadb.py creates the reference database that is used by processsamples.py for analyzing small RNA-seq data. It uses Bowtie2 to build indexes with tRNA sequences and reference genome sequence for read alignments. It also uses Infernal to create tRNA sequence alignments for annotating the positions of sequencing reads and misincorporations. The tRNA annotations required as inputs can be downloaded from GtRNAdb or generated by running tRNAscan-SE. The reference genome sequence can be downloaded from the UCSC Genome Browser.

Usage

maketrnadb.py --databasename=dbname --genomefile=genome.fa 
--trnascanfile=trnascan.txt [--gtrnafafile=db-tRNAs.fa --namemap=db-namemap.txt]

Options

--databasename = dbname (required)
Directory and database name that will be used for the reference database
--genomefile = genome.fa (required)
FASTA file of reference genome
--trnascanfile = trnascan.txt (required)
tRNAscan-SE output file (*.out file in GtRNAdb downloaded tarball)
--gtrnafafile = db-tRNAs.fa (optional)
tRNA sequence FASTA file downloaded from GtRNAdb
--namemap = namemap.txt (optional)
Map file that converts tRNAscan-SE IDs to GtRNAdb gene symbols
--orgmode = mode (optional)
Sequence source of reference
Default is "euk" for eukaryotes. Other option includes "mito" for mitochondria, "arch" for archaea, and "bact" for bacteria.
(Archaeal and bacterial options are currently under development. The mito option may not work properly for degenerative mitochondrial tRNAs. We suggest downloading the pre-built reference databases with the use of quickdb.bash instead.)
--forcecca = mode (optional)
Force addition of 3' CCA to tRNA for bacteria and archaea. By default, 3' CCA will be added automatically to eukaryotic mature tRNAs.

Outputs

The following files are generated upon completion:

dbname-tRNAgenome.*: FASTA file of tRNA and reference genome sequences with Bowtie2 indexes
dbname-trnaalign.stk: Alignments of mature tRNA isodecoder sequences in Stockholm file format
dbname-trnaloci.stk: Alignment of tRNA gene sequences in Stockholm file format
dbname-trnatable.txt: Tab-delimited file with tRNA isodecoders and tRNA genes map
dbname-maturetRNAs.fa: FASTA file of mature tRNA isodecoder sequences
dbname-maturetRNAs.bed: tRNA isodecoders in BED file format
dbname-trnaloci.bed: tRNA genes in BED file format
dbname-dbinfo.txt: Database creation information and log
dbname-alignnum.txt : tRNA position numbering table for tRNA
dbname-locusnum.txt : tRNA position numbering table for pre-tRNAs

trimadapters.py

trimadapters.py removes adapter sequences from raw sequencing data as the preprocessing step before data analysis. For single end reads, Cutadapt is used for performing the adapter trimming. For paired end reads, the two read ends have to be merged into single end reads in order to allow correct read alignments by Bowtie2. Therefore, SeqPrep is used that will trim adapter sequences and merge the read ends simultaneously. The tool was designed to work with raw sequencing data generated by Illumina platform and has not been tested with data from other sequencing platform.

Note

The options for supporting Unique Molecular Identifiers (UMI) are experimental and may only be applicable for specific designs. Researchers who are interested in utilizing the tool with sequencing data containing UMI can contact us for feasibility inquiries.

Usage

trimadapters.py --runname=runname --runfile=runfile.txt [--firadapter=FIRADAPTER --secadapter=SECADAPTER --minlength=MINLENGTH --singleend --cores=CORES]

Options

--runname = runname (required)
Name used for output files
--runfile = runfile (required)
Tab or whitespace-delimited file with sample and FASTQ file information (described below)
--firadapter = FIRADAPTER (optional)
3' adapter sequence used in single end reads or forward primer/adapter sequence used in paired end reads
Default value is AGATCGGAAGAGCACACGTC
--secadapter = SECADAPTER (optional)
Reverse primer/adapter sequence used in paired end reads
Default value is GATCGTCGGACTGTAGAACTC
--minlength = MINLENGTH (optional)
Minimum length of sequencing reads to be retained after trimming and/or merging
Default value is 15
--singleend
Flag to specify single end reads
--cores = CORES (optional)
Number of cores to be used
Default value is the number of available cores on the system
--umilength = length (optional)
Length of UMI
Default value is 0
--umithreeprime (optional)
Flag to indicate that UMI is located at the 3' end of the reads instead of 5' end. It will have no effect if --umilength is 0.

Input files

Raw sequencing files

These files should be in FASTQ file format and can be compressed by gzip. They are specified in the runfile.

trimadapters.py `runfile`

This is a two or three-column tab-delimited text file depending on single or paired end reads. The columns are defined as:

column 1: sample name
column 2: single-end FASTQ file name or read 1 FASTQ file name for paired end reads
column 3: read 2 FASTQ file name for paired end reads

The FASTQ file names may include the file paths where the files are located.

Outputs

The following files are always generated upon completion:

runname_log.txt: Output log containing statistics generated by Cutadapt or SeqPrep
runname_manifest.txt: List of sample names and trimmed/merged FASTQ files

For single end reads, the following files are created:

samplename_trim.fastq.gz: FASTQ file containing trimmed sequencing reads and can be used by processsamples.py
runname_ca.pdf: histogram summarizing the amount of trimmed, untrimmed, and discarded reads
runname_ca.txt: Data source for the histogram in runname_ca.pdf

For paired end reads, the following files are created:

samplename_merged.fastq.gz: FASTQ file containing trimmed and merged sequencing reads and can be used by processsamples.py
samplename_left.fastq.gz: FASTQ file containing Read 1 that do not merge
samplename_right.fastq.gz: FASTQ file containing Read 2 that do not merge
runname_sp.pdf: histogram summarizing the amount of merged, unmerged, and discarded reads
runname_sp.txt: Data source for the histogram in runname_sp.pdf

processamples.py

processamples.py is the main tool that performs analysis on the small RNA sequencing data. It includes aligning reads to the reference database using Bowtie2, estimating transcript abundance, performing differential expression comparison across samples, and detecting misincorporations as potential RNA modifications. Multiple outputs in text files and images are generated during different steps of the analysis. For details of the analysis methods used in processamples.py, please refer to the Methods section of our publication.

Usage

processsamples.py --experimentname=expname --databasename=dbname --ensemblgtf=genes.gtf.gz --samplefile=samplefile.txt [--exppairs=samplespairs.txt --bedfile=file.bed --minnontrnasize=SIZE --nofrag --lazyremap --makehub --olddeseq --dumpother]

Options

--experimentname = expname (required)
Experiment name that will be used for results
--databasename = dbname (required)
Directory and name of reference database generated by maketrnadb.py
--ensemblgtf = genes.gtf.gz (required)
Non-tRNA annotation file in Ensembl GTF file format
--samplefile = samplefile.txt (required)
Tab or whitespace-delimited file with sample and pre-trimmed FASTQ file information (described below)
--exppairs = samplespairs.txt (optional but recommended)
Tab or whitespace-delimited file with sample pairs for differential expression comparison (described below)
If not provided, all vs all pairwise comparison will be performed.
--bedfile = file.bed (optional)
Other non-tRNA gene annotations in BED file format
Multiple files can be specified.
--minnontrnasize = SIZE (optional)
Minimum read length to be retained for non-tRNA genes
Defaults value is 20
--nofrag
Flag to skip tRNA fragment determination
Design for full length tRNA sequencing experiments
--lazyremap
Skip read mapping step when BAM files already exist
--makehub
Convert BAM files by interpolating tRNA read alignments to the reference genome and create read coverage tracks in track hub to be used in UCSC Genome Browser
--olddeseq
Use DESeq1 instead of DESeq2
--dumpother
Output read alignments that do not have an associated gene feature into BAM files (files end in "nofeat.bam")
--cores
Number of processing cores to be used (default = 8)

Input files

Trimmed sequencing files

These files should be in FASTQ file format and can be compressed by gzip. They are specified in the samplefile. Adaptor sequences should have been removed from the sequencing reads. Only single end reads are supported. If paired end reads are used in sequencing, the two read ends that always overlap in small RNA-seq should be merged into single end reads. These files can be output files generated by trimadapters.py or other preferred tools.

Sample File

samplefile is a three-column tab-delimited file of sample meta data. The columns are:

column 1: unique sample replicate name
column 2: sample or group name common for replicates
column 3: preprocessed FASTQ file name

An example sample file TestSamples.txt (used for TestRun.bash) can be found with the tRAX source code.

Sample pair file

exppairs is a two-column tab-delimited file consists of sample pairs to be compared with each other. The columns are:

column 1: sample name
column 2: another sample name

The sample names must match with column 2 values in the sample file above.

Outputs

The results generated by processamples.py are distributed into multiple subfolders, outlined as follows:

file structure

Analysis run info

expname-runinfo.txt: Information on the tRAX run, including input parameters, run date, run command, and tRAX version

Quality assessment report

expname-qa.html: HTML document showing quality assesment of sequencing reads and alignments

Read alignments

sample_name.bam: Sorted BAM file with read alignments of sample_name
sample_name.bam.bai: BAM index file of sample_name.bam
sample_name-genome.bam: BAM file with read alignments of sample_name mapped to reference genome coordinates, used for UCSC Genome Browser track creation
sample_name_nofeat.bam: BAM file without gene features, generated when using --dumpother option
expname-mapstats.txt: Bowtie2 command and raw output statistics for each sample
expname-mapinfo.txt: Mapping statistics for each sample, used as input for expname-mapinfo.pdf
expname-mapinfo.pdf: Histogram of mapping rate for each sample

Summary of read coverage statistics

expname-typecounts.txt: Normalized read counts of small RNA types per sample
expname-typecounts.pdf: Normalized read distribution of small RNA types per sample
expname-typerealcounts.txt: Raw read counts of small RNA types per sample
expname-typerealcounts.pdf: Raw read distribution of small RNA types per sample
expname-aminocounts.txt: Normalized read counts of tRNA isotypes per sample
expname-aminocounts.pdf: Normalized read distribution of tRNA isotypes per sample
expname-combinedcoverage.pdf: Read coverage summary of tRNA isotypes per sample
expname-readlengths.txt: Counts of read lengths for tRNAs, pre-tRNAs, and other gene types per sample
expname-readlengths.pdf: Distribution of read lengths for tRNAs, pre-tRNAs, and other gene types per sample

Abundance of tRNA, tDRs, and other gene types

sample1_sample2-typescatter.pdf: Sample pair comparisons of normalized read counts for each gene type, including tRNA fragment type when --nofrag is not used
sample1_sample2-aminoscatter.pdf: Sample pair comparisons of normalized read counts for each tRNA fragment type when --nofrag is not used or for tRNAs when --nofrag is used
expname-readcounts.txt: Raw read counts for each transcript and tRNA fragment
expname-normalizedreadcounts.txt: Normalized read counts for each transcript and tRNA fragment
expname-coverage.txt: Per-base read coverage data of all tRNA isodecoders per sample, using Sprinzl tRNA positioning
expname-coverage.pdf: Per-base read coverage plot of all tRNA isodecoders per sample colored by specificity of mapping
expname-<isotype>_cov.pdf: Individual isotype per-base read coverage plots of tRNA isodecoders per sample colored by specificity of mapping
expname-SizeFactors.txt: DESeq2 size factors for each sample
expname-trnaendcounts.txt: Counts of tRNA "CCA" tail types

Differential expression comparison

expname-logvals.txt: Log2 fold change values of sample pairs generated by DESeq2
expname-padjs.txt: Adjusted P-values of sample pairs generated by DESeq2
expname-combine.txt: Tab-delimited file including adjusted P-values, log2 fold change values, and normalized read counts
sample1_sample2-volcano.pdf: Volcano plots of sample pair with log2 fold change values and adjusted P-values Top 10 results are labeled
sample1_sample2-volcano_tRNA.pdf: Volcano plots of sample pair with log2 fold change values and adjusted P-values of only tRNAs and tRNA fragments. Top 10 results are labeled
expname-pca.pdf: Principle component analysis of samples, based on normalized read counts of all gene features
expname-pcatrna.pdf: Principle component analysis of samples, based on normalized read counts of tRNAs and tDRs

Base mismatch estimation

mismatch/expname-trnapositionmismatches.pdf: Maximum mismimatch percentage among all samples for each tRNA position
mismatch/expname-trnapositionfiveprime.pdf: Maximum percentage of start position among all samples for each tRNA position
mismatch/expname-<isotype>_mismatch.pdf: Per-base coverage of mismatches for tRNA isodecoders
mismatch/expname-<isotype>_delete.pdf: Per-base coverage of deletion for tRNA isodecoders
mismatch/expname-<isotype>_fiveprimeends.pdf: Per-base coverage of 5' start position for tRNA isodecoders
mismatch/expname-<isotype>_mismatchheatmap.pdf: Heatmap of mismatches per base for tRNA isodecoders of specific isotype
mismatch/expname-<isotype>_deletionheatmap.pdf: Heatmap of deletions per base for tRNA isodecoders of specific isotype
mismatch/expname-<isotype>_threeprimeheatmap.pdf: Heatmap of 3' end position for tRNA isodecoders of specific isotype
mismatch/expname-<isotype>_fiveprimeheatmap.pdf: Heatmap of 5' start postion for tRNA isodecoders of specific isotype
mismatch/expname-<isotype>_fiveprimecounts.pdf: Distribution of 5' start position of all tRNA transcripts for each sample
mismatch/expname-<isotype>_threeprimecounts.pdf: Distribution of 3' end position of all tRNA transcripts for each sample
mismatch/expname-<isotype>_trnapositionmismatches.pdf: Maximum mismimatch percentage among all samples for each tRNA position by anticodon
mismatch/expname-<position>_posaminomismatches.pdf: Mismatch percentage at position per tRNA isotype
mismatch/expname-<position>_possamplemismatches.pdf: Mismatch percentage at position per sample
mismatch/expname-<position>_posaminodeletions.pdfs: Deletion percentage at position per tRNA isotype
mismatch/expname-<position>_possampledeletions.pdf: Deletion percentage at position per sample
mismatch/expname-<position>_possamplereadstarts.pdf: Read start percentage at position per sample

Pre-tRNA read coverage

pretRNAs/expname-pretRNAcombinedcoverage.pdf: Read coverage summary of tRNA isotypes per sample, with upstream and downstream regions
pretRNAs/expname-pretRNAcoverage.txt: Per-base read coverage data of all tRNA isodecoders per sample, with upstream and downstream regions
pretRNAs/expname-pretRNAcoverage.pdf: Per-base read coverage plot of all tRNA isodecoders per sample, with upstream and downstream regions

Unique read coverage

unique/expname-trnauniquecounts.txt: Counts of uniquely mapped reads for all transcripts per sample
unique/expname-<isotype>_uniqueonlycov.pdf: Individual isotype per-base unique read coverage plots of tRNA isodecoders per sample

Genome browser track hub

trackhub/<sample>.Plus.bw: Read coverage BigWig file of specific sample on forward strand
trackhub/<sample>.Minus.bw: Read coverage BigWig file of specific sample on reverse strand
trackhub/trackdb.txt: UCSC Genome Browser track information in track database definition format

gettranscripts.bash

gettranscripts.bash is a tool for generating transcript predictions from a bam file. It is intended to be used on the output from the --dumpother option of processamples.py after merging with samtools. This requires bedtools to be installed to function

Usage

gettranscripts.bash mergedoutput.bam 30 > output.bed

Arguments

First argument is the bam file generated by processamples.py with --dumpother option.
Second argument is the minimum number of transcripts.

Output

output.bed: BED file with possible unannotated transcripts found in the BAM file

Reference Manual

maketrnadb.py

Usage

Options

Outputs

trimadapters.py

Usage

Options

Input files

Raw sequencing files

trimadapters.py runfile

Outputs

processamples.py

Usage

Options

Input files

Trimmed sequencing files

Sample File

Sample pair file

Outputs

Analysis run info

Quality assessment report

Read alignments

Summary of read coverage statistics

Abundance of tRNA, tDRs, and other gene types

Differential expression comparison

Base mismatch estimation

Pre-tRNA read coverage

Unique read coverage

Genome browser track hub

gettranscripts.bash

Usage

Arguments

Output

trimadapters.py `runfile`