tDRnamer standalone software

The standalone version of tDRnamer is available for reseearchers who have large data sets and resources/experience to work on a Linux/Unix environment.

Installation

System requirements

tDRnamer requires to be run on a Linux/Unix system with at least 8 cores and 16 GB memory. If working with small RNA sequencing data, we do not recommend using tDRnamer on a regular desktop or laptop.

Using Docker Image

To eliminate the need of installing dependencies, you can download the Docker image from our DockerHub repository using the command

docker pull ucsclowelab/tdrnamer

Using Conda Environment

For those who prefer to use conda, you can create the environment using the command

conda env create -f tdrnamer_env.yaml

Getting source code

The source code can be downloaded from GitHub at https://github.com/UCSC-LoweLab/tDRnamer. tDRnamer was developed with Python and Perl, and does not require compilation or installation.

To run tDRnamer from source code, dependencies listed below are required to be installed.

Dependencies

Python 2.7 or higher
pysam Python library (latest verion - older versions have a memory leak)
Bowtie2
NCBI BLAST+ 2.3 or higher
EMBOSS 6.6
Samtools 1.9 or higher
Infernal 1.1.2 or higher

Tutorial

Test run

To try out tDRnamer with small data sets, we provide a script test_run.bash. It includes downloading sample data, GRCh38/hg38 reference genome, and GtRNAdb tRNA annotations from our server, building tDRnamer reference database, and performing five tDRnamer runs:
1. Search and annotate tDRs from an ARM-seq sample data in FASTQ file format
2. Name and annotate tDR sequences provided in FASTA file with default mode 3. Name and annotate tDR sequences provided in FASTA file with maximum sensitivity mode 4. Name and annotate tDR sequences provided in FASTA file with the inclusion of nucleotide variations if exist 5. Search and annotate tDR sequences from provided tDR names

The ARM-Seq sample data was described in the following publication.

Cite

Cozen AE, Quartley E, Holmes AD, et al. (2015) ARM-seq: AlkB-facilitated RNA methylation sequencing reveals a complex landscape of modified tRNA fragments. Nature Methods 12:879–884.

The test run may take approximately five minutes to complete and the sample outputs can be downloaded from our server for comparison.

Components

tDRnamer contains two main tools:

create_tDRnamer_db - Build a reference database for naming tDRs or finding tDR sequences
tDRnamer - Naming/annotating tDRs or finding tDR sequences

Data needed as inputs

For naming tDRs from sequences
- Preprocessed sequencing data from Illumina platform in FASTQ file format, or
- tDR sequences in FASTA file format
For finding tDR sequences from tDR names, single-column text file containing tDR names in defined format
tRNAscan-SE outputs of targeted genome downloaded from GtRNAdb
Genome sequence FASTA file (Eukaryotic genomes can be downloaded from UCSC Genome Browser)

Note

The chromosome names must be the same across all the input files. For example, if chr1 is used as chromosome 1 in tRNA annotations, the same chromosome name must be used in the genome sequence FASTA file. If genome sequence file is obtained from NCBI, ENSEMBL, or ENA, the chromosome names in the FASTA file have to be updated to match with the tRNA annotations.

How to Run

Step 1: Build custom reference database

Before naming tDRs or searching for tDR sequences, a custom reference database has to be built. Pre-built databases for model organisms have been made available for download here.

Reference databases can also be built using the create_tDRnamer_db tool.

create_tDRnamer_db --db dbname --genome genome.fa --trna trnascan.out --ss trnascan.ss --namemap trna_name_map.txt --source source

dbname is the output directory and name that will be used for the reference database
genome.fa is a FASTA file of the reference genome
trnascan.out is the output file generated by tRNAscan-SE and can be downloaded from GtRNAdb
trnascan.ss is the secondary structure file generated by tRNAscan-SE and can be downloaded from GtRNAdb
trna_name_map.txt is the map file that converts the tRNAscan-SE IDs to GtRNAdb gene symbols. It is also included in the GtRNAdb downloaded tarball.
source is the sequence source of the reference and can be euk for eukaryotes (default), bact for bacteria, or arch for archaea.

If create_tDRnamer_db is run within the tDRnamer source directory, a file path where the reference database will be created should be included as part of the dbname, for example, /db_path/hg38.

Step 2: Naming tDRs or finding tDR sequences

Naming tDRs from sequences

Researchers can provide a FASTA file with possible tDR sequences as input. Alternatively, preprocessed small RNA-seq data in FASTQ file can be supplied. Raw sequencing data has to be preprocessed to remove sequencing adapters and merge paired end reads into single end reads. trimadapters.py in tRAX software package can be used for this purpose. Gzip compressed file can be used.

Note

Both the forward and reverse strands of input sequences are searched.

To start tDR naming process, run the following command:

tDRnamer --mode seq --seq tdrs --db dbname --source source --output output_dir/prefix

tdrs is the input FASTA or FASTQ file
dbname is the directory and name of the reference database generated by create_tDRnamer_db
source is the sequence source of the tDRs and can be euk for eukaryotes (default), bact for bacteria, or arch for archaea.
output_dir/prefix is the directory and prefix for output files

Finding tDR sequences by names

A single-column text file without column header that contains tDR names will be used as input.

Example

tDR-31:76-Asp-GTC-2-G15C
tDR-38:76-Gln-CTG-1-D15U
tDR-1:41-Lys-CTT-1

To find and annotate tDR sequences, run the following command:

tDRnamer --mode name --name tdrs --db dbname --source domain --output output_dir/prefix

tdrs is the input single-column text file with tDR names
dbname is the directory and name of the reference database generated by create_tDRnamer_db
domain is the sequence source of the tDRs and can be euk for eukaryotes (default), bact for bacteria, or arch for archaea.
output_dir/prefix is the directory and prefix for output files