Reference Manual
create_tDRnamer_db
create_tDRnamer_db creates the reference database that is used by tDRnamer for naming tDRs or find tDR sequences by names. It uses Bowtie2 to build indexes for sequence similarity search with default search mode. Nucleotide BLAST database is also built with NCBI BLAST+ for the initial scan of the maximum sensitivity search mode. In addition, the tool uses Infernal to create tRNA sequence alignments for annotating the positions of tDRs. The tRNA annotations required as inputs can be downloaded from GtRNAdb or generated by running tRNAscan-SE. The chromosome/sequence names of the reference genome sequences must match with the sequence name in tRNA annotations. Eukaryote genomes can be downloaded from the UCSC Genome Browser.
By default, possible pseudogenes and predicted tRNA genes with undetermined isotype are excluded in the database. For multicellular eukaryotes with high confidence tRNA genes defined, only high confidence/scoring tRNAs and filtered genes with score >= 50 bits, isotype model score >= 80 bits, and have consistent anticodon/isotype model are included. Researchers can skip the filtering step completely by using --skipfilter
option. The cutoffs of tRNA score and isotype model score can also be adjusted using --score
and --isoscore
options.
For more information regarding the classification of high confidence/scoring tRNA genes, please check out our tRNAscan-SE paper:
Cite
Chan PP, Lin BY, Mak AJ, and Lowe TM. (2021) tRNAscan-SE 2.0: Improved Detection and Functional Classification of Transfer RNA Genes. Nucleic Acids Res. 49:9077–9096.
Usage
create_tDRnamer_db --db dbname --genome genome.fa --trna trnascan.out --ss trnascan.ss --namemap trna_name_map.txt [--source source]
[--force] [--skipfilter] [--score score] [--isoscore score]
Options
- --db or -d : dbname (required)
Directory and database name that will be used for the reference database - --genome or -g : genome.fa (required)
FASTA file of reference genome - --trna or -t : trnascan.out (required)
tRNAscan-SE output file (*.out file in GtRNAdb downloaded tarball) - --ss or -s : trnascan.ss (required)
tRNAscan-SE secondary structure output file (*.ss file in GtRNAdb downloaded tarball) - --namemap or -n : trna_name_map.txt (required)
Map file that converts tRNAscan-SE IDs to GtRNAdb gene symbols (*_name_map.txt file in GtRNAdb downloaded tarball) - --source or -r : source (optional)
Sequence source of reference
Default iseuk
for eukaryotes. Other values includearch
for archaea andbact
for bacteria. - --force or -q (optional)
Force to overwrite output files if existed. - --skipfilter (optional)
Skip filtering step to include all provided tRNAs in database - --score : score (optional)
tRNAscan-SE score cutoff for filtering multicellular eukaryotic tRNA genes (default = 50) - --isoscore : score (optional)
Isotype model score cutoff for filtering multicellular eukaryotic tRNA genes (default = 80)
Outputs
The following files are generated upon completion:
dbname-tRNAgenome.*
: FASTA file of tRNA sequences with Bowtie2 indexes and BLAST databasedbname-trnaalign.stk
: Alignments of mature tRNA sequences in Stockholm file formatdbname-trnaconvert.stk
: Alignments of mature tRNA sequences in Stockholm file formatdbname-trnaloci.stk
: Alignment of tRNA gene sequences in Stockholm file formatdbname-trnatable.txt
: Tab-delimited file with tRNA transcripts and tRNA genes mapdbname-maturetRNAs.fa
: FASTA file of mature tRNA sequencesdbname-maturetRNAs.bed
: tRNA transcripts in BED file formatdbname-tRNAloci.fa
: FASTA file of tRNA gene sequencesdbname-trnaloci.bed
: tRNA genes in BED file formatdbname-filtered-tRNAs.out
: tRNAscan-SE output file format with filtered tRNA genes used for database creationdbname-dbinfo.txt
: Database creation informationdbname-create_tDRnamer_db.log
: Database creation log file
tDRnamer
tDRnamer is the main tool that annotates tDRs. When naming tDRs by sequences, it includes aligning input sequences to the reference database using Bowtie2, computing tDR positions relative to source tRNAs, assigning names to tDRs, and grouping tDRs with source tRNAs based on alignments. Both the forward and reverse strands of input sequences are searched. When providing tDR names as inputs, the tool will search for the corresponding tDR sequences in the reference database and annotate the tDRs with the identified sequences. tDRs derived from both mature tRNAs and precursor tRNAs will be identified when sequence source is set as euk
(eukaryotes). Only tDRs derived from mature tRNAs will be identified when sequence source is set as bact
(bacteria) or arch
(archaea).
Usage
tDRnamer --mode mode [--seq filename or --name filename] --db dbname --output output_dir/prefix [--source source] [--force] [--max] [--var] [--minread reads] [--minlen length] [--maxlen length] [--maxmismatch percentage] [--cores cores]
Options
- --mode or -m : mode (required)
tDRnamer search mode
Default isseq
, search by sequences. Other value isname
, search by tDR names. - --seq or -s : filename (required)
Input sequence file, only applicable with--mode
asseq
Can be FASTA file with possible tDR sequences or FASTQ file with preprocessed small RNA-seq reads. Gzip compressed file is supported. - --name or -n : filename (required)
Input tDR name file, only applicable with--mode
asname
Single-column text file without column header - --db or -d : dbname (required)
Directory and name of reference database generated by create_tDRnamer_db - --output or -o : output_dir/prefix (required)
Directory and prefix for output files - --source or -r : source (optional)
Sequence source of reference
Default iseuk
for eukaryotes. Other values includearch
for archaea,bact
for bacteria, andmito
for mitochondria. Please note thatmito
can only be used with thepre-built mitochondrial genome reference databases
available for download. - --force or -q (optional)
Force to overwrite output files if existed. - --max (optional)
Search with maximum sensitivity (slowest speed) - --var (optional)
Include nucleotide variation (if exists) as part of a tDR name
Only applicable with--mode
asseq
- --minread : reads (optional)
Minimum number of identical sequencing reads to be considered as a possible tDR (default = 10)
Only applicable with--mode
asseq
and--seq
as FASTQ file - --minlen : length (optional)
Minimum sequence length (nt) to be considered as a tDR (default = 15)
To skip minimum sequence length constraint, specify value as 0. - --maxlen : length (optional)
Maximum sequence length (nt) to be considered as a tDR (default = 70)
Only applicable with--mode
asseq
To skip maximum sequence length constraint, specify value as 0 - --maxmismatch : percentage (optional)
Maximum percentage of mismatches by sequence length (default = 10)
Only applicable with
--max
option Maximum acceptable value is 20 - --cores
Number of processing cores to be used for sequence search (default = 4)
Input files
Sequence file
Researchers can provide a FASTA file with possible tDR sequences as input. Alternatively, preprocessed small RNA-seq data in FASTQ file can be supplied. Raw sequencing data has to be preprocessed to remove sequencing adapters and merge paired end reads into single end reads. trimadapters.py
in tRAX software package can be used for this purpose. Gzip compressed file can be used. Please check out test_run.bash
obtainable with the source code for examples.
tDR name file
A single-column text file without column header that contains tDR names will be used as input. An example file ExampleNames.txt
can be downloaded from here.
Outputs
tDR annotations
prefix-tDR.fa
: FASTA file with tDR names and sequencesprefix-tDR-info.txt
: tab-delimited file with tDR annotations including tDR names and sequences, source tRNAs, Sprinzl positions of tDRs relative to source tRNAs, original tRNA isotype and anticodon, sequence variation counts, and group IDprefix-tDR-groups.txt
: text file containing queried tDRs that are grouped together by sequence alignments. Alignments are arranged in Stockholm format that includes primary sequence and secondary structure information. Details about Stockholm format can be found in the Infernal User Guide.prefix-found-seq.fa
: FASTA file with tDR sequences when search by tDR names. This file is generated during the initial round of sequence search before annotation process.
tDR alignments
prefix-tDRs.stk
: Alignments of identified tDRs derived from mature tRNAs with reference tRNA sequences in Stockholm file formatprefix-pre-tDRs.stk
: Alignments of identified tDRs derived from precursor tRNAs with reference tRNA sequences in Stockholm file format. This file is only generated when--source
iseuk
.
Other output files
prefix-unique-seq.fa
: FASTA file with unique sequences in provided FASTQ fileprefix-filtered-seq.fa
: FASTA file with sequences after filtering by length constraint and error checking. This is only generated when--mode
isseq
.prefix-reformatted-seq.fa
: FASTA file with sequences after converting RNA sequences to DNA sequences if applicableprefix-filtered-names.txt
: List of tDR names after filtering by error checking. This is only generated when--mode
isname
.prefix-tDR-list.txt
: Intermediate file generated during tDR annotation processprefix-clusters.txt
: Intermediate file generated during tDR group processprefix-pre-clusters.txt
: Intermediate file generated during tDR group process. This file is only generated when--source
iseuk
.prefix-tDRs.sam
: Intermediate alignment file generated during tDR searching/annotation process with--max
option.prefix-find-tdrs.log
: Intermediate log file for searching tDR sequences by names. This is only generated when--mode
isname
.prefix_tDRnamer.log
: Log file of tDRnamer run