De novo assemble RNA-seq sequence
Intro
A research paper published on biorxiv determined a new coronavirus subgenus, I would like to figure out is there any changes on protease. However, the sequence data has not been publish.
Fortunately, the similar sequence is do available on NCBI, unfortunately, only RNA-seq data is available.
So I need to assemble the RNA-seq reads first, and BLAST the sequence I need with the assembled data.
TL;DR
Setup the environment with conda:
conda create -n sra_env -y conda activate sra_env conda install -c bioconda -y sra-tools trinity transdecoder blast fastp fastqc
Fetch the data:
prefetch SRR11301086 fasterq-dump SRR11301086 mkdir -p analysis_results/SRR11301086/{raw_fastqc,clean_fastqc,fastp,trinity,transdecoder,blast_results}
Data quality check
fastqc SRR11301086_1.fastq SRR11301086_2.fastq \ -o analysis_results/SRR11301086/raw_fastqc \ -t 28 \ 2>&1 | tee analysis_results
Quality control using fastp
fastp -i SRR11301086_1.fastq \ -I SRR11301086_2.fastq \ -o analysis_results/SRR11301086/fastp/SRR11301086_1.clean.fastq \ -O analysis_results/SRR11301086/fastp/SRR11301086_2.clean.fastq \ --qualified_quality_phred 20 \ --length_required 50 \ --thread 28 \ --html analysis_results/SRR11301086/fastp/SRR11301086_fastp.html \ --json analysis_results/SRR11301086/fastp/SRR11301086_fastp.json \ 2>&1 | tee analysis_results/SRR11301086/logs/SRR11301086_fastp.log
Data quality check (post-cleaning data)
fastqc analysis_results/SRR11301086/fastp/SRR11301086_1.clean.fastq \ analysis_results/SRR11301086/fastp/SRR11301086_2.clean.fastq \ -o analysis_results/SRR11301086/clean_fastqc \ -t 28 \ 2>&1 | tee analysis_results/SRR11301086/logs/SRR11301086_fastqc_cleaned.log
Assemble with Trinity
nohup Trinity --seqType fq \ --left analysis_results/SRR11301086/fastp/SRR11301086_1.clean.fastq \ --right analysis_results/SRR11301086/fastp/SRR11301086_2.clean.fastq \ --CPU 28 --max_memory 48G \ --output analysis_results/SRR11301086/trinity \ 2>&1 > analysis_results/SRR11301086/logs/SRR11301086_trinity.log & mv analysis_results/SRR11301086/trinity.Trinity.fasta* analysis_results/SRR11301086/trinity/
Check the Trinity result:
TrinityStats.pl analysis_results/SRR11301086/trinity/trinity.Trinity.fasta \ 2>&1 | tee analysis_results/SRR11301086/logs/SRR11301086_trinity_stats.txt
BLAST sequence of interest
Put your sequence in query.fasta.
vi query.fasta
Make BLAST database and run:
makeblastdb -in analysis_results/SRR11301086/trinity/trinity.Trinity.fasta -dbtype nucl -out analysis_results/SRR11301086/trinity/SRR11301086_nuc_db tblastn -query query.fasta \ -db analysis_results/SRR11301086/trinity/SRR11301086_nuc_db \ -out analysis_results/SRR11301086/blast_results/SRR11301086_nuc_blast.txt \ -evalue 1e-5 \ -word_size 2 \ -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qseq sseq" \ -num_threads 28
Check the BLAST result:
cat analysis_results/SRR11301086/blast_results/SRR11301086_nuc_blast.txt
Extract the sequence from
trinity.Trinity.fasta
# Build index samtools faidx analysis_results/SRR11301086/trinity/trinity.Trinity.fasta # Extract the sequence samtools faidx analysis_results/SRR11301086/trinity/trinity.Trinity.fasta TRINITY_DN284_c0_g1_i7 > analysis_results/SRR11301086/blast_results/SRR11301086_TTRINITY_DN284_c0_g1_i7.fasta
Tail
You can also blast with the Predicted sequence:
TransDecoder.LongOrfs -t analysis_results/SRR11301086/trinity/trinity.Trinity.fasta\ 2>&1 | tee analysis_results/SRR11301086/logs/SRR11301086_transdecoder_long.log TransDecoder.Predict -t analysis_results/SRR11301086/trinity/trinity.Trinity.fasta \ 2>&1 | tee analysis_results/SRR11301086/logs/SRR11301086_transdecoder_predict.log mv *.transdecoder* analysis_results/SRR11301086/transdecoder/
Make BLAST database and run:
makeblastdb -in analysis_results/SRR11301086/transdecoder/trinity.Trinity.fasta.transdecoder.pep \ -dbtype prot \ -out analysis_results/SRR11301086/transdecoder/SRR11301086_db \ 2>&1 | tee analysis_results/SRR11301086/logs/SRR11301086_makeblastdb.log blastp -query query.fasta \-db analysis_results/SRR11301086/transdecoder/SRR11301086_db \ -out analysis_results/SRR11301086/blast_results/SRR11301086_prot_blast.txt \ -evalue 1e-5 \ -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qseq sseq" \ -num_threads 28 \ 2>&1 | tee analysis_results/SRR11301086/logs/SRR11301086_blast.log