A research paper published on biorxiv determined a new coronavirus subgenus, I would like to figure out is there any changes on protease. However, the sequence data has not been publish.
Fortunately, the similar sequence is do available on NCBI, unfortunately, only RNA-seq data is available.
So I need to assemble the RNA-seq reads first, and BLAST the sequence I need with the assembled data.
Setup the environment with conda:
conda create -n sra_env -y
conda activate sra_env
conda install -c bioconda -y sra-tools trinity transdecoder blast fastp fastqc
Fetch the data:
prefetch SRR11301086
fasterq-dump SRR11301086
mkdir -p analysis_results/SRR11301086/{raw_fastqc,clean_fastqc,fastp,trinity,transdecoder,blast_results}
Data quality check
fastqc SRR11301086_1.fastq SRR11301086_2.fastq \
-o analysis_results/SRR11301086/raw_fastqc \
-t 28 \
2>&1 | tee analysis_results
Quality control using fastp
fastp -i SRR11301086_1.fastq \
-I SRR11301086_2.fastq \
-o analysis_results/SRR11301086/fastp/SRR11301086_1.clean.fastq \
-O analysis_results/SRR11301086/fastp/SRR11301086_2.clean.fastq \
--qualified_quality_phred 20 \
--length_required 50 \
--thread 28 \
--html analysis_results/SRR11301086/fastp/SRR11301086_fastp.html \
--json analysis_results/SRR11301086/fastp/SRR11301086_fastp.json \
2>&1 | tee analysis_results/SRR11301086/logs/SRR11301086_fastp.log
Data quality check (post-cleaning data)
fastqc analysis_results/SRR11301086/fastp/SRR11301086_1.clean.fastq \
analysis_results/SRR11301086/fastp/SRR11301086_2.clean.fastq \
-o analysis_results/SRR11301086/clean_fastqc \
-t 28 \
2>&1 | tee analysis_results/SRR11301086/logs/SRR11301086_fastqc_cleaned.log
Assemble with Trinity
Check the Trinity result:
BLAST sequence of interest
Put your sequence in query.fasta.
Make BLAST database and run:
Check the BLAST result:
Extract the sequence from trinity.Trinity.fasta
You can also blast with the Predicted sequence:
Make BLAST database and run:
nohup Trinity --seqType fq \
--left analysis_results/SRR11301086/fastp/SRR11301086_1.clean.fastq \
--right analysis_results/SRR11301086/fastp/SRR11301086_2.clean.fastq \
--CPU 28 --max_memory 48G \
--output analysis_results/SRR11301086/trinity \
2>&1 > analysis_results/SRR11301086/logs/SRR11301086_trinity.log &
mv analysis_results/SRR11301086/trinity.Trinity.fasta* analysis_results/SRR11301086/trinity/
TrinityStats.pl analysis_results/SRR11301086/trinity/trinity.Trinity.fasta \
2>&1 | tee analysis_results/SRR11301086/logs/SRR11301086_trinity_stats.txt
vi query.fasta
makeblastdb -in analysis_results/SRR11301086/trinity/trinity.Trinity.fasta -dbtype nucl -out analysis_results/SRR11301086/trinity/SRR11301086_nuc_db
tblastn -query query.fasta \
-db analysis_results/SRR11301086/trinity/SRR11301086_nuc_db \
-out analysis_results/SRR11301086/blast_results/SRR11301086_nuc_blast.txt \
-evalue 1e-5 \
-word_size 2 \
-outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qseq sseq" \
-num_threads 28
cat analysis_results/SRR11301086/blast_results/SRR11301086_nuc_blast.txt
# Build index
samtools faidx analysis_results/SRR11301086/trinity/trinity.Trinity.fasta
# Extract the sequence
samtools faidx analysis_results/SRR11301086/trinity/trinity.Trinity.fasta TRINITY_DN284_c0_g1_i7 > analysis_results/SRR11301086/blast_results/SRR11301086_TTRINITY_DN284_c0_g1_i7.fasta
TransDecoder.LongOrfs -t analysis_results/SRR11301086/trinity/trinity.Trinity.fasta\
2>&1 | tee analysis_results/SRR11301086/logs/SRR11301086_transdecoder_long.log
TransDecoder.Predict -t analysis_results/SRR11301086/trinity/trinity.Trinity.fasta \
2>&1 | tee analysis_results/SRR11301086/logs/SRR11301086_transdecoder_predict.log
mv *.transdecoder* analysis_results/SRR11301086/transdecoder/
makeblastdb -in analysis_results/SRR11301086/transdecoder/trinity.Trinity.fasta.transdecoder.pep \
-dbtype prot \
-out analysis_results/SRR11301086/transdecoder/SRR11301086_db \
2>&1 | tee analysis_results/SRR11301086/logs/SRR11301086_makeblastdb.log
blastp -query query.fasta \-db analysis_results/SRR11301086/transdecoder/SRR11301086_db \
-out analysis_results/SRR11301086/blast_results/SRR11301086_prot_blast.txt \
-evalue 1e-5 \
-outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qseq sseq" \
-num_threads 28 \
2>&1 | tee analysis_results/SRR11301086/logs/SRR11301086_blast.log