Intro
A research paper published on biorxiv determined a new coronavirus subgenus, I would like to figure out is there any changes on protease. However, the sequence data has not been publish.
Fortunately, the similar sequence is do available on NCBI, unfortunately, only RNA-seq data is available.
So I need to assemble the RNA-seq reads first, and BLAST the sequence I need with the assembled data.
TL;DR
Setup the environment with conda:
conda create -n sra_env -y conda activate sra_env conda install -c bioconda -y sra-tools trinity transdecoder blast fastp fastqcFetch the data:
prefetch SRR11301086 fasterq-dump SRR11301086 mkdir -p analysis_results/SRR11301086/{raw_fastqc,clean_fastqc,fastp,trinity,transdecoder,blast_results}Data quality check
fastqc SRR11301086_1.fastq SRR11301086_2.fastq \ -o analysis_results/SRR11301086/raw_fastqc \ -t 28 \ 2>&1 | tee analysis_resultsQuality control using fastp
fastp -i SRR11301086_1.fastq \ -I SRR11301086_2.fastq \ -o analysis_results/SRR11301086/fastp/SRR11301086_1.clean.fastq \ -O analysis_results/SRR11301086/fastp/SRR11301086_2.clean.fastq \ --qualified_quality_phred 20 \ --length_required 50 \ --thread 28 \ --html analysis_results/SRR11301086/fastp/SRR11301086_fastp.html \ --json analysis_results/SRR11301086/fastp/SRR11301086_fastp.json \ 2>&1 | tee analysis_results/SRR11301086/logs/SRR11301086_fastp.logData quality check (post-cleaning data)
CodeBlock Loading...Assemble with Trinity
CodeBlock Loading...Check the Trinity result:
CodeBlock Loading...BLAST sequence of interest
Put your sequence in query.fasta.
CodeBlock Loading...Make BLAST database and run:
CodeBlock Loading...
Check the BLAST result:
CodeBlock Loading...Extract the sequence from
trinity.Trinity.fastaCodeBlock Loading...
Tail
You can also blast with the Predicted sequence:
CodeBlock Loading...Make BLAST database and run:
CodeBlock Loading...