How to Make Blast Results Easy to View

Background

Identifying and aligning similar DNA and protein sequences is one of the most common tasks in computational biology, and BLAST (Basic Local Alignment Search Tool) still represents one of the most widely used and effective tools. While many users are familiar with running BLAST searches through the NCBI webserver, it is also possible and often preferable to download BLAST software and databases and run searches on a local computer or server. Advantages of running local BLAST include…

The ability to use customized databases including your own unpublished genome and transcriptome sequences that are not available on GenBank
The ability to submit searches for thousands of query sequences simultaneously (e.g., all contigs from a de novo transcriptome assembly)
Integration into custom scripts and pipelines

There are numerous specialized "flavors" of BLAST, but the following five programs represent the classic and most widely used methods. They differ based on whether the inputs are nucleotide or amino acid sequences and whether the alignments are based on nucleotide or (translated) amino acid sequences.

Program	Query Sequences	Database Sequences	Alignment
blastn	nucleotide	nucleotide	nucleotide
blastp	amino acid	amino acid	amino acid
blastx	nucleotide	amino acid	amino acid*
tblastn	amino acid	nucleotide	amino acid*
tblastx	nucleotide	nucleotide	amino acid*

*blastx, tblastn, and tblastx translate the input query and/or database sequences in all six possible reading frames before performing alignment.

Objectives

The goal of this exercise will be to make sequence comparisons between entire sets of proteins and entire genomes of closely related bacteria. This will involve generating databases and running searches with local BLAST, parsing the output with BioPerl scripts, and plotting data with R.

Software and Dependencies

BLAST+ Executables (installed and in your PATH)
Custom Perl scripts (BLASTN_parse.pl and BLASTP_parse.pl)
R
BBEdit or a text editor of your choice
Cyberduck or file transfer software of your choice
Microsoft Excel

[Note: If you do not have local BLAST and/or BioPerl installed, intermediate files are stored in the ~/TodosSantos/local_blast/prerun/ directory so that you can follow along.]

Protocol

1. Make BLAST databases

Prior to running a local BLAST search, you must first download or create a BLAST database. Familiar databases like "nr" or "nt" can be downloaded directly from NCBI for use in local searches, but you can also create a custom BLAST database from any input file in FASTA format. In this exercise, we will make two BLAST databases. One will be made from the entire set of protein sequences from a particular strain of Escherichia coli, and the other will consist of a single nucleotide sequence of the entire E. coli genome.

First, open a Terminal session and connect to your account on our workshop server. Once you have done that, enter the following command to move into the directory with the relevant files for this exercise.

            cd ~/TodosSantos/local_blast/

The directory contains FASTA files with E. coli protein sequences (Ecoli.proteins.fas) and the E. coli genome sequence (Ecoli.genome.fas). To convert these files into BLAST databases run the following commands. As you can see, the -dptype flag is required so that you can specify whether the sequences are nucleotide (nucl) or amino acid (prot) sequences.

            makeblastdb -in Ecoli.proteins.fas -dbtype prot  makeblastdb -in Ecoli.genome.fas -dbtype nucl

Enter the ls command and note that you have now generated three files with additional extensions for each of the FASTA input files. These are the actual BLAST database files that will be used to run searches.

Ecoli.genome.fas
Ecoli.genome.fas.nhr
Ecoli.genome.fas.nin
Ecoli.genome.fas.nsq

Ecoli.proteins.fas
Ecoli.proteins.fas.phr
Ecoli.proteins.fas.pin
Ecoli.proteins.fas.psq

2.Run BLASTP search

Now use the BLASTP program to run a search for every protein sequence from a strain of Salmonella enterica (Salmonella.proteins.fas) against the E. coli protein database you just made. Enter the following command:

            blastp -query Salmonella.proteins.fas -db Ecoli.proteins.fas -evalue 1e-6 -num_threads 4 -out blastp.txt

Note that the -evalue option sets the significance threshold for reporting hits. This can be modified depending on how stringent/permissive you would like to be with your search. There are 5027 proteins in the Salmonella file, so this will run 5027 separate BLAST searches. The -num_threads option tells the program to use multiple cores (in this case 4) to make the search go a little faster. It should take about 1 min. For more information on BLAST settings you can type: blastp -help

Once the search is complete, view the output file with the following command:

            less blastp.txt

This is a plain-text version of the information you would see when you run a BLAST search online. To exit viewing the file, type q.

3. Summarize BLAST results by parsing output file with a BioPerl script

The output text file from BLAST is easy to read, but when you do large numbers of searches it is far too much to view all of it. Extracting key information from BLAST output with Perl or Python scripts can be very valuable. Here, we will use a Perl script that relies on modules from BioPerl that make it easy to parse BLAST output.

            perl BLASTP_parse.pl blastp.txt > blastp.parsed.txt

Lets look at the last few lines of the output file with the tail command.

            tail blastp.parsed.txt

            ## zipA zipA    76.3848396501458 ## zntA zntA    84.9655172413793 ## zntB ydaN    92.6829268292683 ## zntR zntR    92.9577464788732 ## znuA znuA    86.984126984127 ## znuB znuB    96.1832061068702 ## znuC znuC    95.2380952380952 ## zraP zraP    63.2 ## zur  zur 93.0232558139535 ## zwf  zwf 97.3577235772358

Each line corresponds to a Salmonella protein and reports its top E. coli hit and the percent amino acid identity between the query and the hit. You can view all of the parsed output file (blastp.parsed.txt) from the command line with less or by opening it in Microsoft Excel.

Note that we have only extracted a very small amount of information from the BLAST output (name of the query sequence, name of the top hit sequence, and percent identity of the top hit). But depending on your needs, you can extract essentially any of the information you could want from the file, including alignment lengths, positions, orientations, sequences, scores, etc.

Open the BLASTP_parse.pl script in BBEdit (or a command-line editor like nano). The Perl code is complicated, but look at it to get a general sense of what it is doing.

Now try changing line 25 from…

$value = $hsp_obj->percent_identity;

to…

$value = $hsp_obj->evalue;

If you save these changes and repeat the parsing of the BLAST output, you will find the parsed output file now reports the e-value for each hit rather than the percent amino-acid identity.

4. Run BLASTN search

Now try performing a nucleotide-based search to compare the entire Salmonella genome against the entire E. coli genome by entering the following command:

            blastn -task blastn  -query Salmonella.genome.fas -db Ecoli.genome.fas -evalue 1e-20 -num_threads 4 -out blastn.txt

Note the use of the -task blastn flag. That may seem redundant because the program itself is BLASTN, but this program defaults to use a version of BLAST called MEGABLAST, which has very stringent default parameters (including a word size of 28). Therefore, for many or most applications it is necessary to either reduce the word size in MEGABLAST or to run classic BLASTN by using the task flag. Also note that we are using and unusually low e-value (1e-20). This will eliminate a large number of significant but short alignments from our output, which we are not interested in for this exercise. For more information on BLAST settings you can type: blastn -help

Once the search is complete, view the output file with the following command.

            less blastn.txt

This is a plain-text version of the information you would see when you run a BLAST search online. To exit viewing the file, type q. As above, we will use a BioPerl script to parse and summarize the BLAST output. But in this case, we are going to use a script that will report much more information about each alignment.

            perl BLASTN_parse.pl blastn.txt > blastn.parsed.txt

The parsed output file (blastn.parsed.txt) has numerous columns so open it in Microsoft Excel to view it (you will have to transfer it back to your local machine first). Note that the output file reports the positions of both the query and the hit for each BLAST alignment. Try using these to make a "dot-plot".

5. Plot data in R

Launch an R session on the workshop server by simply typing R.

            blastnData = read.table ("blastn.parsed.txt", sep = '\t', header=TRUE)

And then make an XY scatter plot showing the locations of the BLASTN hits in the two genomes with the following command. Note that the cex options simply sets the size of the points. We will save the plot to a PDF file by calling pdf() and the closing the file with dev.off(). The output should look like the image below. To view the output PDF file you will neen to transfer my_dotplot.pdf to your own computer with Cyberduck.

            pdf ("my_dotplot.pdf") plot (blastnData$Query_Start, blastnData$Hit_Start, cex = .25) dev.off() quit()

What is this plot telling us about the similarities and differences between the genome structures of E. coli and Salmonella? How can you explain the downward-sloping set of points that ends at ~2 Mb on the x-axis?

parkerhimathat.blogspot.com

Source: https://dbsloan.github.io/TS2019/exercises/local_blast.html

How to Make Blast Results Easy to View

Background

Objectives

Software and Dependencies

Protocol

1. Make BLAST databases

2.Run BLASTP search

3. Summarize BLAST results by parsing output file with a BioPerl script

4. Run BLASTN search

5. Plot data in R

0 Response to "How to Make Blast Results Easy to View"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel