词条 | BLAST |
释义 |
| name = BLAST | developer = {{Plainlist|
}} | latest release version = 2.8.1+ | latest release date = {{release date and age|2018|11|26|df=yes}} | operating_system = UNIX, Linux, Mac, MS-Windows | genre = Bioinformatics tool | license = Public domain | website = {{URL|http://blast.ncbi.nlm.nih.gov/Blast.cgi}} }} In bioinformatics, BLAST (basic local alignment search tool) is an algorithm for comparing primary biological sequence information, such as the amino-acid sequences of proteins or the nucleotides of DNA and/or RNA sequences. A BLAST search enables a researcher to compare a query sequence with a library or database of sequences, and identify library sequences that resemble the query sequence above a certain threshold. Different types of BLASTs are available according to the query sequences. For example, following the discovery of a previously unknown gene in the mouse, a scientist will typically perform a BLAST search of the human genome to see if humans carry a similar gene; BLAST will identify sequences in the human genome that resemble the mouse gene based on similarity of sequence. The BLAST algorithm and program were designed by Stephen Altschul, Warren Gish, Webb Miller, Eugene Myers, and David J. Lipman at the National Institutes of Health and was published in the Journal of Molecular Biology in 1990 and cited over 75,000 times.[1] BackgroundBLAST is one of the most widely used bioinformatics programs for sequence searching.[2] It addresses a fundamental problem in bioinformatics research. The heuristic algorithm it uses is much faster than other approaches, such as calculating an optimal alignment. This emphasis on speed is vital to making the algorithm practical on the huge genome databases currently available, although subsequent algorithms can be even faster. Before BLAST, FASTA was developed by David J. Lipman and William R. Pearson in 1985.[3] Before fast algorithms such as BLAST and FASTA were developed, doing database searches for protein or nucleic sequences was very time consuming because a full alignment procedure (e.g., the Smith–Waterman algorithm) was used. While BLAST is faster than any Smith-Waterman implementation for most cases, it cannot "guarantee the optimal alignments of the query and database sequences" as Smith-Waterman algorithm does. The optimality of Smith-Waterman "ensured the best performance on accuracy and the most precise results" at the expense of time and computer power. BLAST is more time-efficient than FASTA by searching only for the more significant patterns in the sequences, yet with comparative sensitivity. This could be further realized by understanding the algorithm of BLAST introduced below. Examples of other questions that researchers use BLAST to answer are:
BLAST is also often used as part of other algorithms that require approximate sequence matching. The BLAST algorithm and the computer program that implements it were developed by Stephen Altschul, Warren Gish, and David Lipman at the U.S. National Center for Biotechnology Information (NCBI), Webb Miller at the Pennsylvania State University, and Gene Myers at the University of Arizona. It is available on the web on the NCBI website. Alternative implementations include AB-BLAST (formerly known as WU-BLAST), FSA-BLAST (last updated in 2006), and ScalaBLAST.[4][5] The original paper by Altschul, et al.[1] was the most highly cited paper published in the 1990s.[6] InputInput sequences (in FASTA or Genbank format) and weight matrix. OutputBLAST output can be delivered in a variety of formats. These formats include HTML, plain text, and XML formatting. For NCBI's web-page, the default format for output is HTML. When performing a BLAST on NCBI, the results are given in a graphical format showing the hits found, a table showing sequence identifiers for the hits with scoring related data, as well as alignments for the sequence of interest and the hits received with corresponding BLAST scores for these. The easiest to read and most informative of these is probably the table. If one is attempting to search for a proprietary sequence or simply one that is unavailable in databases available to the general public through sources such as NCBI, there is a BLAST program available for download to any computer, at no cost. This can be found at BLAST+ executables. There are also commercial programs available for purchase. Databases can be found from the NCBI site, as well as from Index of BLAST databases (FTP). ProcessUsing a heuristic method, BLAST finds similar sequences, by locating short matches between the two sequences. This process of finding similar sequences is called seeding. It is after this first match that BLAST begins to make local alignments. While attempting to find similarity in sequences, sets of common letters, known as words, are very important. For example, suppose that the sequence contains the following stretch of letters, GLKFA. If a BLAST was being conducted under normal conditions, the word size would be 3 letters. In this case, using the given stretch of letters, the searched words would be GLK, LKF, KFA. The heuristic algorithm of BLAST locates all common three-letter words between the sequence of interest and the hit sequence or sequences from the database. This result will then be used to build an alignment. After making words for the sequence of interest, the rest of the words are also assembled. These words must satisfy a requirement of having a score of at least the threshold T, when compared by using a scoring matrix. One commonly used scoring matrix for BLAST searches is BLOSUM62, although the optimal scoring matrix depends on sequence similarity. Once both words and neighborhood words are assembled and compiled, they are compared to the sequences in the database in order to find matches. The threshold score T determines whether or not a particular word will be included in the alignment. Once seeding has been conducted, the alignment which is only 3 residues long, is extended in both directions by the algorithm used by BLAST. Each extension impacts the score of the alignment by either increasing or decreasing it. If this score is higher than a pre-determined T, the alignment will be included in the results given by BLAST. However, if this score is lower than this pre-determined T, the alignment will cease to extend, preventing the areas of poor alignment from being included in the BLAST results. Note that increasing the T score limits the amount of space available to search, decreasing the number of neighborhood words, while at the same time speeding up the process of BLAST AlgorithmTo run the software, BLAST requires a query sequence to search for, and a sequence to search against (also called the target sequence) or a sequence database containing multiple such sequences. BLAST will find sub-sequences in the database which are similar to sub sequences in the query. In typical usage, the query sequence is much smaller than the database, e.g., the query may be one thousand nucleotides while the database is several billion nucleotides. The main idea of BLAST is that there are often High-scoring Segment Pairs (HSP) contained in a statistically significant alignment. BLAST searches for high scoring sequence alignments between the query sequence and the existing sequences in the database using a heuristic approach that approximates the Smith-Waterman algorithm. However, the exhaustive Smith-Waterman approach is too slow for searching large genomic databases such as GenBank. Therefore, the BLAST algorithm uses a heuristic approach that is less accurate than the Smith-Waterman algorithm but over 50 times faster. [8] The speed and relatively good accuracy of BLAST are among the key technical innovations of the BLAST programs. An overview of the BLAST algorithm (a protein to protein search) is as follows:[7]
Parallel BLASTParallel BLAST versions of split databases are implemented using MPI and Pthreads, and have been ported to various platforms including Windows, Linux, Solaris, Mac OS X, and AIX. Popular approaches to parallelize BLAST include query distribution, hash table segmentation, computation parallelization, and database segmentation (partition). Databases are split into equal sized pieces and stored locally on each node. Each query is run on all nodes in parallel and the resultant BLAST output files from all nodes merged to yield the final output.{{citation needed|date=December 2018}} ProgramThe BLAST program can either be downloaded and run as a command-line utility "blastall" or accessed for free over the web. The BLAST web server, hosted by the NCBI, allows anyone with a web browser to perform similarity searches against constantly updated databases of proteins and DNA that include most of the newly sequenced organisms. The BLAST program is based on an open-source format, giving everyone access to it and enabling them to have the ability to change the program code. This has led to the creation of several BLAST "spin-offs". There are now a handful of different BLAST programs available, which can be used depending on what one is attempting to do and what they are working with. These different programs vary in query sequence input, the database being searched, and what is being compared. These programs and their details are listed below: BLAST is actually a family of programs (all included in the blastall executable). These include:[9]
By including related proteins in the search, PSI-BLAST is much more sensitive in picking up distant evolutionary relationships than a standard protein-protein BLAST.
Of these programs, {{Citation needed span|text=BLASTn and BLASTp are the most commonly used|date=August 2012}} because they use direct comparisons, and do not require translations. However, since protein sequences are better conserved evolutionarily than nucleotide sequences, tBLASTn, tBLASTx, and BLASTx, produce more reliable and accurate results when dealing with coding DNA. They also enable one to be able to directly see the function of the protein sequence, since by translating the sequence of interest before searching often gives you annotated protein hits. Alternative versionsA version designed for comparing large genomes or DNA is BLASTZ. CS-BLAST (Context-Specific BLAST) is an extended version of BLAST for searching protein sequences that finds twice as many remotely related sequences as BLAST at the same speed and error rate. In CS-BLAST, the mutation probabilities between amino acids depend not only on the single amino acid, as in BLAST, but also on its local sequence context. Washington University produced an alternative version of NCBI BLAST, called WU-BLAST. The rights have since been acquired to Advanced Biocomputing, LLC. In 2009, NCBI has released a new set of BLAST executables, the C++ based BLAST+,[10] and has released parallel versions until 2.2.26. Starting with version 2.2.27 (April 2013), only BLAST+ executables are available. Among the changes is the replacement of the Accelerated versionsTimeLogic offers an FPGA-accelerated implementation of the BLAST algorithm called Tera-BLAST that is 100's of times faster. Other formerly supported versions include:
Alternatives to BLASTThe predecessor to BLAST, FASTA, can also be used for protein and DNA similarity searching. FASTA provides a similar set of programs for comparing proteins to protein and DNA databases, DNA to DNA and protein databases, and includes additional programs for working with unordered short peptides and DNA sequences. In addition, the FASTA package provides SSEARCH, a vectorized implementation of the rigorous Smith-Waterman algorithm. FASTA is slower than BLAST, but provides a much wider range of scoring matrices, making it easier to tailor a search to a specific evolutionary distance. An extremely fast but considerably less sensitive alternative to BLAST is BLAT (Blast Like Alignment Tool). While BLAST does a linear search, BLAT relies on k-mer indexing the database, and can thus often find seeds faster.[15] Another software alternative similar to BLAT is PatternHunter. Advances in sequencing technology in the late 2000s has made searching for very similar nucleotide matches an important problem. New alignment programs tailored for this use typically use BWT-indexing of the target database (typically a genome). Input sequences can then be mapped very quickly, and output is typically in the form of a BAM file. Example alignment programs are BWA, SOAP, and Bowtie. For protein identification, searching for known domains (for instance from Pfam) by matching with Hidden Markov Models is a popular alternative, such as HMMER. An alternative to BLAST for comparing two banks of sequences is PLAST. PLAST provides a high-performance general purpose bank to bank sequence similarity search tool relying on the PLAST[16] and ORIS[17] algorithms. Results of PLAST are very similar to BLAST, but PLAST is significantly faster and capable of comparing large sets of sequences with a small memory (i.e. RAM) footprint. For applications in metagenomics, where the task is to compare billions of short DNA reads against tens of millions of protein references, DIAMOND[18] runs at up to 20,000 times as fast as BLASTX, while maintaining a high level of sensitivity. The open-source software MMseqs2[19] is an alternative to BLAST/PSI-BLAST, which improves on current search tools over the full range of speed-sensitivity trade-off, achieving sensitivities better than PSI-BLAST at more than 400 times its speed. Comparing BLAST and the Smith-Waterman ProcessWhile both Smith-Waterman and BLAST are used to find homologous sequences by searching and comparing a query sequence with those in the databases, they do have their differences. Due to the fact that BLAST is based on a heuristic algorithm, the results received through BLAST, in terms of the hits found, may not be the best possible results, as it will not provide you with all the hits within the database. BLAST misses hard to find matches. A better alternative in order to find the best possible results would be to use the Smith-Waterman algorithm. This method varies from the BLAST method in two areas, accuracy and speed. The Smith-Waterman option provides better accuracy, in that it finds matches that BLAST cannot, because it does not miss any information. Therefore, it is necessary for remote homology. However, when compared to BLAST, it is more time consuming, not to mention that it requires large amounts of computer usage and space. However, technologies to speed up the Smith-Waterman process have been found to improve the time necessary to perform a search dramatically. These technologies include FPGA chips and SIMD technology. In order to receive better results from BLAST, the settings can be changed from their default settings. However, there is no given or set way of changing these settings in order to receive the best results for a given sequence. The settings available for change are E-Value, gap costs, filters, word size, and substitution matrix. Note, that the algorithm used for BLAST was developed from the algorithm used for Smith-Waterman. BLAST employs an alignment which finds "local alignments between sequences by finding short matches and from these initial matches (local) alignments are created". BLAST output visualizationTo help users interpreting BLAST results, different software is available. According to installation and use, analysis features and technology, here are some available tools:[20]
Uses of BLASTBLAST can be used for several purposes. These include identifying species, locating domains, establishing phylogeny, DNA mapping, and comparison.
See also{{Div col|colwidth=22em}}
References1. ^1 {{Cite journal | last1 = Altschul | first1 = Stephen| authorlink1 = Stephen Altschul| last2 = Gish | first2 = Warren| authorlink2 = Warren Gish| last3 = Miller | first3 = Webb| authorlink3 = Webb Miller | last4 = Myers | first4 = Eugene | authorlink4 = Eugene Koonin| last5 = Lipman | first5 = David| authorlink5 = David J. Lipman | title = Basic local alignment search tool | doi = 10.1016/S0022-2836(05)80360-2 | journal = Journal of Molecular Biology | volume = 215 | issue = 3 | pages = 403–410 | year = 1990 | pmid = 2231712| url = http://www.blastalgorithm.com| pmc = }} 2. ^{{cite web |last=Casey |first=R. M. |year=2005 |title=BLAST Sequences Aid in Genomics and Proteomics |publisher=Business Intelligence Network |url=http://www.b-eye-network.com/view/1730}} 3. ^{{cite journal |pmid=2983426 |year=1985 |title=Rapid and sensitive protein similarity searches |volume=227 |issue=4693 |pages=1435–41 |journal=Science |doi=10.1126/science.2983426 |last1=Lipman |first1=DJ |last2=Pearson |first2=WR}} 4. ^{{Cite journal | last1 = Oehmen | first1 = C. | last2 = Nieplocha | first2 = J. | doi = 10.1109/TPDS.2006.112 | title = ScalaBLAST: A Scalable Implementation of BLAST for High-Performance Data-Intensive Bioinformatics Analysis | journal = IEEE Transactions on Parallel and Distributed Systems | volume = 17 | issue = 8 | pages = 740 | year = 2006 | pmid = | pmc = }} 5. ^{{Cite journal | last1 = Oehmen | first1 = C. S. | last2 = Baxter | first2 = D. J. | doi = 10.1093/bioinformatics/btt013 | title = ScalaBLAST 2.0: Rapid and robust BLAST calculations on multiprocessor systems | journal = Bioinformatics | volume = 29 | issue = 6 | pages = 797–798 | year = 2013 | pmid = 23361326| pmc =3597145 }} 6. ^{{cite web|title=Sense from Sequences: Stephen F. Altschul on Bettering BLAST |publisher=ScienceWatch |date=July–August 2000 |url=http://www.sciencewatch.com/july-aug2000/sw_july-aug2000_page3.htm |deadurl=yes |archiveurl=https://web.archive.org/web/20071007132448/http://www.sciencewatch.com/july-aug2000/sw_july-aug2000_page3.htm |archivedate=October 7, 2007 }} 7. ^{{cite book |last=Mount |first=D. W. |title=Bioinformatics: Sequence and Genome Analysis |year=2004 |url=http://www.bioinformaticsonline.org/ |publisher=Cold Spring Harbor Press |isbn=978-0-87969-712-9 |edition=2nd}} 8. ^Adapted from Biological Sequence Analysis I, Current Topics in Genome Analysis [https://www.youtube.com/watch?v=SAweFv8I8ow]. 9. ^{{cite web|title=Program Selection Tables of the Blast NCBI web site |url= http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=ProgSelectionGuide}} 10. ^{{Cite journal | last1 = Camacho | first1 = C. | last2 = Coulouris | first2 = G. | last3 = Avagyan | first3 = V. | last4 = Ma | first4 = N. | last5 = Papadopoulos | first5 = J. | last6 = Bealer | first6 = K. | last7 = Madden | first7 = T. L. | title = BLAST+: Architecture and applications | doi = 10.1186/1471-2105-10-421 | journal = BMC Bioinformatics | volume = 10 | pages = 421 | year = 2009 | pmid = 20003500| pmc =2803857 }} 11. ^{{cite journal|title=GPU-BLAST: using graphics processors to accelerate protein sequence alignment|journal=Bioinformatics|volume=27|issue=2|pages=182–8|year=2010|doi=10.1093/bioinformatics/btq644|pmid=21088027|url=http://bioinformatics.oxfordjournals.org/content/27/2/182|last1=Vouzis|first1=P. D.|last2=Sahinidis|first2=N. V.|pmc=3018811}} 12. ^{{cite journal |vauthors=Liu W, Schmidt B, Müller-Wittig W |title=CUDA-BLASTP: accelerating BLASTP on CUDA-enabled graphics hardware |journal=IEEE/ACM Trans Comput Biol Bioinform |volume=8 |issue=6 |pages=1678–84 |date=2011 |pmid=21339531 |doi=10.1109/TCBB.2011.33 |url=}} 13. ^{{cite journal |vauthors=Zhao K, Chu X |title=G-BLASTN: accelerating nucleotide alignment by graphics processors |journal=Bioinformatics |volume=30 |issue=10 |pages=1384–91 |date=May 2014 |pmid=24463183 |doi=10.1093/bioinformatics/btu047 |url=}} 14. ^{{cite journal |vauthors=Loh PR, Baym M, Berger B |title=Compressive genomics |journal=Nat. Biotechnol. |volume=30 |issue=7 |pages=627–30 |date=July 2012 |pmid=22781691 |doi=10.1038/nbt.2241 |url=}} 15. ^{{Cite journal|last=Kent|first=W. James|date=2002-04-01|title=BLAT—The BLAST-Like Alignment Tool|url=http://genome.cshlp.org/content/12/4/656|journal=Genome Research|language=en|volume=12|issue=4|pages=656–664|doi=10.1101/gr.229202|issn=1088-9051|pmc=187518|pmid=11932250}} 16. ^{{cite journal |last=Lavenier |first=D. |year=2009|doi=10.1186/1471-2105-10-329 |pmid=19821978 |title=PLAST: parallel local alignment search tool for database comparison |journal=BMC Bioinformatics |volume=10 |pages=329 |url=http://www.biomedcentral.com/1471-2105/10/329|last2=Lavenier |first2=Dominique |pmc=2770072 }} 17. ^{{cite book|doi=10.1109/IPDPS.2008.4536172 |last=Lavenier |first=D. |year=2009 |url=http://www.hicomb.org/papers/HICOMB2008-01.pdf|chapter=Ordered index seed algorithm for intensive DNA sequence comparison |title=2008 IEEE International Symposium on Parallel and Distributed Processing |isbn=978-1-4244-1693-6 |pages=1–8 }} 18. ^{{Cite journal |author=Buchfink, Xie and Huson |title=Fast and sensitive protein alignment using DIAMOND |journal=Nature Methods |volume=12 |issue=1 |pages=59–60 |date= 2015 |doi=10.1038/nmeth.3176|pmid=25402007 }} 19. ^{{Cite journal|last=Steinegger|first=Martin|last2=Soeding|first2=Johannes|date=2017-10-16|journal=Nature Biotechnology|volume=35|issue=11|pages=1026–1028|title=MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets|doi=10.1038/nbt.3988|pmid=29035372}} 20. ^{{Cite journal |author=Neumann, Kumar and Shalchian-Tabrizi |title=BLAST output visualization in the new sequencing era|journal=Briefings in Bioinformatics|volume=15|issue=4|pages=484–503 |date=2014|doi=10.1093/bib/bbt009|pmid=23603091}} External links{{Library resources box|onlinebooks=no |by=no |wikititle=Sequence alignment |label=Sequence alignment}}
5 : Bioinformatics algorithms|Computational phylogenetics|Bioinformatics software|Laboratory software|Public-domain software |
随便看 |
|
开放百科全书收录14589846条英语、德语、日语等多语种百科知识,基本涵盖了大多数领域的百科知识,是一部内容自由、开放的电子版国际百科全书。