“Basic Linear Algebra Subprograms”的意思、由来-开放百科全书

Basic Linear Algebra Subprograms (BLAS) is a specification that prescribes a set of low-level routines for performing common linear algebra operations such as vector addition, scalar multiplication, dot products, linear combinations, and matrix multiplication. They are the de facto standard low-level routines for linear algebra libraries; the routines have bindings for both C and Fortran. Although the BLAS specification is general, BLAS implementations are often optimized for speed on a particular machine, so using them can bring substantial performance benefits. BLAS implementations will take advantage of special floating point hardware such as vector registers or SIMD instructions.

It originated as a Fortran library in 1979^[1] and its interface was standardized by the BLAS Technical (BLAST) Forum, whose latest BLAS report can be found on the netlib website.^[2] This Fortran library is known as the reference implementation (sometimes confusingly referred to as the BLAS library) and is not optimized for speed but is in the public domain.^[3]^[4]

Most libraries that offer linear algebra routines conform to the BLAS interface, allowing library users to develop programs that are agnostic of the BLAS library being used. Examples of BLAS libraries include: AMD Core Math Library (ACML), ATLAS, Intel Math Kernel Library (MKL), and OpenBLAS. ACML is no longer supported by its producer.^[5] ATLAS is a portable library that automatically optimizes itself for an arbitrary architecture. MKL is a freeware^[6] and proprietary^[7] vendor library optimized for x86 and x86-64 with a performance emphasis on Intel processors.^[8] OpenBLAS is an open-source library that is hand-optimized for many of the popular architectures. The LINPACK benchmarks rely heavily on the BLAS routine gemm for its performance measurements.

Many numerical software applications use BLAS-compatible libraries to do linear algebra computations, including Armadillo, LAPACK, LINPACK, GNU Octave, Mathematica,^[9] MATLAB,^[10] NumPy,^[11] R, and Julia.

Background

With the advent of numerical programming, sophisticated subroutine libraries became useful. These libraries would contain subroutines for common high-level mathematical operations such as root finding, matrix inversion, and solving systems of equations. The language of choice was FORTRAN. The most prominent numerical programming library was IBM's Scientific Subroutine Package (SSP).^[12] These subroutine libraries allowed programmers to concentrate on their specific problems and avoid re-implementing well-known algorithms. The library routines would also be better than average implementations; matrix algorithms, for example, might use full pivoting to get better numerical accuracy. The library routines would also have more efficient routines. For example, a library may include a program to solve a matrix that is upper triangular. The libraries would include single-precision and double-precision versions of some algorithms.

Initially, these subroutines used hard-coded loops for their low-level operations. For example, if a subroutine need to perform a matrix multiplication, then the subroutine would have three nested loops. Linear algebra programs have many common low-level operations (the so-called "kernel" operations, not related to operating systems).^[13] Between 1973 and 1977, several of these kernel operations were identified.{{sfn|BLAST Forum|2001|p=1}} These kernel operations became defined subroutines that math libraries could call. The kernel calls had advantages over hard-coded loops: the library routine would be more readable, there were fewer chances for bugs, and the kernel implementation could be optimized for speed. A specification for these kernel operations using scalars and vectors, the level-1 Basic Linear Algebra Subroutines (BLAS), was published in 1979.{{sfn|Lawson|Hanson|Kincaid|Krogh|1979}} BLAS was used to implement the linear algebra subroutine library LINPACK.

The BLAS abstraction allows customization for high performance. For example, LINPACK is a general purpose library that can be used on many different machines without modification. LINPACK could use a generic version of BLAS. To gain performance, different machines might use tailored versions of BLAS. As computer architectures became more sophisticated, vector machines appeared. BLAS for a vector machine could use the machine's fast vector operations. (While vector processors eventually fell out of favor, vector instructions in modern CPUs are essential for optimal performance in BLAS routines.)

Other machine features became available and could also be exploited. Consequently, BLAS was augmented from 1984 to 1986 with level-2 kernel operations that concerned vector-matrix operations. Memory hierarchy was also recognized as something to exploit. Many computers have cache memory that is much faster than main memory; keeping matrix manipulations localized allows better usage of the cache. In 1987 and 1988, the level 3 BLAS were identified to do matrix-matrix operations. The level 3 BLAS encouraged block-partitioned algorithms. The LAPACK library uses level 3 BLAS.{{sfn|BLAST Forum|2001|pp=1–2}}

The original BLAS concerned only densely stored vectors and matrices. Further extensions to BLAS, such as for sparse matrices, have been addressed.{{sfn|BLAST Forum|2001|p=2}}

ATLAS

Automatically Tuned Linear Algebra Software (ATLAS) attempts to make a BLAS implementation with higher performance. ATLAS defines many BLAS operations in terms of some core routines and then tries to automatically tailor the core routines to have good performance. A search is performed to choose good block sizes. The block sizes may depend on the computer's cache size and architecture. Tests are also made to see if copying arrays and vectors improves performance. For example, it may be advantageous to copy arguments so that they are cache-line aligned so user-supplied routines can use SIMD instructions.

Functionality

BLAS functionality is categorized into three sets of routines called "levels", which correspond to both the chronological order of definition and publication, as well as the degree of the polynomial in the complexities of algorithms; Level 1 BLAS operations typically take linear time, {{math|O(n)}}, Level 2 operations quadratic time and Level 3 operations cubic time.{{r|level3}} Modern BLAS implementations typically provide all three levels.

Level 1

This level consists of all the routines described in the original presentation of BLAS (1979),^[1] which defined only vector operations on strided arrays: dot products, vector norms, a generalized vector addition of the form

Level 2

This level contains matrix-vector operations including, among other things, a generalized matrix-vector multiplication ({{mono|gemv}}):

with {{math|T}} being triangular. Design of the Level 2 BLAS started in 1984, with results published in 1988.^[15] The Level 2 subroutines are especially intended to improve performance of programs using BLAS on vector processors, where Level 1 BLAS are suboptimal "because they hide the matrix-vector nature of the operations from the compiler."^[14]

Level 3

This level, formally published in 1990,^[15] contains matrix-matrix operations, including a "general matrix multiplication" (gemm), of the form

where {{math|A}} and {{math|B}} can optionally be transposed or hermitian-conjugated inside the routine and all three matrices may be strided. The ordinary matrix multiplication {{math|A B}} can be performed by setting {{math|α}} to one and {{math|C}} to an all-zeros matrix of the appropriate size.

Due to the ubiquity of matrix multiplications in many scientific applications, including for the implementation of the rest of Level 3 BLAS,^[16] and because faster algorithms exist beyond the obvious repetition of matrix-vector multiplication, gemm is a prime target of optimization for BLAS implementers. E.g., by decomposing one or both of {{math|A}}, {{math|B}} into block matrices, gemm can be implemented recursively. This is one of the motivations for including the {{math|β}} parameter,{{dubious|Reason for beta parameter|date=January 2015}} so the results of previous blocks can be accumulated. Note that this decomposition requires the special case {{math|β {{=}} 1}} which many implementations optimize for, thereby eliminating one multiplication for each value of {{math|C}}. This decomposition allows for better locality of reference both in space and time of the data used in the product. This, in turn, takes advantage of the cache on the system.^[17] For systems with more than one level of cache, the blocking can be applied a second time to the order in which the blocks are used in the computation. Both of these levels of optimization are used in implementations such as ATLAS. More recently, implementations by Kazushige Goto have shown that blocking only for the L2 cache, combined with careful amortizing of copying to contiguous memory to reduce TLB misses, is superior to ATLAS.^[18] A highly tuned implementation based on these ideas is part of the GotoBLAS, OpenBLAS and BLIS.

Implementations

Similar libraries but not compatible with BLAS

Sparse BLAS

Several extensions to BLAS for handling sparse matrices have been suggested over the course of the library's history; a small set of sparse matrix kernel routines were finally standardized in 2002.^[52]

See also

References

1. ^¹*{{cite journal |last1=Lawson |first1=C. L. |last2=Hanson |first2=R. J. |last3=Kincaid |first3=D. |last4=Krogh |first4=F. T. |title=Basic Linear Algebra Subprograms for FORTRAN usage |journal=ACM Trans. Math. Softw. |volume=5 |issue=3 |pages=308–323 |year=1979 |id=Algorithm 539 |doi=10.1145/355841.355847 |ref=harv}}
2. ^{{Cite web|url=http://netlib.org/blas/blast-forum|title=BLAS Technical Forum|website=netlib.org|access-date=2017-07-07}}
3. ^blaseman {{webarchive |url=https://web.archive.org/web/20161012014431/http://www.lahey.com/docs/blaseman_lin62.pdf |date=October 12, 2016 }} "The products are the implementations of the public domain BLAS (Basic Linear Algebra Subprograms) and LAPACK (Linear Algebra PACKage), which have been developed by groups of people such as Prof. Jack Dongarra, University of Tennessee, USA and all published on the WWW (URL: http://www.netlib.org/)."{{dead link|date=October 2016 |bot=InternetArchiveBot |fix-attempted=yes }}
4. ^{{cite web|url=http://www.netlib.org/utk/people/JackDongarra/PAPERS/netlib-history6.pdf|title=Netlib and NA-Net: building a scientific computing community |author=Jack Dongarra |author2=Gene Golub |author3=Eric Grosse |author4=Cleve Moler |author5=Keith Moore |quote=The Netlib software repository was created in 1984 to facilitate quick distribution of public domain software routines for use in scientific computation. |publisher=netlib.org |date=|accessdate=2016-02-13}}
5. ^{{cite web |year=2013 |title=ACML – AMD Core Math Library |publisher=AMD |url=http://developer.amd.com/tools-and-sdks/archive/amd-core-math-library-acml/ |accessdate=26 August 2015 |deadurl=yes |archiveurl=https://web.archive.org/web/20150905190558/http://developer.amd.com/tools-and-sdks/archive/amd-core-math-library-acml/ |archivedate=5 September 2015 |df= }}
6. ^¹{{cite web |year=2015 |title=No Cost Options for Intel Math Kernel Library (MKL), Support yourself, Royalty-Free |publisher=Intel |url=http://software.intel.com/articles/free_mkl |accessdate=31 August 2015}}
7. ^{{cite web |year=2015 |title=Intel® Math Kernel Library (Intel® MKL) |publisher=Intel |url=http://software.intel.com/intel-mkl |accessdate=25 August 2015}}
8. ^{{cite web |year=2012 |title=Optimization Notice |publisher=Intel |url=http://software.intel.com/articles/optimization-notice |accessdate=10 April 2013}}
9. ^{{cite journal|author=Douglas Quinney |year=2003 |title=So what's new in Mathematica 5.0? |journal=MSOR Connections |volume=3 |number=4 |publisher=The Higher Education Academy |url=http://78.158.56.101/archive/msor/headocs/34mathematica5.pdf |deadurl=yes |archiveurl=https://web.archive.org/web/20131029204826/http://78.158.56.101/archive/msor/headocs/34mathematica5.pdf |archivedate=2013-10-29 |df= }}
10. ^{{cite web |author=Cleve Moler |year=2000 |title=MATLAB Incorporates LAPACK |publisher=MathWorks |url=http://www.mathworks.com/company/newsletters/articles/matlab-incorporates-lapack.html |accessdate=26 October 2013}}
11. ^{{cite journal |title=The NumPy array: a structure for efficient numerical computation |author=Stéfan van der Walt |author2=S. Chris Colbert |author3=Gaël Varoquaux |last-author-amp=yes |year=2011 |journal=Computing in Science and Engineering |volume=13 |issue=2 |pages=22–30 |arxiv=1102.1523|bibcode=2011arXiv1102.1523V |doi=10.1109/MCSE.2011.37 }}
12. ^{{Cite journal | last1 = Boisvert | first1 = Ronald F. | year = 2000 | title = Mathematical software: past, present, and future | journal = Mathematics and Computers in Simulation | volume = 54 | issue = 4–5 | pages = 227–241 | publisher = | jstor = | doi = 10.1016/S0378-4754(00)00185-3 | url = | arxiv = cs/0004004}}
13. ^Even the SSP (which appeared around 1966) had some basic routines such as RADD (add rows), CADD (add columns), SRMA (scale row and add to another row), and RINT (row interchange). These routines apparently were not used as kernel operations to implement other routines such as matrix inversion. See {{Citation |last=IBM |title=System/360 Scientific Subroutine Package, Version III, Programmer's Manual |edition=5th |publisher=International Business Machines |year=1970 |id=GH20-0205-4}}.
14. ^¹{{cite journal |first1=Jack J. |last1=Dongarra |first2=Jeremy |last2=Du Croz |first3=Sven |last3=Hammarling |first4=Richard J. |last4=Hanson |title=An extended set of FORTRAN Basic Linear Algebra Subprograms |journal=ACM Trans. Math. Softw. |volume=14 |year=1988 |pages=1–17 |doi=10.1145/42288.42291|citeseerx=10.1.1.17.5421 }}
15. ^{{Cite journal |last1=Dongarra |first1=Jack J. |last2=Du Croz |first2=Jeremy |last3=Hammarling |first3=Sven |last4=Duff |first4=Iain S. |title=A set of level 3 basic linear algebra subprograms |doi=10.1145/77626.79170 |year=1990 |journal=ACM Transactions on Mathematical Software |issn=0098-3500 |volume=16 |issue=1 |pages=1–17}}
16. ^{{cite journal |last1=Goto |first1=Kazushige |first2=Robert |last2=van de Geijn |title=High-performance implementation of the level-3 BLAS |journal=ACM Transactions on Mathematical Software |volume=35 |pages=1–14 |number=1 |year=2008 |doi=10.1145/1377603.1377607 |url=ftp://ftp.cs.utexas.edu/pub/techreports/tr06-23.pdf}}
17. ^{{Citation | last1=Golub | first1=Gene H. | author1-link=Gene H. Golub | last2=Van Loan | first2=Charles F. | author2-link=Charles F. Van Loan | title=Matrix Computations | publisher=Johns Hopkins | edition=3rd | isbn=978-0-8018-5414-9 | year=1996}}
18. ^{{Cite journal |title=Anatomy of High-performance Matrix Multiplication |doi=10.1145/1356052.1356053 |journal=ACM Trans. Math. Softw. |year=2008 |issn=0098-3500 |pages=12:1–12:25 |volume=34 |issue=3 |first1=Kazushige |last1=Goto |first2=Robert A. |last2=van de Geijn|citeseerx=10.1.1.111.3873 }}
19. ^{{Cite web|url=https://developer.apple.com/library/mac/#releasenotes/Performance/RN-vecLib/|title=Guides and Sample Code|website=developer.apple.com|access-date=2017-07-07}}
20. ^{{Cite web|url=https://developer.apple.com/library/ios/#documentation/Accelerate/Reference/AccelerateFWRef/|title=Guides and Sample Code|website=developer.apple.com|access-date=2017-07-07}}
21. ^{{cite web |url=http://developer.amd.com/acml.aspx |title=Archived copy |accessdate=2005-10-26 |deadurl=yes |archiveurl=https://web.archive.org/web/20051130022536/http://developer.amd.com/acml.aspx |archivedate=2005-11-30 |df= }}
22. ^{{Cite web|url=http://ampblas.codeplex.com/|title=C++ AMP BLAS Library|website=CodePlex|language=en|access-date=2017-07-07}}
23. ^{{Cite web|url=http://math-atlas.sourceforge.net/|title=Automatically Tuned Linear Algebra Software (ATLAS)|website=math-atlas.sourceforge.net|access-date=2017-07-07}}
24. ^{{Citation|title=blis: BLAS-like Library Instantiation Software Framework|date=2017-06-30|url=https://github.com/flame/blis|publisher=flame|accessdate=2017-07-07}}
25. ^{{Cite news|url=http://developer.nvidia.com/cublas|title=cuBLAS|date=2013-07-29|work=NVIDIA Developer|access-date=2017-07-07|language=en}}
26. ^{{Cite news|url=https://docs.nvidia.com/cuda/nvblas/index.htmls|title=NVBLAS|date=2018-05-15|work=NVIDIA Developer|access-date=2018-05-15|language=en}}
27. ^¹{{Citation|title=clBLAS: a software library containing BLAS functions written in OpenCL|date=2017-07-03|url=https://github.com/clMathLibraries/clBLAS|publisher=clMathLibraries|accessdate=2017-07-07}}
28. ^{{Citation|last=Nugteren|first=Cedric|title=CLBlast: Tuned OpenCL BLAS|date=2017-07-05|url=https://github.com/CNugteren/CLBlast|accessdate=2017-07-07}}
29. ^http://publib.boulder.ibm.com/infocenter/clresctr/index.jsp?topic=/com.ibm.cluster.essl.doc/esslbooks.html{{dead link|date=July 2017 |bot=InternetArchiveBot |fix-attempted=yes }}
30. ^{{cite web |url=http://www.tacc.utexas.edu/tacc-projects/gotoblas2/ |title=Archived copy |accessdate=2012-05-24 |deadurl=yes |archiveurl=https://web.archive.org/web/20120517132718/http://www.tacc.utexas.edu/tacc-projects/gotoblas2 |archivedate=2012-05-17 |df= }}
31. ^{{Cite web|url=http://software.intel.com/en-us/intel-mkl/|title=Intel® Math Kernel Library (Intel® MKL) {{!}} Intel® Software|website=software.intel.com|language=en|access-date=2017-07-07}}
32. ^{{Cite web|url=http://www.mathkeisan.com/|title=MathKeisan|last=Mathkeisan|first=NEC|website=www.mathkeisan.com|language=en|access-date=2017-07-07}}
33. ^{{Cite web|url=http://www.netlib.org/blas/|title=BLAS (Basic Linear Algebra Subprograms)|website=www.netlib.org|access-date=2017-07-07}}
34. ^{{Cite web|url=http://www.netlib.org/blas|title=BLAS (Basic Linear Algebra Subprograms)|website=www.netlib.org|access-date=2017-07-07}}
35. ^{{Cite web|url=http://www.openblas.net/|title=OpenBLAS : An optimized BLAS library|website=www.openblas.net|access-date=2017-07-07}}
36. ^{{cite web |url=http://www.nec.co.jp/hpc/mediator/sxm_e/software/61.html |title=Archived copy |accessdate=2007-05-20 |deadurl=yes |archiveurl=https://web.archive.org/web/20070222154031/http://www.nec.co.jp/hpc/mediator/sxm_e/software/61.html |archivedate=2007-02-22 |df= }}
37. ^{{cite web |url=http://www.sgi.com/products/software/scsl.html |title=Archived copy |accessdate=2007-05-20 |deadurl=yes |archiveurl=https://web.archive.org/web/20070513173030/http://www.sgi.com/products/software/scsl.html |archivedate=2007-05-13 |df= }}
38. ^{{Cite web|url=http://www.oracle.com/technetwork/server-storage/solarisstudio/overview/index.html|title=Oracle Developer Studio|website=www.oracle.com|access-date=2017-07-07}}
39. ^{{Cite web|url=http://arma.sourceforge.net/|title=Armadillo: C++ linear algebra library|website=arma.sourceforge.net|access-date=2017-07-07}}
40. ^{{cite web |url=http://developer.amd.com/tools-and-sdks/opencl-zone/acl-amd-compute-libraries/ |title=Archived copy |accessdate=2016-10-25 |deadurl=yes |archiveurl=https://web.archive.org/web/20161116145528/http://developer.amd.com/tools-and-sdks/opencl-zone/acl-amd-compute-libraries/ |archivedate=2016-11-16 |df= }}
41. ^{{Citation|title=clSPARSE: a software library containing Sparse functions written in OpenCL|date=2017-07-03|url=https://github.com/clMathLibraries/clSPARSE|publisher=clMathLibraries|accessdate=2017-07-07}}
42. ^{{Citation|title=clFFT: a software library containing FFT functions written in OpenCL|date=2017-07-06|url=https://github.com/clMathLibraries/clFFT|publisher=clMathLibraries|accessdate=2017-07-07}}
43. ^{{Citation|title=clRNG: an OpenCL based software library containing random number generation functions|date=2017-06-25|url=https://github.com/clMathLibraries/clRNG|publisher=clMathLibraries|accessdate=2017-07-07}}
44. ^{{Cite web|url=http://eigen.tuxfamily.org|title=Eigen|website=eigen.tuxfamily.org|language=en|access-date=2017-07-07}}
45. ^{{Cite web|url=http://libelemental.org/|title=Elemental: distributed-memory dense and sparse-direct linear algebra and optimization — Elemental|website=libelemental.org|access-date=2017-07-07}}
46. ^{{Cite web|url=http://sourceforge.net/projects/hasem/|title=HASEM|website=SourceForge|language=en|access-date=2017-07-07}}
47. ^{{cite web |url=http://z.cs.utexas.edu/wiki/flame.wiki/FrontPage |title=Archived copy |accessdate=2011-02-21 |deadurl=yes |archiveurl=https://web.archive.org/web/20100803003649/http://z.cs.utexas.edu/wiki/flame.wiki/FrontPage |archivedate=2010-08-03 |df= }}
48. ^http://icl.eecs.utk.edu/magma/
49. ^{{Cite web|url=https://github.com/libmir|title= Dlang Numerical and System Libraries|last=|first=|date=|website=|publisher=|access-date=}}
50. ^{{Cite web|url=http://icl.eecs.utk.edu/|title=ICL|website=icl.eecs.utk.edu|language=en|access-date=2017-07-07}}
51. ^{{Cite web|url=http://www.boost.org/doc/libs/1_60_0/libs/numeric/ublas/doc/index.html|title=Boost Basic Linear Algebra - 1.60.0|website=www.boost.org|access-date=2017-07-07}}
52. ^{{cite journal |first1=Iain S. |last1=Duff |first2=Michael A. |last2=Heroux |first3=Roldan |last3=Pozo |title=An Overview of the Sparse Basic Linear Algebra Subprograms: The New Standard from the BLAS Technical Forum |journal=TOMS |year=2002 |volume=28 |issue=2 |pages=239–267 |doi=10.1145/567806.567810}}