请输入您要查询的百科知识:

 

词条 FASTA format
释义

  1. Original format & overview

  2. Description line

      NCBI identifiers  

  3. Sequence representation

  4. FASTA file

     Filename extension  Compression  Encryption 

  5. Extended Format

  6. Working with FASTA files

  7. See also

  8. References

  9. External links

{{Infobox file format
| name = FASTA format
| icon =
| iconcaption =
| icon_size =
| screenshot =
| screenshot_size =
| caption =
|_noextcode =
| extension =
|_nomimecode =
| mime =
| type_code =
| uniform_type =
| conforms_to =
| magic =
| developer = David J. Lipman
William R. Pearson[1][2]
| released = 1985
| latest_release_version =
| latest_release_date =
| genre = Bioinformatics
| container_for =
| contained_by =
| extended_from = ASCII for FASTA
| extended_to = FASTQ format[3]
| standard =
| free =
| url = {{URL|https://www.ncbi.nlm.nih.gov/BLAST/fasta.shtml}}

}}In bioinformatics and biochemistry, the FASTA format is a text-based format for representing either nucleotide sequences or amino acid (protein) sequences, in which nucleotides or amino acids are represented using single-letter codes. The format also allows for sequence names and comments to precede the sequences. The format originates from the FASTA software package, but has now become a near universal standard in the field of bioinformatics.[3]

The simplicity of FASTA format makes it easy to manipulate and parse sequences using text-processing tools and scripting languages like the R programming language, Python, Ruby, and Perl.

Original format & overview

The original FASTA/Pearson format is described in the documentation for the FASTA suite of programs. It can be downloaded with any free distribution of FASTA (see fasta20.doc, fastaVN.doc or fastaVN.me—where VN is the Version Number).

In the original format, a sequence was represented as a series of lines, each of which was no longer than 120 characters and usually

did not exceed 80 characters. This probably was to allow for preallocation of fixed line sizes in software: at the time most users relied on Digital Equipment Corporation (DEC) VT220 (or compatible) terminals which could display 80 or 132 characters per line.{{fact|date=March 2018}} Most people preferred the bigger font in 80-character modes and so it became the recommended fashion to use 80 characters or less (often 70) in FASTA lines. Also, the width of a standard printed page is 70 to 80 characters (depending on the font). Hence, 80 characters became the norm.{{fact|date=February 2019}}

The first line in a FASTA file started either with a ">" (greater-than) symbol or, less frequently, a ";"{{fact|date=February 2019}} (semicolon) was taken as a comment. Subsequent lines starting with a semicolon would be ignored by software. Since the only comment used was the first, it quickly became used to hold a summary description of the sequence, often starting with a unique library accession number, and with time it has become commonplace to always use ">" for the first line and to not use ";" comments (which would otherwise be ignored).

Following the initial line (used for a unique description of the sequence) is the actual sequence itself in standard

one-letter character string. Anything other than a valid character would be ignored (including spaces, tabulators, asterisks, etc...). Originally it was also common to end the sequence with an "*" (asterisk) character (in analogy with use in PIR formatted sequences) and, for the same reason, to leave a blank line between the description and the sequence. A few sample sequences:

LCBO - Prolactin precursor - Bovine
a sample sequence in FASTA format

MDSKGSSQKGSRLLLLLVVSNLLLCQGVVSTPVCPNGPGNCQVSLRDLFDRAVMVSHYIHDLSS

EMFNEFDKRYAQGKGFITMALNSCHTSSLPTPEDKEQAQQTHHEVLMSLILGLLRSWNDPLYHL

VTEVRGMKGAPDAILSRAIEIEEENKRLLEGMEMIFGQVIPGAKETEPYPVWSGLPSLQTKDED

ARYSAFYNLLHCLRRDSSKIDTYLKLLNCRIIYNNNC*

>MCHU - Calmodulin - Human, rabbit, bovine, rat, and chicken

ADQLTEEQIAEFKEAFSLFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGNGTID

FPEFLTMMARKMKDTDSEEEIREAFRVFDKDGNGYISAAELRHVMTNLGEKLTDEEVDEMIREA

DIDGDGQVNYEEFVQMMTAK*

>gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus]

LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV

EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG

LLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVIL

GLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGX

IENY

A multiple sequence FASTA format would be obtained by concatenating several single sequence FASTA files in a common file (also known as multi-FASTA format). This does not imply a contradiction with the format as only the first line in a FASTA file may start with a ";" or ">", hence forcing all subsequent sequences to start with a ">" in order to be taken as different ones (and further forcing the exclusive reservation of ">" for the sequence definition line). Thus, the examples above may as well be taken as a multisequence (i.e multi-FASTA) file if taken together.

Nowadays, modern bioinformatic programs that rely on the FASTA format expect the sequence headers to be preceded by ">", and the actual sequence, while generally represented as "interleaved", i.e. on multiple lines as in the above example, may also be "sequential" when the full stretch is found on a single line. Users may often need to perform conversion between "Sequential" and "Interleaved" FASTA format to run different bioinformatic programs.

Description line

The description line (defline) or header/identifier line, which begins with '>', gives a name and/or a unique identifier for the sequence, and may also contain additional information. In a deprecated practice, the header line sometimes contained more than one header, separated by a ^A (Control-A) character. In the original Pearson FASTA format, one or more comments, distinguished by a semi-colon at the beginning of the line, may occur after the header. Some databases and bioinformatics applications do not recognize these comments and follow [https://www.ncbi.nlm.nih.gov/blast/fasta.shtml the NCBI FASTA specification]. An example of a multiple sequence FASTA file follows:

>SEQUENCE_1

MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEG

LVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHK

IPQFASRKQLSDAILKEAEEKIKEELKAQGKPEKIWDNIIPGKMNSFIADNSQLDSKLTL

MGQFYVMDDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGLEKKTEDFAAEVAAQL

>SEQUENCE_2

SATVSEINSETDFVAKNDQFIALTKDTTAHIQSNSLQSVEELHSSTINGVKFEEYLKSQI

ATIGENLVVRRFATLKAGANGVVNGYIHTNGRVGVVIAAACDSAEVASKSRDLLRQICMH

NCBI identifiers

The NCBI defined a standard for the unique identifier used for the sequence (SeqID) in the header line. This allows a sequence that was obtained from a database to be labelled with a reference to its database record. The database identifier format is understood by the NCBI tools like makeblastdb and table2asn. The following list describes the NCBI FASTA defined format for sequence identifiers.[4]

Type Format(s) Example(s)
local (i.e. no database reference) lcl|integer
lcl|string
lcl|123
lcl|hmm271
GenInfo backbone seqid bbs|integer bbs|123
GenInfo backbone moltype bbm|integer bbm|123
GenInfo import ID gim|integer gim|123
[https://www.ncbi.nlm.nih.gov/Genbank/index.html GenBank] gb|accession|locus gb|M73307|AGMA13GT
EMBL emb|accession|locus emb|CAM43271.1|
PIR pir|accession|name pir||G36364
SWISS-PROT sp|accession|name sp|P01013|OVAX_CHICK
patent pat|country|patent|sequence-number pat|US|RE33188|1
pre-grant patent pgp|country|application-number|sequence-number pgp|EP|0238993|7
[https://www.ncbi.nlm.nih.gov/projects/RefSeq RefSeq] ref|accession|name ref|NM_010450.1|
general database reference
(a reference to a database that's not in this list)
gnl|database|integer
gnl|database|string
gnl|taxon|9606
gnl|PID|e1632
GenInfo integrated database gi|integer gi|21434723
DDBJ dbj|accession|locus dbj|BAC85684.1|
PRF prf|accession|name prf||0806162C
PDB pdb|entry|chain pdb|1I4L|D
third-party [https://www.ncbi.nlm.nih.gov/Genbank/index.html GenBank] tpg|accession|name tpg|BK003456|
third-party EMBL tpe|accession|name tpe|BN000123|
third-party DDBJ tpd|accession|name tpd|FAA00017|
TrEMBL tr|accession|name tr|Q90RT2|Q90RT2_9HIV1

The vertical bars ("|") in the above list are not separators in the sense of the Backus–Naur form, but are part of the format. Multiple identifiers can be concatenated, also separated by vertical bars.

Sequence representation

Following the header line, the actual sequence is represented. Sequences may be protein sequences or nucleic acid sequences, and they can contain gaps or alignment characters (see sequence alignment). Sequences are expected to be represented in the standard IUB/IUPAC amino acid and nucleic acid codes, with these exceptions: lower-case letters are accepted and are mapped into upper-case; a single hyphen or dash can be used to represent a gap character; and in amino acid sequences, U and * are acceptable letters (see below). Numerical digits are not allowed but are used in some databases to indicate the position in the sequence. The nucleic acid codes supported are:[5][6]

Nucleic Acid Code Meaning Mnemonic
A A Adenine
C C Cytosine
G G Guanine
T T Thymine
U U Uracil
R A or G puRine
Y C, T or U pYrimidines
K G, T or U bases which are Ketones
M A or C bases with aMino groups
S C or G Strong interaction
W A, T or U Weak interaction
B not A (i.e. C, G, T or U) B comes after A
D not C (i.e. A, G, T or U) D comes after C
H not G (i.e., A, C, T or U) H comes after G
V neither T nor U (i.e. A, C or G) V comes after U
N A C G T U Nucleic acid
- gap of indeterminate length

The amino acid codes supported (22 amino acids and 3 special codes) are:

Amino Acid Code Meaning
A Alanine
B Aspartic acid (D) or Asparagine (N)
C Cysteine
D Aspartic acid
E Glutamic acid
F Phenylalanine
G Glycine
H Histidine
I Isoleucine
J Leucine (L) or Isoleucine (I)
K Lysine
L Leucine
M Methionine/Start codon
N Asparagine
O Pyrrolysine
P Proline
Q Glutamine
R Arginine
S Serine
T Threonine
U Selenocysteine
V Valine
W Tryptophan
Y Tyrosine
Z Glutamic acid (E) or Glutamine (Q)
X any
* translation stop
- gap of indeterminate length

FASTA file

Filename extension

There is no standard filename extension for a text file containing FASTA formatted sequences. The table below shows each extension and its respective meaning.

Extension Meaning Notes
fasta generic fasta Any generic fasta file. See below for other common FASTA file extensions
fnafasta nucleic acidUsed generically to specify nucleic acids.
ffnFASTA nucleotide of gene regionsContains coding regions for a genome.
faafasta amino acidContains amino acid sequences. A multiple protein fasta file can have the more specific extension mpfa.
frnFASTA non-coding RNAContains non-coding RNA regions for a genome, in DNA alphabet e.g. tRNA, rRNA

Compression

The compression of FASTA files requires a specific compressor to handle both channels of information: identifiers and sequence. For improved compression results, these are mainly divided in two streams where the compression is made assuming independence. For example, the algorithm MFCompress [7] performs lossless compression of these files using context modelling and arithmetic encoding. For a benchmark on FASTA files compression algorithms, see [8].

Encryption

The encryption of FASTA files has been mostly addressed with a specific encryption tool: Cryfa.[9].[10] Cryfa uses AES encryption and enables to compact data besides encryption. It can also address FASTQ files.

Extended Format

FASTA format was extended by FASTQ format from the Sanger Centre in Cambridge.[11]

Working with FASTA files

A plethora of user-friendly scripts are available from the community to perform FASTA file manipulations. Online toolbox are also available such as FaBox[12] or the FASTX-Toolkit within Galaxy servers.[13] For instance, these can be used to segregate sequence headers/identifiers, rename them, shorten them, or extract sequences of interest from large FASTA files based on a list of wanted identifiers (among other available functions). A tree-based approach to sorting multi-FASTA files (TREE2FASTA[14]) also exists based on the coloring and/or annotation of sequence of interest in the FigTree viewer. Additionally, Bioconductor.org's Biostrings package can be used to read and manipulate FASTA files in R.[15]

Several online format converters exist to rapidly reformat multi-FASTA files to different formats (e.g. NEXUS, PHYLIP) for their use with different phylogenetic programs (e.g. such as the converter available on phylogeny.fr.[16]

See also

  • The FASTQ format, used to represent DNA sequencer reads along with quality scores.
  • The SAM format, used to represent genome sequencer reads that have been aligned to genome sequences.
  • The GVF format (Genome Variation Format), an extension based on the GFF3 format.

References

1. ^{{cite journal | vauthors = Lipman DJ, Pearson WR | title = Rapid and sensitive protein similarity searches | journal = Science | volume = 227 | issue = 4693 | pages = 1435–41 | date = March 1985 | pmid = 2983426 | doi = 10.1126/science.2983426 }} {{closed access}}
2. ^{{cite journal | vauthors = Pearson WR, Lipman DJ | title = Improved tools for biological sequence comparison | journal = Proceedings of the National Academy of Sciences of the United States of America | volume = 85 | issue = 8 | pages = 2444–8 | date = April 1988 | pmid = 3162770 | pmc = 280013 | doi = 10.1073/pnas.85.8.2444 }}
3. ^{{cite web|url=http://zhanglab.ccmb.med.umich.edu/FASTA/|website=zhanglab.ccmb.med.umich.edu|title= What is FASTA Format?}} explains the FASTA format
4. ^{{cite book |title=NCBI C++ Toolkit Book |publisher=National Center for Biotechnology Information |url=https://ncbi.github.io/cxx-toolkit/pages/ch_demo#ch_demo.id1_fetch.html_ref_fasta |accessdate=2018-12-19}}
5. ^{{cite web| author = Tao Tao| date = 2011-08-24| title = Single Letter Codes for Nucleotides| work = [NCBI Learning Center]| publisher = National Center for Biotechnology Information| url = https://www.ncbi.nlm.nih.gov/staff/tao/tools/tool_lettercode.html| access-date = 2012-03-15}}
6. ^{{cite web |url=http://www.dna.affrc.go.jp/misc/MPsrch/InfoIUPAC.html |title=IUPAC code table |publisher=NIAS DNA Bank |deadurl=yes |archive-url=https://web.archive.org/web/20110811073845/http://www.dna.affrc.go.jp/misc/MPsrch/InfoIUPAC.html |archive-date=2011-08-11 |df= }}
7. ^{{cite journal | vauthors = Pinho AJ, Pratas D | title = MFCompress: a compression tool for FASTA and multi-FASTA data | journal = Bioinformatics | volume = 30 | issue = 1 | pages = 117–8 | date = January 2014 | pmid = 24132931 | pmc = 3866555 | doi = 10.1093/bioinformatics/btt594 }}
8. ^M. Hosseini, D. Pratas, and A. Pinho. 2016. A survey on data compression methods for biological sequences. Information 7(4):(2016): 56
9. ^{{cite book | vauthors = Pratas D, Hosseini M, Pinho A | title = Cryfa: a tool to compact and encrypt FASTA files.|journal=11'th International Conference on Practical Applications of Computational Biology & Bioinformatics (PACBB), Springer| volume = 616|date=2017|pages=305–312|doi=10.1007/978-3-319-60816-7_37| series = Advances in Intelligent Systems and Computing| isbn = 978-3-319-60815-0}}
10. ^{{cite book | vauthors = Hosseini M, Pratas D, Pinho A | title = Cryfa: a secure encryption tool for genomic data. | journal=Bioinformatics | volume = 35 | date=2018 | pages=146–148 | doi=10.1093/bioinformatics/bty645}}
11. ^{{cite journal | vauthors = Cock PJ, Fields CJ, Goto N, Heuer ML, Rice PM | title = The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants | journal = Nucleic Acids Research | volume = 38 | issue = 6 | pages = 1767–71 | date = April 2010 | pmid = 20015970 | pmc = 2847217 | doi = 10.1093/nar/gkp1137 }}
12. ^{{cite journal | vauthors = Villesen P | title = FaBox: an online toolbox for fasta sequences | journal = Molecular Ecology Resources | volume = 7 | issue = 6 | pages = 965–968 | date = April 2007 | doi = 10.1111/j.1471-8286.2007.01821.x }}
13. ^{{cite journal | vauthors = Blankenberg D, Von Kuster G, Bouvier E, Baker D, Afgan E, Stoler N, ((Galaxy Team)), Taylor J, Nekrutenko A | title = Dissemination of scientific software with Galaxy ToolShed | journal = Genome Biology | volume = 15 | issue = 2 | pages = 403 | date = 2014 | doi = 10.1186/gb4161 | pmid = 25001293 | pmc = 4038738 }}
14. ^{{cite journal | vauthors = Sauvage T, Plouviez S, Schmidt WE, Fredericq S | title = TREE2FASTA: a flexible Perl script for batch extraction of FASTA sequences from exploratory phylogenetic trees | journal = BMC Research Notes | volume = 11 | pages = 403 | issue = 1 | date = March 2018 | doi = 10.1186/s13104-018-3268-y | pmid = 29506565 | pmc = 5838971 }}
15. ^{{cite web| url=https://bioconductor.org/packages/release/bioc/html/Biostrings.html | title=Biostrings: Efficient manipulation of biological strings. | last1=Pagès | first1=H | last2 = Aboyoun | first2=P | last3=Gentleman | first3=R | last4=DebRoy | first4=S | date=2018 | website = Bioconductor.org | publisher = R package version 2.48.0}}
16. ^{{cite journal | vauthors = Dereeper A, Guignon V, Blanc G, Audic S, Buffet S, Chevenet F, Dufayard JF, Guindon S, Lefort V, Lescot M, Claverie JM, Gascuel O | title = Phylogeny.fr: robust phylogenetic analysis for the non-specialist | journal = Nucleic Acids Research | volume = 36 | issue = Web Server issue | pages = W465–9 | date = July 2008 | doi = 10.1093/nar/gkn180 | pmid = 18424797 | pmc = 2447785 }}

External links

  • Bioconductor
  • FASTX-Toolkit
  • FigTree viewer
  • Phylogeny.fr

2 : Bioinformatics|Biological sequence format

随便看

 

开放百科全书收录14589846条英语、德语、日语等多语种百科知识,基本涵盖了大多数领域的百科知识,是一部内容自由、开放的电子版国际百科全书。

 

Copyright © 2023 OENC.NET All Rights Reserved
京ICP备2021023879号 更新时间:2024/9/21 16:49:02