词条 | Sentence boundary disambiguation |
释义 |
Sentence boundary disambiguation (SBD), also known as sentence breaking, sentence boundary detection, and sentence segmentation, is the problem in natural language processing of deciding where sentences begin and end. Natural language processing tools often require their input to be divided into sentences; however, sentence boundary identification can be challenging due to the potential ambiguity of punctuation marks. In written English, a period may indicate the end of a sentence, or may denote an abbreviation, a decimal point, an ellipsis, or an email address, among other possibilities. About 47% of the periods in the Wall Street Journal corpus denote abbreviations.[1] Question marks and exclamation marks can be similarly ambiguous due to use in emoticons, computer code, and slang. Languages like Japanese and Chinese have unambiguous sentence-ending markers. StrategiesThe standard 'vanilla' approach to locate the end of a sentence:{{clarify|date=February 2015}} (a) If it's a period, it ends a sentence. (b) If the preceding token is in the hand-compiled list of abbreviations, then it doesn't end a sentence. (c) If the next token is capitalized, then it ends a sentence. This strategy gets about 95% of sentences correct.[2] Things such as shortened names, e.g. "D. H. Lawrence" (with whitespaces between the individual words that form the full name), idiosyncratic orthographical spellings used for stylistic purposes (often referring to a single concept, e.g. an entertainment product title like ".hack//SIGN") and usage of non-standard punctuation (or non-standard usage of punctuation) in a text often fall under the remaining 5%. Another approach is to automatically learn a set of rules from a set of documents where the sentence breaks are pre-marked. Solutions have been based on a maximum entropy model.[3] The [https://web.archive.org/web/20070922132340/http://elib.cs.berkeley.edu/src/satz/ SATZ] architecture uses a neural network to disambiguate sentence boundaries and achieves 98.5% accuracy. Software
See also
References1. ^{{cite web|url=http://www.ling.gu.se/~lager/Mutbl/Papers/sent_bound.ps|title=1 AUTOMATIC EXTRACTION OF RULES FOR SENTENCE BOUNDARY DISAMBIGUATION |author1=E. STAMATATOS |author2=N. FAKOTAKIS |author3=G. KOKKINAKIS |last-author-amp=yes |publisher=University of Patras|accessdate=2009-01-03}} 2. ^{{cite web|url=http://www.attivio.com/attivio/blog/263-doing-things-with-words-part-two-sentence-boundary-detection.html|title= Doing Things with Words, Part Two: Sentence Boundary Detection|first=John |last=O'Neil |accessdate=2009-01-03}} 3. ^{{cite web|url=http://www.aclweb.org/anthology/A/A97/A97-1004.pdf|title=A Maximum Entropy Approach to Identifying Sentence Boundaries |first1=JC |last1=Reynar |first2=A |last2=Ratnaparkhi |accessdate=2009-01-03}} External links
1 : Tasks of natural language processing |
随便看 |
|
开放百科全书收录14589846条英语、德语、日语等多语种百科知识,基本涵盖了大多数领域的百科知识,是一部内容自由、开放的电子版国际百科全书。