词条 | Optical character recognition |
释义 |
Widely used as a form of information entry from printed paper data records – whether passport documents, invoices, bank statements, computerised receipts, business cards, mail, printouts of static-data, or any suitable documentation – it is a common method of digitising printed texts so that they can be electronically edited, searched, stored more compactly, displayed on-line, and used in machine processes such as cognitive computing, machine translation, (extracted) text-to-speech, key data and text mining. OCR is a field of research in pattern recognition, artificial intelligence and computer vision. Early versions needed to be trained with images of each character, and worked on one font at a time. Advanced systems capable of producing a high degree of recognition accuracy for most fonts are now common, and with support for a variety of digital image file format inputs.[2] Some systems are capable of reproducing formatted output that closely approximates the original page including images, columns, and other non-textual components. History{{see also|Timeline of optical character recognition}}Early optical character recognition may be traced to technologies involving telegraphy and creating reading devices for the blind.[3] In 1914, Emanuel Goldberg developed a machine that read characters and converted them into standard telegraph code.{{citation needed|date=April 2012}} Concurrently, Edmund Fournier d'Albe developed the Optophone, a handheld scanner that when moved across a printed page, produced tones that corresponded to specific letters or characters.[4] In the late 1920s and into the 1930s Emanuel Goldberg developed what he called a "Statistical Machine" for searching microfilm archives using an optical code recognition system. In 1931 he was granted USA Patent number 1,838,389 for the invention. The patent was acquired by IBM. With the advent of smart-phones and smartglasses, OCR can be used in internet connected mobile device applications that extract text captured using the device's camera. These devices that do not have OCR functionality built into the operating system will typically use an OCR API to extract the text from the image file captured and provided by the device.[5][6] The OCR API returns the extracted text, along with information about the location of the detected text in the original image back to the device app for further processing (such as text-to-speech) or display. Blind and visually impaired usersIn 1974, Ray Kurzweil started the company Kurzweil Computer Products, Inc. and continued development of omni-font OCR, which could recognise text printed in virtually any font (Kurzweil is often credited with inventing omni-font OCR, but it was in use by companies, including CompuScan, in the late 1960s and 1970s[3][7]). Kurzweil decided that the best application of this technology would be to create a reading machine for the blind, which would allow blind people to have a computer read text to them out loud. This device required the invention of two enabling technologies{{spaced ndash}}the CCD flatbed scanner and the text-to-speech synthesiser. On January 13, 1976, the successful finished product was unveiled during a widely reported news conference headed by Kurzweil and the leaders of the National Federation of the Blind.{{Citation needed|date=October 2011}} In 1978, Kurzweil Computer Products began selling a commercial version of the optical character recognition computer program. LexisNexis was one of the first customers, and bought the program to upload legal paper and news documents onto its nascent online databases. Two years later, Kurzweil sold his company to Xerox, which had an interest in further commercialising paper-to-computer text conversion. Xerox eventually spun it off as Scansoft, which merged with Nuance Communications.{{Citation needed|date=October 2011}} The research group headed by A. G. Ramakrishnan at the Medical intelligence and language engineering lab, Indian Institute of Science, has developed PrintToBraille tool, an open source GUI frontend[8] that can be used by any OCR to convert scanned images of printed books to Braille books. In the 2000s, OCR was made available online as a service (WebOCR), in a cloud computing environment, and in mobile applications like real-time translation of foreign-language signs on a smartphone. Various commercial and open source OCR systems are available for most common writing systems, including Latin, Cyrillic, Arabic, Hebrew, Indic, Bengali (Bangla), Devanagari, Tamil, Chinese, Japanese, and Korean characters. ApplicationsOCR engines have been developed into many kinds of domain-specific OCR applications, such as receipt OCR, invoice OCR, check OCR, legal billing document OCR. They can be used for:
Types
OCR is generally an "offline" process, which analyses a static document. Handwriting movement analysis can be used as input to handwriting recognition.[13] Instead of merely using the shapes of glyphs and words, this technique is able to capture motions, such as the order in which segments are drawn, the direction, and the pattern of putting the pen down and lifting it. This additional information can make the end-to-end process more accurate. This technology is also known as "on-line character recognition", "dynamic character recognition", "real-time character recognition", and "intelligent character recognition". TechniquesPre-processingOCR software often "pre-processes" images to improve the chances of successful recognition. Techniques include:[14]
Segmentation of fixed-pitch fonts is accomplished relatively simply by aligning the image to a uniform grid based on where vertical grid lines will least often intersect black areas. For proportional fonts, more sophisticated techniques are needed because whitespace between letters can sometimes be greater than that between words, and vertical lines can intersect more than one character.[23] Character recognitionThere are two basic types of core OCR algorithm, which may produce a ranked list of candidate characters.[22] Matrix matching involves comparing an image to a stored glyph on a pixel-by-pixel basis; it is also known as "pattern matching", "pattern recognition", or "image correlation". This relies on the input glyph being correctly isolated from the rest of the image, and on the stored glyph being in a similar font and at the same scale. This technique works best with typewritten text and does not work well when new fonts are encountered. This is the technique the early physical photocell-based OCR implemented, rather directly. Feature extraction decomposes glyphs into "features" like lines, closed loops, line direction, and line intersections. The extraction features reduces the dimensionality of the representation and makes the recognition process computationally efficient. These features are compared with an abstract vector-like representation of a character, which might reduce to one or more glyph prototypes. General techniques of feature detection in computer vision are applicable to this type of OCR, which is commonly seen in "intelligent" handwriting recognition and indeed most modern OCR software.[23] Nearest neighbour classifiers such as the k-nearest neighbors algorithm are used to compare image features with stored glyph features and choose the nearest match.[24] Software such as Cuneiform and Tesseract use a two-pass approach to character recognition. The second pass is known as "adaptive recognition" and uses the letter shapes recognised with high confidence on the first pass to recognise better the remaining letters on the second pass. This is advantageous for unusual fonts or low-quality scans where the font is distorted (e.g. blurred or faded).[25] The OCR result can be stored in the standardised ALTO format, a dedicated XML schema maintained by the United States Library of Congress. For a list of optical character recognition software see Comparison of optical character recognition software. Post-processingOCR accuracy can be increased if the output is constrained by a lexicon{{spaced ndash}}a list of words that are allowed to occur in a document.[14] This might be, for example, all the words in the English language, or a more technical lexicon for a specific field. This technique can be problematic if the document contains words not in the lexicon, like proper nouns. Tesseract uses its dictionary to influence the character segmentation step, for improved accuracy.[25] The output stream may be a plain text stream or file of characters, but more sophisticated OCR systems can preserve the original layout of the page and produce, for example, an annotated PDF that includes both the original image of the page and a searchable textual representation. "Near-neighbor analysis" can make use of co-occurrence frequencies to correct errors, by noting that certain words are often seen together.[26] For example, "Washington, D.C." is generally far more common in English than "Washington DOC". Knowledge of the grammar of the language being scanned can also help determine if a word is likely to be a verb or a noun, for example, allowing greater accuracy. The Levenshtein Distance algorithm has also been used in OCR post-processing to further optimize results from an OCR API.[27] Application-specific optimisationsIn recent years,{{when|date=March 2013}} the major OCR technology providers began to tweak OCR systems to deal more efficiently with specific types of input. Beyond an application-specific lexicon, better performance may be had by taking into account business rules, standard expression,{{clarify|date=March 2013}} or rich information contained in color images. This strategy is called "Application-Oriented OCR" or "Customised OCR", and has been applied to OCR of license plates, invoices, screenshots, ID cards, driver licenses, and automobile manufacturing. The New York Times has adapted the OCR technology into a proprietary tool they entitle, Document Helper, that enables their interactive news team to accelerate the processing of documents that need to be reviewed. They note that it enables them to process what amounts to as many as 5,400 pages per hour in preparation for reporters to review the contents.[28]WorkaroundsThere are several techniques for solving the problem of character recognition by means other than improved OCR algorithms. Forcing better inputSpecial fonts like OCR-A, OCR-B, or MICR fonts, with precisely specified sizing, spacing, and distinctive character shapes, allow a higher accuracy rate during transcription in bank check processing. Ironically however, several prominent OCR engines were designed to capture text in popular fonts such as Arial or Times New Roman, and are incapable of capturing text in these fonts that are specialized and much different from popularly used fonts. As Google Tesseract can be trained to recognize new fonts, it can recognize OCR-A, OCR-B and MICR fonts[29]. "Comb fields" are pre-printed boxes that encourage humans to write more legibly{{spaced ndash}}one glyph per box.[26] These are often printed in a "dropout color" which can be easily removed by the OCR system.[26] Palm OS used a special set of glyphs, known as "Graffiti" which are similar to printed English characters but simplified or modified for easier recognition on the platform's computationally limited hardware. Users would need to learn how to write these special glyphs. Zone-based OCR restricts the image to a specific part of a document. This is often referred to as "Template OCR". CrowdsourcingCrowdsourcing humans to perform the character recognition can quickly process images like computer-driven OCR, but with higher accuracy for recognising images than is obtained with computers. Practical systems include the Amazon Mechanical Turk and reCAPTCHA. The National Library of Finland has developed an online interface for users to correct OCRed texts in the standardised ALTO format.[30] Crowdsourcing has also been used not to perform character recognition directly but to invite software developers to develop image processing algorithms, for example, through the use of rank-order tournaments.[31]Accuracy{{update|date=March 2013}}Commissioned by the U.S. Department of Energy (DOE), the Information Science Research Institute (ISRI) had the mission to foster the improvement of automated technologies for understanding machine printed documents, and it conducted the most authoritative of the Annual Test of OCR Accuracy from 1992 to 1996.[32] Recognition of Latin-script, typewritten text is still not 100% accurate even where clear imaging is available. One study based on recognition of 19th- and early 20th-century newspaper pages concluded that character-by-character OCR accuracy for commercial OCR software varied from 81% to 99%;[33] total accuracy can be achieved by human review or Data Dictionary Authentication. Other areas—including recognition of hand printing, cursive handwriting, and printed text in other scripts (especially those East Asian language characters which have many strokes for a single character)—are still the subject of active research. The MNIST database is commonly used for testing systems' ability to recognise handwritten digits. Accuracy rates can be measured in several ways, and how they are measured can greatly affect the reported accuracy rate. For example, if word context (basically a lexicon of words) is not used to correct software finding non-existent words, a character error rate of 1% (99% accuracy) may result in an error rate of 5% (95% accuracy) or worse if the measurement is based on whether each whole word was recognised with no incorrect letters.[34] An example of the difficulties inherent in digitising old text is the inability of OCR to differentiate between the "long s" and "f" characters.[35] Web-based OCR systems for recognising hand-printed text on the fly have become well known as commercial products in recent years{{when|date=March 2013}} (see Tablet PC history). Accuracy rates of 80% to 90% on neat, clean hand-printed characters can be achieved by pen computing software, but that accuracy rate still translates to dozens of errors per page, making the technology useful only in very limited applications.{{citation needed|date=May 2009}} Recognition of cursive text is an active area of research, with recognition rates even lower than that of hand-printed text. Higher rates of recognition of general cursive script will likely not be possible without the use of contextual or grammatical information. For example, recognising entire words from a dictionary is easier than trying to parse individual characters from script. Reading the Amount line of a cheque (which is always a written-out number) is an example where using a smaller dictionary can increase recognition rates greatly. The shapes of individual cursive characters themselves simply do not contain enough information to accurately (greater than 98%) recognise all handwritten cursive script.{{citation needed|date=May 2009}} Most programs allow users to set "confidence rates". This means that if the software does not achieve their desired level of accuracy, a user can be notified for manual review. UNICODE{{Main article|Optical Character Recognition (Unicode block)}}Characters to support OCR were added to the Unicode Standard in June 1993, with the release of version 1.1. Some of these characters are mapped from fonts specific to MICR, OCR-A or OCR-B. {{Unicode chart Optical Character Recognition}}See also{{columns-list|colwidth=22em|
}} References1. ^{{cite web|url=https://dev.havenondemand.com/apis/ocrdocument#overview|title=OCR Document|first=HPE Haven|last=OnDemand|publisher=}} 2. ^{{cite web|url=https://dev.havenondemand.com/docs/ImageFormats.html|title=undefined|first=HPE Haven|last=OnDemand|publisher=}} 3. ^1 {{cite book|last=Schantz|first=Herbert F.|title=The history of OCR, optical character recognition|year=1982|publisher=Recognition Technologies Users Association|location=[Manchester Center, Vt.]|isbn=9780943072012}} 4. ^{{cite journal|last=d'Albe|first=E. E. F.|title=On a Type-Reading Optophone|journal=Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences|date=1 July 1914|volume=90|issue=619|pages=373–375|doi=10.1098/rspa.1914.0061|bibcode=1914RSPSA..90..373D}} 5. ^{{cite web|url=https://community.havenondemand.com/t5/Blog/Extracting-text-from-images-using-OCR-on-Android/ba-p/1883|title=Extracting text from images using OCR on Android|date=27 June 2015|publisher=}} 6. ^{{cite web|url=https://community.havenondemand.com/t5/Blog/Tutorial-OCR-on-Google-Glass/ba-p/1164|title=[Tutorial] OCR on Google Glass|date=23 October 2014|publisher=}} 7. ^{{cite journal |journal=Data Processing Magazine |title=The History of OCR |volume=12 |year=1970 |page=46}} 8. ^{{cite web|last1=PrintToBraille Tool |title=ocr-gui-frontend |url=https://code.google.com/p/ocr-gui-frontend/ |publisher=MILE Lab, Dept of EE, IISc |accessdate=7 December 2014 |deadurl=yes |archiveurl=https://web.archive.org/web/20141225115650/https://code.google.com/p/ocr-gui-frontend/ |archivedate=December 25, 2014 }} 9. ^{{cite web|url=https://community.havenondemand.com/t5/Blog/javascript-Using-OCR-and-Entity-Extraction-for-LinkedIn-Company/ba-p/460|title=[javascript] Using OCR and Entity Extraction for LinkedIn Company Lookup|date=22 July 2014|publisher=}} 10. ^{{cite web|url=http://www.andrewt.net/blog/how-to-crack-captchas/ |title=How To Crack Captchas |publisher=andrewt.net |date=2006-06-28 |accessdate=2013-06-16}} 11. ^{{cite web|url=http://www.cs.sfu.ca/~mori/research/gimpy/ |title=Breaking a Visual CAPTCHA |publisher=Cs.sfu.ca |date=2002-12-10 |accessdate=2013-06-16}} 12. ^{{cite web|author=John Resig |url=http://ejohn.org/blog/ocr-and-neural-nets-in-javascript/ |title=John Resig – OCR and Neural Nets in JavaScript |publisher=Ejohn.org |date=2009-01-23 |accessdate=2013-06-16}} 13. ^{{Cite journal | last1 = Tappert | first1 = C. C. | last2 = Suen | first2 = C. Y. | last3 = Wakahara | first3 = T. | doi = 10.1109/34.57669 | title = The state of the art in online handwriting recognition | journal = IEEE Transactions on Pattern Analysis and Machine Intelligence | volume = 12 | issue = 8 | pages = 787 | year = 1990 | pmid = | pmc = }} 14. ^1 {{cite web|url=https://www.nicomsoft.com/optical-character-recognition-ocr-how-it-works/ |title=Optical Character Recognition (OCR) – How it works |publisher=Nicomsoft.com |accessdate=2013-06-16}} 15. ^{{cite journal|last1=Sezgin|first1=Mehmet|last2=Sankur|first2=Bulent|title=Survey over image thresholding techniques and quantitative performance evaluation|journal=Journal of Electronic Imaging|date=2004|volume=13|issue=1|page=146|url=http://webdocs.cs.ualberta.ca/~nray1/CMPUT605/track3_papers/Threshold_survey.pdf|accessdate=2 May 2015|bibcode=2004JEI....13..146S|doi=10.1117/1.1631315}} 16. ^{{cite journal|last1=Gupta|first1=Maya R.|last2=Jacobson|first2=Nathaniel P.|last3=Garcia|first3=Eric K.|title=OCR binarisation and image pre-processing for searching historical documents.|journal=Pattern Recognition|date=2007|volume=40|issue=2|page=389|url=http://www.rfai.li.univ-tours.fr/fr/ressources/_dh/DOC/DocOCR/OCRbinarisation.pdf|accessdate=2 May 2015|doi=10.1016/j.patcog.2006.04.043}} 17. ^{{cite journal|last1=Trier|first1=Oeivind Due|last2=Jain|first2=Anil K.|title=Goal-directed evaluation of binarisation methods.|journal=IEEE Transactions on Pattern Analysis and Machine Intelligence|date=1995|volume=17|issue=12|pages=1191–1201|url=http://heim.ifi.uio.no/inf386/trier2.pdf|accessdate=2 May 2015|doi=10.1109/34.476511}} 18. ^{{cite journal|last1=Milyaev|first1=Sergey|last2=Barinova|first2=Olga|last3=Novikova|first3=Tatiana|last4=Kohli|first4=Pushmeet|last5=Lempitsky|first5=Victor|title=Image binarisation for end-to-end text understanding in natural images.|journal=Document Analysis and Recognition (ICDAR) 2013|date=2013|volume=12th International Conference on|url=http://research.microsoft.com/en-us/um/people/pkohli/papers/mbnlk_icdar2013.pdf|accessdate=2 May 2015}} 19. ^{{Cite web|url=https://grooper.com/image-optimization.html|title=Image Optimization|last=|first=|date=September 20, 2018|website=Grooper Document Capture|at=Bottom of the web page|archive-url=|archive-date=|dead-url=|access-date=September 20, 2018}} 20. ^{{Cite journal | last1=Pati | first1=P.B. | last2= Ramakrishnan | first2=A.G. | title = Word Level Multi-script Identification | date =1987-05-29 | journal=Pattern Recognition Letters | volume=29 | issue=9 | pages=1218–1229 | doi=10.1016/j.patrec.2008.01.027 }} 21. ^{{cite web|url=http://blog.damiles.com/2008/11/20/basic-ocr-in-opencv.html |title=Basic OCR in OpenCV | Damiles |publisher=Blog.damiles.com |accessdate=2013-06-16|date=2008-11-20 }} 22. ^{{cite web|url=http://www.dataid.com/aboutocr.htm |title=OCR Introduction |publisher=Dataid.com |accessdate=2013-06-16}} 23. ^{{cite web|url=http://ocrwizard.com/ocr-software/how-ocr-software-works.html |title=How OCR Software Works |publisher=OCRWizard |accessdate=2013-06-16}} 24. ^{{cite web|url=http://blog.damiles.com/2008/11/14/the-basic-patter-recognition-and-classification-with-opencv.html |title=The basic pattern recognition and classification with openCV | Damiles |publisher=Blog.damiles.com |accessdate=2013-06-16|date=2008-11-14 }} 25. ^1 2 {{cite web|url=http://tesseract-ocr.googlecode.com/svn/trunk/doc/tesseracticdar2007.pdf|title=An Overview of the Tesseract OCR Engine|year=2007|accessdate=2013-05-23|author=Ray Smith}} 26. ^1 2 {{cite web|url=http://www.explainthatstuff.com/how-ocr-works.html |title=How does OCR document scanning work? |publisher=Explain that Stuff |date=2012-01-30 |accessdate=2013-06-16}} 27. ^{{cite web|url=https://community.havenondemand.com/t5/Wiki/How-to-optimize-results-from-the-OCR-API-when-extracting-text/ta-p/1656|title=How to optimize results from the OCR API when extracting text from an image? - Haven OnDemand Developer Community|publisher=}} 28. ^Fehr, Tiff, [https://www.nytimes.com/2019/03/26/reader-center/times-documents-reporters-cohen.html?rref=collection%2Fsectioncollection%2Freader-center&action=click&contentCollection=reader-center®ion=rank&module=package&version=highlights&contentPlacement=2&pgtype=sectionfront How We Sped Through 900 Pages of Cohen Documents in Under 10 Minutes], Times Insider, The New York Times, March 26, 2019 29. ^{{Cite web|url=http://trainyourtesseract.com/|title=Train Your Tesseract|last=|first=|date=September 20, 2018|website=Train Your Tesseract|archive-url=|archive-date=|dead-url=|access-date=September 20, 2018}} 30. ^{{cite web|url=http://blogs.helsinki.fi/fennougrica/2014/02/21/ocr-text-editor/|title=What is the point of an online interactive OCR text editor? - Fenno-Ugrica|publisher=|date=2014-02-21}} 31. ^{{cite journal |author=Riedl, C. |author2=Zanibbi, R. |author3=Hearst, M. A. |author4=Zhu, S. |author5=Menietti, M. |author6=Crusan, J. |author7=Metelsky, I. |author8=Lakhani, K. |title=Detecting Figures and Part Labels in Patents: Competition-Based Development of Image Processing Algorithms |journal=International Journal on Document Analysis and Recognition |volume=19 |issue=2 |pages=155 |date=20 February 2016 |bibcode= |doi=10.1007/s10032-016-0260-8|arxiv=1410.6751 }} 32. ^{{cite web|url=https://code.google.com/p/isri-ocr-evaluation-tools/|title= Code and Data to evaluate OCR accuracy, originally from UNLV/ISRI|publisher=Google Code Archive}} 33. ^{{cite web |url=http://www.dlib.org/dlib/march09/holley/03holley.html |accessdate=5 January 2014 |date=April 2009 |last=Holley |first=Rose |publisher=D-Lib Magazine |title=How Good Can It Get? Analysing and Improving OCR Accuracy in Large Scale Historic Newspaper Digitisation Programs}} 34. ^{{Cite conference | last1=Suen | first1=C.Y. | last2= Plamondon | first2=R. | last3= Tappert | first3=A. | last4=Thomassen | first4=A. | last5=Ward | first5=J.R. | last6=Yamamoto | first6=K. | title = Future Challenges in Handwriting and Computer Applications | date =1987-05-29 | conference=3rd International Symposium on Handwriting and Computer Applications, Montreal, May 29, 1987 | url=http://users.erols.com/rwservices/pens/biblio88.html#Suen88 | accessdate = 2008-10-03}} 35. ^{{cite book|title=Research and Advanced Technology for Digital Libraries|author=Sarantos Kapidakis, Cezary Mazurek, Marcin Werla|date=2015|page=257|publisher=Springer|isbn=9783319245928|url=https://books.google.com/?id=kEyGCgAAQBAJ&dq=OCR+and+long+s|accessdate=3 April 2018}} External links{{Commons category|Optical character recognition}}
8 : Artificial intelligence applications|Applications of computer vision|Automatic identification and data capture|Computational linguistics|Optical character recognition|Unicode|Symbols|Machine learning task |
随便看 |
|
开放百科全书收录14589846条英语、德语、日语等多语种百科知识,基本涵盖了大多数领域的百科知识,是一部内容自由、开放的电子版国际百科全书。