请输入您要查询的百科知识:

 

词条 Tesseract (software)
释义

  1. History

  2. Features

  3. Version 4

  4. User interfaces

  5. Reception

  6. See also

  7. References

  8. External links

{{Infobox Software
| name = Tesseract
| logo = File:Tesseract OCR logo (Google).png
| logo size = 250px
| screenshot = File:Tesseract v3.02.png
| screenshot size = 250px
| caption = Tesseract 3.02 running on Gnome Terminal 3.8.0. "input_image.tif" is the input document which will be rendered as "output_text.txt" by Tesseract.
| collapsible =
| author = Ray Smith, Hewlett-Packard[1]
| developer = Google
| released =
| latest release version = 4.0.0
| latest release date = {{Start date and age|2018|10|29}}[2]
| latest preview version =
| latest preview date =
| programming language = C and C++
| operating system = Linux, Windows, and macOS (x86)
| platform =
| language = Interface: English
Recognition:

Afrikaans, Albanian, Arabic, Azerbaijani, Basque, Belarusian, Bengali, Bulgarian, Catalan, Czech, Cherokee, Croatian, Danish, Dutch, English, Esperanto, Estonian, Finnish, French, Galician, German, Greek, Hindi, Hungarian, Indonesian, Italian, Japanese, Kannada, Korean, Latvian, Lithuanian, Malayalam, Macedonian, Maltese, Malay, Norwegian, Polish, Portuguese, Romanian, Russian, Serbian, Slovak, Slovenian, Spanish, Swahili, Swedish, Tagalog, Tamil, Telugu, Thai, Turkish, Ukrainian & Vietnamese (more can be added using included training files)


| status = Active
| genre = Optical character recognition
| license = Apache License v2.0
| repo =
| website =
}}Tesseract is an optical character recognition engine for various operating systems.[3] It is free software, released under the Apache License, Version 2.0,[1][4][5] and development has been sponsored by Google since 2006.[6]

In 2006, Tesseract was considered one of the most accurate open-source OCR engines then available.[5][7]

History

The Tesseract engine was originally developed as proprietary software at Hewlett Packard labs in Bristol, England and Greeley, Colorado between 1985 and 1994, with some more changes made in 1996 to port to Windows, and some migration from C to C++ in 1998. A lot of the code was written in C, and then some more was written in C++. Since then all the code has been converted to at least compile with a C++ compiler.[4] Very little work was done in the following decade. It was then released as open source in 2005 by Hewlett Packard and the University of Nevada, Las Vegas (UNLV). Tesseract development has been sponsored by Google since 2006.[6]

Features

Tesseract was in the top three OCR engines in terms of character accuracy in 1995.[8] It is available for Linux, Windows and Mac OS X. However, due to limited resources it is only rigorously tested by developers under Windows and Ubuntu.[4][5]

Tesseract up to and including version 2 could only accept TIFF images of simple one-column text as inputs. These early versions did not include layout analysis, and so inputting multi-columned text, images, or equations produced garbled output. Since version 3.00 Tesseract has supported output text formatting, hOCR[9] positional information and page-layout analysis. Support for a number of new image formats was added using the Leptonica library. Tesseract can detect whether text is monospaced or proportionally spaced.[5]

The initial versions of Tesseract could only recognize English-language text. Tesseract v2 added six additional Western languages (French, Italian, German, Spanish, Brazilian Portuguese, Dutch). Version 3 extended language support significantly to include ideographic (Chinese & Japanese) and right-to-left (e.g. Arabic, Hebrew) languages, as well as many more scripts. New languages included Arabic, Bulgarian, Catalan, Chinese (Simplified and Traditional), Croatian, Czech, Danish, German (Fraktur script), Greek, Finnish, Hebrew, Hindi, Hungarian, Indonesian, Japanese, Korean, Latvian, Lithuanian, Norwegian, Polish, Portuguese, Romanian, Russian, Serbian, Slovak (standard and Fraktur script), Slovenian, Swedish, Tagalog, Tamil, Thai, Turkish, Ukrainian and Vietnamese. V3.04, released in July 2015, added an additional 39 language/script combinations, bringing the total count of support languages to over 100. New language codes included: amh (Amharic), asm (Assamese), aze_cyrl (Azerbaijana in Cyrillic script), bod (Tibetan), bos (Bosnian), ceb (Cebuano), cym (Welsh), dzo (Dzongkha), fas (Persian), gle (Irish), guj (Gujarati), hat (Haitian and Haitian Creole), iku (Inuktitut), jav (Javanese), kat (Georgian), kat_old (Old Georgian), kaz (Kazakh), khm (Central Khmer), kir (Kyrgyz), kur (Kurdish), lao (Lao), lat (Latin), mar (Marathi), mya (Burmese), nep (Nepali), ori (Oriya), pan (Punjabi), pus (Pashto), san (Sanskrit), sin (Sinhala), srp_latn (Serbian in Latin script), syr (Syriac), tgk (Tajik), tir (Tigrinya), uig (Uyghur), urd (Urdu), uzb (Uzbek), uzb_cyrl (Uzbek in Cyrillic script), yid (Yiddish).[10]

Tesseract can be trained to work in other languages too.[5]

Tesseract can process right-to-left text such as Arabic or Hebrew, many Indic scripts as well as CJK quite well. Accuracy rates are shown in this presentation for Tesseract tutorial at DAS 2016, Santorini by Ray Smith.[11]

Tesseract is suitable for use as a backend and can be used for more complicated OCR tasks including layout analysis by using a frontend such as OCRopus.[12]

Tesseract's output will have very poor quality if the input images are not preprocessed to suit it: Images (especially screenshots) must be scaled up such that the text x-height is at least 20 pixels,[13] any rotation or skew must be corrected or no text will be recognized, low-frequency changes in brightness must be high-pass filtered, or Tesseract's binarization stage will destroy much of the page, and dark borders must be manually removed, or they will be misinterpreted as characters.[14]

Version 4

Version 4 adds LSTM based OCR engine and models for many additional languages and scripts, bringing the total to 116 languages.[15]

Additionally scripts for 37 languages are supported so it is possible to recognize a language by using the script it is written in.

User interfaces

Tesseract is executed from the command-line interface.[16] While Tesseract is not supplied with a GUI, there are many separate projects which provide a GUI for it.[17] One common example is OCRFeeder.[18]

Reception

In a July 2007 article on Tesseract, Anthony Kay of Linux Journal termed it "a quirky command-line tool that does an outstanding job". At that time he noted "Tesseract is a bare-bones OCR engine. The build process is a little quirky, and the engine needs some additional features (such as layout detection), but the core feature, text recognition, is drastically better than anything else I've tried from the Open Source community. It is reasonably easy to get excellent recognition rates using nothing more than a scanner and some image tools, such as The GIMP and Netpbm."[3]

See also

  • Libtiff

References

1. ^{{cite web|url = https://github.com/tesseract-ocr/tesseract/|title = tesseract-ocr|accessdate = 2016-03-08|last = Google|authorlink = |year = 2008}}
2. ^{{cite web|url=https://github.com/tesseract-ocr/tesseract/releases|title = Releases - tesseract-ocr/tesseract|accessdate = 29 October 2018|via=GitHub}}
3. ^{{Cite news|url = http://www.linuxjournal.com/article/9676|title = Tesseract: an Open-Source Optical Character Recognition Engine|accessdate = 28 September 2011|last = Kay|first = Anthony|authorlink = |date=July 2007| work = Linux Journal}}
4. ^{{cite web|url=http://google-code-updates.blogspot.com/2006/08/announcing-tesseract-ocr.html |title=Announcing Tesseract OCR |accessdate=2008-06-26 |last=Vincent |first=Luc |authorlink= |date=August 2006 |deadurl=yes |archiveurl=https://web.archive.org/web/20061026075310/http://google-code-updates.blogspot.com/2006/08/announcing-tesseract-ocr.html |archivedate=October 26, 2006 }}
5. ^{{cite web|url = https://help.ubuntu.com/community/OCR|title = OCR|accessdate = 2011-02-11|last = Canonical Ltd.|authorlink = |date=February 2011}}
6. ^Announcing Tesseract OCR - The official Google blog
7. ^{{cite web|url = http://www.linux.com/articles/57222|title = Google's Tesseract OCR engine is a quantum leap forward|accessdate = 2008-07-18|last = Willis |first = Nathan|authorlink = |date=September 2006}}
8. ^Rice Stephen V., Frank R. Jenkins, and Thomas A. Nartker The Fourth Annual Test of OCR Accuracy, expervision.com, retrieved 21 May 2013
9. ^{{cite web|url=http://code.google.com/p/tesseract-ocr/issues/detail?id=263 |title=Issue 263: patch to enable hOCR output |accessdate=26 February 2011 |last=Tesseract Project |authorlink= |date=February 2011 |deadurl=yes |archiveurl=https://web.archive.org/web/20121113065732/http://code.google.com/p/tesseract-ocr/issues/detail?id=263 |archivedate=November 13, 2012 }}
10. ^{{cite web|title=langdata - Source training data for Tesseract for lots of languages|url=https://github.com/tesseract-ocr/langdata|accessdate=6 November 2016}}
11. ^{{cite web|title=Training LSTM networks on 100 languages and test results |url=https://github.com/tesseract-ocr/docs/blob/master/das_tutorial2016/7Building%20a%20Multi-Lingual%20OCR%20Engine.pdf|accessdate=18 March 2018}}
12. ^Announcing the OCRopus Open Source OCR System (Thomas Breuel, OCRopus Project Leader).
13. ^{{cite web|url=http://code.google.com/p/tesseract-ocr/wiki/FAQ#Is_there_a_Minimum_Text_Size?_%28It_won%27t_read_screen_text!%29 |title=FAQ - tesseract-ocr - Frequently Asked Questions - An OCR Engine that was developed at HP Labs between 1985 and 1995... and now at Google. - Google Project Hosting |publisher=Code.google.com |date= |accessdate=2014-05-30}}
14. ^{{cite web|url=http://code.google.com/p/tesseract-ocr/wiki/ImproveQuality |title=ImproveQuality - tesseract-ocr - Advice on improving the quality of your output. - An OCR Engine that was developed at HP Labs between 1985 and 1995... and now at Google. - Google Project Hosting |publisher=Code.google.com |date=2014-01-27 |accessdate=2014-05-30}}
15. ^{{cite web|title=TESSERACT(1) Manual Page|url=https://github.com/tesseract-ocr/tesseract/blob/master/doc/tesseract.1.asc|accessdate=15 March 2018}}
16. ^Google Code – Tesseract Readme
17. ^{{cite web|url=https://github.com/tesseract-ocr/tesseract/wiki/User-Projects-%E2%80%93-3rdParty |title=3rdParty - tesseract-ocr - GUIs and Other Projects using Tesseract OCR. |publisher=github.com |date= |accessdate=2017-03-30}}
18. ^{{cite web|url = https://wiki.gnome.org/Apps/OCRFeeder|title = OCRFeeder|accessdate = 12 January 2019|website=GNOME wiki}}

External links

{{Commons category}}
  • {{Official website|https://github.com/tesseract-ocr}}
  • Hacking Tesseract V0.04 – C/C++ structure of Tesseract extracted from Doxyfied source code (based on Tesseract V1.03)
  • [https://github.com/tesseract-ocr/docs/blob/master/tesseracticdar2007.pdf Tesseract OCR Engine] An Overview of the Tesseract OCR Engine.
{{OCR}}

6 : Free software programmed in C|Free software programmed in C++|Optical character recognition|Google software|Formerly proprietary software|Software using the Apache license

随便看

 

开放百科全书收录14589846条英语、德语、日语等多语种百科知识,基本涵盖了大多数领域的百科知识,是一部内容自由、开放的电子版国际百科全书。

 

Copyright © 2023 OENC.NET All Rights Reserved
京ICP备2021023879号 更新时间:2024/9/20 8:14:54