“Scene text”的意思、由来-开放百科全书

Scene text is text that appears in an image captured by a camera in an outdoor environment. The detection and recognition of scene text from camera captured images are computer vision tasks which became important after smart phones with good cameras became ubiquitous. The text in scene images varies in shape, font, colour and position. The recognition of scene text is further complicated sometimes by non-uniform illumination and focus.

To improve scene text recognition, the International Conference on Document Analysis and Recognition (ICDAR) conducts a robust reading competition once in two years. The competition was held in 2003, 2005^[1]^[2]^[3] and during every ICDAR conference. ^[4]^[5] ^[6] International association for pattern recognition (IAPR) has created a list of datasets as Reading systems. ^[7]

Text detection

Text detection is the process of detecting the text present in the image, followed by surrounding it with a rectangular bounding box. Text detection can be carried out using image based techniques or frequency based techniques.

In image based techniques, an image is segmented into multiple segments. Each segment is a connected component of pixels with similar characteristics. The statistical features of connected components are utilised to group them and form the text. Machine learning approaches such as support vector machine and convolutional neural networks are used to classify the components into text and non-text.

In frequency based techniques, discrete Fourier transform (DFT) or discrete wavelet transform (DWT) are used to extract the high frequency coefficients. It is assumed that the text present in an image has high frequency components and selecting only the high frequency coefficients filters the text from the non-text regions in an image.

Word recognition

In word recognition, the text is assumed to be already detected and located and the rectangular bounding box containing the text is available. The word present in the bounding box needs to be recognized. The methods available to perform word recognition can be broadly classified into top-down and bottom-up approaches.

In the top-down approaches, a set of words from a dictionary is used to identify which word suits the given image ^[8] ^[9] ^[10]. Images are not segmented in most of these methods. Hence, the top-down approach is sometimes referred as segmentation free recognition.

In the bottom-up approaches, the image is segmented into multiple components and the segmented image is passed through a recognition engine ^[11] ^[12] ^[13]. Either an off the shelf Optical character recognition (OCR) engine ^[14]^[15]^[16] or a custom-trained one is used to recognise the text.

References

1. ^{{cite book| chapter-url = https://dl.acm.org/citation.cfm?id=1106881| title=S. M. Lucas. Text Locating Competition Results. In Proc. 8th ICDAR, pages 80–85, 2005.| pages=80–84 Vol. 1| doi=10.1109/ICDAR.2005.231| chapter=ICDAR 2005 text locating competition results| year=2005| last1=Lucas| first1=S.M.| isbn=978-0-7695-2420-7}}
2. ^ICDAR 2005 Competitions.http://www.iapr-tc11.org/mediawiki/index.php/ICDAR_2005_Robust_Reading_Competitions.
3. ^{{cite journal|title=S. M. Lucas. ICDAR 2003 Robust Reading Competitions: Entries, Results, and Future Directions. IJDAR, 7(2):105–122, June 2005.|journal=International Journal of Document Analysis and Recognition (Ijdar)|volume=7|issue=2–3|pages=105–122|doi=10.1007/s10032-004-0134-3|year = 2005|last1 = Lucas|first1 = Simon M.|last2=Panaretos|first2=Alex|last3=Sosa|first3=Luis|last4=Tang|first4=Anthony|last5=Wong|first5=Shirley|last6=Young|first6=Robert|last7=Ashida|first7=Kazuki|last8=Nagai|first8=Hiroki|last9=Okamoto|first9=Masayuki|last10=Yamamoto|first10=Hiroaki|last11=Miyao|first11=Hidetoshi|last12=Zhu|first12=Junmin|last13=Ou|first13=Wuwen|last14=Wolf|first14=Christian|last15=Jolion|first15=Jean-Michel|last16=Todoran|first16=Leon|last17=Worring|first17=Marcel|last18=Lin|first18=Xiaofan|citeseerx=10.1.1.104.1667}}
4. ^ICDAR 2013. http://www.icdar2013.org.
5. ^ICDAR 2017. http://u-pat.org/ICDAR2017/
6. ^ICDAR 2011 Robust Reading Competition. http://www.cvc.uab.es/icdar2011competition/.
7. ^IAPR TC11 Reading Systems-Datasets List. http://www.iapr-tc11.org/mediawiki/index.php?title=Datasets.
8. ^{{cite journal|title=J. J. Weinmann, E. Learned-Miller, and A. R. Hanson. Scene text recognition using similarity and a lexicon with sparse belief propagation. IEEE Trans. PAMI, 31(10):1733–1746, 2009.|journal=IEEE Transactions on Pattern Analysis and Machine Intelligence|volume=31|issue=10|pages=1733–1746|doi=10.1109/TPAMI.2009.38|pmid = 19696446|pmc=3021989|year=2009|last1=Weinman|first1=J.J.|last2=Learned-Miller|first2=E.|last3=Hanson|first3=A.R.}}
9. ^{{cite web|url=http://www.bmva.org/bmvc/2012/BMVC/paper127/abstract127.pdf|title=A. Mishra, K. Alahari, and C. V. Jawahar. Scene Text Recognition using Higher Order Language Priors. In Proc. BMVC, 2012.}}
10. ^{{cite book|title=T. Novikova, O. Barinova, P. Kohli, and V. Lempitsky. Large-Lexicon Attribute-Consistent Text Recognition in Natural Images. In Proc. 12th ECCV, pages 752-765, 2012.|volume=7577|pages=752–765|doi=10.1007/978-3-642-33783-3_54|chapter = Large-Lexicon Attribute-Consistent Text Recognition in Natural Images|series = Lecture Notes in Computer Science|year = 2012|last1 = Novikova|first1 = Tatiana|last2=Barinova|first2=Olga|last3=Kohli|first3=Pushmeet|last4=Lempitsky|first4=Victor|isbn=978-3-642-33782-6|citeseerx=10.1.1.296.4807}}
11. ^{{cite book|chapter-url=http://ieeexplore.ieee.org/document/6290009/|title=D. Kumar and A. G. Ramakrishnan. Power-law transformation for enhanced recognition of born-digital word images. In Proc. 9th SPCOM, 2012.|pages=1–5|doi=10.1109/SPCOM.2012.6290009|chapter=Power-law transformation for enhanced recognition of born-digital word images|year=2012|last1=Kumar|first1=Deepak|last2=Ramakrishnan|first2=A. G.|isbn=978-1-4673-2014-6}}
12. ^{{cite journal|url=http://delivery.acm.org/10.1145/2430000/2425348/a15-kumar.pdf?ip=14.139.128.16&id=2425348&acc=ACTIVE%20SERVICE&key=045416EF4DDA69D9%2EDB7584019D0D7099%2E4D4702B0C3E38B35%2E4D4702B0C3E38B35&CFID=1020578451&CFTOKEN=60769235&__acm__=1514285305_e5e66522e32d4706652839f1ae5dc009|title=D. Kumar, M. N. Anil Prasad, and A. G. Ramakrishnan. MAPS: Midline analysis and propagation of segmentation. In Proc. 8th ICVGIP, 2012.|doi=10.1145/2430000/2425348/a15-kumar|doi-broken-date=2019-03-14}}
13. ^{{cite web|url=https://pdfs.semanticscholar.org/a87a/eca2aa95fcc16a6e63a6d69c322732767736.pdf|title= D. Kumar, M. N. Anil Prasad, and A. G. Ramakrishnan. NESP: Nonlinear enhancement and selection of plane for optimal segmentation and recognition of scene word images. In Proc. 20th DRR, 2013.}}
14. ^Abbyy Fine Reader. http://www.abbyy.com/
15. ^Nuance Omnipage Reader. http://www.nuance.com/
16. ^Tesseract OCR Engine. http://code.google.com/p/tesseract-ocr/