“Natural language processing”的意思、由来-开放百科全书

Natural language processing (NLP) is a subfield of computer science, information engineering, and artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to process and analyze large amounts of natural language data.

Challenges in natural language processing frequently involve speech recognition, natural language understanding, and natural language generation.

History

The history of natural language processing generally started in the 1950s, although work can be found from earlier periods.

In 1950, Alan Turing published an article titled "Intelligence" which proposed what is now called the Turing test as a criterion of intelligence.

The Georgetown experiment in 1954 involved fully automatic translation of more than sixty Russian sentences into English. The authors claimed that within three or five years, machine translation would be a solved problem.^[2] However, real progress was much slower, and after the ALPAC report in 1966, which found that ten-year-long research had failed to fulfill the expectations, funding for machine translation was dramatically reduced. Little further research in machine translation was conducted until the late 1980s, when the first statistical machine translation systems were developed.

Some notably successful natural language processing systems developed in the 1960s were SHRDLU, a natural language system working in restricted "blocks worlds" with restricted vocabularies, and ELIZA, a simulation of a Rogerian psychotherapist, written by Joseph Weizenbaum between 1964 and 1966. Using almost no information about human thought or emotion, ELIZA sometimes provided a startlingly human-like interaction. When the "patient" exceeded the very small knowledge base, ELIZA might provide a generic response, for example, responding to "My head hurts" with "Why do you say your head hurts?".

During the 1970s, many programmers began to write "conceptual ontologies", which structured real-world information into computer-understandable data. Examples are MARGIE (Schank, 1975), SAM (Cullingford, 1978), PAM (Wilensky, 1978), TaleSpin (Meehan, 1976), QUALM (Lehnert, 1977), Politics (Carbonell, 1979), and Plot Units (Lehnert 1981). During this time, many chatterbots were written including PARRY, Racter, and Jabberwacky.

Up to the 1980s,most natural language processing systems were based on complex sets of hand-written rules. Starting in the late 1980s, however, there was a revolution in natural language processing with the introduction of machine learning algorithms for language processing. This was due to both the steady increase in computational power (see Moore's law) and the gradual lessening of the dominance of Chomskyan theories of linguistics (e.g. transformational grammar), whose theoretical underpinnings discouraged the sort of corpus linguistics that underlies the machine-learning approach to language processing.^[3] Some of the earliest-used machine learning algorithms, such as decision trees, produced systems of hard if-then rules similar to existing hand-written rules. However, part-of-speech tagging introduced the use of hidden Markov models to natural language processing, and increasingly, research has focused on statistical models, which make soft, probabilistic decisions based on attaching real-valued weights to the features making up the input data. The cache language models upon which many speech recognition systems now rely are examples of such statistical models. Such models are generally more robust when given unfamiliar input, especially input that contains errors (as is very common for real-world data), and produce more reliable results when integrated into a larger system comprising multiple subtasks.

Many of the notable early successes occurred in the field of machine translation, due especially to work at IBM Research, where successively more complicated statistical models were developed. These systems were able to take advantage of existing multilingual textual corpora that had been produced by the Parliament of Canada and the European Union as a result of laws calling for the translation of all governmental proceedings into all official languages of the corresponding systems of government. However, most other systems depended on corpora specifically developed for the tasks implemented by these systems, which was (and often continues to be) a major limitation in the success of these systems. As a result, a great deal of research has gone into methods of more effectively learning from limited amounts of data.

Recent research has increasingly focused on unsupervised and semi-supervised learning algorithms. Such algorithms are able to learn from data that has not been hand-annotated with the desired answers, or using a combination of annotated and non-annotated data. Generally, this task is much more difficult than supervised learning, and typically produces less accurate results for a given amount of input data. However, there is an enormous amount of non-annotated data available (including, among other things, the entire content of the World Wide Web), which can often make up for the inferior results if the algorithm used has a low enough time complexity to be practical.

In the 2010s, representation learning and deep neural network-style machine learning methods became widespread in natural language processing, due in part to a flurry of results showing that such techniques^[4]^[5] can achieve state-of-the-art results in many natural language tasks, for example in language modeling,^[6]

parsing,^[7]^[8] and many others. Popular techniques include the use of word embeddings to capture semantic properties of words, and an increase in end-to-end learning of a higher-level task (e.g., question answering) instead of relying on a pipeline of separate intermediate tasks (e.g., part-of-speech tagging and dependency parsing). In some areas, this shift has entailed substantial changes in how NLP systems are designed, such that deep neural network-based approaches may be viewed as a new paradigm distinct from statistical natural language processing. For instance, the term neural machine translation (NMT) emphasizes the fact that deep learning-based approaches to machine translation directly learn sequence-to-sequence transformations, obviating the need for intermediate steps such as word alignment and language modeling that were used in statistical machine translation (SMT).

Rule-based vs. statistical NLP

In the early days, many language-processing systems were designed by hand-coding a set of rules,^[9]^[10], e.g. by writing grammars or devising heuristic rules for stemming.

in the late 1980s and mid 1990s, much natural language processing research has relied heavily on machine learning.

The machine-learning paradigm calls instead for using statistical inference to automatically learn such rules through the analysis of large corpora of typical real-world examples (a corpus (plural, "corpora") is a set of documents, possibly with human or computer annotations).

Many different classes of machine-learning algorithms have been applied to natural-language-processing tasks. These algorithms take as input a large set of "features" that are generated from the input data. Some of the earliest-used algorithms, such as decision trees, produced systems of hard if-then rules similar to the systems of hand-written rules that were then common. Increasingly, however, research has focused on statistical models, which make soft, probabilistic decisions based on attaching real-valued weights to each input feature. Such models have the advantage that they can express the relative certainty of many different possible answers rather than only one, producing more reliable results when such a model is included as a component of a larger system.

Systems based on machine-learning algorithms have many advantages over hand-produced rules:

Major evaluations and tasks

The following is a list of some of the most commonly researched tasks in natural language processing. Note that some of these tasks have direct real-world applications, while others more commonly serve as subtasks that are used to aid in solving larger tasks.

Though natural language processing tasks are closely intertwined, they are frequently subdivided into categories for convenience. A coarse division is given below.

Syntax

Semantics

Discourse

Speech

Dialogue

The first published work by an artificial intelligence was published in 2018, 1 the Road, marketed as a novel, contains sixty million words.

History of NLP

The first major study was conducted in 2013, on the occasion of the anniversary of the Association for Computational Linguistics (ACL) with a workshop called: "Rediscovering 50 Years of Discoveries in Natural Language Processing ^[19].

The same year, started the NLP4NLP project with the aim of discovering which terms are introduced along the years, with details concerning the authors and the conferences involved

Then the project was extended to other directions while covering 34 conferences in speech and NLP. A full synthesis of the NLP4NLP project has been published in 2019 under the form of a double publication in Frontiers in Research Metrics and Analytics

See also

References

1. ^Implementing an online help desk system based on conversational agent Authors: Alisa Kongthon, Chatchawal Sangkeettrakarn, Sarawoot Kongyoung and Choochart Haruechaiyasak. Published by ACM 2009 Article, Bibliometrics Data Bibliometrics. Published in: Proceeding, MEDES '09 Proceedings of the International Conference on Management of Emergent Digital EcoSystems, ACM New York, NY, USA. {{ISBN|978-1-60558-829-2}}, {{doi|10.1145/1643823.1643908}}
2. ^{{cite web|author=Hutchins, J.|year=2005|url=http://www.hutchinsweb.me.uk/Nutshell-2005.pdf|title=The history of machine translation in a nutshell}}{{self-published source|date=December 2013}}
3. ^Chomskyan linguistics encourages the investigation of "corner cases" that stress the limits of its theoretical models (comparable to pathological phenomena in mathematics), typically created using thought experiments, rather than the systematic investigation of typical phenomena that occur in real-world data, as is the case in corpus linguistics. The creation and use of such corpora of real-world data is a fundamental part of machine-learning algorithms for natural language processing. In addition, theoretical underpinnings of Chomskyan linguistics such as the so-called "poverty of the stimulus" argument entail that general learning algorithms, as are typically used in machine learning, cannot be successful in language processing. As a result, the Chomskyan paradigm discouraged the application of such models to language processing.
4. ^Goldberg, Yoav (2016). [https://www.jair.org/media/4992/live-4992-9623-jair.pdf A Primer on Neural Network Models for Natural Language Processing]. Journal of Artificial Intelligence Research 57 (2016) 345–420
5. ^Ian Goodfellow, Yoshua Bengio and Aaron Courville. http://www.deeplearningbook.org/ Deep Learning]. MIT Press.
6. ^Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu (2016). https://arxiv.org/abs/1602.02410 Exploring the Limits of Language Modeling
7. ^Do Kook Choe and Eugene Charniak (EMNLP 2016). https://aclanthology.coli.uni-saarland.de/papers/D16-1257/d16-1257 Parsing as Language Modeling
8. ^Vinyals, Oriol, et al. (NIPS2015). https://papers.nips.cc/paper/5635-grammar-as-a-foreign-language.pdf
9. ^Winograd, Terry (1971). Procedures as a Representation for Data in a Computer Program for Understanding Natural Language. http://hci.stanford.edu/winograd/shrdlu/
10. ^Roger C. Schank and Robert P. Abelson (1977). Scripts, plans, goals, and understanding: An inquiry into human knowledge structures
11. ^Mark Johnson. How the statistical revolution changes (computational) linguistics. Proceedings of the EACL 2009 Workshop on the Interaction between Linguistics and Computational Linguistics.
12. ^Philip Resnik. Four revolutions. Language Log, February 5, 2011.
13. ^Klein, Dan, and Christopher D. Manning. "Natural language grammar induction using a constituent-context model." Advances in neural information processing systems. 2002.
14. ^Kishorjit, N., Vidya Raj RK., Nirmal Y., and Sivaji B. (2012) "Manipuri Morpheme Identification", Proceedings of the 3rd Workshop on South and Southeast Asian Natural Language Processing (SANLP), pages 95–108, COLING 2012, Mumbai, December 2012
15. ^Yucong Duan, Christophe Cruz (2011), Formalizing Semantic of Natural Language through Conceptualization from Existence. {{Webarchive|url=https://web.archive.org/web/20111009135952/http://www.ijimt.org/abstract/100-E00187.htm |date=2011-10-09 }} International Journal of Innovation, Management and Technology(2011) 2 (1), pp. 37-42.
16. ^"[https://www.academia.edu/2475776/Versatile_question_answering_systems_seeing_in_synthesis Versatile question answering systems: seeing in synthesis]", Mittal et al., IJIIDS, 5(2), 119-142, 2011.
17. ^ PASCAL Recognizing Textual Entailment Challenge (RTE-7) https://tac.nist.gov//2011/RTE/
18. ^{{Citation|last=Yi|first=Chucai|title=Assistive Text Reading from Complex Background for Blind Persons|date=2012|work=Camera-Based Document Analysis and Recognition|pages=15–28|publisher=Springer Berlin Heidelberg|language=en|doi=10.1007/978-3-642-29364-1_2|isbn=9783642293634|last2=Tian|first2=Yingli|citeseerx=10.1.1.668.869}}
19. ^Radev Dragomir R, Muthukrishnan Pradeep, Qazvinian Vahed, Abu-Jbara, Amjad, The ACL Anthology Network Corpus, Language Resources and Evaluation, 47, 2013, Springer, pp. 919–944.
20. ^{{Citation|first1=Gil|last1=Francopoulo|first2=Joseph|last2=Mariani|first3=Patrick|last3=Paroubek|title=The Cobbler's Children Won't Go Unshod|year=2013|url=http://www.dlib.org/dlib/november15/francopoulo/11francopoulo.html|work=D-Lib Magazine}}
21. ^{{Citation|first1=Joseph|last1=Mariani|first2=Gil|last2=Francopoulo|first3=Patrick|last3=Paroubek|title=The NLP4NLP Corpus (I): 50 Years of Publication Collaboration and Citation in Speech and Language Processing|year=2019|work=Frontiers in Research Metrics and Analytics|url=https://doi.org/10.3389/frma.2018.00036}}
22. ^{{Citation|first1=Joseph|last1=Mariani|first2=Gil|last2=Francopoulo|first3=Patrick|last3=Paroubek|first4=Frédéric|last4=Vernier|title=The NLP4NLP Corpus (I): 50 Years of Research in Speech and Language Processing|year=2019|work=Frontiers in Research Metrics and Analytics|url=https://doi.org/10.3389/frma.2018.00037}}