请输入您要查询的百科知识:

 

词条 Word embedding
释义

  1. Development of technique

  2. Limitations

  3. For biological sequences: BioVectors

  4. Thought vectors

  5. Software

      Examples of application  

  6. See also

  7. References

{{machine learning bar}}

Word embedding is the collective name for a set of language modeling and feature learning techniques in natural language processing (NLP) where words or phrases from the vocabulary are mapped to vectors of real numbers. Conceptually it involves a mathematical embedding from a space with one dimension per word to a continuous vector space with a much lower dimension.

Methods to generate this mapping include neural networks,[1] dimensionality reduction on the word co-occurrence matrix,[2][3][4] probabilistic models,[5] explainable knowledge base method,[6] and explicit representation in terms of the context in which words appear.[7]

Word and phrase embeddings, when used as the underlying input representation, have been shown to boost the performance in NLP tasks such as syntactic parsing[8] and sentiment analysis.[9]

Development of technique

In linguistics word embeddings were discussed in the research area of distributional semantics. It aims to quantify and categorize semantic similarities between linguistic items based on their distributional properties in large samples of language data. The underlying idea that "a word is characterized by the company it keeps" was popularized by Firth.[10]

The technique of representing words as vectors has roots in the 1960s with the development of the vector space model for information retrieval. Reducing the number of dimensions using singular value decomposition then led to the introduction of latent semantic analysis in the late 1980s.[11]

In 2000 Bengio et al. provided in a series of papers the "Neural probabilistic language models" to reduce the high dimensionality of words representations in contexts by "learning a distributed representation for words". (Bengio et al, 2003).[12] Word embeddings come in two different styles, one in which words are expressed as vectors of co-occurring words, and another in which words are expressed as vectors of linguistic contexts in which the words occur; these different styles are studied in (Lavelli et al, 2004).[13] Roweis and Saul published in Science how to use "locally linear embedding" (LLE) to discover representations of high dimensional data structures.[14] The area developed gradually and really took off after 2010, partly because important advances had been made since then on the quality of vectors and the training speed of the model.

There are many branches and many research groups working on word embeddings. In 2013, a team at Google led by Tomas Mikolov created word2vec, a word embedding toolkit which can train vector space models faster than the previous approaches.[15] Most new word embedding techniques rely on a neural network architecture instead of more traditional n-gram models and unsupervised learning.[16]

Limitations

One of the main limitations of word embeddings (word vector space models in general) is that possible meanings of a word are conflated into a single representation (a single vector in the semantic space). Sense embeddings[17] are proposed as a solution to this problem: individual meanings of words are represented as distinct vectors in the space.

For biological sequences: BioVectors

Word embeddings for n-grams in biological sequences (e.g. DNA, RNA, and Proteins) for bioinformatics applications have been proposed by Asgari and Mofrad.[18] Named bio-vectors (BioVec) to refer to biological sequences in general with protein-vectors (ProtVec) for proteins (amino-acid sequences) and gene-vectors (GeneVec) for gene sequences, this representation can be widely used in applications of deep learning in proteomics and genomics. The results presented by Asgari and Mofrad[18] suggest that BioVectors can characterize biological sequences in terms of biochemical and biophysical interpretations of the underlying patterns.

Thought vectors

Thought vectors are an extension of word embeddings to entire sentences or even documents. Some researchers hope that these can improve the quality of machine translation.[19]

Software

Software for training and using word embeddings includes Tomas Mikolov's Word2vec, Stanford University's GloVe,[20] AllenNLP's Elmo,[21] fastText, Gensim,[22] Indra[23] and Deeplearning4j. Principal Component Analysis (PCA) and T-Distributed Stochastic Neighbour Embedding (t-SNE) are both used to reduce the dimensionality of word vector spaces and visualize word embeddings and clusters.[24]

Examples of application

For instance, the fastText is also used to calculate word embeddings for text corpora in Sketch Engine that are available online.[25]

See also

  • Latent semantic analysis
  • Brown clustering
  • GloVe
  • word2vec
  • fastText
  • Gensim

References

1. ^{{cite arXiv |eprint=1310.4546 |last1=Mikolov |first1=Tomas |title=Distributed Representations of Words and Phrases and their Compositionality |last2=Sutskever |first2=Ilya |last3=Chen |first3=Kai |last4=Corrado |first4=Greg |last5=Dean |first5=Jeffrey |class=cs.CL| year=2013}}
2. ^{{Cite journal|arxiv=1312.5542 |last1=Lebret |first1=Rémi |title=Word Emdeddings through Hellinger PCA |journal=Conference of the European Chapter of the Association for Computational Linguistics (EACL) |volume=2014 |last2=Collobert |first2=Ronan |year=2013|bibcode=2013arXiv1312.5542L }}
3. ^{{Cite conference |url=http://papers.nips.cc/paper/5477-neural-word-embedding-as-implicit-matrix-factorization.pdf |title=Neural Word Embedding as Implicit Matrix Factorization |last=Levy |first=Omer |conference=NIPS |year=2014 |last2=Goldberg |first2=Yoav}}
4. ^{{Cite conference |url=http://ijcai.org/papers15/Papers/IJCAI15-513.pdf |title=Word Embedding Revisited: A New Representation Learning and Explicit Matrix Factorization Perspective |last=Li |first=Yitan |conference=Int'l J. Conf. on Artificial Intelligence (IJCAI) |year=2015 |last2=Xu |first2=Linli}}
5. ^{{Cite journal|last=Globerson|first=Amir|date=2007|title=Euclidean Embedding of Co-occurrence Data|url=http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/34951.pdf|journal=Journal of Machine Learning Research|doi=|pmid=|access-date=}}
6. ^{{Cite journal|last=Qureshi|first=M. Atif|last2=Greene|first2=Derek|date=2018-06-04|title=EVE: explainable vector based embedding technique using Wikipedia|journal=Journal of Intelligent Information Systems|language=en|doi=10.1007/s10844-018-0511-x|issn=0925-9902|arxiv=1702.06891}}
7. ^{{cite conference |last1=Levy |first1=Omer |last2=Goldberg |first2=Yoav |title=Linguistic Regularities in Sparse and Explicit Word Representations |conference=CoNLL |pages=171–180 |year=2014 |url=https://levyomer.files.wordpress.com/2014/04/linguistic-regularities-in-sparse-and-explicit-word-representations-conll-2014.pdf}}
8. ^{{cite conference |last1=Socher |first1=Richard |last2=Bauer |first2=John |last3=Manning |first3=Christopher |last4=Ng |first4=Andrew |title=Parsing with compositional vector grammars |conference=Proc. ACL Conf. |year=2013 |url=http://www.socher.org/uploads/Main/SocherBauerManningNg_ACL2013.pdf}}
9. ^{{cite conference |last1=Socher |first1=Richard |last2=Perelygin |first2=Alex |last3=Wu |first3=Jean |last4=Chuang |first4=Jason |last5=Manning |first5=Chris |last6=Ng |first6=Andrew |last7=Potts |first7=Chris |title=Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank |conference=EMNLP |year=2013 |url=http://nlp.stanford.edu/~socherr/EMNLP2013_RNTN.pdf}}
10. ^{{cite journal | last = Firth| first = J.R. | year = 1957 | title = A synopsis of linguistic theory 1930-1955 | journal = Studies in Linguistic Analysis | pages = 1–32 | ref = harv }} Reprinted in {{cite book | editor = F.R. Palmer | title = Selected Papers of J.R. Firth 1952-1959 | publisher = London: Longman | year = 1968}}
11. ^{{cite web |url=https://www.linkedin.com/pulse/brief-history-word-embeddings-some-clarifications-magnus-sahlgren/ |first=Magnus|last=Sahlgren | title=A brief history of word embeddings}}
12. ^{{cite book|title=A Neural Probabilistic Language Model|doi=10.1007/3-540-33486-6_6 | journal=Studies in Fuzziness and Soft Computing|volume=194 |pages=137–186|year=2006|last1=Bengio|first1=Yoshua|last2=Schwenk |first2=Holger |last3=Senécal |first3=Jean-Sébastien |last4=Morin |first4=Fréderic |last5=Gauvain |first5=Jean-Luc |isbn=978-3-540-30609-2 }}
13. ^{{cite conference |year=2004|last1=Lavelli |first1=Alberto |last2=Sebastiani |first2=Fabrizio |last3=Zanoli |first3=Roberto|title=Distributional term representations: an experimental comparison| conference=13th ACM International Conference on Information and Knowledge Management|pages=615–624|doi=10.1145/1031171.1031284 }}
14. ^{{cite journal|title=Nonlinear Dimensionality Reduction by Locally Linear Embedding|journal=Science|volume=290|issue=5500|pages=2323–6|bibcode=2000Sci...290.2323R|author1=Roweis|first1=Sam T.|last2=Saul|first2=Lawrence K.|year=2000|doi=10.1126/science.290.5500.2323|pmid=11125150|citeseerx=10.1.1.111.3313}}
15. ^[https://code.google.com/archive/p/word2vec/ word2vec]
16. ^{{cite journal|title=A Scalable Hierarchical Distributed Language Model|pages=1081–1088|url=http://papers.nips.cc/paper/3583-a-scalable-hierarchical-distributed-language-model|publisher=Curran Associates, Inc.|year=2009}}
17. ^{{cite conference |last1=Camacho-Collados |first1=Jose |last2=Pilehvar |first2=Mohammad Taher |title= From Word to Sense Embeddings: A Survey on Vector Representations of Meaning | year=2018 |arxiv=1805.04032|bibcode=2018arXiv180504032C }}
18. ^{{cite journal|last1=Asgari|first1=Ehsaneddin|last2=Mofrad|first2=Mohammad R.K.|title=Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics|journal=PLOS ONE|date=2015|volume=10|issue=11|page=e0141287|doi=10.1371/journal.pone.0141287|pmid=26555596|pmc=4640716|bibcode=2015PLoSO..1041287A|arxiv=1503.05140}}
19. ^{{cite arXiv|title=skip-thought vectors|eprint=1506.06726|last1=Kiros|first1=Ryan|last2=Zhu|first2=Yukun|last3=Salakhutdinov|first3=Ruslan|last4= Zemel|first4=Richard S.|last5=Torralba|first5=Antonio|last6=Urtasun|first6=Raquel|last7=Fidler|first7=Sanja|class=cs.CL|year=2015}}
20. ^{{cite web |url=http://nlp.stanford.edu/projects/glove/ |title=GloVe}}
21. ^{{cite web |url=https://allennlp.org/elmo |title=Elmo}}
22. ^{{cite web |url=http://radimrehurek.com/gensim/ |title=Gensim}}
23. ^{{cite web |url=https://github.com/Lambda-3/Indra |title=Indra|date=2018-10-25}}
24. ^{{Cite journal|last=Ghassemi|first=Mohammad|last2=Mark|first2=Roger|last3=Nemati|first3=Shamim|date=2015|title=A Visualization of Evolving Clinical Sentiment Using Vector Representations of Clinical Notes|url=http://www.cinc.org/archives/2015/pdf/0629.pdf|journal=Computing in Cardiology|doi=|pmid=|access-date=}}
25. ^{{cite web |url=https://embeddings.sketchengine.co.uk/ |title=Embedding Viewer |author= |date= |website=Embedding Viewer |publisher=Lexical Computing |access-date=7 Feb 2018 |quote=}}

4 : Language modeling|Artificial neural networks|Natural language processing|Computational linguistics

随便看

 

开放百科全书收录14589846条英语、德语、日语等多语种百科知识,基本涵盖了大多数领域的百科知识,是一部内容自由、开放的电子版国际百科全书。

 

Copyright © 2023 OENC.NET All Rights Reserved
京ICP备2021023879号 更新时间:2024/11/12 19:57:36