词条 | Word2vec |
释义 |
Word2vec was created by a team of researchers led by Tomas Mikolov at Google and patented.[2] The algorithm has been subsequently analysed and explained by other researchers.[3][4] Embedding vectors created using the Word2vec algorithm have many advantages compared to earlier algorithms[1] such as latent semantic analysis. CBOW and skip gramsWord2vec can utilize either of two model architectures to produce a distributed representation of words: continuous bag-of-words (CBOW) or continuous skip-gram. In the continuous bag-of-words architecture, the model predicts the current word from a window of surrounding context words. The order of context words does not influence prediction (bag-of-words assumption). In the continuous skip-gram architecture, the model uses the current word to predict the surrounding window of context words. The skip-gram architecture weighs nearby context words more heavily than more distant context words.[1][5] According to the authors' note,[6] CBOW is faster while skip-gram is slower but does a better job for infrequent words. ParametrizationResults of word2vec training can be sensitive to parametrization. The following are some important parameters in word2vec training. Training algorithmA Word2vec model can be trained with hierarchical softmax and/or negative sampling. To approximate the conditional log-likelihood a model seeks to maximize, the hierarchical softmax method uses a Huffman tree to reduce calculation. The negative sampling method, on the other hand, approaches the maximization problem by minimizing the log-likelihood of sampled negative instances. According to the authors, hierarchical softmax works better for infrequent words while negative sampling works better for frequent words and better with low dimensional vectors.[6] As training epochs increase, hierarchical softmax stops being useful.[7] Sub-samplingHigh frequency words often provide little information. Words with frequency above a certain threshold may be subsampled to increase training speed.[8] DimensionalityQuality of word embedding increases with higher dimensionality. But after reaching some point, marginal gain will diminish.[1] Typically, the dimensionality of the vectors is set to be between 100 and 1,000. Context windowThe size of the context window determines how many words before and after a given word would be included as context words of the given word. According to the authors' note, the recommended value is 10 for skip-gram and 5 for CBOW.[6] ExtensionsAn extension of word2vec to construct embeddings from entire documents (rather than the individual words) has been proposed.[9] This extension is called paragraph2vec or doc2vec and has been implemented in the C, Python[10][11] and Java/Scala[12] tools (see below), with the Java and Python versions also supporting inference of document embeddings on new, unseen documents. Word vectors for bioinformatics: BioVectorsAn extension of word vectors for n-grams in biological sequences (e.g. DNA, RNA, and Proteins) for bioinformatics applications have been proposed by Asgari and Mofrad.[13] Named bio-vectors (BioVec) to refer to biological sequences in general with protein-vectors (ProtVec) for proteins (amino-acid sequences) and gene-vectors (GeneVec) for gene sequences, this representation can be widely used in applications of machine learning in proteomics and genomics. The results suggest that BioVectors can characterize biological sequences in terms of biochemical and biophysical interpretations of the underlying patterns.[13] A similar variant, dna2vec, has shown that there is correlation between Needleman-Wunsch similarity score and cosine similarity of dna2vec word vectors.[14] Word vectors for Radiology: Intelligent Word Embedding (IWE)An extension of word vectors for creating a dense vector representation of unstructured radiology reports has been proposed by Banerjee et. al.[15] One of the biggest challenges with Word2Vec is how to handle unknown or out-of-vocabulary (OOV) words and morphologically similar words. This can particularly be an issue in domains like medicine where synonyms and related words can be used depending on the preferred style of radiologist, and words may have been used infrequently in a large corpus. If the word2vec model has not encountered a particular word before, it will be forced to use a random vector, which is generally far from its ideal representation. IWE combines Word2vec with a semantic dictionary mapping technique to tackle the major challenges of information extraction from clinical texts, which include ambiguity of free text narrative style, lexical variations, use of ungrammatical and telegraphic phases, arbitrary ordering of words, and frequent appearance of abbreviations and acronyms. Of particular interest, the IWE model (trained on the one institutional dataset) successfully translated to a different institutional dataset which demonstrates good generalizability of the approach across institutions. AnalysisThe reasons for successful word embedding learning in the word2vec framework are poorly understood. Goldberg and Levy point out that the word2vec objective function causes words that occur in similar contexts to have similar embeddings (as measured by cosine similarity) and note that this is in line with J. R. Firth's distributional hypothesis. However, they note that this explanation is "very hand-wavy" and argue that a more formal explanation would be preferable.[3] Levy et al. (2015)[16] show that much of the superior performance of word2vec or similar embeddings in downstream tasks is not a result of the models per se, but of the choice of specific hyperparameters. Transferring these hyperparameters to more 'traditional' approaches yields similar performances in downstream tasks. Arora et al (2016)[17] explain word2vec and related algorithms as performing inference for a simple generative model for text, which involves a random walk generation process based upon loglinear topic model. They use this to explain some properties of word embeddings, including their use to solve analogies. Preservation of semantic and syntactic relationshipsThe word embedding approach is able to capture multiple different degrees of similarity between words. Mikolov et al. (2013)[18] found that semantic and syntactic patterns can be reproduced using vector arithmetic. Patterns such as “Man is to Woman as Brother is to Sister” can be generated through algebraic operations on the vector representations of these words such that the vector representation of “Brother” - ”Man” + ”Woman” produces a result which is closest to the vector representation of “Sister” in the model. Such relationships can be generated for a range of semantic relations (such as Country–Capital) as well as syntactic relations (e.g. present tense–past tense) Assessing the quality of a modelMikolov et al. (2013)[1] develop an approach to assessing the quality of a word2vec model which draws on the semantic and syntactic patterns discussed above. They developed a set of 8,869 semantic relations and 10,675 syntactic relations which they use as a benchmark to test the accuracy of a model. When assessing the quality of a vector model, a user may draw on this accuracy test which is implemented in word2vec,[19] or develop their own test set which is meaningful to the corpora which make up the model. This approach offers a more challenging test than simply arguing that the words most similar to a given test word are intuitively plausible.[1] Parameters and model qualityThe use of different model parameters and different corpus sizes can greatly affect the quality of a word2vec model. Accuracy can be improved in a number of ways, including the choice of model architecture (CBOW or Skip-Gram), increasing the training data set, increasing the number of vector dimensions, and increasing the window size of words considered by the algorithm. Each of these improvements comes with the cost of increased computational complexity and therefore increased model generation time.[1] In models using large corpora and a high number of dimensions, the skip-gram model yields the highest overall accuracy, and consistently produces the highest accuracy on semantic relationships, as well as yielding the highest syntactic accuracy in most cases. However, the CBOW is less computationally expensive and yields similar accuracy results.[1] Accuracy increases overall as the number of words used increases, and as the number of dimensions increases. Mikolov et al.[1] report that doubling the amount of training data results in an increase in computational complexity equivalent to doubling the number of vector dimensions. Altszyler et al. (2017) [20] studied Word2vec performance in two semantic tests for different corpus size. They found that Word2vec has a steep learning curve, outperforming another word-embedding technique (LSA) when it is trained with medium to large corpus size (more than 10 million words). However, with a small training corpus LSA showed better performance. Additionally they show that the best parameter setting depends on the task and the training corpus. Nevertheless, for skip-gram models trained in medium size corpora, with 50 dimensions, a window size of 15 and 10 negative samples seems to be a good parameter setting. Implementations
See also
References1. ^1 2 3 4 5 6 7 8 {{cite arXiv|first=Tomas |last=Mikolov |title=Efficient Estimation of Word Representations in Vector Space|eprint=1301.3781|display-authors=etal|class=cs.CL |year=2013 }} {{Natural Language Processing}}{{Use dmy dates|date=April 2017}}2. ^{{cite patent|title=Computing numeric representations of words in a high-dimensional space |url=https://patents.google.com/patent/US9037464B1/en}} 3. ^1 {{cite arXiv |first1=Yoav |last1=Goldberg |first2=Omer |last2=Levy |title=word2vec Explained: Deriving Mikolov et al.'s Negative-Sampling Word-Embedding Method|eprint=1402.3722|class=cs.CL |year=2014 }} 4. ^{{cite AV media|first=Radim|last=Řehůřek|title=Word2vec and friends|medium=Youtube video|url=https://www.youtube.com/watch?v=wTp3P2UnTfQ|accessdate=2015-08-14}} 5. ^{{cite conference |first1=Tomas |last1=Mikolov |first2=Ilya |last2=Sutskever |first3=Kai |last3=Chen |first4=Greg S. |last4=Corrado |first5=Jeff |last5=Dean |title=Distributed representations of words and phrases and their compositionality |conference=Advances in Neural Information Processing Systems |year=2013|arxiv=1310.4546|bibcode=2013arXiv1310.4546M }} 6. ^1 2 {{Cite web|url=https://code.google.com/archive/p/word2vec/|title=Google Code Archive - Long-term storage for Google Code Project Hosting.|website=code.google.com|access-date=2016-06-13}} 7. ^{{Cite web|url=https://groups.google.com/forum/#!msg/word2vec-toolkit/WUWad9fL0jU/LdbWy1jQjUIJ|title=Parameter (hs & negative)|last=|first=|date=|website=Google Groups|access-date=2016-06-13}} 8. ^{{Cite web|url=http://jmlr.csail.mit.edu/papers/volume9/vandermaaten08a/vandermaaten08a.pdf|title=Visualizing Data using t-SNE|last=|first=|date=|website= Journal of Machine Learning Research, 2008. Vol. 9, pg. 2595|access-date=2017-03-18}} 9. ^{{cite arXiv|first=Quoc|last=Le|title=Distributed Representations of Sentences and Documents. |eprint=1405.4053|display-authors=etal|class=cs.CL|year=2014}} 10. ^{{cite web|title=Doc2Vec tutorial using Gensim|url=https://medium.com/@klintcho/doc2vec-tutorial-using-gensim-ab3ac03d3a1|accessdate=2015-08-02|display-authors=etal}} 11. ^{{cite web|title=Doc2vec for IMDB sentiment analysis|url=https://github.com/piskvorky/gensim/blob/develop/docs/notebooks/doc2vec-IMDB.ipynb|accessdate=2016-02-18|display-authors=etal}} 12. ^{{cite web|title=Doc2Vec and Paragraph Vectors for Classification|url=http://deeplearning4j.org/doc2vec.html|accessdate=2016-01-13|display-authors=etal}} 13. ^1 {{cite journal|last1=Asgari|first1=Ehsaneddin|last2=Mofrad|first2=Mohammad R.K.|title=Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics|journal=PLOS ONE|date=2015|volume=10|issue=11|page=e0141287|doi=10.1371/journal.pone.0141287|pmid=26555596|pmc=4640716|arxiv=1503.05140|bibcode=2015PLoSO..1041287A}} 14. ^{{cite arXiv|eprint=1701.06279|first=Patrick|last=Ng|title=dna2vec: Consistent vector representations of variable-length k-mers|date=2017|class=q-bio.QM}} 15. ^{{Cite journal|last=Banerjee|first=Imon|last2=Chen|first2=Matthew C.|last3=Lungren|first3=Matthew P.|last4=Rubin|first4=Daniel L.|title=Radiology report annotation using intelligent word embeddings: Applied to multi-institutional chest CT cohort|journal=Journal of Biomedical Informatics|volume=77|pages=11–20|doi=10.1016/j.jbi.2017.11.012|pmid=29175548|pmc=5771955|year=2018}} 16. ^{{cite book|last1=Levy|first1=Omer|last2=Goldberg|first2=Yoav|last3=Dagan|first3=Ido|title=Improving Distributional Similarity with Lessons Learned from Word Embeddings|date=2015|location=Transactions of the Association for Computational Linguistics|url=http://www.aclweb.org/anthology/Q15-1016}} 17. ^{{Cite journal|last=Arora |display-authors=et al|first=S|date=Summer 2016|title=A Latent Variable Model Approach to PMI-based Word Embeddings|url=http://aclweb.org/anthology/Q16-1028|journal=Transactions of Assoc. of Comp. Linguistics|volume=4|pages=385-399|via=ACLWEB}} 18. ^{{Cite journal|last=Mikolov|first=Tomas|last2=Yih|first2=Wen-tau|last3=Zweig|first3=Geoffrey|date=2013|title=Linguistic Regularities in Continuous Space Word Representations.|url=|journal=Hlt-Naacl|pages=746–751|doi=|pmid=}} 19. ^{{Cite web|url=https://radimrehurek.com/gensim/models/word2vec.html|title=Gensim - Deep learning with word2vec|last=|first=|date=|website=|access-date=10 June 2016}} 20. ^{{Cite journal|author1=Altszyler, E. |author2=Ribeiro, S. | author3= Sigman, M.|author4=Fernández Slezak, D. |title=The interpretation of dream meaning: Resolving ambiguity using Latent Semantic Analysis in a small corpus of text|volume=56|journal=Consciousness and Cognition|date=2017|pages=178–187|url=https://www.sciencedirect.com/science/article/pii/S1053810017301034|doi=10.1016/j.concog.2017.09.004 |pmid=28943127 }} 4 : Free science software|Natural language processing toolkits|Artificial neural networks|Machine learning |
随便看 |
|
开放百科全书收录14589846条英语、德语、日语等多语种百科知识,基本涵盖了大多数领域的百科知识,是一部内容自由、开放的电子版国际百科全书。