Web Embeddings for NLP

Introduction

NLP is an academic discipline that occupies a place on the border between computer science, AI, and linguistics. The direct concern is the interaction between people and computers based on the use of language. Another basic process that falls under the field of NLP is the conversion of text data into forms understandable by machines. Among the most notable milestones in this field is the word embeddings. Word embeddings were defined above as delivering density vector representations of words that encompass semantic meaning, syntactic characteristics and relations of words based on big data of raw texts.

Hence, this article offers more detail on the topic of word embeddings, including their historical background and development, different construction techniques, usage in different contexts, and future research on word embeddings.

Historical Context and Motivation

Early Representations of Words

Before the arrival of word embeddings, the words were usually portrayed through vectors with many zeros, known as sparse vectors like the one-hot encoding and Term Frequency-Inverse Document Frequency (TF-IDF). These methods had significant limitations:

One-Hot Encoding: In this representation, each word of the vocabulary is represented by a binary vector with all zeros except for the index of the concrete word. This yields extremely high-dimensional vectors equal to the size of the vocabulary, and it does not take into account any relationship between words. For example, words like 'cat' and 'dog' will have the same degree of dissimilarity from 'animal' as from 'car'.
TF-IDF: This method measures a word's importance in ranking by its frequency in a document as compared to the entire text collection. Despite the enhancement over one-hot encoding used previously, in which the words' frequencies shape the corresponding vectors and are sensitive to their importance in capturing terms, TF-IDF is also sparse and high-dimensional and incapable of handling semantic query-problem relations.

The Emergence of Word Embeddings

The difficulties stemmed from the use of sparse matrices; hence, researchers looked for ways in which the word vectors could be made denser with a lot of information. Thus, word embeddings appeared as a rather effective solution, providing dense low-dimensional vectors and keeping information about semantic relations between the words. The idea that lays the building block is the postulation that words that co-occur empirically have close meanings, summarized by what is referred to as the Distributional Hypothesis advanced by Zellig Harris and J. R. Firth in the 1950s.

Some approaches to building word embeddings are as follows, which have their advantages and disadvantages on the basis of different goals of NLP applications. The most notable approaches include:

Word2Vec
GloVe
FastText
Subword Pre-trained Embeddings (e.g., ELMo, BERT)

Word2Vec

A system known as Word2Vec, created in 2013 by a team led by Tomas Mikolov, who works at Google, is one of the most popular models designed for providing word embeddings. It comes in two flavors: Word2Vec's two methodologies include Continuous Bag of Words (CBOW) and Skip-gram.

CBOW: This model decides the target word based on the context words that exist around it within a window size. To classify the target word, it averages the context word vectors. This method works well in a situation with fewer records and is generally more efficient.
Skip-gram: This model gives the probability of context words given the target word. It is suitable for use with large data sets and performs better for all of the uncommon words.

Both models share one major mechanism, namely, neural networks: the goal of optimization is the probability of observing the context words given the target word (or the target word given the context words). The above embeddings are capable of encoding semantic similarities, that is, synonyms and analogies.

The proposed method is 'GloVe (Global Vectors for Word Representation).'

Stanford University's GloVe is another algorithm for creating word embedding with a significantly different approach. As it differs from Word2Vec, which uses the local windows of context, GloVe applies global statistics of word occurrences in a corpus. The goal is to perform factorization of the co-occurrence matrix to arrive at the word vectors. Specifically, the GloVe algorithm tries to minimize the value of the difference between the dot product of two-word vectors and the log of the probability of co-occurrence of these two words next to each other. Thus, it encodes both local and global semantics.

FastText

Proposed by Facebook's Artificial Intelligence Research (FAIR) lab as an extension of the Word2Vec model, FastText makes use of subword information. Unlike Word2Vec, FastText does not operate on a word level, but it chunks the words into a character of n length and subsequently represents a word as a sum of vectors of these n-length chunks. This happens by utilizing the constituent character n-grams, which makes FastText more suitable for morphologically dense languages; it also enables the model to obtain embeddings for unseen and out-of-vocabulary words.

Contextual Word Embeddings

Most prior work in word vectors is non-contextual, including Word2Vec, GloVe, and FastText, which create one fixed vector for a word that does not change with context. Yet, the meaning of an individual word depends on the situation it is used in. While this increases the risk of of-vocabulary words, contextual word embeddings solve this by creating different word vectors in the same sentence. Notable models include:

ELMo (Embeddings from Language Models): Elmo is a contextualized word embedding model created jointly by the Allen Institute for AI, obtained from a deep bidirectional LSTM network. Since the task focuses on targeting the next word and looking both left and right, this model is used in the training process.
BERT (Bidirectional Encoder Representations from Transformers): Google developed BERT, which creates contextual embeddings using a transformer-based approach. It is pre-trained with a masked language model objective that entails masking some of the words in the input, with the model attempting to guess them based on the context. This is because, unlike traditional models, BERT is bidirectional, which enables it to capture in-depth information and thus be appropriate for many NLP operations.

Concerning the former component, Wikimedia Commons provides quite a mathematical foundation of the concepts featured in the learning methods.

It is important to know the purpose and mathematical background of word embeddings in adding value and realizing their limitations in applications.

Word2Vec: Skip-gram and CBOW

The Word2Vec models are trained with the help of expected objectives based on neural networks. The key mathematical concepts include:

Negative Sampling: Again, due to the difficulty of the training procedure, Word2Vec uses a method called negative sampling, which involves updating vectors of a few "negative" words (i.e., words that are unlikely to occur in the given context).

FastText: Subword Information

FastText is an improvement of the Word2Vec model that takes into consideration the subword information. The key concepts include:

Character N-grams: The words are respected as the bags of the character n-grams so that the model considers the morphological information.
Training: Like Word2Vec, the training process is the same, but instead of learning word vectors directly, FastText learns vectors of character n-grams, and a word's vector is the sum of its n-gram vectors.

Contextual Embeddings: Saved models: specific ELMo and BERT.

LSTMs and transformers are used to create contextual embeddings due to their complex structures. Key concepts include:

ELMo: ELMo employs bidirectional LSTM to create contextualized embedding of words.
Next Sentence Prediction (NSP): This task assists BERT in comprehending the inclination of a single sentence to another. The specific model is an assembly of pairs of sentences with the goal of assessing whether the second sentence in a pair will follow the first in a given corpus.

Applications of Word Embeddings

Therefore, word embeddings have solved the selected problem and advanced NLP by helping to process textual data more efficiently. Here are some key applications:

Text Classification
Word embeddings are useful for improving text classification tasks because they generate representations based on words' semantic analogies. These embeddings are used as features for classifiers like logistic regression, support vector machine, or deep neural networks and help increase the accuracy of certain tasks, including sentiment analysis, topic categorization and spam detection.
Named Entity Recognition (NER)
Well, NER's target is to locate and categorize entities (for instance, the name of a person, organization or location) in the text. It does this when word embeddings give context-aware features that increase the accuracy in recognizing similar entities.
Machine Translation
These embeddings help in the machine translation process because word vectors are more nuanced and continuous and give the relationships between words and their meanings. Approaches such as sequence-to-sequence models with attention utilize such embeddings to better translate sentences from one language to another.
Question Answering
Word embeddings are very useful in question-answering systems to determine the relationship between the question and the possible answers. Thus, contextual embeddings generated by BERT, for example, are very suitable for matching questions with appropriate answers.
Semantic Search and Information Retrieval
Word embeddings improve semantic search because they allow for the improved search of documents where the meaning of the words used is taken into consideration as opposed to where the words are searched word by word. In other words, embeddings assist in the generation of documents that contain similar content to the documents used during query formulation, regardless of the specific search terms used.
Text Generation
Word embeddings are applied in tasks that involve the generation of texts, such as chatbots and automatic text completion, to provide contextually correct responses. It is worth mentioning that embeddings help models, such as GPT-3, to produce text that resembles writing done by people in response to specific stimuli.
Evaluating Word Embeddings
Quality assessment of word embeddings is highly important for encoding important semantic information for machines. Common evaluation methods include:
Intrinsic Evaluation
Self-assessment is an assessment of embedding qualities that depend on certain tasks. These tasks analyze the capacity of an embedding to capture semantic/syntactic accumulations.
- Word Similarity: This task measures the extent to which the embeddings preserve the words' similarity by computing the cosine similarity of the word pairs with gold-standard similarity scores.
- Word Analogy: This task tests the embeddings' performance in using vector operations to obtain the solution of predications such as analogy (for instance, "king" to "queen" or "man" to "woman").
Extrinsic Evaluation
Further, extrinsic assesses the effects of embeddings on other subsequent NLP tasks. This is accomplished by employing the embeddings as input characteristics of other tasks, including text categorization, NER, or machine translation, and comparing the performance ideals.

Challenges and Limitations

While word embeddings have significantly advanced NLP, they come with challenges and limitations:

Bias and Fairness
Thus, word embeddings can preserve and even strengthen prejudice present in certain data sets. Sites, for example, could be embeddings to connect specific professions with certain genders or races, which means that their Propositions will be prejudiced, for example, in applications like hiring algorithms or sentiment analysis.
Out-of-Vocabulary Words
The methods mentioned above have the problem of out-of-vocabulary (OOV) words since word embeddings cannot create an embedding for a word not encountered during training. Some methods, such as FastText, somewhat solve this problem by incorporating sub-word information.
Contextual Variability
Another issue with static word vectors is that they do not express any variability of word usage contexts, as done by Dynamic word embeddings. For instance, the term 'bank' can be used in the context of a building that provides deposit or money services; it can also refer to the edge of a river. Contextual embeddings like those from BERT do solve this problem but at the cost of higher computational resources.
Computational Resources
Training good-quality word embeddings in general and contextual embeddings in particular takes time, computational power and text data. This can be an issue when the algorithms are implemented by researchers and practitioners who have restricted access to high-end equipment.
Future Directions
The field of word embeddings and NLP continues to evolve rapidly, with several promising directions for future research and development:
Improved Contextual Embeddings
Subsequent studies will probably aim at deepening the efficiency of contextual embeddings, which will, in turn, learn to capture the contextual peculiarities of language usage more fully. This includes considering new needed architectures and training objectives as well as proper fine-tuning.
Cross-lingual and Multilingual Embeddings
When NLP applications are being developed and deployed in more than one language at a time, multilingual embeddings are needed. Cross-lingual and multilingual embeddings concern the production of a single vector table for each language since it becomes convenient to work with a single table instead of two separate ones when translating between two languages.
Addressing Bias and Fairness
People are also trying to overcome the problems that were born with word embeddings as far as bias is concerned. This is about creating ways of determining, measuring, and mitigating bias, thus promoting equality in the application of NLP.
Interpretability and Explainability
Considering that NLP models are becoming increasingly complex, there is an increased need for interpretability and explainability. Knowledge of how embeddings encode semantic information and their effect on the model's predictions is vital to building trustworthy and explainable AI systems.
Efficient Training and Deployment
Simplifying pre-training the pretraining and distribution of word embeddings could go a long, long way toward making the field of NLP more usable at scale. This comprises developing processes for minimizing computations and enhancing efficiency in the embedding models.

Conclusion

Word embeddings are possibly the most important concept in NLP in the last decade as they allow for the creation of highly informative vectors of words and their relationships. Other methods such as Word2Vec, GloVe, FastText, and contextual embeddings have made it possible to make enhancements in different NLP activities, including text categorization, machine translation, etc. However, issues such as biasness, contextual differences, and computational resource demands persist. Future studies strive to solve such issues and improve the evolution of word embeddings, thus creating more sophisticated and fair NLP.

Next TopicWindow functions in pandas

← prev next →