Published on

Transformers can be used for many learning tasks, and the only difference comes from the way we prepare the data

Authors

"Transformers can be used for many learning tasks, and the only difference comes from the way we prepare the data, the modeling head we choose, and the loss function we use to optimize the model.

With Causal Language Modeling, the model learns the language statistics by focusing on predicting the next word in a sequence. This is the more common way to perform Language modeling nowadays, and it has been the approach taken since GPT-1. Causality is ensured by applying a mask to the attention matrices computed within the attention layers. To avoid paying attention to words later in the sequence, we just set the attention to 0 for those words. To train this model, we just need to shift the inputs by removing the first word to create the labels.

For text classification, we want to associate the input text data with some category. For example, in the context of sentiment analysis, we may want to categorize the input sentence into the following three categories: [POSITIVE], [NEGATIVE] and [NEUTRAL]. In the context of text classification, we only need one prediction vector, and the typical strategy is usually to choose one of the hidden states and project it into the prediction space. This works because, although there are as many hidden states as there are input tokens, after passing through multiple transformer blocks, they all represent an entangled representation of the entire sentence. To train that model, we only need to compare the prediction vectors to the categorical labels by using a loss function such as cross-entropy.

The token classification learning task is often used for applications such as Named Entity Recognition (NER). We want to categorize each of the tokens in the input sentence. For example, we may want to associate each of the words with their grammatical categories: [NOUN], [VERB], and [ADJECTIVE]. For each of the inputs in the sequence, we need a prediction vector of the size of the number of categories we want to predict. At training time, we compare that prediction matrix for all the tokens to their categories in the labels with a cross-entropy loss function and update the model weights."

Transformers can be used for many learning tasks, and the only difference comes from the way we prepare the data

Author

ABN ASIA was founded by people with deep roots in academia, with work experience in the US, Holland, Hungary, Japan, South Korea, Singapore, and Vietnam. ABN Asia is where academy and technology meet opportunity. With our cutting-edge solutions and competent software development services, we're helping businesses level up and take on the global scene. Our commitment: Faster. Better. More reliable. In most cases: Cheaper as well.

Feel free to reach out to us whenever you require IT services, digital consulting, off-the-shelf software solutions, or if you'd like to send us requests for proposals (RFPs). You can contact us at contact@abnasia.org. We're ready to assist you with all your technology needs.

ABNAsia.org

© ABN ASIA