Using Machine Learning Models to Analyze and Leverage Text Data
Every organization uses texts to disseminate, improve, and modify its services or products. Natural Language Processing (NLP), a subfield of artificial intelligence and computer science, focuses on the science of extracting meaning and information from texts by applying machine learning algorithms.
With the help of machine learning algorithms and techniques, organizations can solve common text data problems such as identifying different categories of users, identifying the intent of a text, and accurately identifying different categories of user reviews and feedback. Once text data can be analyzed using deep learning models, then appropriate responses can be generated.
Apply the following techniques to understand text data and solve all text problems for your service/products.
1. Organize your Data
IT teams have to deal with a large volume of data daily. The first step in leveraging these text and solving problems related to the text is to organize or gather the data based on its relevance.
For instance, let’s use a dataset with the keyword "Fight." In organizing datasets such as tweets or social media posts with this keyword, we will need to categorize them based on the contextual relevance. The potential goal is to report cases of physical assaults to local authorities.
Therefore, data needs to be differentiated based on the context of the word. Does the word in the context suggest an organized sport such as a boxing match or does its contextual meaning imply an argument or a quarrel which does not involve physical assault? The word may also suggest a brawl or physical struggle, which is our target text, as it could also indicate a struggle to overcome a social ill; for example, “a fight for justice.”
This creates a need for labels to identify the texts that are relevant (that suggest a physical fight or brawl) and the irrelevant texts (every other context for the keyword). Labeling data and training a deep learning model, therefore, produces faster and simpler results in solving problems with textual data.
2. Clean your Data
After gathering your data, it then needs to be cleaned for effective and seamless model training. The reason is simple - clean data is easier to process and analyze by a deep learning model. Here a
re some ways to clean your data;
- Get rid of non-alphanumeric characters: Although non-alphanumerics such as symbols (currency signs, punctuations) may hold significant information, they may make data difficult to analyze for several models. One of the best ways to address this is by eliminating them or restricting them to text-dependent usages, such as the use of hyphen in the word “full-time.”
- Use Tokenization: Tokenization involves breaking a sequence of strings into several pieces called tokens. The tokens selected could be sentences (sentence tokenization) or words (word tokenization). In sentence tokenization (also known as sentence segmentation), a string of text is broken into its component sentences, while word tokenization breaks down a text into its component words.
- Use Lemmatization: Lemmatization is an effective way of cleaning data using vocabulary and morphological word analytics to reduce related words to their common grammatical base form, known as Lemma. For instance, Lemmatizations removes inflections to return a word to its base or dictionary form.
3. Use Accurate Data Representation
Algorithms can not analyze data in text forms, so data has to be represented to our systems in a list of numbers that algorithms can process. This is called vectorization.
A natural way to do this may be to encode each character as a number such that the classifier learns the structure of each word in a dataset; however, this is not realistically possible. Therefore, a more effective method of representing data on our systems or into a classifier is to associate a unique number with each word. Consequently, each sentence is represented by a long list of numbers.
In a representational model called a Bag of Words (BOW), only the frequency of known words are considered and not the order or sequence of the words in the text. All you need to do is t
o decide on an effective way to design the vocabulary of tokens (known words) used and how to score their presence in the text.
The BOW method is based on an assumption that the more frequently a word appears in a text, the more strongly it represents its meaning.
https://immediateedgeapp.org/
4. Classify your Data
Unstructured texts are ubiquitous; they are in emails, chats, messages, survey responses, etc. Extracting relevant information from unstructured text data can be a daunting task and one way to combat this is through text classification.
Text classification (also called text categorization or text tagging) cleans up a text by using tags or categories to designate components of a text according to its content. For example, product reviews can be categorized by intent, articles can be categorized by relevant topics, and conversations in a chatbot classified by urgency. Text classification helps in spam detection and sentiment analysis for data.
Text classification can be done in two ways: manually or automatically. In manual text classification, a human annotates the text, interprets it and categorizes it accordingly. Of course, this method is time-consuming. The automatic method uses machine learning models and techniques to classify a text according to certain criteria.
Using the BOW model, text classification analytics can detect patterns and sentiments of a text, based on the frequency of a set of words.
5. Inspect your Data
After you have processed and interpreted your data using machine learning models, it is important to inspect them for errors. An effective way of visualizing data for inspection is using a confusion matrix. It is so named to determine if the system is confusion two labels. For example: the relevant and irrelevant class.
A confusion matrix, also called an error matrix, allows you visualize the output performance of an algorithm. It presents the data on a table layout, where each row of the matrix represents a component in a predicted label and each column represents a component in the actual label.
In our example, we trained the classification to distinguish between physical fights and non-physical fights (such as a non-violent civil rights movement). Assuming the sample was of 22 events - 12 physical fights and 10 non-physical fights, a confusion matrix will represent the results on a table layout as below:
Predicted Class | Physical | Non-Physical | |
Physical fights | 5 (TP) | 3 (FP) | |
Non-physical fights | 7 (FN) | 7 (TN) |
In this confusion matrix, of the 12 actual physical fights, the algorithm predicted that there were 7 nonviolent fights or protests. Furthermore, the system predicted that of the 10 actual protests, there were three physical fights. The correct predictions are highlighted - these represent the true positives (TP) and true negatives (TN) respectively. The other results are the false negatives (FN) and false positives (FP).
So, in interpreting and validating results of our predictions using this model, we must use the most appropriate words used as classifiers. Suitable words to classify non-physical fights in a text include protests, marches, non-violent, peaceful, and demonstration.
Upon properly analyzing the written data, systems can then effectively produce appropriate responses.
Leveraging Text Data to Generate Responses: A Case for Chatbots
After cleaning up, analyzing, and interpreting text data, the next step is returning an appropriate response. This is the science employed by chatbots.
Response models used in chatbots are typically two types - retrieval-based models and generative models. Retrieval-based models use a set of predetermined responses which are automatically retrieved based on the input. This uses a form of heuristic to select the most appropriate response. On the other hand, generative models do not use predefined responses; instead, new responses are generated using machine translation algorithms.
Both methods have their pros and cons and have valid use-cases. First, being predefined and pre-written, retrieval-based methods do not make grammatical errors; however, if there has been no pre-registered output for an unseen input (such as a name), these methods may not produce ideal responses.
Generative methods are more advanced and “smarter” as responses are generated on-the-go and based on the context of the input. However, since they require intense training and responses are not pre-written, they may make grammatical errors.
For both methods of response-generation, conversation length can present challenges. The longer the input or the conversation, the more difficult it is to automate the responses. In open domains, the conversation is unrestricted and the input can take any turn. Therefore, open domains cannot be built on a retrieval-based chatbot. However, in a closed domain, where there is a limit on inputs and outputs (you can ask only a limited set of questions), retrieval-based bots work best.
Generative-based chat systems can handle closed domains but may require a smart machine to handle longer conversations in an open domain.
The challenges that come with long or open-ended conversations include the following:
- Incorporating a linguistic and physical context: In long conversations, people keep track of what has been said and this may be difficult for the system to process if such information is reused in the conversation. This, therefore, requires incorporating contexts to each word generated, and this can be challenging.
- Maintaining Semantic Coherence: While many systems are trained to generate a response to a particular question or input, they may not be able to produce a similar or consistent response if the input is rephrased. For example, you want the same answer to “what do you do?” and “what’s your occupation?”. It may be difficult to train generative systems to do this.
- Detecting Intent: To ensure a response is relevant to the input of the context, the system needs to understand the intent of the user and this has been difficult. As a result, many systems produce a generic response where it is not needed. For example, “that’s great!” as a generic response may be inappropriate for an input such as “I live alone, outside the yard”.
For these reasons, the retrieval-based methods are still easier to use for chatbots or other conversational platforms.
Conclusion
Leveraging text data for conversational systems requires deep learning models to effectively interpret the written data and generate a response to it. The pre-processing techniques of organizing, cleaning, classifying, representing, and inspecting the data not only improves the performance of these models at analyzing the data, but it also improves the accuracy of the output or response. Interested to learn more about how AI can help to transform your enterprise? Contact our experts.