A workflow for obtaining the sentiment of an audio.

Overview
1

Notes

Workflow: ASR -> Sentiment Analysis

Introduction

This workflow takes audio as input, runs an Audio Speech Recognition (ASR) model, and uses the resulting text as input to a text sentiment model.

Since the input audio is being fed into Facebook's ASR Wav2Vec 2.0 model, and that model was trained on audio sampled at 16kHz, make sure that your speech audio input files are also sampled at 16Khz.

Pro Tip

Note that the sentiment analysis model is suitable for English. If you require this workflow on audio files in different languages, you create a derived custom workflow and insert a text translation model in between the ASR and the sentiment analysis model. This will ensure that the sentiment analysis model is receiving its input text in English.

Signal Flow

  1. Audio input is fed into Facebook's Wav2Vec 2.0 ASR model:
  2. The output of Wav2Vec 2.0 is a block of text, which is then fed as the input to the roBERTa text sentiment analysis model:
  3. The output of the text sentiment analysis model is one of three labels denoting the sentiment of the text:
    • 0 -> Negative
    • 1 -> Neutral
    • 2 -> Positive

Limitations

The social media text sentiment model is trained on twitter text, so it may fail to identify the sentiment of other kinds of text, such as an input consisting of literary text.

Facebook Wav2Vec 2.0

Introduction

Facebook Automatic Speech Recognition provides great models and workflows built by Facebook (now known as Meta) that you can use in your apps to carry out automatic speech recognition (ASR).

You can use this model to easily and quickly convert English audio to English text (speech-to-text transcription). Simply upload an English audio file from your local computer or add a publicly accessible audio URL, and the model will output the transcribed text.

Facebook ASR models will help you to effectively transcribe audio content into written words without having to type them manually. These models are also valuable to persons with disabilities who cannot use a keyboard.

Wav2Vec2-Large-Robust finetuned on Switchboard

This model is a fine-tuned version of the wav2vec2-large-robust model. It has been pretrained on:

  • Libri-Light: Open-source audio books from the LibriVox project – clean read-out audio data.
  • CommonVoice: Crowd-source collected audio data – read-out text snippets.
  • Switchboard: Telephone speech corpus – noisy telephone data.
  • Fisher: Conversational telephone speech – noisy telephone data.

The model was subsequently finetuned on 300 hours of

  • Switchboard: telephone speech corpus – noisy telephone data.

The audio data used to fine-tune and train this model as sampled at a sample rate of 16kHz.

More Info

Paper

Robust wav2vec 2.0: Analyzing Domain Shift in Self-Supervised Pre-Training

Authors: Wei-Ning Hsu, Anuroop Sriram, Alexei Baevski, Tatiana Likhomanenko, Qiantong Xu, Vineel Pratap, Jacob Kahn, Ann Lee, Ronan Collobert, Gabriel Synnaeve, Michael Auli

Abstract

Self-supervised learning of speech representations has been a very active research area but most work is focused on a single domain such as read audio books for which there exist large quantities of labeled and unlabeled data. In this paper, we explore more general setups where the domain of the unlabeled data for pre-training data differs from the domain of the labeled data for fine-tuning, which in turn may differ from the test data domain. Our experiments show that using target domain data during pre-training leads to large performance improvements across a variety of setups. On a large-scale competitive setup, we show that pre-training on unlabeled in-domain data reduces the gap between models trained on in-domain and out-of-domain labeled data by 66%-73%. This has obvious practical implications since it is much easier to obtain unlabeled target domain data than labeled data. Moreover, we find that pre-training on multiple domains improves generalization performance on domains not seen during training.

Twitter-roBERTa-base for Sentiment Analysis

This is a roBERTa-base model trained on ~58M tweets and finetuned for sentiment analysis with the TweetEval benchmark. This model is suitable for English.

The output of the text sentiment analysis model is one of three labels denoting the sentiment of the text:

  • 0 -> Negative
  • 1 -> Neutral
  • 2 -> Positive

Pro Tip

Note that the sentiment analysis model is suitable for English text. If you require this workflow on text block in different languages, you create a derived custom workflow and insert a text translation model before the sentiment analysis model. This will ensure that the sentiment analysis model is receiving its input text in English.

More Info

Papers

Original

TWEETEVAL: Unified Benchmark and Comparative Evaluation for Tweet Classification

Authors: Francesco Barbieri, Jose Camacho-Collados, Leonardo Neves, Luis Espinosa-Anke

Abstract

The experimental landscape in natural language processing for social media is too fragmented. Each year, new shared tasks and datasets are proposed, ranging from classics like sentiment analysis to irony detection or emoji prediction. Therefore, it is unclear what the current state of the art is, as there is no standardized evaluation protocol, neither a strong set of baselines trained on such domainspecific data. In this paper, we propose a new evaluation framework (TWEETEVAL) consisting of seven heterogeneous Twitter-specific classification tasks. We also provide a strong set of baselines as starting point, and compare different language modeling pre-training strategies. Our initial experiments show the effectiveness of starting off with existing pretrained generic language models, and continue training them on Twitter corpora.

Latest

TimeLMs: Diachronic Language Models from Twitter

Authors: Daniel Loureiro, Francesco Barbieri, Leonardo Neves, Luis Espinosa Anke, Jose Camacho-Collados

Abstract

Despite its importance, the time variable has been largely neglected in the NLP and language model literature. In this paper, we present TimeLMs, a set of language models specialized on diachronic Twitter data. We show that a continual learning strategy contributes to enhancing Twitter-based language models’ capacity to deal with future and out-of-distribution tweets, while making them competitive with standardized and more monolithic benchmarks. We also perform a number of qualitative analyses showing how they cope with trends and peaks in activity involving specific named entities or concept drift.

  • Workflow ID
    asr-sentiment
  • Description
    A workflow for obtaining the sentiment of an audio.
  • Last Updated
    Aug 01, 2022
  • Privacy
    PUBLIC
  • Share