Real Time Speech Recognition

related_spaces: tags: ASR, SPEECH, STREAMING

Introduction

Automatic speech recognition (ASR), the conversion of spoken speech to text, is a very important and thriving area of machine learning. ASR algorithms run on practically every smartphone, and are becoming increasingly embedded in professional workflows, such as digital assistants for nurses and doctors. Because ASR algorithms are designed to be used directly by customers and end users, it is important to validate that they are behaving as expected when confronted with a wide variety of speech patterns (different accents, pitches, and background audio conditions).

Using gradio, you can easily build a demo of your ASR model and share that with a testing team, or test it yourself by speaking through the microphone on your device.

This tutorial will show how to take a pretrained speech to text model and deploy it with a Gradio interface. We will start with a full-context model, in which the user speaks the entire audio before the prediction runs. Then we will adapt the demo to make it streaming, meaning that the audio model will convert speech as you speak. The streaming demo that we create will look something like this (try it!):

Real-time ASR is inherently stateful, meaning that the model's predictions change depending on what words the user previously spoke. So, in this tutorial, we will also cover how to use state with Gradio demos.

Prerequisites

Make sure you have the gradio Python package already installed. You will also need a pretrained speech recognition model. In this tutorial, we will build demos from 2 ASR libraries:

Transformers (for this, pip install transformers and pip install torch)
DeepSpeech (pip install deepspeech==0.8.2)

Make sure you have at least one of these installed so that you can follow along

Step 1 — Setting up the Transformers ASR Model

First, you will need to have an ASR model that you have either trained yourself or you will need to download a pretrained model. In this tutorial, we will start by using a pretrained ASR model from the Hugging Face model, Wav2Vec2.

Here is the code to load Wav2Vec2 from Hugging Face transformers.

from transformers import pipeline

p = pipeline("automatic-speech-recognition")

That's it! By default, the automatic speech recognition model pipeline loads Facebook's facebook/wav2vec2-base-960h model.

Step 2 — Creating a Full-Context ASR Demo

We will start by creating a full-context ASR demo, in which the user speaks the full audio before using the ASR model to run inference. This is very easy with Gradio -- we simply create a function around the pipeline object above.

We will use gradio's built in Audio component, configured to take input from the user's microphone and return a filepath for the recorded audio. The output component will be a plain Textbox.

def predict(input, history=[]):
    # tokenize the new input sentence
    new_user_input_ids = tokenizer.encode(input + tokenizer.eos_token, return_tensors='pt')

    # append the new user input tokens to the chat history
    bot_input_ids = torch.cat([torch.LongTensor(history), new_user_input_ids], dim=-1)

    # generate a response 
    history = model.generate(bot_input_ids, max_length=1000, pad_token_id=tokenizer.eos_token_id).tolist()

    # convert the tokens to text, and then split the responses into lines
    response = tokenizer.decode(history[0]).split("<|endoftext|>")
    response = [(response[i], response[i+1]) for i in range(0, len(response)-1, 2)]  # convert to tuples of list
    return response, history

Let's break this down. The function takes two parameters:

input: which is what the user enters (through the Gradio GUI) in a particular step of the conversation.
history: which represents the state, consisting of the list of user and bot responses. To create a stateful Gradio demo, we must pass in a parameter to represent the state, and we set the default value of this parameter to be the initial value of the state (in this case, the empty list since this is what we would like the chat history to be at the start).

Then, the function tokenizes the input and concatenates it with the tokens corresponding to the previous user and bot responses. Then, this is fed into the pretrained model to get a prediction. Finally, we do some cleaning up so that we can return two values from our function:

response: which is a list of tuples of strings corresponding to all of the user and bot responses. This will be rendered as the output in the Gradio demo.
history variable, which is the token representation of all of the user and bot responses. In stateful Gradio demos, we must return the updated state at the end of the function.

Step 3 — Creating a Gradio Interface

Now that we have our predictive function set up, we can create a Gradio Interface around it.

In this case, our function takes in two values, a text input and a state input. The corresponding input components in gradio are "text" and "state".

The function also returns two values. We will display the list of responses using the dedicated "chatbot" component and use the "state" output component type for the second return value.

Note that the "state" input and output components are not displayed.

import gradio as gr

gr.Interface(fn=predict,
             inputs=["text", "state"],
             outputs=["text", "state"]).launch()

This produces the following interface, which you can try right here in your browser (try typing in some simple greetings like "Hi!" to get started):

And you're done! That's all the code you need to build an interface for your chatbot model. Here are some references that you may find useful:

Gradio's "Getting Started" guide
The final chatbot demo and complete code (on Hugging Face Spaces)

7.3 KiB Raw Blame History