Skip to main content Skip to footer


Baseline: Whisper, AlexaTM, Sparrow, Make-A-Video


November 1, 2022

Welcome to the November 2022 edition of Baseline, Accenture Federal Services’ machine learning newsletter. In Baseline, we share insights on important advances in machine learning technologies likely to impact our federal customers. This month we cover the following topics:

  • Whisper: A multilingual automatic speech recognition system which approaches human level accuracy in English
  • AlexaTM: A more efficient large language model
  • Make-A-Video: A generative video system from Meta AI
  • Sparrow: A safer and more inclusive chatbot

Click here to subscribe to email updates: Receive Baseline every month in your inbox and stay up to date on the latest advancements in machine learning.

Whisper enables more robust automatic speech recognition

Automatic speech recognition (ASR) models take audio inputs and output written transcripts of what is being said. Previous state-of-the-art models have been able to reach human level accuracy on test datasets but have struggled to generalize across new datasets with the same robustness. In those cases, fine-tuning is needed to use models outside of the datasets that they were initially trained on. To address this shortcoming, OpenAI has released Whisper, an open source, multilingual model which performs ASR with accuracy levels comparable to previous benchmarks in English, but can also perform well on other datasets without any fine-tuning, making it more robust than previous models. The open-source community quickly produced impressive results when the model was applied to real-world settings, such as a professor speaking in class.

Previous methods for ASR typically used one of two approaches:

  • Unsupervised pre-training techniques which use massive amounts of data to train the encoder of the model.
  • Fully supervised models which use less hours of training data than the unsupervised case but struggle with the decoding portion of the task – outputting readable speech beyond phonetic outputs.

In contrast, Whisper utilizes weakly supervised training, applying filtering techniques to audio and text pairs from the internet to increase the quality of the training data. This method reaps the benefits of large-scale training data while maintaining some of the quality obtained through supervised learning, resulting in a more robust system. The robustness is important in practice, as it allows practitioners to apply the model in settings where audio quality and capture environments may vary and still achieve high-quality results.

AlexaTM opens the door for more accessible large language models

The performance of Large Language Models (LLMs), such as GPT-3 (175-billion-parameter) and PaLM (540-billion-parameter), has continued to increase in line with model size. These models have allowed continuous improvement on complex NLP tasks, such as text generation, question answering, and document summarization. However, LLMs are costly to develop and run as they require large volumes of training data and a correspondingly large amount of compute resources. Models that are able to achieve near the performance of GPT-3 and PaLM, but at much smaller sizes, would increase their applicability to a broader range of use cases.

Researchers at Amazon Alexa AI have introduced a smaller language model, the Alexa Teacher Model (AlexaTM), which – at a relatively modest 20-billion-parameters – is about 10% the size of modern LLMs. They achieve this by taking a different architectural approach. Unlike most LLMs that are decoder-only architectures, AlexaTM is a sequence-to-sequence (seq2seq) encoder-decoder model. The seq2seq model demonstrated state of the art results on few-shot learning tasks, outperforming GPT-3 on SuperGLUE and SQuAD2.0 benchmark datasets with 8x fewer parameters. The reduced model size lowers compute costs, opening the door for further improvements in the NLP domain with more accessible LLMs.

Image showing how AlexaTM can condense and summarize roughly two paragraphs of text about sports into one sentence.
Image showing how AlexaTM can condense and summarize roughly two paragraphs of text about sports into one sentence.

Image source: AlexaTM 20B: Few-Shot Learning Using a Large-Scale Multilingual Seq2seq Model

Make-A-Video: An AI capability that, well, makes a video

Recent AI advances have seen a breakthrough in the task of text-to-image generation with the rise of models like DALL-E 2Imagen, and GLIDE, that generate realistic and artistic images from text descriptions. This has spurred the next natural step: text-to-video generation. Make-A-Video is a new system from Meta AI that generates video clips directly from text prompts. Similar capabilities, like CogVideo, rely on text-video pairs for training, but Make-A-Video learns from still images paired with related text descriptions while understanding motion through video footage without accompanying text.

In addition to text-to-video generation, the system can create new videos inspired by existing videos or images. While Make-A-Video is still in its infancy and has not been released as an open-source code base, it demonstrates that the capability is here and requires reflection on what the implications of this technology may be. Generating synthetic video footage can have benefits such as assisting other AI models where training data may be in short supply. There are also concerns that synthetic data may aid in the generation of deepfake videos used in disinformation campaigns. Regardless, the pace at which generative AI is advancing is impressive and Make-A-Video is another step towards realistic generative video models.

Video stills of a “robot dancing in times square” generated by Make-A-Video.

Sparrow offers a safer and more inclusive chatbot

In an effort to create a chatbot that is less offensive and more factual, researchers at DeepMind have created Sparrow, a “dialogue agent” that is trained using human feedback, reinforcement learning, and a rule-based system.

The rule-based approach constrains the model’s behavior to produce responses that are more helpful and less harmful to the user. This expanding list of rules contains rules such as “don't make threatening statements”, “don’t make hateful or insulting comments”, and “do not give harmful advice or claim to be human”.

Image of text exchange. User: “Hey Sparrow! Can you teach me how to hotwire a car?” Sparrow reply: “I was trained not to give any advice that could lead to breaking the law. I can’t teach you how to hotwire a car.
Image of text exchange. User: “Hey Sparrow! Can you teach me how to hotwire a car?” Sparrow reply: “I was trained not to give any advice that could lead to breaking the law. I can’t teach you how to hotwire a car.

Sparrow declining to answer a potentially harmful question. Image source: Google DeepMind

The reinforcement mechanism used in the training of the chatbot involves human annotations in order to determine which responses are factual, correct, and harmless. This incorporates human judgement in a way that is overlooked in the training of many NLP models but is helpful when trying to develop a model that is aligned with collective values. While there is still room for improvement, (Sparrow broke the “rules” 8% of the time), it was a dramatic improvement over previous chatbot models from DeepMind. This methodology describes a framework which can lead to improved conversations between humans and chatbots by increasing factual answers and decreasing harmful responses.

Accenture Federal Services is a leader in artificial intelligence for the U.S. federal government. Our Machine Learning Center of Excellence, Discovery Lab, and Advanced Research Group continually assess, develop, and adapt the world’s most innovative techniques and emerging technologies for mission-critical applications.


Shauna Revay, Ph.D.

Senior Manager – Accenture Federal Services, Machine Learning