Welcome to the September 2022 edition of Baseline, Accenture Federal Services’ machine learning newsletter, where we share thoughts on important advances in machine learning technologies impacting federal agencies.
For the past few months, we’ve been working side-by-side with our Summer Analysts (interns), teaching them new skills and technologies while also learning from their fresh perspectives and creative thinking. Through our internship program and other initiatives, such as our partnership with AI4ALL, we work to expand and democratize access to AI education and training to promote compelling career paths.
Last month, we began inviting Summer Analysts to serve as guest editors for Baseline – check out August’s edition here. Summer Analysts once again led this month’s edition, sharing their analysis of recent significant machine learning updates. They selected:
- A language-agnostic approach to natural language processing that represents text as images
- A standardized language for the precise definition of machine learning datasets
- A technique that improves efficiency when training transformer-based language models
Best of luck to our Summer Analysts as they wrap up their internships – we can’t wait to see where you go next.
Click here to subscribe to email updates: Receive Baseline every month in your inbox and stay up-to-date on the latest advancements in machine learning.
Language Modeling with PIXEL
Language models such as BERT and GPT-3 have pushed the boundaries of natural language processing and advanced the way computers utilize human language. These models are each based on a finite vocabulary which encompasses all the possible words recognized by the model. In order to create language models which support multiple languages, this vocabulary would need to expand to cover more potential inputs, leading to increasingly large vocabularies. This is not sustainable since large vocabularies lead to computational bottlenecks during model training and inference.
In order to address this problem, a new method called PIXEL replaces the traditional vocabulary with a visual text representation created by rendering text as a sequence of fixed-size patches (pixels). This leads to orthographically similar words from different languages having a similar visual representation, allowing the model to support thousands of languages and be more robust to noisy inputs. PIXEL outperformed BERT on syntactic and semantic processing tasks, even for languages not found in its training data. Given the inequalities in coverage and performance of language models on the world’s languages, this work helps bridge the gap by offering a computationally inexpensive way to train powerful, language-agnostic language models.
<<< Start >>>
Illustrative examples of text rendered by PIXEL into image patches.
<<< End >>>
A Domain-Specific Language for Dataset Description
With machine learning being applied to an increasing number of tasks, the issue of model bias, where biases inherent in the training data can negatively influence outcomes, becomes more critical. This has led many to shift towards data-centric approaches focused on improving training data quality to minimize bias and improve performance. To enable this shift, standardized practices for training data collection, processing, and reporting are needed.
To address this need, researchers at the Internet Interdisciplinary Institute have developed a Data Descriptive Domain Specific Language (DSL). It standardizes dataset naming conventions as well as capturing information on data collection procedures, labeling processes, and data types, providing a standardized language for the reporting of ML datasets. Most importantly, DSL offers a standardized way to measure the limitations and data quality of datasets without sacrificing convenience – easy to write as a generic description and accessible as a VS Code plugin that practitioners can quickly fill in. DSL is a great tool to help ML practitioners locate quality datasets that align with the specific goals of their project, helping improve model quality and fairness.
Confident Adaptive Language Modeling (CALM)
Transformer-based language models have revolutionized text generation, allowing for performance improvement on complex natural language processing (NLP) tasks such as text prediction and text generation. However, transformer models are much larger and take up much more compute time than other models, forcing researchers and engineers to make tradeoffs.
Early exiting is an approach where the number of layers used by the model is decided dynamically on an input-by-input basis, which can be used to decrease the compute costs. However, a few questions remain. What factors go into deciding at which layer to exit? What are the tradeoffs between accuracy and computational efficiency being made when one exits early?
Recent research by Google Research and MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) presents a principled approach to early exits called Confident Adaptive Language Modeling (CALM). CALM uses an early exit classifier which determines the likelihood of exiting with accuracy. This structure allows for the model runtime to end earlier while allowing for the user to specify the quality of the output (via controlling the boundary conditions that the classifier uses to determine its measurement of accuracy). Given that this is an approach, and not itself a model, it can be used with any other language task that uses transformers, allowing for engineers to save both time and compute resources on any given language project.
Join us next year?
Accenture was recently recognized as one of the nation’s top 100 internship programs. Learn how you can join us with information on Accenture’s internship and student opportunities found HERE.
Accenture Federal Services is a leader in artificial intelligence for the U.S. federal government. Our Machine Learning Center of Excellence, Discovery Lab, and Advanced Research Group continually assess, develop, and adapt the world’s most innovative techniques and emerging technologies for mission-critical applications.