My work aims to co-optimize natural language processing (NLP) at the algorithm, hardware architecture, and solid-state layers. This endeavor has so far led to the following efforts:
1) A novel and hardware-friendly floating-point based encoding datatype, AdaptivFloat, which enables highly resilient and energy-efficient quantized computations. AdaptivFloat dynamically maximizes and optimally clips its available dynamic range, at a tensor granularity, in order to create faithful encodings of neural network parameters. In doing so, AdaptivFloat produces higher inference accuracies at low bit precision compared to many other prominent datatypes used in deep learning computations.
AdaptivFloat quantization scheme based on exponential bias shift from maximum absolute tensor value
Thierry Tambe, En-Yu Yang, Zishen Wan, Yuntian Deng, Vijay Janapa Reddi, Alexander M. Rush, David Brooks, and Gu-Yeon Wei. “Algorithm-Hardware Co-Design of Adaptive Floating-Point Encodings for Resilient Deep Learning Inference.” In DAC 2020, Best Paper Award.
Thierry Tambe, En-Yu Yang, Zishen Wan, Yuntian Deng, Vijay Janapa Reddi, Alexander M. Rush, David Brooks, and Gu-Yeon Wei. “AdaptivFloat: A Floating-point based Data Type for Resilient Deep Learning Inference” arXiv (1909.13271), 11/2019.
2) A 16nm system-on-chip for noise-robust speech recognition and NLP featuring hardware acceleration of sequence-to-sequence attention-based DNNs with AdaptivFloat-based processing elements, and unsupervised Bayesian-based speech denoising. The test chip executes a full speech-enhancing ASR pipeline while consuming 2.24mJ of energy per frame and 18ms end-to-end latency -- enabling real-time throughput.
Thierry Tambe, En-Yu Yang, Glenn G. Ko, Yuji Chai, Coleman Hooper, Marco Donato, Paul N. Whatmough, Alexander M. Rush, David Brooks, and Gu-Yeon Wei. “A 25mm2 SoC for IoT Devices with 18ms Noise Robust Speech-to-Text Latency via Bayesian Speech Denoising and Attention-Based Sequence-to-Sequence DNN Speech Recognition in 16nm FinFET.” In International Solid-State Circuits Conference (ISSCC'21)
3) As newer Transformer-based pre-trained models continue to generate impressive breakthroughs in language modeling, they characteristically exhibit complexities that levy hefty latency, memory, and energy taxes on resource-constrained, battery-operated devices. EdgeBERT provides a principled latency-driven methodology to alleviate these computational challenges in both the algorithm and hardware architecture layers. EdgeBERT adopts first-layer early exit prediction in order to perform sentence-level dynamic voltage-frequency scaling (DVFS), for minimal energy consumption while adhering to a prescribed target latency.
Memory and latency optimizations for minimizing multi-task NLP energy consumption
Thierry Tambe, Coleman Hooper, Lillian Pentecost, Tianyu Jia, En-Yu Yang, Marco Donato, Victor Sanh, Paul Whatmough, Alexander M. Rush, David Brooks, and Gu-Yeon Wei. “EdgeBERT: Sentence-Level Energy Optimizations for Latency-Aware Multi-Task NLP Inference”. In 54th IEEE/ACM International Symposium on Microarchitecture (MICRO 2021).