My work aims to co-optimize natural language processing (NLP) at the algorithm, hardware architecture, and solid-state layers. This endeavor has so far led to the following efforts:
1) A novel and hardware-friendly floating-point based encoding datatype, AdaptivFloat, which enables highly resilient and energy-efficient quantized computations. AdaptivFloat dynamically maximizes and optimally clips its available dynamic range, at a tensor granularity, in order to create faithful encodings of neural network parameters. In doing so, AdaptivFloat produces higher inference accuracies at low bit precision compared to many other prominent datatypes used in deep learning computations.
AdaptivFloat quantization scheme based on exponential bias shift from maximum absolute tensor value
Thierry Tambe, En-Yu Yang, Zishen Wan, Yuntian Deng, Vijay Janapa Reddi, Alexander M. Rush, David Brooks, and Gu-Yeon Wei. “Algorithm-Hardware Co-Design of Adaptive Floating-Point Encodings for Resilient Deep Learning Inference.” In DAC 2020, Best Paper Award.
Thierry Tambe, En-Yu Yang, Zishen Wan, Yuntian Deng, Vijay Janapa Reddi, Alexander M. Rush, David Brooks, and Gu-Yeon Wei. “AdaptivFloat: A Floating-point based Data Type for Resilient Deep Learning Inference” arXiv (1909.13271), 11/2019.
2) A 16nm system-on-chip for noise-robust speech recognition featuring hardware acceleration of attention-based DNNs with AdaptivFloat-based processing elements, and Bayesian-based speech denoising.
Thierry Tambe, En-Yu Yang, Glenn G. Ko, Yuji Chai, Coleman Hooper, Marco Donato, Paul N. Whatmough, Alexander M. Rush, David Brooks, and Gu-Yeon Wei. “A 25mm2 SoC for IoT Devices with 18ms Noise Robust Speech-to-Text Latency via Bayesian Speech Denoising and Attention-Based Sequence-to-Sequence DNN Speech Recognition in 16nm FinFET.” In International Solid-State Circuits Conference (ISSCC'21)
3) As newer Transformer-based pre-trained models continue to generate impressive breakthroughs in language modeling, they characteristically exhibit complexities that levy hefty latency, memory, and energy taxes on resource-constrained embedded platforms. EdgeBERT provides an in-depth and principled methodology to alleviate these computational challenges in both the algorithm and hardware architecture layers. Furthermore, EdgeBERT investigates to what extent the very dense, albeit stochastic, storage capabilities of emerging non-volatile memories can be exploited in satisfying the always-on and intermediate computing requirements of fully on-chip multi-task NLP.
Memory and latency optimizations incorporated under the EdgeBERT methodology
Thierry Tambe, Coleman Hooper, Lillian Pentecost, En-Yu Yang, Marco Donato, Victor Sanh, Alexander M. Rush, David Brooks, and Gu-Yeon Wei. “EdgeBERT: Optimizing On-Chip Inference for Multi-Task NLP” arXiv (2011.14203), 11/28/2020.