My work aims to co-optimize natural language processing (NLP) at the algorithm, hardware architecture, and solid-state layers. This endeavor has so far led to the following efforts:


1) A novel and hardware-friendly floating-point based encoding datatype, AdaptivFloat, which enables highly resilient and energy-efficient quantized computations. AdaptivFloat dynamically maximizes and optimally clips its available dynamic range, at a tensor granularity, in order to create faithful encodings of neural network parameters.  In doing so, AdaptivFloat produces higher inference accuracies at low bit precision compared to many other prominent datatypes used in deep learning computations. 

AdaptivFloat Quantization

AdaptivFloat quantization scheme based on exponential bias shift from maximum absolute tensor value

2) A 16nm system-on-chip for noise-robust speech recognition and NLP featuring hardware acceleration of sequence-to-sequence attention-based DNNs with AdaptivFloat-based processing elements, and unsupervised Bayesian-based speech denoising. The test chip executes a full speech-enhancing ASR pipeline while consuming 2.24mJ of energy per frame and 18ms end-to-end latency -- enabling real-time throughput.



3) As newer Transformer-based pre-trained models continue to generate impressive breakthroughs in language modeling, they characteristically exhibit complexities that levy hefty latency, memory, and energy taxes on resource-constrained, battery-operated devices. EdgeBERT provides a principled latency-driven methodology to alleviate these computational challenges in both the algorithm and hardware architecture layers. EdgeBERT adopts first-layer early exit prediction in order to perform sentence-level dynamic voltage-frequency scaling (DVFS), for minimal energy consumption while adhering to a prescribed target latency.


EdgeBERT HW/SW co-design

Memory and latency optimizations for minimizing multi-task NLP energy consumption