Publications

Working Paper
Working Paper. “Multi-Agent Reinforcement Learning for Efficient DNN Model Inference.” In Coming soon.Abstract
Follow up work on Multi-Agent RL for Architecture DSE 
2023
Srivatsan Krishnan, Amir Yazdanbaksh, Shvetank Prakash, Jason Jabbour, Ikechukwu Uchendu, Susobhan Ghosh, Behzad Boroujerdian, Daniel Richins, Devashree Tripathy, Aleksandra Faust, and Vijay Janapa Reddi. 2023. “ArchGym: An Open-Source Gymnasium for Machine Learning Assisted Architecture Design.” Proceedings of the 50th Annual International Symposium on Computer Architecture. Orlando: IEEE. Publisher's Version
2022
Srivatsan Krishnan, Zishen Wan, Kshitij Bhardwaj, Paul Whatmough, Aleksandra Faust, Sabrina Neuman, Gu-Yeon Wei, David Brooks, and Vijay Janapa Reddi. 2022. “Automatic Domain-Specific SoC Design for Autonomous Unmanned Aerial Vehicles.” 55th ACM/IEEE International Symposium on Microarchitecture (MICRO 2022). Chicago: IEEE. Publisher's Version
Srivatsan Krishnan, Natasha Jaques, Shayegan Omidshafiei, Dan Zhang, Izzeddin Gur, Vijay Janapa Reddi, and Aleksandra Faust. 2022. “Multi-Agent Reinforcement Learning for Microprocessor Design Space Exploration.” In Workshop on ML for Systems @ 36th Conference on Neural Information Processing Systems (NeurIPS 2022. New Orleans: http://mlforsystems.org/neurips2022/#collapse62282. Publisher's Version
Srivatsan Krishnan, Sharad Chitlangia, Maximilian Lam, Zishen Wan, Gabriel Barth-Maron, Aleksandra Faust, and Vijay Janapa Reddi. 2022. “QuaRL: Quantization for Fast and Environmentally Sustainable Reinforcement Learning.” Transaction on Machine Learning Research. Publisher's Version
Sabrina Neuman, Brian Plancher, Bart Duisterhof, Srivatsan Krishnan, Colby Banbury, Mark Mazumder, Shvetank Prakash, Jason Jabbour, Aleksandra Faust, Guido CHE de Croon, and Vijay Janapa Reddi. 2022. “Tiny Robot Learning: Challenges and Directions for Machine Learning in Resource-Constrained Robots.” 2022 IEEE 4th International Conference on Artificial Intelligence Circuits and Systems (AICAS).
2021
Srivatsan Krishnan, Behzad Boroujerdian, Will Fu, Aleksandra Faust, and Vijay Janapa Reddi. 2021. “Air learning: a deep reinforcement learning gym for autonomous aerial robot visual navigation.” Machine Learning Journal, 110, Pp. 2501–2540. Publisher's Version
2020
Srivatsan Krishnan, Sharad Chitlangia, Maximilian Lam, Zishen Wan, Aleksandra Faust, and Vijay Janapa Reddi. 3/5/2020. “Quantized Reinforcement Learning (QUARL).” In Workshop on Resource-Constrained Machine Learning at MLSys 2020 . Austin. Publisher's VersionAbstract

Quantization techniques continue to play a significant role in serving deep learning applications on resource-
constrained devices. However, whether these prior techniques, applied traditionally to image-based models, work with the same efficacy to the sequential decision-making process in reinforcement learning remains an unanswered question. To address this void, we conduct the first comprehensive empirical study that quantifies the effects
of quantization on various deep reinforcement learning policies with the intent to reduce their computational resource demands. We apply techniques such as post-training quantization and quantization aware training to a spectrum of reinforcement learning tasks (such as Pong, Breakout, BeamRider, and more) and training algorithms
(such as PPO, A2C, DDPG, and DQN). Across this spectrum of tasks and learning algorithms, we show that policies can be quantized to 6-8 bits of precision without loss of accuracy. We also show that certain tasks
and reinforcement learning algorithms yield policies that are more difficult to quantize due to their effect of widening the models’ distribution of weights and that quantization aware training consistently improves results over post-training quantization and oftentimes even over the full precision baseline. Finally, we demonstrate
real-world applications of quantization for reinforcement learning. We use mixed/half-precision training to train a Pong model 50% faster, and deploy a quantized reinforcement learning-based robot navigation policy onto an embedded system, achieving an 18× speedup and a 4× reduction in memory usage over an unquantized policy.

Srivatsan Krishnan, Suchit Subhaschandra, Pratik Marolia, and Brent Thomas. 1/7/2020. “Methods and apparatus to provide user-level access authorization for cloud-based field-programmable gate arrays.” United States of America 10528768 (USPTO). Publisher's VersionAbstract
Methods and apparatus to provide user-level access authorization for cloud-based filed-programmable gate arrays are disclosed. An example apparatus includes a field-programmable gate array (FPGA) including a first memory and a second memory different from the first memory. The first memory stores a bitstream. The second memory stores a first user tag associated with the bitstream. The example apparatus further includes a kernel having an FPGA driver operatively coupled to the FPGA. The FPGA driver is to receive a command associated with accessing the FPGA from a user-executed application. The FPGA driver is further to identify a second user tag associated with the command. The FPGA driver is further to determine whether the command is to be accepted based on the second user tag.
Srivatsan Krishnan, Colby Banbury, Bart Duisterhof, Aleksandra Faust, and Vijay Janapa Reddi. 2020. “Air Learning: An End to End Learning Gym for Aerial Robots.” Third Conference on Machine Learning and Systems (Demo Track). Austin, TX. Publisher's VersionAbstract
Air Learning is an interdisciplinary, open-source research infrastructure that aims to soften the boundaries between aerial robotics, machine learning, controls, and systems architecture. It provides the tooling and various components necessary for developing an end-to-end learning-based application for aerial robotics starting from simulation to deployment on a real aerial robot. By having all the key components tightly integrated, we hope researchers can use this tool to develop novel solutions to several open problems in these domains. We also hope researchers can use it to explore and understand various trade-offs of their solution due to cross-domain interactions between algorithms and system. Also since the infrastructure is open-source, the research community can add new features thus modifying it according to their own requirements and use cases.
Srivatsan Krishnan, Zishen Wan, Kshitij Bharadwaj, Paul Whatmough, Aleksandra Faust, Gu-Yeon Wei, David Brooks, and Vijay Janapa Reddi. 2020. “The Sky Is Not the Limit: A Visual Performance Model for Cyber-Physical Co-Design in Autonomous Machines.” IEEE Computer Architecture Letter. Publisher's VersionAbstract
We introduce the "Formula-1" (F-1) roofline model to understand the role of computing in aerial autonomous machines. The model provides insights by exploiting the fundamental relationships between various components in an aerial robot, such as sensor framerate, compute performance, and body dynamics (physics). The model serves as a tool that can aid computer and cyber-physical system architects to understand the optimal design (or selection) of various components in the development of autonomous machines.
f-1-roofline.pdf
2019
Bardienus Duisterhof, Srivatsan Krishnan, Jonathan J. Cruz, Colby R. Banbury, William Fu, Aleksandra Faust, Guido C. H. E. de Croon, and Vijay Janapa Reddi. 9/25/2019. Learning to Seek: Autonomous Source Seeking with Deep Reinforcement Learning Onboard a Nano Drone Microcontroller.Abstract
Fully autonomous navigation using nano drones has numerous application in the real world, ranging from search and rescue to source seeking. Nano drones are well-suited for source seeking because of their agility, low price, and ubiquitous character. Unfortunately, their constrained form factor limits flight time, sensor payload, and compute capability. These challenges are a crucial limitation for the use of source-seeking nano drones in GPS-denied and highly cluttered environments. Hereby, we introduce a fully autonomous deep reinforcement learning-based light-seeking nano drone. The 33-gram nano drone performs all computation on-board the ultra-low-power microcontroller (MCU). We present the method for efficiently training, converting, and utilizing deep reinforcement learning policies. Our training methodology and novel quantization scheme allow fitting the trained policy in 3 kB of memory. The quantization scheme uses representative input data and input scaling to arrive at a full 8-bit model. Finally, we evaluate the approach in simulation and flight tests using a Bitcraze CrazyFlie, achieving 80% success rate on average in a highly cluttered and randomized test environment. Even more, the drone finds the light source in 29% fewer steps compared to a baseline simulation (obstacle avoidance without source information). To our knowledge, this is the first deep reinforcement learning method that enables source seeking within a highly constrained nano drone demonstrating robust flight behavior. Our general methodology is suitable for any (source seeking) highly constrained platform using deep reinforcement learning.
Behzad Boroujerdian, Hasan Genc, Srivatsan Krishnan, Bardienus Pieter Duisterhof, Brian Plancher, Kayvan Mansoorshahi, Marcelino Almeida, Wenzhi Cui, Aleksandra Faust, and Vijay Janapa Reddi. 6/24/2019. The Role of Compute in Autonomous Aerial Vehicles. Publisher's VersionAbstract
Autonomous and mobile cyber-physical machines are becoming an inevitable part of our future. In particular, unmanned aerial vehicles have seen a resurgence in activity. With multiple use cases, such as surveillance, search and rescue, package delivery, and more, these unmanned aerial systems are on the cusp of demonstrating their full potential. Despite such promises, these systems face many challenges, one of the most prominent of which is their low endurance caused by their limited onboard energy. Since the success of a mission depends on whether the drone can finish it within such duration and before it runs out of battery, improving both the time and energy associated with the mission are of high importance. Such improvements have traditionally arrived at through the use of better algorithms. But our premise is that more powerful and efficient onboard compute can also address the problem. In this paper, we investigate how the compute subsystem, in a cyber-physical mobile machine, such as a Micro Aerial Vehicle (MAV), can impact mission time and energy. Specifically, we pose the question as “what is the role of computing for cyber-physical mobile robots?” We show that compute and motion are tightly intertwined, and as such a close examination of cyber and physical processes and their impact on one another is necessary. We show different “impact paths” through which compute impacts mission metrics and examine them using a combination of analytical models, simulation, micro and end-to-end benchmarking. To enable similar studies, we open sourced MAVBench, our tool-set, which consists of (1) a closed-loop real-time feedback simulator and (2) an end-to-end benchmark suite comprised of state-of-the-art kernels. By combining MAVBench, analytical modeling, and an understanding of various compute impacts, we show up to 2X and 1.8X improvements for mission time and mission energy for two optimization case studies. Our investigations, as well as our optimizations, show that cyber-physical co-design, a methodology with which both the cyber and physical processes/quantities of the robot are developed with consideration of one another, similar to hardware-software co-design, is necessary for arriving at the design of the optimal robot.
Srivatsan Krishnan, Behzad Boroujerdian, William Fu, Aleksandra Faust, and Vijay Janapa Reddi. 6/9/2019. Air learning: An AI Research Platform for Algorithm-Hardware Benchmarking of Autonomous Aerial Robots. Publisher's VersionAbstract
We introduce Air Learning, an AI research platform for benchmarking algorithm-hardware performance and energy efficiency trade-offs. We focus in particular on deep reinforcement learning (RL) interactions in autonomous unmanned aerial vehicles (UAVs). Equipped with a random environment generator, AirLearning exposes an UAV to a diverse set of challenging scenarios. Users can specify a task, train different deep RL policies and evaluate their performance and energy efficiency on a variety of hardware platforms. To show how Air Learning can be used, we seed it with Deep Q Networks (DQN) and Proximal Policy Optimization (PPO) to solve a point-to-point obstacle avoidance task in three different environments, generated using our configurable environment generator. We train the two algorithms using curriculum learning and non-curriculumlearning. Air Learning assesses the trained policies’ performance, under a variety of quality-of-flight (QoF) metrics, such as the energy consumed, endurance and the average trajectory length, on resource-constrained embedded platforms like a Raspberry Pi. We find that the trajectories on an embedded Ras-Pi are vastly different from those predicted on a high-end desktop system, resulting in up to 79.43% longer trajectories in one of the environments. To understand the source of such discrepancies, we use Air Learning to artificially degrade high-end desktop performance to mimic what happens on a low-end embedded system. QoF metrics with hardware-in-the-loop characterize those differences and expose how the choice of onboard compute affects the aerial robot’s performance. We also conduct reliability studies to demonstrate how Air Learning can help understand how sensor failures affect the learned policies. All put together, Air Learning enables a broad class of deep RL studies on UAVs. More information and source code for the Air Learning project can be found online here: http://bit.ly/2JNAVb6.
Srivatsan Krishnan, Behzad Boroujerdian, Aleksandra Faust, and Vijay Janapa Reddi. 5/24/2019. Toward Exploring End-to-End Learning Algorithms for Autonomous Aerial Machines. Publisher's VersionAbstract
We develop AirLearning, a tool suite for endto-end closed-loop UAV analysis, equipped with a customized yet randomized environment generator in order to expose the UAV with a diverse set of challenges. We take Deep Q networks (DQN) as an example deep reinforcement learning algorithm and use curriculum learning to train a point to point obstacle avoidance policy. While we determine the best policy based on the success rate, we evaluate it under strict resource constraints on an embedded platform such as RasPi 3. Using hardware in the loop methodology, we quantify the policy’s performance with quality of flight metrics such as energy consumed, endurance and the average length of the trajectory. We find that the trajectories produced on the embedded platform are very different from those predicted on the desktop, resulting in up to 26.43% longer trajectories. Quality of flight metrics with hardware in the loop characterizes those differences in simulation, thereby exposing how the choice of onboard compute contributes to shortening or widening of ‘Sim2Real’ gap.
2018
Behzad Boroujerdian, Hasan Genc, Srivatsan Krishnan, Wenzhi Cui, Aleksandra Faust, and Vijay Janapa Reddi. 10/2018. “MAVBench: Micro Aerial Vehicle Benchmarking.” In The 51st Annual IEEE/ACM International Symposium on Microarchitecture. Fukoka, Japan: IEEE/ACM. Publisher's VersionAbstract
Unmanned Aerial Vehicles (UAVs) are getting closer to becoming ubiquitous in everyday life. Among them, Micro Aerial Vehicles (MAVs) have seen an outburst of attention recently, specifically in the area with a demand for autonomy. A key challenge standing in the way of making MAVs autonomous is that researchers lack the comprehensive understanding of how performance, power, and computational bottlenecks affect MAV applications. MAVs must operate under a stringent power budget, which severely limits their flight endurance time. As such, there is a need for new tools, benchmarks, and methodologies to foster the systematic development of autonomous MAVs. In this paper, we introduce the MAVBench' framework which consists of a closed-loop simulator and an end-to-end application benchmark suite. A closed-loop simulation platform is needed to probe and understand the intra-system (application data flow) and inter-system (system and environment) interactions in MAV applications to pinpoint bottlenecks and identify opportunities for hardware and software co-design and optimization. In addition to the simulator, MAVBench provides a benchmark suite, the first of its kind, consisting of a variety of MAV applications designed to enable computer architects to perform characterization and develop future aerial computing systems. Using our open source, end-to-end experimental platform, we uncover a hidden, and thus far unexpected compute to total system energy relationship in MAVs. Furthermore, we explore the role of compute by presenting three case studies targeting performance, energy and reliability. These studies confirm that an efficient system design can improve MAV's battery consumption by up to 1.8X.
Eriko Nurvitadhi, Ganesh Venkatesh, Srivatsan Krishnan, Suchit Subhaschandra, and Deborah Marr. 7/5/2018. “Hardware accelerator architecture and template for web-scale k-means clustering.” United States of America 15396515 (USPTO). Publisher's VersionAbstract
Hardware accelerator architectures for clustering are described. A hardware accelerator includes sparse tiles and very/hyper sparse tiles. The sparse tile (s) execute operations for a clustering task involving a matrix. Each sparse tile includes a first plurality of processing units to operate upon a first plurality of blocks of the matrix that have been streamed to one or more random access memories of the sparse tiles over a high bandwidth interface from a first memory unit. Each of the very/hyper sparse tiles are to execute operations for the clustering task involving the matrix. Each of the very/hyper sparse tiles includes a second plurality of processing units to operate upon a second plurality of blocks of the matrix that have been randomly accessed over a low-latency interface from a second memory unit.
Duncan Moss, Srivatsan Krishnan, Eriko Nurvitadhi, Piotr Ratuszniak, Chris Johnson, Jaewoong Sim, Asit Mishra, Debbie Marr, Suchit Subhaschandra, and Philip HW Leong. 2/15/2018. “A Customizable Matrix Multiplication Framework For The Intel HARPv2 Xeon+ Fpga Platform: A Deep Learning Case Study.” In 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. California, USA: IEEE/ACM. Publisher's VersionAbstract
General Matrix to Matrix multiplication (GEMM) is the cornerstone for a wide gamut of applications in high performance computing (HPC), scientific computing (SC) and more recently, deep learning. 

In this work, we present a customizable matrix multiplication framework for the Intel HARPv2 CPU+FPGA platform that includes support for both traditional single precision floating point and reduced precision workloads.
Our framework supports arbitrary size GEMMs and consists of two parts: (1) a simple application programming interface (API) for easy configuration and integration into existing software and (2) a highly customizable hardware template.
The API provides both compile and runtime options for controlling key aspects of the hardware template including dynamic precision switching; interleaving and block size control; and fused deep learning specific operations.
The framework currently supports single precision floating point (FP32), 16, 8, 4 and 2 bit Integer and Fixed Point (INT16, INT8, INT4, INT2) and more exotic data types for deep learning workloads: INT16xTernary, INT8xTernary, BinaryxBinary.

We compare our implementation to the latest NVIDIA Pascal GPU and evaluate the performance benefits provided by optimizations built into the hardware template.
Using three neural networks (AlexNet, VGGNet and ResNet) we illustrate that reduced precision representations such as binary achieve the best performance, and that the HARPv2 enables fine-grained partitioning of computations over both the Xeon and FPGA.
We observe up to 50x improvement in execution time compared to single precision floating point, and that runtime configuration options can improve the efficiency of certain layers in AlexNet up to 4x, achieving an overall 1.3x improvement over the entire network.

Behzad Boroujerdian, Hasan Genc, Srivatsan Krishnan, Aleksandra Faust, and Vijay Janapa Reddi. 2018. “Why Compute Matters for UAV Energy Efficiency?” 2nd International Symposium on Aerial Robotics. Publisher's VersionAbstract
Unmanned Aerial Vehicles (UAVs) are getting closer to becoming ubiquitous in everyday life. Although the researchers in the robotic domain have made rapid progress in recent years, hardware and software architects in the computer architecture community lack the comprehensive understanding of how performance, power, and computational bottlenecks affect UAV applications. Such an understanding enables system architects to design microchips tailored for aerial agents. This paper is an attempt by computer architects to initiate the discussion between the two academic domains by investigating the underlying compute systems’ impact on aerial robotic applications. To do so, we identify performance and energy constraints and examine the impact of various compute knobs such as processor cores and frequency on these constraints. Our experiment show that such knobs allow for up to 5X speed up for a wide class of applications.
2017
Jack Yinger, Eriko Nurvitadhi, Davor Capalija, Andrew Ling, Debbie Marr, Srivatsan Krishnan, Duncan Moss, and Suchit Subhaschandra. 12/11/2017. “Customizable FPGA OpenCL matrix multiply design template for deep neural networks .” In 2017 International Conference on Field Programmable Technology (ICFPT). Melbourne, VIC, Australia: IEEE. Publisher's VersionAbstract
Deep neural networks (DNNs) have gained popularity for their state-of-the-art accuracy and relative ease of use. DNNs rely on a growing variety of matrix multiply operations (i.e., dense to sparse, FP32 to N-bit). We propose an OpenCL-based matrix multiply design template, which enables automated design exploration to generate optimized FPGA matrix accelerators for DNN applications. Given the desired matrix operations (e.g., sparsity, data types), our template rapidly produces performance and area estimates for a variety of design variants and/or FPGA platforms. Upon identifying compelling design points and target platforms, FPGA implementations can then be generated automatically using the Intel® OpenCL™ FPGA SDK. We show the effectiveness of the template with a comparison to hand-tuned RTL, a design space exploration, and a DNN case study.

Pages