Hardware acceleration can revolutionize robotics, enabling new applications by speeding up robot response times while remaining power-efficient. However, the diversity of acceleration options makes it difficult for roboticists to easily deploy accelerated systems without expertise in each specific hardware platform. In this work, we address this challenge with RobotCore, an architecture to integrate hardware acceleration in the widely-used ROS 2 robotics software framework. This architecture is target-agnostic (supports edge, workstation, data center, or cloud targets) and accelerator-agnostic (supports both FPGAs and GPUs). It builds on top of the common ROS 2 build system and tools and is easily portable across different research and commercial solutions through a new firmware layer. We also leverage the Linux Tracing Toolkit next generation (LTTng) for low-overhead real-time tracing and benchmarking. To demonstrate the acceleration enabled by this architecture, we use it to deploy a ROS 2 perception computational graph on a CPU and FPGA.
We employ our integrated tracing and benchmarking to analyze bottlenecks, uncovering insights that guide us to improve FPGA communication efficiency. In particular, we design an intra-FPGA ROS 2 node communication queue to enable faster data flows, and use it in conjunction with FPGA-accelerated nodes to achieve a 24.42% speedup over a CPU.
Building domain-specific accelerators is becoming increasingly paramount to meet the high-performance requirements under stringent power and real-time constraints. However, emerging application domains like autonomous vehicles are complex systems, where the constraints extend beyond just the computing stack. Manually selecting and navigating the design space to design custom and efficient domain-specific SoCs (DSSoC) is tedious and expensive. As such, there is a need for automated DSSoC design methodologies. In this paper, we use agile and autonomous UAVs as a case study for understanding how to automate the design of domain-specific SoCs for autonomous vehicles. Architecting a UAV DSSoC requires considering parameters such as sensor rate, compute throughput, and other physical characteristics (e.g., payload weight, thrust-to-weight ratio) that affect overall performance. Iterating over the many component choices results in a combinatorial explosion of the number of possible combinations: from 10s of thousands to billions, depending on implementation details. To navigate the DSSoC design space efficiently, we introduce AutoPilot, a framework that automates full-system UAV co-design. AutoPilot uses machine learning to navigate the large DSSoC design space and automatically select a combination of autonomy algorithm and hardware accelerator while considering the cross-product effect across different UAV components. AutoPilot consistently outperforms general-purpose hardware selections like Xavier NX and Jetson TX2, as well as dedicated hardware accelerators built for autonomous UAVs. DSSoC designs generated by AutoPilot increase the number of missions on average by up to 2.25x, 1.62x, and 1.43x for nano, micro, and mini-UAVs, respectively, over baselines. We also discuss how AutoPilot can be extended to other closely related autonomous vehicles.
For application domains with a range of different deployment scenarios, such as robotics, it can be advantageous to parameterize accelerator designs so that they can be easily re-designed and customized per-deployment. Previous work prescribed a systematic methodology to parameterize a class of robotics accelerators according to the physical features of the robot, and manually implemented an accelerator for robot rigid body dynamics gradients for a manipulator arm. However, to automate the process of accelerator re-deployment for future work, it has been necessary to elaborate on the tool infrastructure to create a fully parameterized flow. Without delving into the details of our particular architectural framework, this case study informally provides a retrospective of some of the challenges with tools and software that we have encountered while parameterizing our domain-specific accelerator design.
Automating robust walking gaits for legged robots has been a long-standing challenge. Previous work has achieved robust locomotion gaits on sophisticated quadruped hardware platforms through the use of reinforcement learning and imitation learning. However, these approaches do not consider the strict constraints of ultra-low-cost robot platforms with limited computing resources, few sensors, and restricted actuation. These constrained robot platforms require special attention to successfully transfer skills learned in simulation to reality. As a step toward robust learning pipelines for these constrained robot platforms, we demonstrate how existing state-of-the-art imitation learning pipelines can be modified and augmented to support low-cost, limited hardware. By reducing our model’s observational space, leveraging TinyML to quantize our model, and adjusting the model outputs through post-processing, we are able to learn and deploy successful walking gaits on an 8-DoF, $299 (USD) toy quadruped robot that has reduced actuation and sensor feedback, as well as limited computing resources. A video of our current results can be found at: https://youtu.be/jloya0TOzWA.
We introduce GRiD: a GPU-accelerated library for computing rigid body dynamics with analytical gradients. GRiD was designed to accelerate the nonlinear trajectory optimization subproblem used in state-of-the-art robotic planning, control, and machine learning. Each iteration of nonlinear trajectory optimization requires tens to hundreds of naturally parallel computations of rigid body dynamics and their gradients. GRiD leverages URDF parsing and code generation to deliver optimized dynamics kernels that not only expose GPU-friendly computational patterns, but also take advantage of both fine-grained parallelism within each computation and coarse-grained parallelism between computations. Through this approach, when performing multiple computations of rigid body dynamics algorithms, GRiD provides as much as a 7.6x speedup over a state-of-the-art, multi-threaded CPU implementation, and maintains as much as a 2.6x speedup when accounting for I/O overhead. We release GRiD as an open-source library, so that it can be leveraged by the robotics community to easily and efficiently accelerate rigid body dynamics on the GPU.
Machine learning (ML) has become a pervasive tool across computing systems. An emerging application that stress-tests the challenges of ML system design is tiny robot learning, the deployment of ML on resource-constrained low-cost autonomous robots. Tiny robot learning lies at the intersection of embedded systems, robotics, and ML, compounding the challenges of these domains. Tiny robot learning is subject to challenges from size, weight, area, and power (SWAP) constraints; sensor, actuator, and compute hardware limitations; end-to-end system tradeoffs; and a large diversity of possible deployment scenarios. Tiny robot learning requires ML models to be designed with these challenges in mind, providing a crucible that reveals the necessity of holistic ML system design and automated end-to-end design tools for agile development. This paper gives a brief survey of the tiny robot learning space, elaborates on key challenges, and proposes promising opportunities for future work in ML system design.
Computing the gradient of rigid body dynamics is a central operation in many state-of-the-art planning and control algorithms in robotics. Parallel computing platforms such as GPUs and FPGAs can offer performance gains for algorithms with hardware-compatible computational structures. In this paper, we detail the designs of three faster than state-of-the-art implementations of the gradient of rigid body dynamics on a CPU, GPU, and FPGA. Our optimized FPGA and GPU implementations provide as much as a 3.0x end-to-end speedup over our optimized CPU implementation by refactoring the algorithm to exploit its computational features, e.g., parallelism at different granularities. We also find that the relative performance across hardware platforms depends on the number of parallel gradient evaluations required.
Robotics applications have hard time constraints and heavy computational burdens that can greatly benefit from domain-specific hardware accelerators. For the latency-critical problem of robot motion planning and control, there exists a performance gap of at least an order of magnitude between joint actuator response rates and state-of-the-art software solutions. Hardware acceleration can close this gap, but it is essential to define automated hardware design flows to keep the design process agile as applications and robot platforms evolve. To address this challenge, we introduce robomorphic computing: a methodology to transform robot morphology into a customized hardware accelerator morphology. We (i) present this design methodology, using robot topology and structure to exploit parallelism and matrix sparsity patterns in accelerator hardware; (ii) use the methodology to generate a parameterized accelerator design for the gradient of rigid body dynamics, a key kernel in motion planning; (iii) evaluate FPGA and synthesized ASIC implementations of this accelerator for an industrial manipulator robot; and (iv) describe how the design can be automatically customized for other robot models. Our FPGA accelerator achieves speedups of 8x and 86x over CPU and GPU when executing a single dynamics gradient computation. It maintains speedups of 1.9x to 2.9x over CPU and GPU, including computation and I/O round-trip latency, when deployed as a coprocessor to a host CPU for processing multiple dynamics gradient computations. ASIC synthesis indicates an additional 7.2x speedup for single computation latency. We describe how this principled approach generalizes to more complex robot platforms, such as quadrupeds and humanoids, as well as to other computational kernels in robotics, outlining a path forward for future robomorphic computing accelerators.
Rigid body dynamics calculations are needed for many tasks in robotics, including online control. While there currently exist several competing software implementations that are sufficient for use in traditional control approaches, emerging sophisticated motion control techniques such as nonlinear model predictive control demand orders of magnitude more frequent dynamics calculations. Current software solutions are not fast enough to meet that demand for complex robots. The goal of this work is to examine the performance of current dynamics software libraries in detail. In this paper, we (i) survey current state-of-the-art software implementations of the key rigid body dynamics algorithms (RBDL, Pinocchio, RigidBodyDynamics.jl, and RobCoGen), (ii) establish a methodology for benchmarking these algorithms, and (iii) characterize their performance through real measurements taken on a modern hardware platform. With this analysis, we aim to provide direction for future improvements that will need to be made to enable emerging techniques for real-time robot motion control. To this end, we are also releasing our suite of benchmarks to enable others to help contribute to this important task.
Power and thermal limitations make it impossible to run all cores on a multicore system at their maximum frequency. Therefore, modern systems require careful power management. These systems must manage complex tradeoffs between energy, power, and frequency, choosing which cores to accelerate to achieve good performance while maintaining energy efficiency or operating under a power budget. Navigating these tradeoffs is especially hard with multi-threaded applications, where performance depends on the relative progress of parallel worker threads between synchronization points. Prior work on chip-level power management for multi-threaded applications has largely relied on indirect heuristics and metrics calculated from low-level performance counters to estimate each thread’s progress. However, these indirect metrics are often inaccurate. Instead, we propose to gather progress information directly from software itself. We present ThreadBeats, a simple application-level annotation framework that directly and accurately conveys thread progress information to hardware. We design DVFS controllers that exploit ThreadBeats information for two purposes: (i) improving performance by equalizing thread progress and (ii) minimizing runtime under a power budget constraint. These controllers reduce wait time at barriers by 77% on average and improve energy-delay product under a power budget by 23% over prior work.
This paper described recent improvements to the Graphite simulator designed to help explore current and emerging research topics. With these improvements, Graphite is ideally suited to explore both power and performance in future multicore and manycore processors, especially those incorporating dynamic runtime monitoring and adaptation. Separate validation of Graphite has shown performance results within about 6% on average (18% worst case) of a cycle-level simulator and normalized power trends are predicted to within 10%. This makes Graphite accurate enough for medium- to long-term studies while maintaining very high performance. Graphite is freely available for anyone to use: https://github.com/mit-carbon/Graphite.
This paper presents a self-aware processor with energy monitoring circuits that can measure actual energy consumption of the key blocks. The monitors are embedded into on-chip DC/DC converters and generate results within 10% of accuracy with minimal power (<0.1%) and area (<1%) overhead. Our system, which is implemented in 0.18um technology, is designed to be voltage scalable from 1.8V down to 0.6V. Low-voltage SRAM operation is made possible through the use of 8T bit-cells and write-assists. The d-caches are designed to be re-configurable in associativity and size to adapt to compute- versus cache-bound phases of applications. Cache configuration is performed in < 3 clock cycles including tag invalidation. These hardware features enable a software self-aware computation engine (SEEC) to dynamically adapt the processor to meet performance and energy goals. Measurement results show that up to 8.4x energy savings can be achieved with DVFS and self-adaptation.
Henry Hoffmann, Jim Holt, George Kurian, Eric Lau, Martina Maggio, Jason E Miller, Sabrina M. Neuman, Mahmut Sinangil, Yildiz Sinangil, Anant Agarwal, Anantha P Chandrakasan, and Srinivas Devadas. 2012. “Self-aware computing in the Angstrom processor.” In Proceedings of the 49th Annual Design Automation Conference (DAC), Pp. 259–264. Full TextAbstract
Addressing the challenges of extreme scale computing requires holistic design of new programming models and systems that support those models. This paper discusses the Angstrom processor, which is designed to support a new Self-aware Computing (SEEC) model. In SEEC, applications explicitly state goals, while other systems components provide actions that the SEEC runtime system can use to meet those goals. Angstrom supports this model by exposing sensors and adaptations that traditionally would be managed independently by hardware. This exposure allows SEEC to coordinate hardware actions with actions specified by other parts of the system, and allows the SEEC runtime system to meet application goals while reducing costs (e.g., power consumption).
Smart Grid and Smart Meter initiatives seek to enable energy providers and consumers to intelligently manage their energy needs through real-time monitoring, analysis, and control. We have developed an inexpensive FPGA implementation of a spectral envelope preprocessor. This FPGA permits cost-effective and richly detailed power consumption monitoring for individual loads or collections of loads. It permits a flexible trade-off between data transmission, storage, and computation requirements in a power monitoring or control system. The information from the FPGA can be used to coordinate the operation of power electronic controls.