Hardware acceleration can revolutionize robotics, enabling new applications by speeding up robot response times while remaining power-efficient. However, the diversity of acceleration options makes it difficult for roboticists to easily deploy accelerated systems without expertise in each specific hardware platform. In this work, we address this challenge with RobotCore, an architecture to integrate hardware acceleration in the widely-used ROS 2 robotics software framework. This architecture is target-agnostic (supports edge, workstation, data center, or cloud targets) and accelerator-agnostic (supports both FPGAs and GPUs). It builds on top of the common ROS 2 build system and tools and is easily portable across different research and commercial solutions through a new firmware layer. We also leverage the Linux Tracing Toolkit next generation (LTTng) for low-overhead real-time tracing and benchmarking. To demonstrate the acceleration enabled by this architecture, we use it to deploy a ROS 2 perception computational graph on a CPU and FPGA.
We employ our integrated tracing and benchmarking to analyze bottlenecks, uncovering insights that guide us to improve FPGA communication efficiency. In particular, we design an intra-FPGA ROS 2 node communication queue to enable faster data flows, and use it in conjunction with FPGA-accelerated nodes to achieve a 24.42% speedup over a CPU.
Building domain-specific accelerators is becoming increasingly paramount to meet the high-performance requirements under stringent power and real-time constraints. However, emerging application domains like autonomous vehicles are complex systems, where the constraints extend beyond just the computing stack. Manually selecting and navigating the design space to design custom and efficient domain-specific SoCs (DSSoC) is tedious and expensive. As such, there is a need for automated DSSoC design methodologies. In this paper, we use agile and autonomous UAVs as a case study for understanding how to automate the design of domain-specific SoCs for autonomous vehicles. Architecting a UAV DSSoC requires considering parameters such as sensor rate, compute throughput, and other physical characteristics (e.g., payload weight, thrust-to-weight ratio) that affect overall performance. Iterating over the many component choices results in a combinatorial explosion of the number of possible combinations: from 10s of thousands to billions, depending on implementation details. To navigate the DSSoC design space efficiently, we introduce AutoPilot, a framework that automates full-system UAV co-design. AutoPilot uses machine learning to navigate the large DSSoC design space and automatically select a combination of autonomy algorithm and hardware accelerator while considering the cross-product effect across different UAV components. AutoPilot consistently outperforms general-purpose hardware selections like Xavier NX and Jetson TX2, as well as dedicated hardware accelerators built for autonomous UAVs. DSSoC designs generated by AutoPilot increase the number of missions on average by up to 2.25x, 1.62x, and 1.43x for nano, micro, and mini-UAVs, respectively, over baselines. We also discuss how AutoPilot can be extended to other closely related autonomous vehicles.
For application domains with a range of different deployment scenarios, such as robotics, it can be advantageous to parameterize accelerator designs so that they can be easily re-designed and customized per-deployment. Previous work prescribed a systematic methodology to parameterize a class of robotics accelerators according to the physical features of the robot, and manually implemented an accelerator for robot rigid body dynamics gradients for a manipulator arm. However, to automate the process of accelerator re-deployment for future work, it has been necessary to elaborate on the tool infrastructure to create a fully parameterized flow. Without delving into the details of our particular architectural framework, this case study informally provides a retrospective of some of the challenges with tools and software that we have encountered while parameterizing our domain-specific accelerator design.
Automating robust walking gaits for legged robots has been a long-standing challenge. Previous work has achieved robust locomotion gaits on sophisticated quadruped hardware platforms through the use of reinforcement learning and imitation learning. However, these approaches do not consider the strict constraints of ultra-low-cost robot platforms with limited computing resources, few sensors, and restricted actuation. These constrained robot platforms require special attention to successfully transfer skills learned in simulation to reality. As a step toward robust learning pipelines for these constrained robot platforms, we demonstrate how existing state-of-the-art imitation learning pipelines can be modified and augmented to support low-cost, limited hardware. By reducing our model’s observational space, leveraging TinyML to quantize our model, and adjusting the model outputs through post-processing, we are able to learn and deploy successful walking gaits on an 8-DoF, $299 (USD) toy quadruped robot that has reduced actuation and sensor feedback, as well as limited computing resources. A video of our current results can be found at: https://youtu.be/jloya0TOzWA.
We introduce GRiD: a GPU-accelerated library for computing rigid body dynamics with analytical gradients. GRiD was designed to accelerate the nonlinear trajectory optimization subproblem used in state-of-the-art robotic planning, control, and machine learning. Each iteration of nonlinear trajectory optimization requires tens to hundreds of naturally parallel computations of rigid body dynamics and their gradients. GRiD leverages URDF parsing and code generation to deliver optimized dynamics kernels that not only expose GPU-friendly computational patterns, but also take advantage of both fine-grained parallelism within each computation and coarse-grained parallelism between computations. Through this approach, when performing multiple computations of rigid body dynamics algorithms, GRiD provides as much as a 7.6x speedup over a state-of-the-art, multi-threaded CPU implementation, and maintains as much as a 2.6x speedup when accounting for I/O overhead. We release GRiD as an open-source library, so that it can be leveraged by the robotics community to easily and efficiently accelerate rigid body dynamics on the GPU.
Machine learning (ML) has become a pervasive tool across computing systems. An emerging application that stress-tests the challenges of ML system design is tiny robot learning, the deployment of ML on resource-constrained low-cost autonomous robots. Tiny robot learning lies at the intersection of embedded systems, robotics, and ML, compounding the challenges of these domains. Tiny robot learning is subject to challenges from size, weight, area, and power (SWAP) constraints; sensor, actuator, and compute hardware limitations; end-to-end system tradeoffs; and a large diversity of possible deployment scenarios. Tiny robot learning requires ML models to be designed with these challenges in mind, providing a crucible that reveals the necessity of holistic ML system design and automated end-to-end design tools for agile development. This paper gives a brief survey of the tiny robot learning space, elaborates on key challenges, and proposes promising opportunities for future work in ML system design.
Computing the gradient of rigid body dynamics is a central operation in many state-of-the-art planning and control algorithms in robotics. Parallel computing platforms such as GPUs and FPGAs can offer performance gains for algorithms with hardware-compatible computational structures. In this paper, we detail the designs of three faster than state-of-the-art implementations of the gradient of rigid body dynamics on a CPU, GPU, and FPGA. Our optimized FPGA and GPU implementations provide as much as a 3.0x end-to-end speedup over our optimized CPU implementation by refactoring the algorithm to exploit its computational features, e.g., parallelism at different granularities. We also find that the relative performance across hardware platforms depends on the number of parallel gradient evaluations required.
Robotics applications have hard time constraints and heavy computational burdens that can greatly benefit from domain-specific hardware accelerators. For the latency-critical problem of robot motion planning and control, there exists a performance gap of at least an order of magnitude between joint actuator response rates and state-of-the-art software solutions. Hardware acceleration can close this gap, but it is essential to define automated hardware design flows to keep the design process agile as applications and robot platforms evolve. To address this challenge, we introduce robomorphic computing: a methodology to transform robot morphology into a customized hardware accelerator morphology. We (i) present this design methodology, using robot topology and structure to exploit parallelism and matrix sparsity patterns in accelerator hardware; (ii) use the methodology to generate a parameterized accelerator design for the gradient of rigid body dynamics, a key kernel in motion planning; (iii) evaluate FPGA and synthesized ASIC implementations of this accelerator for an industrial manipulator robot; and (iv) describe how the design can be automatically customized for other robot models. Our FPGA accelerator achieves speedups of 8x and 86x over CPU and GPU when executing a single dynamics gradient computation. It maintains speedups of 1.9x to 2.9x over CPU and GPU, including computation and I/O round-trip latency, when deployed as a coprocessor to a host CPU for processing multiple dynamics gradient computations. ASIC synthesis indicates an additional 7.2x speedup for single computation latency. We describe how this principled approach generalizes to more complex robot platforms, such as quadrupeds and humanoids, as well as to other computational kernels in robotics, outlining a path forward for future robomorphic computing accelerators.
Rigid body dynamics calculations are needed for many tasks in robotics, including online control. While there currently exist several competing software implementations that are sufficient for use in traditional control approaches, emerging sophisticated motion control techniques such as nonlinear model predictive control demand orders of magnitude more frequent dynamics calculations. Current software solutions are not fast enough to meet that demand for complex robots. The goal of this work is to examine the performance of current dynamics software libraries in detail. In this paper, we (i) survey current state-of-the-art software implementations of the key rigid body dynamics algorithms (RBDL, Pinocchio, RigidBodyDynamics.jl, and RobCoGen), (ii) establish a methodology for benchmarking these algorithms, and (iii) characterize their performance through real measurements taken on a modern hardware platform. With this analysis, we aim to provide direction for future improvements that will need to be made to enable emerging techniques for real-time robot motion control. To this end, we are also releasing our suite of benchmarks to enable others to help contribute to this important task.