Imaging systems consist of cameras to encode visual information about the world and perception models to interpret this encoding. Cameras contain (1) illumination sources, (2) optical elements, and (3) sensors, while perception models use (4) algorithms. Directly searching over all combinations of these four building blocks to design an imaging system is challenging due to the size of the search space. Moreover, cameras and perception models are often designed independently, leading to sub-optimal task performance. In this paper, we formulate these four building blocks of imaging systems as a context-free grammar (CFG), which can be automatically searched over with a learned camera designer to jointly optimize the imaging system with task-specific perception models. By transforming the CFG to a state-action space, we then show how the camera designer can be implemented with reinforcement learning to intelligently search over the combinatorial space of possible imaging system configurations. We demonstrate our approach on two tasks, depth estimation and camera rig design for autonomous vehicles, showing that our method yields rigs that outperform industry-wide standards. We believe that our proposed approach is an important step towards automating imaging system design.
DISeR: Designing Imaging Systems with Reinforcement Learning
Tzofi Klinghoffer*, Kushagra Tiwary*, Nikhil Behari, Bhavya Agrawalla, Ramesh Raskar
Designing camera systems is non-trivial due to the vast number of engineering decisions to be made, including which parameters to use for illumination, optics, and sensors. We define a language for imaging system design using context-free grammar (CFG), which allows imaging systems to be represented as strings. The CFG serves as a search space for which search algorithms can then be used to automate imaging system design. We refer to such an algorithm as a camera designer (CD) and implement it with RL. RL allows us to search over imaging systems without relying on differentiable simulators and can scale to the combinatorially large search space of the CFG. Inspired by how animal eyes and brains are tightly integrated, our approach jointly trains the CD and PM, using the accuracy of the PM to inform how the CD is updated in training. Because searching over the entire CFG is infeasible with available simulators, we take the first step of validating that RL can be used to search over subsets of the CFG, including number of cameras, pose, field of view (FoV), and light intensity. First, we apply our method to depth estimation, demonstrating the viability of jointly learning imaging and perception. Next, we tackle the practical problem of designing a camera rig for AVs and show that our approach can create rigs that lead to higher perception accuracy than industry-standard rig designs.
We validate our approach with two tasks: (1) designing AV camera rigs for bird's eye view (BEV) segmentation, and (2) depth estimation.
AV Camera Rig Design: We apply our method to optimize an AV camera rig for the perception task of bird's eye view (BEV) segmentation by jointly training the camera designer (CD) and perception model (PM). The CD must choose both the number of cameras and the placement and FoV of each. The PM is trained with the output images from each candidate camera rig in simulated (CARLA) and its test accuracy (IoU) is used as the reward to update the CD. We find that the rigs created with our approach lead to higher BEV segmentation accuracy in our environment compared to the industry-standard nuScenes rig. The camera rig and resulting images of both our approach and nuScenes are visualized above. When tasked with designing AV camera rigs, the CD learns:
@inproceedings{tzofi2023diser,
author = {Klinghoffer, Tzofi and Tiwary, Kushagra and Behari, Nikhil and
Agrawalla, Bhavya and Raskar, Ramesh},
title = {DISeR: Designing Imaging Systems with Reinforcement Learning},
booktitle = {International Conference on Computer Vision},
year = {2023}
}
We thank Siddharth Somasundaram for his diligent proofreading of the paper. KT was supported by the SMART Contract IARPA Grant #2021-20111000004. We also thank Systems & Technology Research (STR).