For M19-D to be fully autonomous, it needs to be able to perceive the world, much like how we do with our own eyes. The modes that our Driverless car uses to sense the world is in the form of visible light (cameras), infrared light (LiDAR) and radio waves (GPS). MEMS (Micro-Electro-Mechanical Systems) sensors are also used to provide additional information on the vehicle state. The IMU (Inertial Measurement Unit) consists of a combination of accelerometer and gyroscope which output forces experienced by the car. A magnetometer is also used to determine the heading of the car based off the earth’s magnetic field. The challenging part of perception is combining all the information from the sensors to provide a useful and accurate estimation of the vehicle’s state and cone positions in the world. To do this we currently use an Extended Kalman Filter (EKF) based SLAM (Simultaneous Localisation and Mapping).

The camera pipeline has two major steps: cone detection using a neural network, and stereo matching for depth estimation. These systems result in 3D coordinates of the cones in the world relative to the car from a pair of 2D stereo images. 

The convolutional neural network used is a slightly modified variant of Tiny-YOLOv3, with 2 detection layers and higher spatial resolution. The ‘You Only Look Once’, or YOLO, family of networks represent an excellent tradeoff between accuracy and computational requirements, and are ideally suited for our task. We train this network in PyTorch, a Python deep-learning framework developed by Facebook for machine learning research. In order to speed up the network as much as possible, we use nVidia’s TensorRT library when it comes to the actual deployment on the car. Combined with the nVidia Xavier kindly provided by Xenon, our neural network runs at over 100fps with a 704 by 704 pixel input resolution, leaving plenty of computational overhead for stereo matching. 

One of the issues we’re facing is the rolling shutter of our current camera. Rolling shutter means that when you take an image, the camera will actually capture the 1st row of pixels beginning from one side, then the 2nd and so on up to the 1080th row. When you are going slowly, the time difference between capturing the top and bottom has an insignificant effect. However, at higher speeds this becomes a critical failure, as the distortion it induces has a significant impact on the stereo matching. This can be seen in the following video.

The LiDAR emits infrared light and detects the return time-of-flight. This gives us distances to the objects in the surrounding area. Unlike cameras, the LiDAR is robust against varying lighting and weather conditions. It allows us to detect cones reliably in our environment at a larger field of view. We use a bespoke cone detection algorithm that filters out information we don’t want and selects the cone only if it has been detected in two layers. This means we cannot know for sure that it is a cone unless two horizontal LiDAR laser beams hit the cone. The vertical resolution of the current LiDAR we use limits how far away we can detect cones due to layer divergence. Consequently, we can only detect cones at a maximum distance of about 8-10m.

LiDAR Layers

Cones surrounded by the blue box present in 4 and 2 layers of the LiDAR scan and detected as cones. The cone surrounded by red is only present in 1 layer, so the algorithm can’t be sure it’s a cone, and ignores it.

LiDAR cone detection and camera cone detection pass cone positions and type (cone colour and size) to the EKF SLAM algorithm. This is fused with GPS and magnetometer measurements that give absolute position and heading of the car (which helps counteract drift), and IMU measurements which sense the transient motion that the car experiences. Everything is connected together using the ROS framework. The SLAM node then outputs the state of the car in the world and all the cones detected, ready to be used by the path planning node.

by Georgia Ovenden and Jack Coleman