Waypoint

A Question of Perception — Talking Autonomous Robot Navigation with Prof. Davide Scaramuzza

The head of the University of Zurich’s Robotics and Perception Group, Professor Davide Scaramuzza, works at the intersection of computer vision and control—using cutting-edge sensors and algorithms to help terrestrial and flying robots (and, possibly, future spaceships) navigate 100% autonomously. Waypoint caught up with him to learn more.

Hello professor, why don’t you start by telling us a little about yourself?

Sure. I am Assistant Professor of Robotics at the University of Zurich, where I lead the Robotics and Perception Group (RPG). This group is currently made up of 12 people, all with engineering and science backgrounds. I got my PhD at ETH under the supervision of Roland Siegwart and a postdoc at the University of Pennsylvania, under the supervision of Kostas Daniilidis and Vijay Kumar.

From 2009 to 2012, I led the European project SFLY, which introduced the world’s first autonomous navigation of micro quadrotors in GPS-denied environments, using vision as the main sensor modality. The project involved five research groups across Europe and the popular Pixhawk drone autopilot and Ascending Technologies’ Firefly hexacopter were some of the SFLY’s key outcomes.

UZH_Scaramuzza_Davide®Bruederli_Small
Professor Davide Scaramuzza leads the University of Zurich’s Robotics and Perception Group (RPG), which works at the intersection of computer vision and control. (Photo: Davide Scaramuzza.)

Could you give us a brief overview of RPG’s focus areas?

My main research interest is computer vision, applied to the autonomous navigation of visually-guided ground and micro flying robots. I’m interested in machines that actively move and interact with the environment, driven by visual input.

Another area I research is low-latency vision. The past fifty years of research have been dedicated to standard vision sensors, which output frames at regular time intervals, whereas ‘event-based’ vision sensors are a new class of sensors that imitate the human eye, in that every pixel is independent of the others and asynchronously sends information when the intensity signal changes over time.

These event-based sensors are really the technology of the future and will drastically change robot perception by enabling a new class of algorithms with a latency of just microseconds. They will make robots able to execute agile maneuvers like birds could only do so far.

Event-based sensors are really the technology of the future

When you’re talking about event-based vision sensors, is your research looking at developing the sensors themselves or the algorithms used to control those sensors, or both?

Just the algorithms. The sensors themselves we buy from a company, actually a start-up out of the University of Zurich and ETH Zurich, called iniLabs. It’s a very new sensor, completely different from the state-of-the-art cameras of today. This doesn’t stream frames but just a sequence of events, so very little theory has been developed to use this sensor.

We are living in an exciting time where basically we are developing new algorithms for a new sensor which, hopefully, will change robotics in the next five to ten years.

We are living in an exciting time, where basically we are developing new algorithms for a new sensor which, hopefully, will change robotics in the next five to ten years

It’s more than a fast camera, the sensor has zero delay—a latency of just a few microseconds!—so you can use it in applications where a fast response is needed, such as drones. These are among the most agile robot platforms you can actually use. They can accelerate faster than a car, and whereas a car is constrained to a road, a quadrotor can have very steep changes of direction that a car could never have. So, if I had to suggest another potential flying application for such technology in the future, besides drones, it would be spaceships! However, automotive will also strongly benefit from this new sensor; the low latency would allow a car to detect obstacles or people suddenly jumping into the street more quickly than any camera, laser or radar could ever do, thus saving more lives.

A_Question_of Perception2_senseFly_Waypoint_Alain Herzog
The Robotics and Perception Group is working with unmanned aerial vehicles, like that above, carrying a cutting-edge event-based vision sensor. (Photo: Alain Herzog)

If I had to suggest a potential application for such technology in the future, besides drones, it would be spaceships!

Why computer vision and why robots? What was the path that led you from school through to where you are today?

I started down the path that led me to study and do research in robotics when I was a kid. I was always fascinated by robot movies. My father used to tell me bedtime stories about robots. This led me to study electrical engineering and then do a PhD in robotics.

Most computer vision research has been dedicated to passive perception, perceiving the world from a user-defined set of camera views. Robot vision instead aims to actively control the robot, and so the cameras, to accomplish a given task, like humans do every day. One of the main reasons why we do not yet have autonomous robots today is that they still rely on user-defined parameters and are not yet able to learn by themselves.

If we can start ‘up in the air’, let’s talk about your flying robot research. We understand that this is pushing autonomous operation forward by working on cutting-edge vision strategies, such as SLAM or VSLAM (Visual Simultaneous Localization and Mapping). Do you want to explain why vision is important and maybe help our less technical readers to understand what these techniques are all about? How do you explain SLAM, for example, to the person on the street?

Vision is the main sensing modality of all mammals and insects. Half of the primate cerebral cortex is dedicated to visual processing. And vision is so fascinating for researchers because we still don’t fully understand how it works!

Most of the mobile robotics research of the last 30 years has been dedicated to use exteroceptive sensors, in other words vision and lasers, to build maps of the environment. The reason is intuitive—when we visit a new place, we use a map to know where we are and how to get somewhere. And so does a robot, in order to go autonomously from one place to another.

But how do we build a map if we don’t have one yet? And another question: given a map, how do we know where we are in that map? The answer to the former is ‘mapping’, while the answer to the latter is called ‘localisation’.

How do we build a map if we don’t have one yet? And given a map, how do we know where we are in that map?

We all know how to build a blueprint map of a house, using tape and a goniometer [an angle measuring instrument]. But translating this into a computer algorithm is not so simple.

In fact, when you start combining all your metric and angular measurements you need to know where exactly those measurements were taken (localisation). But how do you know where in the map they were taken if you are still building the map? This can be solved by alternating localisation and mapping, called ‘Simultaneous Localisation and Mapping’ or SLAM.

We all know how to build a blueprint map of a house, using tape and a goniometer, but translating this into a computer algorithm is not so simple

If this is clear, the next question is, how can we use just a standard camera to build a map? The theory, called photogrammetry, has been around for more than a century.

And how do you explain photogrammetry itself?

To understand photogrammetry, just do this experiment: place your two thumbs in front of your eyes at different distances. Now, close one eye and move your head in front of the thumbs. You’ll see that the thumb that is closer to you moves less than the thumb that is farther away. This is known as the parallax effect. It means there’s a relation between the motion and the relative distance between the objects.

So, for a camera, we need an algorithm to measure (track) the displacement of every pixel as the camera moves. Then we can use algebra and geometry to recover the positions of those points in space, which represent the map!

Back in April 2012, as mentioned in your TedX talk of the same year, you ran a trial project with Zurich firefighters using a swarm of three autonomous micro helicopters to survey an area and localise a victim. Have you carried out more such trials since then and how have your SLAM-related innovations advanced between then and now?

We’ve made tremendous progress since 2012! Now, my group and I are focusing on several new problems.

Most of the autonomous drones nowadays are still confined to controlled environments and slow trajectories—we are departing from these assumptions and we are working on algorithms that are robust to changes of illumination and wind (YouTube) and can fly drones at unprecedented speeds (up to 70 km/h) (YouTube).

There is a big issue in research around drones nowadays. Pick any video of quadrotor drones being used for aerobatic maneuvers—there are drones that have been shown playing ping pong, or you have Raffaello D’Andrea with 30 drones dancing around his head and lights, making a very nice show. But none of those are autonomous drones. They are actually using an external infrastructure like GPS, external cameras or radio beacons on the floor.

In order to explore areas where you have never been before, like disaster zones, you cannot go there and install and calibrate external cameras. Of course that’s not possible. In order to enable drones to get out of the research labs, we have to work on software, i.e. algorithms that can run on a computer, onboard the drone, and it has to be fast so that it can make a decision quickly before colliding with an obstacle.

In order to explore areas where you have never been before, like disaster zones, you cannot go there and install and calibrate external cameras

So, on this front, we are currently working on lots of different things. One is localisation and mapping, so meaning: where is the drone, and where is the drone compared to other drones. With SLAM localisation and mapping, each drone has to be able to build a map of its environment where it can localise itself and the other drones.

Another thing we are working on is collision avoidance and interaction with the environment. You need to make a map that is dense enough that you can actually ‘see’ that you cannot pass, navigate through, certain areas where the map is dense—so all the holes are filled—or alternatively so that you know you can land in certain areas. Typically in robotics, most people actually rely on sparse maps, with just a few points here and there, but there is not enough information to help you understand if you can actually move through that space or not.

Another thing we are working on is perception-aware motion planning—when the robot “doesn’t see very well” this is about it adapting its behavior so as to resolve its “doubts”.

Then, we are also working on using the event-based sensors I told you about previously in order to perceive faster.

Additionally, our drones can now learn: in a recent project we taught drones to recognise forest trails in search of missing people! All these advances are crucial to enable one day the use of drones in search and rescue applications.

Working with algorithms, is the development process an iterative process of testing and tweaking?

Yes. When you work in robotics, you have to make a model of the sensors and a model of the motion of the robot. You don’t have a perfect model of your environment. You watched the movie the Matrix right? There for example it’s a full world simulation where everything is simulated and is completely predictable. We don’t have such a perfect model of the environment. The environment is uncertain and dynamic. We cannot model when the wind gusts will occur, when it will rain, when a person decides to cross a street etc. We can only make assumptions, predictions…. it’s the same for the wind. Most of the algorithms are developed in the lab where there is no wind and when you go outside you assume it should work within a certain range of wind. But we have to go through several test and tweak iterations as we learn through field tests.

Professor Scaramuzza, thanks very much for your time.

You’re welcome.

===========================
About Professor Davide Scaramuzza

Professor Scaramuzza, born in Italy in 1980, is Assistant Professor of Robotics at the University of Zurich. He is also the founder and director of the university’s Robotics and Perception Group, where he develops cutting-edge research on low-latency vision and visually-guided micro aerial vehicles.

He received his PhD in Robotics and Computer Vision at ETH Zurich. He completed his postdoc at both ETH Zurich and the University of Pennsylvania. He led the European project SFLY between 2009 and 2012, which introduced the world’s first autonomous navigation of micro quadrotors in GPS-denied environments using vision as the main sensor modality. For his research contributions, he was awarded an SNSF-ERC Starting Grant, the IEEE Robotics and Automation Early Career Award, and a Google Research Award. He also co-authored the book, Introduction to Autonomous Mobile Robots (MIT Press, 2011) and he is co-founder of the Zurich-Eye startup, dedicated to the commercialisation of visual-inertial navigation solutions.
===========================

1 Comment

Leave a Reply

Your email address will not be published. Required fields are marked *