Robot Vision
Methods for Digital Image Processing
Every picture tells a story
Vision Goal of computer vision is to write computer programs that can interpret images
Can do amazing things like: Human Vision Can do amazing things like: Recognize people and objects Navigate through obstacles Understand mood in the scene Imagine stories But still is not perfect: Suffers from Illusions Ignores many details Ambiguous description of the world Doesn’t care about accuracy of world
Computer Vision What we see What a computer sees
Image Image : a two-dimensional array of pixels The indices [i, j] of pixels : integer values that specify the rows and columns in pixel values
Computer Vision What we see What a computer sees
Gray level image vs. binary image
Components of a Computer Vision System Camera Lighting Scene Interpretation Computer Scene
Microsoft Kinect IR LED Emitter IR Camera RGB Camera
Face detection
Face detection Many digital/Mobile cameras detect faces Why would this be useful? Main reason is focus. Also enables “smart” cropping. Many digital/Mobile cameras detect faces Canon, Sony, Fuji, …
Smile detection? Sony Cyber-shot® T70 Digital Still Camera
Face Recognition Principle Components Analysis (PCA)
Vision-based biometrics “How the Afghan Girl was Identified by Her Iris Patterns” Read the story wikipedia
Definition of Robot vision Robot vision may be defined as the process of extracting, characterizing, and interpreting information from images of a three dimensional world
Common reasons for failure of vision systems Small changes in the environment can result in significant variations in image data Changes in contrast Unexpected occlusion of features
What Skills Do Robots Need? Identification: What/who is that? Object detection, recognition Movement: How do I move safely? Obstacle avoidance, homing Manipulation: How do I change that? Interacting with objects/environment Navigation: Where am I? Mapping, localization
Visual Skills: Identification Recognizing face/body/structure: Who/what do I see? Use shape, color, pattern, other static attributes to distinguish from background, other hypotheses Gesture/activity: What is it doing? From low-level motion detection & tracking to categorizing high-level temporal patterns Feedback between static and dynamic
Visual Skills: Movement Steering, foot placement or landing spot for entire vehicle MAKRO sewer shape pattern Demeter region boundary detection
Visual Skills: Manipulation Moving other things Grasping: Door opener (KTH) Pushing, digging, cranes KTH robot & typical handle Clodbusters push a box cooperatively
Visual Skills: Navigation Building a map Localization/place recognition Where are you in the map? Laser-based wall map (CMU) Minerva’s ceiling map
Binary Image Creation Popularly used in industrial robotics
Bit per Pixel
Color models Color models for images, RGB, CMY Color models for video, YIQ, YUV (YCbCr) Relationship between color models :
Simplified diagram of camera to CPU interface
Interfacing Digital Cameras to CPU Digital camera sensors are very complex units. In many respects they are themselves similar to an embedded controller chip. Some sensors buffer camera data and allow slow reading via handshake (ideal for slow microprocessors) Most sensors send full image as a stream after start signal (CPU must be fast enough to read or use hardware buffer)
Idea • Use FIFO as image data buffer • FIFO is similar to dual-ported RAM, it is required since there is no synchronization between camera and CPU • Interrupt service routine then reads FIFO until empty
Vision Sensors Single Perspective Camera
Vision Sensors Multiple Perspective Cameras (e.g. Stereo Camera Pair)
Vision Sensors Multiple Perspective Cameras (e.g. Stereo Camera Pair)
There are several good approaches to detect objects: Model-based vision. 1) We can have stored models of line- drawings of objects (from many possible angles, and at many different possible scales!), and then compare those with all possible combinations of edges in the image. Notice that this is a very computationally intensive and expensive process.
Motion vision. 2) We can take advantage of motion. If we look at an image at two consecutive time-steps, and we move the camera in between, each continuous solid objects (which obeys physical laws) will move as one. This gives us a hint for finding objects, by subtracting two images from each other. But notice that this also depends on knowing well: how we moved the camera relative to the scene (direction, distance), and that nothing was moving in the scene at the time.
Clever Special Tricks that work: to do object recognition, it is possible to simplify the vision problem in various ways: 1) Use color; look for specifically and uniquely colored objects, and recognize them that way (such as stop signs, for example) 2) Use a small image plane; instead of a full 512 x 512 pixel array, we can reduce our view to much less. Of course there is much less information in the image, but if we are clever, and know what to expect, we can process what we see quickly and usefully.
Smart Tricks continued: 3) Use other, simpler and faster, sensors, and combine those with vision. IR cameras isolate people by body-temperature. Grippers allow us to touch and move objects, after which we can be sure they exist. 4) Use information about the environment; if you know you will be driving on the road which has white lines, look specifically for those lines at the right places in the image. This is how first and still fastest road and highway robotic driving is done.
SLAM: Simultaneous Localization and Mapping A robot is exploring an unknown, static environment. Given: The robot’s controls Observations of nearby features The controls and observations are both noisy. Estimate: Location of the robot -- localization where I am ? Detail map of the environment – mapping What does the world look like?
Objective: Determination of the pose (= position + orientation) of a mobile robot in a known environment in order to succesfully perform a given task
The SLAM Problem SLAM is a chicken-or-egg problem: → A map is needed for localizing a robot → A pose estimate is needed to build a map Thus, SLAM is (regarded as) a hard problem in robotics
SLAM Applications Indoors Undersea Space Underground
SLAM – Multiple parts Landmark extraction data association State estimation state update landmark update There are many ways to solve each of the smaller parts
Hardware Mobile Robot Range Measurement Device Laser scanner – CANNOT be used underwater Sonar – NOT accurate Vision – Cannot be used in a room with NO light
Mobile Robot Mapping What does the world look like? Robot is unaware of its environment The robot must explore the world and determine its structure Most often, this is combined with localization Robot must update its location wrt the landmarks Known in the literature as Simultaneous Localization and Mapping, or Concurrent Localization and Mapping : SLAM (CLM) Example : AIBOs are placed in an unknown environment and must learn the locations of the landmarks (An interesting project idea?)
Notation Robot pose Robot poses from time 0 to time t Localization as an estimation problem Notation y q Robot pose x Robot poses from time 0 to time t Robot exteroceptive measurements from time 1 to time t Motion commands (or proprioceptive measurements) from time 0 to time t
Localization as an estimation problem Notation The robot motion model is the pdf of the robot pose at time t+1 given the robot pose and the motion action at time t. It takes into account the noise characterizing the proprioceptive sensors: The measurement model describes the probability of observing at time t a given measurement zt when the robot pose is xr,t. It takes into account the noise characterizing the exteroceptive sensors:
SLAM: Full SLAM: Online SLAM: Simultaneous Localization and Mapping p(x1:t , m | z1:t ,u1:t ) Estimates entire path and map! Online SLAM: p(xt , m | z1:t ,u1:t ) … p(x1:t , m | z1:t ,u1:t ) dx1dx2...dxt 1 Integrations (marginalization) typically done one at a time Estimates most recent pose and map!
Localization Basics Several cameras, pointing straight down Fitted with ultra wide angle lens Instance of Mezzanine (USC) per camera "finds" fiducial pairs atop robot Removes barrel distortion ("dewarps") Reported positions aggregated into tracks But... Fiducials identical: identify robots via commanded motion pattern
Localization: Better Dewarping Mezzanine's supplied dewarp algorithm unstable (10-20 cm error) Model barrel distortion using cosine function locworld = locimage / cos( α * w ) (where α is angle between optical axis and fiducial) Added interpolative error correction Result: ~1cm max location error No need to account for more complex distortion, even for very cheap lenses An interesting problem is that, when dewarping, we are converting from $d_{image}$ to $d_{world}$, but to get the angle $a$ exactly we need to already know the dewarped world coordinates. In our algorithm we iterate, using each $d_{world}$ approximation to calculate a new angle $\alpha$ and hence a new $d_{world}$, converging in fewer than eight iterations. Uses simple geometry of cameras pointing down Approximation function poor fit for large amts distortion; grid generated with approximation function would tilt/bend GLOBALLY if we added enough control points to fit our level of barrel distortion
To set this room as a goal, we'll associate a reward value to each door (i.e. link between nodes). The doors that lead immediately to the goal have an instant reward of 100. Other doors not directly connected to the target room have zero reward
The -1's in the table represent null values (i. e The -1's in the table represent null values (i.e.; where there isn't a link between nodes). For example, State 0 cannot go to State 1.
Q Matrix "Q", to the brain of our agent, representing the memory of what the agent has learned through experience. The rows of matrix Q represent the current state of the agent, and the columns represent the possible actions leading to the next state (the links between the nodes). The agent starts out knowing nothing, the matrix Q is initialized to zero If we didn't know how many states were involved, the matrix Q could start out with only one element. It is a simple task to add more columns and rows in matrix Q if a new state is found.
Learning Q(state, action) = R(state, action) + Gamma * Max[Q(next state, all actions)] learning parameter Gamma maximum value of Q for all possible actions in the next state
Initialize matrix Q to zero. Select a random initial state. Do While the goal state hasn't been reached. Select one among all possible actions for the current state. Using this possible action, consider going to the next state. Get maximum Q value for this next state based on all possible actions. Compute: Q(state, action) = R(state, action) + Gamma * Max[Q(next state, all actions)] Set the next state as the current state.
Look at the second row (state 1) of matrix R Look at the second row (state 1) of matrix R. There are two possible actions for the current state 1: go to state 3, or go to state 5. By random selection, we select to go to 5 as our action. Since matrix Q is still initialized to zero, Q(5, 1), Q(5, 4), Q(5, 5), are all zero. The result of this computation for Q(1, 5) is 100 because of the instant reward from R(5, 1).
For the next episode, we start with a randomly chosen initial state For the next episode, we start with a randomly chosen initial state. This time, we have state 3 as our initial state We use the updated matrix Q from the last episode. Q(1, 3) = 0 and Q(1, 5) = 100. The result of the computation is Q(3, 1) = 80 because the reward is zero. The matrix Q becomes:
This matrix Q, can then be normalized (i. e This matrix Q, can then be normalized (i.e.; converted to percentage) by dividing all non-zero entries by the highest number (500 in this case):
For example, from initial State 2, the agent can use the matrix Q as a guide: From State 2 the maximum Q values suggests the action to go to state 3. From State 3 the maximum Q values suggest two alternatives: go to state 1 or 4. Suppose we arbitrarily choose to go to 1. From State 1 the maximum Q values suggests the action to go to state 5. Thus the sequence is 2 - 3 - 1 - 5.