Deep Learning and Newtonian Physics

Slides:

Advertisements

Similar presentations

Pose Estimation and Segmentation of People in 3D Movies Karteek Alahari, Guillaume Seguin, Josef Sivic, Ivan Laptev Inria, Ecole Normale Superieure ICCV.

Advertisements

Change Detection C. Stauffer and W.E.L. Grimson, “Learning patterns of activity using real time tracking,” IEEE Trans. On PAMI, 22(8): , Aug 2000.

November 12, 2013Computer Vision Lecture 12: Texture 1Signature Another popular method of representing shape is called the signature. In order to compute.

Real-Time Human Pose Recognition in Parts from Single Depth Images Presented by: Mohammad A. Gowayyed.

MPEG-4 Objective Standardize algorithms for audiovisual coding in multimedia applications allowing for Interactivity High compression Scalability of audio.

Motion Tracking. Image Processing and Computer Vision: 82 Introduction Finding how objects have moved in an image sequence Movement in space Movement.

A Novel 2D To 3D Image Technique Based On Object- Oriented Conversion.

Spatial Pyramid Pooling in Deep Convolutional

Face Processing System Presented by: Harvest Jang Group meeting Fall 2002.

Jacinto C. Nascimento, Member, IEEE, and Jorge S. Marques

1 Research Question  Can a vision-based mobile robot  with limited computation and memory,  and rapidly varying camera positions,  operate autonomously.

Visual Computing Computer Vision 2 INFO410 & INFO350 S2 2015

By Naveen kumar Badam. Contents INTRODUCTION ARCHITECTURE OF THE PROPOSED MODEL MODULES INVOLVED IN THE MODEL FUTURE WORKS CONCLUSION.

Course14 Dynamic Vision. Biological vision can cope with changing world Moving and changing objects Change illumination Change View-point.

Deep Learning Overview Sources: workshop-tutorial-final.pdf

National Taiwan Normal A System to Detect Complex Motion of Nearby Vehicles on Freeways C. Y. Fang Department of Information.

When deep learning meets object detection: Introduction to two technologies: SSD and YOLO Wenchi Ma.

Big data classification using neural network

A Plane-Based Approach to Mondrian Stereo Matching

Convolutional Sequence to Sequence Learning

Unsupervised Learning of Video Representations using LSTMs

Action-Grounded Push Affordance Bootstrapping of Unknown Objects

Summary of “Efficient Deep Learning for Stereo Matching”

Object Detection based on Segment Masks

Deep Learning Amin Sobhani.

Compact Bilinear Pooling

Quantum Simulation Neural Networks

Convolutional Neural Fabrics by Shreyas Saxena, Jakob Verbeek

Deep Predictive Model for Autonomous Driving

Jure Zbontar, Yann LeCun

Tracking Objects with Dynamics

Intelligent Information System Lab

CS6890 Deep Learning Weizhen Cai

Dynamic Routing Using Inter Capsule Routing Protocol Between Capsules

Presenter: Hajar Emami

Adversarially Tuned Scene Generation

"Playing Atari with deep reinforcement learning."

Human-level control through deep reinforcement learning

Computer Vision James Hays

Learning Physical Intuition of Block Towers by Example

Grid Long Short-Term Memory

ENEE 631 Project Video Codec and Shot Segmentation

Vessel Extraction in X-Ray Angiograms Using Deep Learning

Introduction of MATRIX CAPSULES WITH EM ROUTING

Oral presentation for ACM International Conference on Multimedia, 2014

Neural Speech Synthesis with Transformer Network

Lip movement Synthesis from Text

Introduction to Artificial Intelligence Lecture 24: Computer Vision IV

Convolutional Neural Networks

Introduction to Object Tracking

Machine learning overview

Neural networks (3) Regularization Autoencoder

Unsupervised Perceptual Rewards For Imitation Learning

by Khaled Nasr, Pooja Viswanathan, and Andreas Nieder

Automatic Handwriting Generation

Human-object interaction

Higher-Order Figure Discrimination in Fly and Human Vision

Deep Object Co-Segmentation

Volume 23, Issue 21, Pages (November 2013)

Neural Machine Translation using CNN

Learning and Memorization

Object Detection Implementations

Presented By: Harshul Gupta

Multi-UAV to UAV Tracking

Week 7 Presentation Ngoc Ta Aidean Sharghi

Random Neural Network Texture Model

Multi-Target Detection and Tracking of UAVs from a UAV

Directional Occlusion with Neural Network

Shengcong Chen, Changxing Ding, Minfeng Liu 2018

Presentation transcript:

Deep Learning and Newtonian Physics Ohioma Eboreime Deep Learning

Overview Introduction Methods Method 1: Newtonian Neural Networks ( 𝑁 3 ) Objectives Architecture Results Method 2: Phys-Net Method 3: Visual Interaction Networks (VIN)

Introduction Human perceptual ability enables the prediction of object dynamics and next occurrences from a static scene. This ability enables us move beyond object detection and tracking to an inference of object behavior in a given environment.

Method 1: Newtonian Neural Networks Objectives Object dynamics and trajectory prediction in static images. Understanding objects in terms of forces acting on it and the long term motion as a result of these forces Figure 1: The query labelled “query object” represents the object of interest. The components of force acting on the query as well as the expected motion are to be inferred strictly from a static image

Method 1: Newtonian Neural Networks Architecture Maps a single image to intermediate physical abstractions called “Newtonian Scenarios” “Newtonian scenarios” describe a number of simple standard real world motions Figure 2: Newtonian Scenarios The Newtonian scenarios are rendered from 8 different azimuth angles using a game engine. Scenarios 6, 7, and 11 are symmetric across different azimuth angles and are rendered from 3 different elevations of the camera. Scenarios 2 and 12 are the same across viewpoints with 180 azimuth difference. Scenario 5 only 1 viewpoint is considered (no motion). In total, we obtain 66 videos for all 12 Newtonian scenarios.

Method 1: Newtonian Neural Networks Considerations Motion prediction: This method focused on the physics of the motion not an inference of the objects most likely path. Scene understanding: This work focused on the dynamics of already stable and moving objects Action Recognition: This work focused on estimating long-term motions as opposed prediction of a class of. Tracking: This approach is different from tracking since tracking methods are not used for single image reasoning Approach Figuring out which Newtonian scenario explains the dynamics of the static image most accurately Finding the scenario that matches the state of the object in motion

Method 1: Newtonian Neural Networks This network uses two parallel CNNs: Encoder for the visual cues (Image Row) Uses 2D CNNs to represent the static image. Encoder for the Newtonian motions (Motion Row): Uses 3D CNNs to represent game engine videos of Newtonian scenarios. Figure 3: Newtonian Neural Network

Method 1: Newtonian Neural Networks Image Row Figure 4: Image Row The image row consists of the following: Five 2D Convolutional Layers Two Fully Connected layers The input has four channels (RGBM). The channel M is the mask channel that specifies the location of the query object by a bounding-box mask smoothed with a Gaussian kernel.

Method 1: Newtonian Neural Networks Motion Row Figure 5: Image Row Volumetric convolutional neural network and consists of: Six 3D Convolutional layers One Fully Connected layer Each frame has 10 channels (RGB, flow, depth, and surface normal)

Method 1: Newtonian Neural Networks Matching Layer Takes as input the output of the image row and the output of the motion row. The cosine similarity 𝑆 𝑥;𝑦 = 𝑥∙𝑦 𝑥 𝑦 + 𝜖 between all the image descriptors and all of the 10 frames’ descriptors in each video (66 videos) in the batch. The output of matching layer are 66 vectors with each vector having 10 dimensions (for each frame). The dimension with maximum similarity value indicates the state of dynamics for each Newtonian scenario Figure 5: Matching Layer

Method 1: Newtonian Neural Networks Training Testing The loss is computed using the negative log likelihood. 𝐸= 1 𝑛 𝑖=1 𝑛 𝑃 𝑖 𝑙𝑜𝑔 𝑃 𝑖 + 1− 𝑃 𝑖 log 1− 𝑃 𝑖 𝑃 𝑖 =𝐺𝑟𝑜𝑢𝑛𝑑 𝑇𝑟𝑢𝑡ℎ 𝑜𝑓 𝑡ℎ𝑒 𝑖𝑛𝑝𝑢𝑡 𝑖𝑚𝑎𝑔𝑒 𝑃 𝑖 =𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 A random batch of images is fed into the network, but with only the fixed batch of 66 videos across all iterations. This enables N3 to penalize the error over all of the Newtonian scenarios at each iteration. To test, a single RGBM image is fed as input to obtain The underlying Newtonian Scenario (ℎ) The matching state ( 𝑠 ℎ ) The predicted scenario (ℎ) is the scenario with maximum confidence in the output. The matching state ( 𝑠 ℎ ) is achieved by: 𝑠 ℎ = arg max {𝑆𝑖𝑚 (𝐱, 𝐯 ℎ 𝑖 )} 𝐱= 4096x1 image descriptor 𝐯 ℎ 𝑖 = 4096x10 video descriptor for Newtonian scenario A long-term 3D motion path can be drawn for the query object by using the game engine parameters (e.g. direction of velocity and force, 3D motion, and camera view point) from the state ( 𝑠 ℎ ) of Newtonian scenario (ℎ)

Method 1: Newtonian Neural Networks Results Figure 6: The expected motion of the object

Method 1: Newtonian Neural Networks Results Figure 7: Visualization of the direction of net force and object velocity

Method 2: Phys-Net Objectives Explores the ability of a deep feed-forward model to learn intuitive physics of objects. Figure 8: Block tower examples from the synthetic (left) and real (right) datasets

Method 2: Phys-Net Data Collection Synthetic Generated vertical stacks of 2, 3, and 4 colored blocks in random configurations. The features such as, block position, back ground textures, lighting were randomized at each trial to improved generalizable performance of learned features. Each simulation records the outcomes and captured screen shots as well as segmentation masks at 8 frames/sec. Real Data Four wooden cubes were fabricated and spray painted red, green, blue and yellow respectively. The blocks were manually stacked in configurations 2, 3 and 4 and a camera was used to film the blocks falling at 60 frames/sec..

Method 2: Phys-Net Architecture Figure 9: Architecture of the Phys-Net network An integrated approach is used with the CNN architectures here. The lower layers perceive the arrangement of blocks and the upper layers implicitly capture their inherent physics

Method 2: Phys-Net Architecture The ResNet-34 and Google-net networks were trained on the fall prediction task. The models were pre-trained on the Image-net dataset. The final linear layer was replaced with a single logistic output. Figure 10: Fall Prediction Branch of the Phys-Net network

Method 2: Phys-Net Architecture Deep-mask networks are used to predict the segmentation trajectory of falling blocks at multiple future times (0s, 1s, 2s, 4s) based on an input image Each mask pixel is a multi-class classification across a background class and four foreground (block color) classes A binary mask head with a multi-class Soft-max is replicated N times for mask prediction at multiple points in time Figure 11: Mask Prediction Branch of the Phys-Net network The training loss for mask networks is the sum of a binary cross-entropy loss for fall prediction and a pixel-wise multiclass cross-entropy loss for each mask

Method 2: Phys-Net Results Phys-Net correctly predicts fall direction for most synthetic examples, while on real examples, Phys-Net overestimate stability Figure 12: Phys-Net mask predictions for synthetic (Left) and real (Right) towers of 2, 3, and 4 blocks.

Method 2: Phys-Net Results Here these models gain knowledge about the block tower dynamics, rather than simply memorizing a mapping. Only very small degradation in performance of the models on a tower size that is not shown during training Figure 13: Plots comparing Phys-Net accuracy to human performance on real (Top) and synthetic (Bottom) test examples.

Method 3: Visual Interaction Networks (VIN) Objective Learning the dynamics of a physical system from raw visual observations Visual Interaction Networks (VIN) The VIN model can be trained from supervised data sequences which consist of input image frames and target object state values The VIN model is comprised of two main components: A visual encoder based on convolutional neural networks (CNNs) A recurrent neural network (RNN) with an interaction network (IN)

Method 3: Visual Interaction Networks (VIN) Visual Encoder The visual encoder is a CNN that produces a state code from a sequence of 3 images. It takes a pair of consecutive frames from a sequence of 3 frames and outputs a candidate state code. The two resulting candidate state codes are aggregated N = Number of objects in the scene L = Length of each state code slot (Features) Figure 14: Frame Pair Encoder; The frame pair encoder is a CNN which transforms two consecutive input frame into a state code

Method 3: Visual Interaction Networks (VIN) Dynamics Encoder The dynamics encoder is an RNN that produces a candidate state code for a frame from a previous sequence of state codes The main difference between this and a vanilla IN is it aggregates over multiple temporal offsets. Figure 14: Dynamics Encoder; An Interaction Net (IN)

Method 3: Visual Interaction Networks (VIN) Architecture Figure 15: Visual Interaction Network (VIN) State Decoder The state decoder is simply a linear layer with input size L (Features) and output size 4 (for a position/velocity vector). This linear layer is applied independently to each slot of the state code.

Method 3: Visual Interaction Networks (VIN) Training and Testing In each system the force law is applied pair-wise to all objects and all objects have the same mass and density Spring: Each pair of objects has an invisible spring connection Gravity: Objects are massive and obey Newton’s Law of gravity. Billiards: No long-distance forces are present, but the billiards bounce off each other and off the boundaries of the field of vision. Magnetic Billiards: All billiards are positively charged, so instead of bouncing, they repel each other according to Coulomb’s Law. They still bounce off the boundaries. Drift: No forces of any kind are present. Objects drift with their initial velocities

Method 3: Visual Interaction Networks (VIN) Results Figure 16: Rollout Trajectories

Method 3: Visual Interaction Networks (VIN) Results Figure 17: Performance

Questions and Discussion