Deep Learning and Newtonian Physics

Deep Learning and Newtonian Physics
Ohioma Eboreime Deep Learning

Overview Introduction Methods
Method 1: Newtonian Neural Networks ( 𝑁 3 ) Objectives Architecture Results Method 2: Phys-Net Method 3: Visual Interaction Networks (VIN)

Introduction Human perceptual ability enables the prediction of object dynamics and next occurrences from a static scene. This ability enables us move beyond object detection and tracking to an inference of object behavior in a given environment.

Method 1: Newtonian Neural Networks
Objectives Object dynamics and trajectory prediction in static images. Understanding objects in terms of forces acting on it and the long term motion as a result of these forces Figure 1: The query labelled “query object” represents the object of interest. The components of force acting on the query as well as the expected motion are to be inferred strictly from a static image

Architecture Maps a single image to intermediate physical abstractions called “Newtonian Scenarios” “Newtonian scenarios” describe a number of simple standard real world motions Figure 2: Newtonian Scenarios The Newtonian scenarios are rendered from 8 different azimuth angles using a game engine. Scenarios 6, 7, and 11 are symmetric across different azimuth angles and are rendered from 3 different elevations of the camera. Scenarios 2 and 12 are the same across viewpoints with 180 azimuth difference. Scenario 5 only 1 viewpoint is considered (no motion). In total, we obtain 66 videos for all 12 Newtonian scenarios.

Considerations Motion prediction: This method focused on the physics of the motion not an inference of the objects most likely path. Scene understanding: This work focused on the dynamics of already stable and moving objects Action Recognition: This work focused on estimating long-term motions as opposed prediction of a class of. Tracking: This approach is different from tracking since tracking methods are not used for single image reasoning Approach Figuring out which Newtonian scenario explains the dynamics of the static image most accurately Finding the scenario that matches the state of the object in motion

This network uses two parallel CNNs: Encoder for the visual cues (Image Row) Uses 2D CNNs to represent the static image. Encoder for the Newtonian motions (Motion Row): Uses 3D CNNs to represent game engine videos of Newtonian scenarios. Figure 3: Newtonian Neural Network

Image Row Figure 4: Image Row The image row consists of the following: Five 2D Convolutional Layers Two Fully Connected layers The input has four channels (RGBM). The channel M is the mask channel that specifies the location of the query object by a bounding-box mask smoothed with a Gaussian kernel.

Motion Row Figure 5: Image Row Volumetric convolutional neural network and consists of: Six 3D Convolutional layers One Fully Connected layer Each frame has 10 channels (RGB, flow, depth, and surface normal)

Matching Layer Takes as input the output of the image row and the output of the motion row. The cosine similarity 𝑆 𝑥;𝑦 = 𝑥∙𝑦 𝑥 𝑦 + 𝜖 between all the image descriptors and all of the 10 frames’ descriptors in each video (66 videos) in the batch. The output of matching layer are 66 vectors with each vector having 10 dimensions (for each frame). The dimension with maximum similarity value indicates the state of dynamics for each Newtonian scenario Figure 5: Matching Layer

Training Testing The loss is computed using the negative log likelihood. 𝐸= 1 𝑛 𝑖=1 𝑛 𝑃 𝑖 𝑙𝑜𝑔 𝑃 𝑖 + 1− 𝑃 𝑖 log 1− 𝑃 𝑖 𝑃 𝑖 =𝐺𝑟𝑜𝑢𝑛𝑑 𝑇𝑟𝑢𝑡ℎ 𝑜𝑓 𝑡ℎ𝑒 𝑖𝑛𝑝𝑢𝑡 𝑖𝑚𝑎𝑔𝑒 𝑃 𝑖 =𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 A random batch of images is fed into the network, but with only the fixed batch of 66 videos across all iterations. This enables N3 to penalize the error over all of the Newtonian scenarios at each iteration. To test, a single RGBM image is fed as input to obtain The underlying Newtonian Scenario (ℎ) The matching state ( 𝑠 ℎ ) The predicted scenario (ℎ) is the scenario with maximum confidence in the output. The matching state ( 𝑠 ℎ ) is achieved by: 𝑠 ℎ = arg max {𝑆𝑖𝑚 (𝐱, 𝐯 ℎ 𝑖 )} 𝐱= 4096x1 image descriptor 𝐯 ℎ 𝑖 = 4096x10 video descriptor for Newtonian scenario A long-term 3D motion path can be drawn for the query object by using the game engine parameters (e.g. direction of velocity and force, 3D motion, and camera view point) from the state ( 𝑠 ℎ ) of Newtonian scenario (ℎ)

Results Figure 6: The expected motion of the object

Results Figure 7: Visualization of the direction of net force and object velocity

Method 2: Phys-Net Objectives
Explores the ability of a deep feed-forward model to learn intuitive physics of objects. Figure 8: Block tower examples from the synthetic (left) and real (right) datasets

Method 2: Phys-Net Data Collection Synthetic
Generated vertical stacks of 2, 3, and 4 colored blocks in random configurations. The features such as, block position, back ground textures, lighting were randomized at each trial to improved generalizable performance of learned features. Each simulation records the outcomes and captured screen shots as well as segmentation masks at 8 frames/sec. Real Data Four wooden cubes were fabricated and spray painted red, green, blue and yellow respectively. The blocks were manually stacked in configurations 2, 3 and 4 and a camera was used to film the blocks falling at 60 frames/sec..

Method 2: Phys-Net Architecture
Figure 9: Architecture of the Phys-Net network An integrated approach is used with the CNN architectures here. The lower layers perceive the arrangement of blocks and the upper layers implicitly capture their inherent physics

The ResNet-34 and Google-net networks were trained on the fall prediction task. The models were pre-trained on the Image-net dataset. The final linear layer was replaced with a single logistic output. Figure 10: Fall Prediction Branch of the Phys-Net network

Deep-mask networks are used to predict the segmentation trajectory of falling blocks at multiple future times (0s, 1s, 2s, 4s) based on an input image Each mask pixel is a multi-class classification across a background class and four foreground (block color) classes A binary mask head with a multi-class Soft-max is replicated N times for mask prediction at multiple points in time Figure 11: Mask Prediction Branch of the Phys-Net network The training loss for mask networks is the sum of a binary cross-entropy loss for fall prediction and a pixel-wise multiclass cross-entropy loss for each mask

Method 2: Phys-Net Results Phys-Net correctly
predicts fall direction for most synthetic examples, while on real examples, Phys-Net overestimate stability Figure 12: Phys-Net mask predictions for synthetic (Left) and real (Right) towers of 2, 3, and 4 blocks.

Method 2: Phys-Net Results
Here these models gain knowledge about the block tower dynamics, rather than simply memorizing a mapping. Only very small degradation in performance of the models on a tower size that is not shown during training Figure 13: Plots comparing Phys-Net accuracy to human performance on real (Top) and synthetic (Bottom) test examples.

Method 3: Visual Interaction Networks (VIN)
Objective Learning the dynamics of a physical system from raw visual observations Visual Interaction Networks (VIN) The VIN model can be trained from supervised data sequences which consist of input image frames and target object state values The VIN model is comprised of two main components: A visual encoder based on convolutional neural networks (CNNs) A recurrent neural network (RNN) with an interaction network (IN)

Visual Encoder The visual encoder is a CNN that produces a state code from a sequence of 3 images. It takes a pair of consecutive frames from a sequence of 3 frames and outputs a candidate state code. The two resulting candidate state codes are aggregated N = Number of objects in the scene L = Length of each state code slot (Features) Figure 14: Frame Pair Encoder; The frame pair encoder is a CNN which transforms two consecutive input frame into a state code

Dynamics Encoder The dynamics encoder is an RNN that produces a candidate state code for a frame from a previous sequence of state codes The main difference between this and a vanilla IN is it aggregates over multiple temporal offsets. Figure 14: Dynamics Encoder; An Interaction Net (IN)

Architecture Figure 15: Visual Interaction Network (VIN) State Decoder The state decoder is simply a linear layer with input size L (Features) and output size 4 (for a position/velocity vector). This linear layer is applied independently to each slot of the state code.

Training and Testing In each system the force law is applied pair-wise to all objects and all objects have the same mass and density Spring: Each pair of objects has an invisible spring connection Gravity: Objects are massive and obey Newton’s Law of gravity. Billiards: No long-distance forces are present, but the billiards bounce off each other and off the boundaries of the field of vision. Magnetic Billiards: All billiards are positively charged, so instead of bouncing, they repel each other according to Coulomb’s Law. They still bounce off the boundaries. Drift: No forces of any kind are present. Objects drift with their initial velocities

Results Figure 16: Rollout Trajectories

Results Figure 17: Performance

Questions and Discussion

Deep Learning and Newtonian Physics

Similar presentations

Presentation on theme: "Deep Learning and Newtonian Physics"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Deep Learning and Newtonian Physics

Similar presentations

Presentation on theme: "Deep Learning and Newtonian Physics"— Presentation transcript:

Similar presentations

About project

Feedback