Download presentation
Presentation is loading. Please wait.
Published byMateus de Escobar Bergmann Modified over 6 years ago
1
Deep Learning and Newtonian Physics
Ohioma Eboreime Deep Learning
2
Overview Introduction Methods
Method 1: Newtonian Neural Networks ( π 3 ) Objectives Architecture Results Method 2: Phys-Net Method 3: Visual Interaction Networks (VIN)
3
Introduction Human perceptual ability enables the prediction of object dynamics and next occurrences from a static scene. This ability enables us move beyond object detection and tracking to an inference of object behavior in a given environment.
4
Method 1: Newtonian Neural Networks
Objectives Object dynamics and trajectory prediction in static images. Understanding objects in terms of forces acting on it and the long term motion as a result of these forces Figure 1: The query labelled βquery objectβ represents the object of interest. The components of force acting on the query as well as the expected motion are to be inferred strictly from a static image
5
Method 1: Newtonian Neural Networks
Architecture Maps a single image to intermediate physical abstractions called βNewtonian Scenariosβ βNewtonian scenariosβ describe a number of simple standard real world motions Figure 2: Newtonian Scenarios The Newtonian scenarios are rendered from 8 different azimuth angles using a game engine. Scenarios 6, 7, and 11 are symmetric across different azimuth angles and are rendered from 3 different elevations of the camera. Scenarios 2 and 12 are the same across viewpoints with 180 azimuth difference. Scenario 5 only 1 viewpoint is considered (no motion). In total, we obtain 66 videos for all 12 Newtonian scenarios.
6
Method 1: Newtonian Neural Networks
Considerations Motion prediction: This method focused on the physics of the motion not an inference of the objects most likely path. Scene understanding: This work focused on the dynamics of already stable and moving objects Action Recognition: This work focused on estimating long-term motions as opposed prediction of a class of. Tracking: This approach is different from tracking since tracking methods are not used for single image reasoning Approach Figuring out which Newtonian scenario explains the dynamics of the static image most accurately Finding the scenario that matches the state of the object in motion
7
Method 1: Newtonian Neural Networks
This network uses two parallel CNNs: Encoder for the visual cues (Image Row) Uses 2D CNNs to represent the static image. Encoder for the Newtonian motions (Motion Row): Uses 3D CNNs to represent game engine videos of Newtonian scenarios. Figure 3: Newtonian Neural Network
8
Method 1: Newtonian Neural Networks
Image Row Figure 4: Image Row The image row consists of the following: Five 2D Convolutional Layers Two Fully Connected layers The input has four channels (RGBM). The channel M is the mask channel that specifies the location of the query object by a bounding-box mask smoothed with a Gaussian kernel.
9
Method 1: Newtonian Neural Networks
Motion Row Figure 5: Image Row Volumetric convolutional neural network and consists of: Six 3D Convolutional layers One Fully Connected layer Each frame has 10 channels (RGB, flow, depth, and surface normal)
10
Method 1: Newtonian Neural Networks
Matching Layer Takes as input the output of the image row and the output of the motion row. The cosine similarity π π₯;π¦ = π₯βπ¦ π₯ π¦ + π between all the image descriptors and all of the 10 framesβ descriptors in each video (66 videos) in the batch. The output of matching layer are 66 vectors with each vector having 10 dimensions (for each frame). The dimension with maximum similarity value indicates the state of dynamics for each Newtonian scenario Figure 5: Matching Layer
11
Method 1: Newtonian Neural Networks
Training Testing The loss is computed using the negative log likelihood. πΈ= 1 π π=1 π π π πππ π π + 1β π π log 1β π π π π =πΊπππ’ππ πππ’π‘β ππ π‘βπ ππππ’π‘ πππππ π π =πππππππ‘πππ ππππππππππ‘π¦ A random batch of images is fed into the network, but with only the fixed batch of 66 videos across all iterations. This enables N3 to penalize the error over all of the Newtonian scenarios at each iteration. To test, a single RGBM image is fed as input to obtain The underlying Newtonian Scenario (β) The matching state ( π β ) The predicted scenario (β) is the scenario with maximum confidence in the output. The matching state ( π β ) is achieved by: π β = arg max {πππ (π±, π― β π )} π±= 4096x1 image descriptor π― β π = 4096x10 video descriptor for Newtonian scenario A long-term 3D motion path can be drawn for the query object by using the game engine parameters (e.g. direction of velocity and force, 3D motion, and camera view point) from the state ( π β ) of Newtonian scenario (β)
12
Method 1: Newtonian Neural Networks
Results Figure 6: The expected motion of the object
13
Method 1: Newtonian Neural Networks
Results Figure 7: Visualization of the direction of net force and object velocity
14
Method 2: Phys-Net Objectives
Explores the ability of a deep feed-forward model to learn intuitive physics of objects. Figure 8: Block tower examples from the synthetic (left) and real (right) datasets
15
Method 2: Phys-Net Data Collection Synthetic
Generated vertical stacks of 2, 3, and 4 colored blocks in random configurations. The features such as, block position, back ground textures, lighting were randomized at each trial to improved generalizable performance of learned features. Each simulation records the outcomes and captured screen shots as well as segmentation masks at 8 frames/sec. Real Data Four wooden cubes were fabricated and spray painted red, green, blue and yellow respectively. The blocks were manually stacked in configurations 2, 3 and 4 and a camera was used to film the blocks falling at 60 frames/sec..
16
Method 2: Phys-Net Architecture
Figure 9: Architecture of the Phys-Net network An integrated approach is used with the CNN architectures here. The lower layers perceive the arrangement of blocks and the upper layers implicitly capture their inherent physics
17
Method 2: Phys-Net Architecture
The ResNet-34 and Google-net networks were trained on the fall prediction task. The models were pre-trained on the Image-net dataset. The final linear layer was replaced with a single logistic output. Figure 10: Fall Prediction Branch of the Phys-Net network
18
Method 2: Phys-Net Architecture
Deep-mask networks are used to predict the segmentation trajectory of falling blocks at multiple future times (0s, 1s, 2s, 4s) based on an input image Each mask pixel is a multi-class classification across a background class and four foreground (block color) classes A binary mask head with a multi-class Soft-max is replicated N times for mask prediction at multiple points in time Figure 11: Mask Prediction Branch of the Phys-Net network The training loss for mask networks is the sum of a binary cross-entropy loss for fall prediction and a pixel-wise multiclass cross-entropy loss for each mask
19
Method 2: Phys-Net Results Phys-Net correctly
predicts fall direction for most synthetic examples, while on real examples, Phys-Net overestimate stability Figure 12: Phys-Net mask predictions for synthetic (Left) and real (Right) towers of 2, 3, and 4 blocks.
20
Method 2: Phys-Net Results
Here these models gain knowledge about the block tower dynamics, rather than simply memorizing a mapping. Only very small degradation in performance of the models on a tower size that is not shown during training Figure 13: Plots comparing Phys-Net accuracy to human performance on real (Top) and synthetic (Bottom) test examples.
21
Method 3: Visual Interaction Networks (VIN)
Objective Learning the dynamics of a physical system from raw visual observations Visual Interaction Networks (VIN) The VIN model can be trained from supervised data sequences which consist of input image frames and target object state values The VIN model is comprised of two main components: A visual encoder based on convolutional neural networks (CNNs) A recurrent neural network (RNN) with an interaction network (IN)
22
Method 3: Visual Interaction Networks (VIN)
Visual Encoder The visual encoder is a CNN that produces a state code from a sequence of 3 images. It takes a pair of consecutive frames from a sequence of 3 frames and outputs a candidate state code. The two resulting candidate state codes are aggregated N = Number of objects in the scene L = Length of each state code slot (Features) Figure 14: Frame Pair Encoder; The frame pair encoder is a CNN which transforms two consecutive input frame into a state code
23
Method 3: Visual Interaction Networks (VIN)
Dynamics Encoder The dynamics encoder is an RNN that produces a candidate state code for a frame from a previous sequence of state codes The main difference between this and a vanilla IN is it aggregates over multiple temporal offsets. Figure 14: Dynamics Encoder; An Interaction Net (IN)
24
Method 3: Visual Interaction Networks (VIN)
Architecture Figure 15: Visual Interaction Network (VIN) State Decoder The state decoder is simply a linear layer with input size L (Features) and output size 4 (for a position/velocity vector). This linear layer is applied independently to each slot of the state code.
25
Method 3: Visual Interaction Networks (VIN)
Training and Testing In each system the force law is applied pair-wise to all objects and all objects have the same mass and density Spring: Each pair of objects has an invisible spring connection Gravity: Objects are massive and obey Newtonβs Law of gravity. Billiards: No long-distance forces are present, but the billiards bounce off each other and off the boundaries of the field of vision. Magnetic Billiards: All billiards are positively charged, so instead of bouncing, they repel each other according to Coulombβs Law. They still bounce off the boundaries. Drift: No forces of any kind are present. Objects drift with their initial velocities
26
Method 3: Visual Interaction Networks (VIN)
Results Figure 16: Rollout Trajectories
27
Method 3: Visual Interaction Networks (VIN)
Results Figure 17: Performance
28
Questions and Discussion
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.