Human-object interaction

Slides:

Advertisements

Similar presentations

Recognizing Human Actions by Attributes CVPR2011 Jingen Liu, Benjamin Kuipers, Silvio Savarese Dept. of Electrical Engineering and Computer Science University.

Advertisements

INTRODUCTION Heesoo Myeong, Ju Yong Chang, and Kyoung Mu Lee Department of EECS, ASRI, Seoul National University, Seoul, Korea Learning.

Intelligent Systems Lab. Recognizing Human actions from Still Images with Latent Poses Authors: Weilong Yang, Yang Wang, and Greg Mori Simon Fraser University,

Large-Scale Object Recognition with Weak Supervision

On the Relationship between Visual Attributes and Convolutional Networks Paper ID - 52.

Spatial Pyramid Pooling in Deep Convolutional

From R-CNN to Fast R-CNN

Richard Socher Cliff Chiung-Yu Lin Andrew Y. Ng Christopher D. Manning

End-to-End Text Recognition with Convolutional Neural Networks

Eric H. Huang, Richard Socher, Christopher D. Manning, Andrew Y. Ng Computer Science Department, Stanford University, Stanford, CA 94305, USA ImprovingWord.

Reading Between The Lines: Object Localization Using Implicit Cues from Image Tags Sung Ju Hwang and Kristen Grauman University of Texas at Austin Jingnan.

Describing Images using Inferred Visual Dependency Representations Authors : Desmond Elliot & Arjen P. de Vries Presentation of Paper by : Jantre Sanket.

Fully Convolutional Networks for Semantic Segmentation

Recognition Using Visual Phrases

Learning video saliency from human gaze using candidate selection CVPR2013 Poster.

Cascade Region Regression for Robust Object Detection

Deep Residual Learning for Image Recognition

Parsing Natural Scenes and Natural Language with Recursive Neural Networks INTERNATIONAL CONFERENCE ON MACHINE LEARNING (ICML 2011) RICHARD SOCHER CLIFF.

When deep learning meets object detection: Introduction to two technologies: SSD and YOLO Wenchi Ma.

Recent developments in object detection

Big data classification using neural network

Deep Learning for Dual-Energy X-Ray

Guillaume-Alexandre Bilodeau

Object Detection based on Segment Masks

Sentence Modeling Representation of sentences is the heart of Natural Language Processing A sentence model is a representation and analysis of semantic.

Krishna Kumar Singh, Yong Jae Lee University of California, Davis

Saliency-guided Video Classification via Adaptively weighted learning

A Pool of Deep Models for Event Recognition

Recognizing Deformable Shapes

Rotational Rectification Network for Robust Pedestrian Detection

Compositional Human Pose Regression

Nonparametric Semantic Segmentation

Training Techniques for Deep Neural Networks

Dynamic Routing Using Inter Capsule Routing Protocol Between Capsules

Adversarially Tuned Scene Generation

Image Question Answering

A Convolutional Neural Network Cascade For Face Detection

Context-Aware Modeling and Recognition of Activities in Video

Computer Vision James Hays

Image Classification.

A Comparative Study of Convolutional Neural Network Models with Rosenblatt’s Brain Model Abu Kamruzzaman, Atik Khatri , Milind Ikke, Damiano Mastrandrea,

Two-Stream Convolutional Networks for Action Recognition in Videos

Towards Understanding the Invertibility of Convolutional Neural Networks Anna C. Gilbert1, Yi Zhang1, Kibok Lee1, Yuting Zhang1, Honglak Lee1,2 1University.

Deep Learning Hierarchical Representations for Image Steganalysis

Object Detection + Deep Learning

Introduction of MATRIX CAPSULES WITH EM ROUTING

Word Embedding Word2Vec.

CornerNet: Detecting Objects as Paired Keypoints

Outline Background Motivation Proposed Model Experimental Results

RCNN, Fast-RCNN, Faster-RCNN

Heterogeneous convolutional neural networks for visual recognition

Department of Computer Science Ben-Gurion University of the Negev

Deep Object Co-Segmentation

Motivation It can effectively mine multi-modal knowledge with structured textural and visual relationships from web automatically. We propose BC-DNN method.

Motivation Semantic Transformation Module Most of the existing works neglect the semantic relationship between the visual feature and linguistic knowledge,

Motivation State-of-the-art two-stage instance segmentation methods depend heavily on feature localization to produce masks.

Object Detection Implementations

Week 3 Presentation Ngoc Ta Aidean Sharghi.

Weak-supervision based Multi-Object Tracking

End-to-End Facial Alignment and Recognition

Volodymyr Bobyr Supervised by Aayushjungbahadur Rana

Week 7 Presentation Ngoc Ta Aidean Sharghi

Motivation The subjects/objects are correlated to each other under semantic relationships.

Learning to Detect Human-Object Interactions with Knowledge

Visual Grounding.

CVPR 2019 Poster.

Shengcong Chen, Changxing Ding, Minfeng Liu 2018

Presentation transcript:

Human-object interaction 2019.3.15

HOI问题定义 HOI—Human-Object Interaction

HOI-Det问题定义 HOI—Human-Object Interaction 主语->Human 宾语->Object 谓语-> Action 检测出 Human和Object 预测Human和Object交互产生的动作

HOI的发展传统方法起源：Observing human-object interactions using spatial and functional compatibility for recognition. TPAMI 2009. Pose + hoi的先行者：Recognizing Human-Object Interactions in Still Images by Modeling the Mutual Context of Objects and Human Poses. TPAMI 2012 深度学习时代数据库开启新时代：Learning to Detect Human-Object Interactions. WACV 2018. 根据动作定位相关物体：Detecting and Recognizing Human-Object Interactions. CVPR 2018. 精细化到Part和物体的交互： Attention: Pairwise Body-Part Attention for Recognizing Human-Object Interactions .ECCV 2018. :No-Frills Human-Object Interaction Detection: Factorization, Appearance and Layout Encodings, and Training Techniques. Arxiv 2018. 图卷积 Zero-shot: Compositional learning for human object interaction. ECCV 2018. 起源：Learning Human-Object Interactions by Graph Parsing Neural Networks. ECCV 2018. Two Stage: Transferable Interactiveness Prior for Human-Object Interaction Detection. CVPR 2019.

HOI的常用 Pose特征源于Action 位置信息外部语言知识

Learning to Detect Human-Object Interactions

Contributions Propose HICO-DET dataset: the first large benchmark for HOI detection. Propose HO-RCNN: Human-Object Region-based Convolutional Neural Networks.

HICO-Det Dataset 统计信息 600 HOI classes of interest

Method HO-RCNN

HO-RCNN Human-Object Proposals First detect bounding boxes for humans and the object categories of Interest. Then Figure2.

HO-RCNN Human and Object Stream Given a human-object proposal, the human stream extracts local features from the human bounding box, and generates confidence scores for each HOI class. Object stream as same.

HO-RCNN Pairwise Stream

Detecting and Recognizing Human-Object Interactions

Motivation 人的动作可以一定程度上确定和人产生交互物体的位置如<人，打，球>那么球在人手周围的概率会很大，如果是<人，踢，球> 那么球更大概率会出现在脚的旁边。

Method Model Architecture Model Components Object Detection :Image->Faster-Rcnn->human and object box and associated score. Human-centric Branch: input: Human Conv5 Feature action output: action score (sigmoid) target output: Gaussian Map Interaction Brach: input: Human and Object Conv5 Feature output: HOI score.

Method We then write our target localization term as: Decompose the triplet score into four terms

Transferable Interactiveness Prior for Human-Object Interaction Detection

Motivation Implicitly predict whether human-object is interactive or not. How to utilize interactiveness and improve HOI detction learning

Contribution Propose a general and transferable Interactiveness Prior learning method Interactiveness prior can be learned across many datasets and applied to any specific dataset Outperforms state-of-the-art HOI detection results by a great margin.

Method Framework

Method Representation and Classification Networks Human and Object Detection: Detectron with ResNet-50-FPN. Representation Network: Faster R-CNN with ResNet-50 based R here. HOI Classification Network: multi-stream architecture and late fusion strategy.

Method Interactiveness Network Human and Object stream ROI pooling features from representation network R. Spatial-Pose Stream

Method Confidence Function

Method Interactiveness Prior Transfer Training

Modeling Mutual Context of Object and Human Pose in Human-Object Interaction Activities

Difficulties HOI: the relevant object tends to be small or only partially visible. Pose: the human body parts are often self-occluded

Contributions Propose a new random field model to encode the mutual context of objects and human poses in human-object interaction activities. Significantly outperforms state-of-the art in detecting very difficult objects and human poses.

Modeling mutual context of object and pose Goal: To estimate the human pose and to detect the object that the human interacts with. The model

Model The overall model can be computed as Co-occurrence context

Model Spatial Context

Model Modeling objects

Model Modeling human pose. Modeling activities

Properties of the model Co-occurrence context for the activity class, object, and human pose Multiple types of human poses for each activity Spatial context between object and body parts. Relations with the other models.

Pairwise Body-Part Attention for Recognizing Human-Object Interactions

Motivation Human interacts with an object by using some parts of the body . Different body parts should be paid with different attention in HOI recognition. The correlations between different body parts should be further considered

Contributions Propose a new pairwise body-part attention model which can learn to focus on crucial parts, and their correlations for HOI recognition. A novel attention based feature selection method and a feature representation scheme that can capture pairwise correlations between body parts . Our proposed approach achieved 10% relative over the SOTA results in HOI recognition on the HICO dataset.

Method Framework

Method Global Appearance Features Scene and Human Features ROI pooling layer extracts ROI features for each person and the scene given their bounding boxes. Concatenate Human Features and Scene Features. Incorporating Object Features Set ROI as a union box of detected human and object. Sample multiple union boxes of different objects and the person

Method Local Pairwise Body-part Features Given a pair of body parts, to extract their joint feature maps while preserving their relative spatial relationships.

Compositional Learning for Human Object Interaction

Motivation

Contribution Propose a novel method using external knowledge graph and graph convolutional networks which learns how to compose classifiers for verb-noun pairs. Provide benchmarks on several dataset for zero-shot learning including both image and video.

Method Framework

Method A Graphical Representation of Knowledge Graph Construction Nodes: Verb and Noun , and Actions Node Feature: word embeddings , （zero Init）. Edges: A verb node can only connect to a noun node via a valid action node. Adjacency matrix normalization->