Motivation The subjects/objects are correlated to each other under semantic relationships
Contributions We propose a context-dependent diffusion network (CDDN) framework to deal with visual relationship detection Semantic graphs to encapsule semantic priors of subjects/objects Visual scene graphs to capture the surrounding context Graph diffusion network to learn latent representations of objects
Architecture
Object Association Global Semantic Graph: 𝒢 1 = 𝒱 1 , ℰ 1 (Whole training set) 𝓋: object ℯ: Spatial Scene Graph: 𝒢 2 = 𝒱 2 , ℰ 2 (single picture) 𝓋: bounding box 𝑏 𝑖 , 𝑏 𝑗
Diffusion Layer 𝒢= 𝒱,𝐴,𝑋 , 𝒱= 𝑣 1 , 𝑣 2 ,… 𝑣 𝑁 , A∈ ℝ 𝑁×𝑁 , 𝑋= 𝑥 1 , 𝑥 2 ,… 𝑥 𝑁 ∈ ℝ 𝑁×𝑑 𝐴 ∈ ℝ 𝑁×𝐻×𝑁 : H power series of A 𝑊∈ ℝ 𝑁×𝐻×𝑁
Ranking Loss Triplet 𝑟= 𝑠, 𝑝, 𝑜
Experiment