SHREC’19 Track: Extended 2D Scene Image-Based 3D Scene Retrieval Hameed Abdul-Rashid, Juefei Yuan, Bo Li, Yijuan Lu, Tobias Schreck, Ngoc-Minh Bui, Trong-Le Do, Mike Holenderski , Dmitri Jarnikov , Khiem T. Le, Vlado Menkovski, Khac-Tuan Nguyen, Thanh-An Nguyen, Vinh-Tiep Nguyen, Tu V. Ninh, Perez Rey, Minh-Triet Tran, Tianyang Wang Now, I will present the second track organized by us: Extended 2D Scene Image-Based 3D Scene Retrieval. It is a joint work of the following authors from these institutes. The first 5 persons are organizers while the other 13 are participants coming from 3 groups. 1
Outline Introduction Benchmark Methods Results Conclusions and Future Work 2
Fig. 1 Renault SYMBIOZ concept Introduction 2D Scene Image-Based 3D Scene Retrieval Focuses on retrieving relevant 3D scene models Using scene Images as input Motivation Vast applications: autonomous driving cars (Fig. 1), multi-view 3D scene reconstruction, VR/AR scene content generation, and consumer electronics apps Challenges Lacks substantial research due to the involved challenges Lack of related retrieval benchmarks 2D Scene Image-based 3D Scene retrieval (SceneIBR2019) focuses on retrieving relevant 3D scene models using scene image(s) as input. The Motivation of the SceneIBR2019 is that: It has many important related applications, including highly capable autonomous vehicles like the Renault SYMBIOZ as shown in Fig. 1, multi-view 3D scene reconstruction, VR/AR scene content generation, and consumer electronics apps, among others However, this task is far from trivial and lacks substantial research due to the challenges involved as well as a lack of related retrieval benchmarks. Fig. 1 Renault SYMBIOZ concept 3
Introduction (Cont.) 2D Scene Image-Based 3D Scene Retrieval Brand new research topic in image-based 3D object retrieval: A query image contains several objects Objects may overlap with each other Relative context configurations among the objects Our previous work SHREC’18 track: 2D Scene Image-Based 3D Scene Retrieval track Built SceneIBR2018 [1] benchmark: 10 scene classes, each has 25 sketches and 100 3D models Good performance called for a more comprehensive dataset We build the SceneIBR2019 Benchmark To further promote this challenging research direction Most comprehensive and largest 2D scene image-based 3D scene retrieval benchmark 2D Scene Image-Based 3D Scene Retrieval is a brand new research topic in the field of Image-based 3D object retrieval. It has several new features: A query Image contains several objects Objects may overlap with each other There existing relative context configurations among the objects in a scene image/model In Previous Work, we organized a 2D Scene Image-Based 3D Scene Retrieval track in SHREC’18, resulting a SceneIBR2018 benchmark which contains 10 scene classes, with 1000 Images and 100 3D models for each class. During SHREC’18 track, we found that the benchmark is not challenging and comprehensive enough since they cover only 10 categories, each of which is clearly distinct from one another. Considering this, we decided to further increase the comprehensiveness of the benchmarks by building a significantly larger benchmark which supports the retrieval. We built the most comprehensive and largest 2D scene Image-based 3D scene retrieval benchmark, SceneIBR2019. [1] H. Abdul-Rashid and et al. SHREC’18 track: 2D scene image-based 3D scene retrieval. In 3DOR, pages 1–8, 2018 4
Outline Introduction Benchmark Methods Results Conclusions and Future Work Let’s continue with the details of SceneIBR2019 benchmark. 5
SceneIBR2019 Benchmark Overview We have substantially extended the SceneIBR2018 with 20 additional classes Building process Scene labels chosen from Places88 [2] Select 30 from 88 available category labels in Places88 Voting method among three individuals 2D/3D scene data collected from Flickr Google Images 3D Warehouse Overview: We have built a 3D scene retrieval benchmark by substantially extending SceneIBR2018 by means of identifying and consolidating the same number of images/models for another additional 20 classes from the most popular 2D/3D data resources. Building Process: We selected the most popular 30 scene classes (including the initial 10 classes in SceneIBR2018 from the 88 available category labels in the Places88 , via a voting mechanism based on the collaborative judgement of three people. Then, to collect data (images and models) for the additional 20 classes, we gathered from Flicker and Google Image for Images, and downloaded 3D scene models from 3D warehouse [2] B. Zhou and et al. Places: A 10 million image database for scene recognition. IEEE Trans. Pattern Anal. Mach. Intell., 40(6):1452–1464, 2018
SceneIBR2019 Benchmark 2D Scene Image Query Dataset 30,000 2D scene Images 30 classes, each with 1,000 Images 3D Scene Model Target Dataset 3,000 3D scene models 30 classes, each with 100 models To evaluate learning-based 3D scene retrieval The 2D Scene Image Query Dataset comprises 30,000 2D scene images categorized into 30 classes, each with 1000 images. The 3D Scene Model Target Dataset contains 3000 3D scene models. They are categorized into the same 30 classes, each having 100 models. To help evaluate learning-based 3D scene retrieval algorithms, we randomly select 700 images and 70 models from each class for training and use the remaining 300 images and 30 models for testing, as indicated in Table 1. Table 1 Training and testing dataset information of our SceneIBR2019 benchmark.
2D Scene Image Query Dataset One example per class for the 2D Scene Image Query Dataset is demonstrated in Fig. 2. Fig. 2 Example 2D scene query images (1 per class)
3D Scene Model Target Dataset Similarly, one example per class for the 3D Scene Model Target Dataset Fig. 3 Example 3D target scene models (1 per class)
Evaluation Seven commonly adopted performance metrics in 3D model retrieval techniques [3]: Precision-Recall plot (PR) Nearest Neighbor (NN) First Tier (FT) Second Tier (ST) E-Measures (E) Discounted Cumulated Gain (DCG) Average Precision (AP) We also have developed the code to compute them: http://orca.st.usm.edu/~bli/SceneIBR2019/data.html We utilize the following seven commonly adopted performance metrics in 3D model retrieval techniques, which are Precision-Recall, Nearest Neighbor, First Tier, Second Tier, E-Measures, Discounted Cumulated Gain and Average Precision. We also have developed the code to compute them, and the code can be downloaded via this link. [3] B. Li and et al. A comparison of 3D shape retrieval methods based on a large-scale benchmark supporting multimodal queries. Computer Vision and Image Understanding, 131:1–27, 2015. 10
Outline Introduction Benchmark Methods Results Conclusions and Future Work Let’s continue with the participating methods. 11
Methods ResNet50-Based Image Recognition and Adapting Place Classification for 3D Models Using Adversarial Training (RNIRAP) Conditional Variational Autoencoders for Image Based Scene Retrieval (CVAE) View and Majority Vote Based 3D Scene Retrieval Algorithm (VMV) Here we will list the three participating methods. 1. ResNet50-Based Image Recognition and Adapting Place Classification for 3D Models Using Adversarial Training (RNSRAP). 2. Conditional Variational Autoencoders for Image Based Scene Retrieval 2. View and Majority Vote Based 3D Scene Retrieval Algorithm (VMV-VGG) The first and third methods are almost the same as those in the SBR track, so we will skip their introduction which can be found in the hided slides and explain more on the new method CVAE. 12
CVAE: Conditional Variational Autoencoders for Image Based Scene Retrieval Luis Armando Pérez Rey, Mike Holenderski and Dmitri Jarnikov Eindhoven University of Technology, The Netherlands The second method is about the CVAE approach which is based on image-to-image comparison between the query images and the renderings obtained from a 3D scene, contributing by a group from Eindhoven University of Technology, The Netherlands .
CVAE Overview Step 1: Render images from 3D scenes and image preprocessing Step 2: Encode the images as probability distributions over classes and latent space with a Conditional Variational Autoencoder (CVAE) Step 3: Calculate similarity between renderings and query image The method uses Conditional Variational Autoencoders (CVAE) to represent the images in terms of a probability distribution over the category labels and latent variables. The similarity between a query image and a 3D scene is calculated with respect to the estimated probability distributions of the renderings and the query images. The shape retrieval process can be then described according to the following steps: Step 1: Render images for each of the 3D scenes from different angles. Also perform some preprocessing on the images. Step 2: Encode the renderings and query image as probability distributions over the class labels and the chosen latent space with a trained Conditional Variational Autoencoder (CVAE). Step 3: Calculate similarity between renderings and image query by comparing their probability distributions obtained with the encoder of the CVAE. Shape retrieval is then performed with a ranking of the similarity measurements. More detail can be found in the next 4 hided slides. Fig. 5 Three steps of Conditional Variational Autoencoders for Image Based Scene Retrieval method
Outline Introduction Benchmark Methods Results Conclusions and Future Work Here we will show the evaluation results of the eight runs of the three methods based on the seven performance metrics. 27
This figure is the Precision-Recall diagram performance comparisons on the testing dataset of our SceneIBR2019 benchmark for the three learning-based participating methods. Bui’s RNIRAP algorithm (run 2) performs the best, followed by the baseline method VMV-VGG and the CVAE method (CVAE2). Fig. 11 Precision-Recall diagram performance comparisons on the testing dataset of our SceneIBR2019 benchmark for three learning-based participating methods 28
Results: Performance Metrics Table 2. Performance metrics comparison on our SceneIBR2019 benchmark for the three learning-based participating methods This table compares their performance based on other six performance metrics. You can find more details about the retrieval performance of each individual query of every participating method are available on our SceneIBR2019 track homepage. More details about the retrieval performance of each individual query of every participating method are available on the SceneIBR2019 track homepage [5] [5] SceneIBR2019 track Homepage: http://orca.st.usm.edu/~bli/SceneIBR2019/results.html 29
Discussions All the three methods are CNN deep leaning-based methods Most promising and popular approach in tackling this direction Finer classifications RNIRAP and VMV-VGG: CNN + classification-based approach CVAE: VAE only RNIRAP: utilized object-level semantic information for data augmentation and refining retrieval results Significant performance drop if compared with SceneIBR2018 Distinct 10 scene categories in SceneIBR2018 Introduction of many correlating categories in SceneIBR2019 Better overall performance on the SceneIBR2019 track, compared with that on the SceneSBR2019 track Same reason: a larger and information-rich query dataset Firstly, all of the three submitted approaches utilized CNN models, which contribute a lot to their achieved performance. Therefore, according to these two years’ SHREC tracks (SHREC’19 and SHREC’18) on this topic, deep learning-based techniques are still the most promising and popular approach in tackling this new and challenging research direction. Secondly, we could further classify the submitted approaches at a finer granularity. Both RNIRAP and VMV-VGG utilize CNN models and a classification-based approach, which contribute a lot to their better accuracies. While, the CVAE-based method only uses a conditional VAE generative model. To further improve the retrieval performance, RNIRAP used scene object semantic information during the stages of data augmentation and retrieval results refinement. Thirdly, there is a significant drop in the retrieval performance if we compare it with the performance achieved on the SceneIBR2018 track. This is to be expected since the 10 scene categories in the SceneIBR2018 benchmark are distinct and have few correlations, while we introduced many correlating scene categories in SceneIBR2019. Finally, compared with the SBR track this, again we achieved better performance on the IBR track. Similarly, this is because we have a much larger 2D image query dataset containing more details and color information, which make the semantic gap much smaller. 30
Outline Introduction Benchmark Methods Results Conclusions and Future Work 31
Conclusions and Future Work Objective: To foster this challenging and interesting research direction: Scene Image-Based 3D Scene Retrieval Dataset: Build the current largest 2D scene image 3D scene retrieval benchmark Participation: Though challenging, 3 groups successfully participated in the track and contributed 8 runs of 3 methods Evaluation: Performed a comparative evaluation on the accuracy Future work Large-scale benchmarks supporting multiple modalities 2D queries: images, sketches 3D target models: meshes, RGB-D, LIDAR, range scans Semantics-driven retrieval approaches Classification-based retrieval Our conclusions include: Objective: To foster this challenging and interesting research direction: Scene Image-Based 3D Scene Retrieval Dataset: Build the current largest 2D Scene image 3D scene retrieval benchmark Participation: 8 runs of 3 methods has been provided by two groups Evaluation: Performed a comparative evaluation on the accuracy Future work Firstly, to build a large-scale benchmark which supports multiple modalities of 2D queries (i.e. images and sketches) and/or 3D target models (i.e. meshes, RGB-D, LIDAR, and range scans) Secondly, since a lot of semantic information exists in both the 2D query images and the 3D target scenes in our current SceneIBR19 benchmark, it is promising to develop a semantic retrieval approach to further advance the retrieval performance in both accuracy and scalability. Finally, more research in classification/recognition-based retrieval approach due to its better performance (i.e. Bui’s RNIRAP and Yuan’s VMV-VGG) that has been achieved on the 2018 and 2019 IBR tracks.
References [1] H. Abdul-Rashid and et al. SHREC’18 track: 2D scene image-based 3D scene retrieval. In 3DOR, pages 1–8, 2018. [2] B. Zhou and et al. Places: A 10 million image database for scene recognition. IEEE Trans. Pattern Anal. Mach. Intell., 40(6):1452–1464, 2018 [3] B. Li and et al. A comparison of 3D shape retrieval methods based on a large-scale benchmark supporting multimodal queries. Computer Vision and Image Understanding, 131:1–27, 2015. [4] N. Liu and et al. DHSNet: Deep hierarchical saliency network for salient object detection. In CVPR (2016), pp. 678–686. [5] SceneIBR2019 track Homepage: http://orca.st.usm.edu/~bli/SceneIBR2019/results.html 33
Thank you! Q&A?