Understanding Text Corpora with Multiple Facets Lei Shi, Furu Wei, Shixia Liu, Xiaoxiao Lian, Li Tan and Michelle X. Zhou IBM Research.

Slides:



Advertisements
Similar presentations
© 2006 IBM Corporation SOA on your terms and our expertise Software WebSphere Process Integration STEW 5.2 P – How to run the End 2 End Demo.
Advertisements

Almaden Services Research Almaden Research Center, San Jose, CA 20 April 2006 Multifaceted approach to ontologizing the ONTOLOG content Rooted in pragmatism,
Rationale for a multilingual corpus for machine translation evaluation Debbie Elliott Anthony Hartley Eric Atwell Corpus Linguistics 2003, Lancaster, England.
Entity-Centric Topic-Oriented Opinion Summarization in Twitter Date : 2013/09/03 Author : Xinfan Meng, Furu Wei, Xiaohua, Liu, Ming Zhou, Sujian Li and.
Multi-AbstractionRetrievalMulti-AbstractionRetrieval MotivationMotivation ExperimentsExperiments Overall Framework Multi-Abstraction Concern Localization.
Visualization Taxonomies and Techniques Text: Documents and Collections University of Texas – Pan American CSCI 6361, Spring 2014.
VISA: A VIsual Sentiment Analysis System Sept Dongxu Duan 1 Weihong Qian 1 Shimei Pan 2 Lei Shi 3 Chuang Lin 4 1 IBM Research — China 2 IBM T. J.
Dynamic Network Visualization in 1.5D
1 Multi-topic based Query-oriented Summarization Jie Tang *, Limin Yao #, and Dewei Chen * * Dept. of Computer Science and Technology Tsinghua University.
Text mining Extract from various presentations: Temis, URI-INIST-CNRS, Aster Data …
Haifa Research Lab © 2008 IBM Corporation Parallel streaming decision trees Yael Ben-Haim & Elad Yom-Tov Presented by: Yossi Richter.
A Presentation for the Enterprise Architect © 2008 IBM Corporation IBM Technology Day - SOA SOA Governance Miroslav Petrek IT Software Architect
IVITA Workshop Summary Session 1: interactive text analytics (Session chair: Professor Huamin Qu) a) HARVEST: An Intelligent Visual Analytic Tool for the.
Intelligent Visual Interfaces for Text Analysis An Introduction Michelle Zhou Co-Organizers: Shixia Liu, Giuseppe Carenini, Humin Qu.
1 Block-based Web Search Deng Cai *1, Shipeng Yu *2, Ji-Rong Wen * and Wei-Ying Ma * * Microsoft Research Asia 1 Tsinghua University 2 University of Munich.
Improving UML Class Diagrams using Design Patterns Semantics Shahar Maoz Work in Progress.
Multi Factor Authentication for Z
Chapter 2: Business Intelligence Capabilities
© 2014 IBM Corporation Integrated Data Management David Majcher Information Architect Looking at Hadoop in the Rearview.
IBM Proof of Technology Discovering the Value of SOA with WebSphere Process Integration © 2005 IBM Corporation SOA on your terms and our expertise WebSphere.
Chapter 3 Application Software.
SAVE: Sensor Anomaly Visualization Engine Lei Shi 1 Qi Liao 2 Yuan He 3 Rui Li 4 Aaron Striegel 2 Zhong Su 1 1 IBM Research — China 2 University of Notre.
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
4th project meeting 27-29/05/2013, Budapest, Hungary FP 7-INFRASTRUCTURES programme agINFRA agINFRA A data infrastructure for agriculture.
SaariStory: A framework to represent the medieval history of Saarland Michael Barz, Jonas Hempel, Cornelius Leidinger, Mainack Mondal Course supervisor:
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
Jaegul Choo1*, Changhyun Lee1, Chandan K. Reddy2, and Haesun Park1
IAT Text ______________________________________________________________________________________ SCHOOL OF INTERACTIVE ARTS + TECHNOLOGY [SIAT]
KDD 2012, Beijing, China Community Discovery and Profiling with Social Messages Wenjun Zhou Hongxia Jin Yan Liu.
IBM Research – China, 2013 Mining Information Dependency in Outpatient Encounters for Chronic Disease Care Wen Sun, Weijia Shen, Xiang Li, Feng Cao, Yuan.
© 2010 Pearson Addison-Wesley. All rights reserved. Addison Wesley is an imprint of Designing the User Interface: Strategies for Effective Human-Computer.
Using Text Mining and Natural Language Processing for Health Care Claims Processing Cihan ÜNAL
2009 IEEE Symposium on Computational Intelligence in Cyber Security 1 LDA-based Dark Web Analysis.
© 2006 IBM Corporation Flash Copy Solutions im Windows Umfeld TSM for Copy Services Wolfgang Hitzler Technical Sales Tivoli Storage Management
SUMMON ® 2.0 DISCOVERY REINVENTED. What is Summon 2.0? A new, streamlined, modern interface New and enhanced features providing layers of contextual guidance.
21/11/2002 The Integration of Lexical Knowledge and External Resources for QA Hui YANG, Tat-Seng Chua Pris, School of Computing.
Nan Yang Chinese Terminologist Microsoft Language Excellence Shanghai, August 2008.
Data Mining By Dave Maung.
BioSumm A novel summarizer oriented to biological information Elena Baralis, Alessandro Fiori, Lorenzo Montrucchio Politecnico di Torino Introduction text.
Content Mgmt Services eText Overview Digital Delivery Aug 7, 2012.
Summarization Focusing on Polarity or Opinion Fragments in Blogs Yohei Seki Toyohashi University of Technology Visiting Scholar at Columbia University.
© 2007 IBM Corporation SOA on your terms and our expertise Software WebSphere Process Server and Portal Integration Overview.
IAT Text ______________________________________________________________________________________ SCHOOL OF INTERACTIVE ARTS + TECHNOLOGY [SIAT]
© 2012 IBM Corporation Introducing IBM Cognos Insight.
Multi-Abstraction Concern Localization Tien-Duy B. Le, Shaowei Wang, and David Lo School of Information Systems Singapore Management University 1.
VAST 2010 Mini Challenge #1 Award: VisWorks Text and Network Visual Analytics Lei Shi, Weihong Qian, Furu Wei and Li Tan IBM Research - China Visualizations.
1 A Biterm Topic Model for Short Texts Xiaohui Yan, Jiafeng Guo, Yanyan Lan, Xueqi Cheng Institute of Computing Technology, Chinese Academy of Sciences.
© 2005 IBM Corporation Discovering the Value of SOA with WebSphere Process Integration SOA on your terms and our expertise Building a Services Oriented.
1/41 Visualization and Analysis of Text Remco Chang, PhD Assistant Professor Department of Computer Science Tufts University December 17, 2010 Cologne,
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 TIARA: A Visual Exploratory Text Analytic System Presenter.
IBM Proof of Technology Discovering the Value of SOA with WebSphere Process Integration © 2005 IBM Corporation SOA on your terms and our expertise WebSphere.
Transana. General For qualitative analysis Transana is cross-platform. Runs on both Windows and Apple OS X Transana is Open Source. – Researchers can.
Chapter 6 Activity Recognition from Trajectory Data Yin Zhu, Vincent Zheng and Qiang Yang HKUST November 2011.
ELanguages creative collaboration for teachers globally.
Trends in NL Analysis Jim Critz University of New York in Prague EurOpen.CZ 12 December 2008.
Summon® 2.0 Discovery Reinvented
RAD – 255 Certification Overview
RECENT TRENDS IN SMT By M.Balamurugan, Phd Research Scholar,
Visualizing Complex Software Systems
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Measuring Sustainability Reporting using Web Scraping and Natural Language Processing Alessandra Sozzi
Opinion Seer: Interactive Visualization of Hotel Customer Feedback
Multi-Dimensional Data Visualization
Introduction to Geoinformatics L-10. Managing GIS
Presented by: Prof. Ali Jaoua
CHAPTER 7: Information Visualization
COUNTRIES NATIONALITIES LANGUAGES.
PolyAnalyst Web Report Training
Data Pre-processing Lecture Notes for Chapter 2
Presentation transcript:

Understanding Text Corpora with Multiple Facets Lei Shi, Furu Wei, Shixia Liu, Xiaoxiao Lian, Li Tan and Michelle X. Zhou IBM Research

Emergency Room Records

Hotel Reviews

Intelligence Reports

Documents

Financial News/Blogs/Message Boards

Outline  Problem & Related Work  Multi-Facet Text Data Model and Text Processing –Data model –Text pre-processing –Content summarization  Visualization –Metaphor –Creation algorithm –Interactions  Video Demo

Problem & Related Work  It’s challenging to build a visual analytics tool to explain multi-faceted text corpora! –How to combine the raw text data with rich text analytics result for visualization? –What visual metaphors to apply to effectively illustrate text content, evolution and facet correlations? –How to customize interactions to assist user in data navigation and other visual analytics task?  Related work –Text trend visualization ThemeRiver, NameVoyager, etc. –Text content visualization Tag cloud, Wordle, PhraseNet, etc. –Text entity pattern visualization TileBars, Jigsaw, FeatureLens, Takmi, etc. –Text visualization in specific domains

Multi-Facet Data Model and Text Pre-Processing  Multi-Facet Data Model for Text Corpora -- –Time Facet Explicit field or extracted from raw text –Category Facet Topic modeling by Latent Dirichlet Allocation (LDA, Blei et al. 2003) Category labels from document classification/clustering Leverage other nominal structured information (hotel names, countries, etc.) –Unstructured (Content) Facets Inherent multiple text fields Multiple facets from NE extraction (people, location, organization) or POS parsing (Noun, Verbs, Adjective) –Structured Facets Categorical, numerical or nominal data fields Other calculated categorical value (sentiment orientations, average ratings)

Content Facet Summarization A set of topics {T 1, …T i,… T N } A set of keywords {W 1, …, W j, …, W M } A set of topic probabilities {…, P(T i | D k ), …} A set of word probabilities {…, P(W j | T i ), …} kth document in the collection Rank the topics to present most valuable ones first Select keyword sub-set for each time segment for content summary {…} t-1, {…, W j, …} t, {…} t+1,

Doc-topic dist. Doc length Doc no. Content Facet Summarization  Topic/category re-ranking by topic coverage and variance: find the most active topic with significant variety –Topic coverage: –Topic variance: –Balancing two metrics:  Keyword re-ranking –Topic keyword re-ranking: –Time-sensitive keyword re-ranking: preserve completeness and distinctiveness Completeness: cover the original keywords of a topic Distinctiveness: distinguish one time segment from another Topic-keyword distribution Topic number

System Architecture Text Summarization Text Preprocessing Text content + meta data Visualization Text collection User Interaction Summarization results

Visualization Metaphors  Multi-stack trend visualization + Time-sensitive tag clouds –Vis-data mappings: time facet – x (time) axis, category facet – stack, unstructured facets – tag clouds, structured facet – keyword style (color/font) –Other mappings: document count – y axis, re-ranked occurrence count -- keyword size Category Facet Time Unstructured Facets Structured Facets

Keywords Layout  Keyword layout with the sweep-line greedy algorithm

Interactions  Temporal zooming for time facet navigation  Topic editing for category facet navigation  Unstructured facet navigation panel  Structured facet mapping  Other customized interactions: topic focus-in-context view

Focus-In-Context View Calculation  Constraints for detailed trend view –Contour-preserving –Flexible space control –All topic trends as undistorted as possible  1D fisheye distortion –Height calculation for expanded trend –Order-preserving height adjustment –Apply fisheye distortion from the center line of selected topic

Video Demo Visual Analytics for Emergency Room Record

18 Thank You Merci Grazie Gracias Obrigado Danke Japanese English French Russian German Italian Spanish Brazilian Portuguese Arabic Traditional Chinese Simplified Chinese Hindi Tamil Thai Korean