A Researcher’s Workbench in 2020: Intelligent Information Systems for Knowledge Synthesis and Discovery ChengXiang (“Cheng”) Zhai Department of Computer.

Slides:



Advertisements
Similar presentations
Data Mining and the Web Susan Dumais Microsoft Research KDD97 Panel - Aug 17, 1997.
Advertisements

GMD German National Research Center for Information Technology Darmstadt University of Technology Perspectives and Priorities for Digital Libraries Research.
An Online News Recommender System for Social Networks Department of Computer Science University of Illinois at Urbana-Champaign Manish Agrawal, Maryam.
ACM SIGIR 2009 Workshop on Redundancy, Diversity, and Interdependent Document Relevance, July 23, 2009, Boston, MA 1 Modeling Diversity in Information.
A Researcher’s Workbench in 2020: Intelligent Information Systems for Knowledge Synthesis and Discovery ChengXiang (“Cheng”) Zhai Department of Computer.
1.Accuracy of Agree/Disagree relation classification. 2.Accuracy of user opinion prediction. 1.Task extraction performance on Bing web search log with.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Web Mining Research: A Survey Authors: Raymond Kosala & Hendrik Blockeel Presenter: Ryan Patterson April 23rd 2014 CS332 Data Mining pg 01.
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, Introduction to IR Research ChengXiang Zhai Department of Computer.
WebMiningResearch ASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007.
An Overview of Text Mining Rebecca Hwa 4/25/2002 References M. Hearst, “Untangling Text Data Mining,” in the Proceedings of the 37 th Annual Meeting of.
Web Mining Research: A Survey
WebMiningResearch ASurvey Web Mining Research: A Survey By Raymond Kosala & Hendrik Blockeel, Katholieke Universitat Leuven, July 2000 Presented 4/18/2002.
2008 © ChengXiang Zhai 1 Contextual Text Analysis with Probabilistic Topic Models ChengXiang Zhai Department of Computer Science Graduate School of Library.
WebMiningResearchASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007 Revised.
1 BrainWave Biosolutions Limited Accelerating Life Science Research through Technology.
Basic IR Concepts & Techniques ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Putting Query Representation and Understanding in Context: ChengXiang Zhai Department of Computer Science University of Illinois at Urbana-Champaign A.
Overview of Web Data Mining and Applications Part I
In Situ Evaluation of Entity Ranking and Opinion Summarization using Kavita Ganesan & ChengXiang Zhai University of Urbana Champaign
OLAM and Data Mining: Concepts and Techniques. Introduction Data explosion problem: –Automated data collection tools and mature database technology lead.
Temporal Event Map Construction For Event Search Qing Li Department of Computer Science City University of Hong Kong.
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Accelerating Research Discovery: Towards an Intelligent Workbench for Researchers Department of Computer Science Affiliated with Graduate School of Library.
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
Automatic Construction of Topic Maps for Navigation in Information Space ChengXiang (“Cheng”) Zhai Department of Computer Science University of Illinois.
Research paper: Web Mining Research: A survey SIGKDD Explorations, June Volume 2, Issue 1 Author: R. Kosala and H. Blockeel.
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, Prepare Yourself for IR Research ChengXiang Zhai Department of Computer.
BeeSpace Informatics Research: From Information Access to Knowledge Discovery ChengXiang Zhai Nov. 7, 2007.
Chapter 1 Introduction to Data Mining
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
Comparative Text Mining Q. Mei, C. Liu, H. Su, A. Velivelli, B. Yu, C. Zhai DAIS The Database and Information Systems Laboratory. at The University of.
Personalized Search Xiao Liu
Data Mining By Dave Maung.
Toward A Session-Based Search Engine Smitha Sriram, Xuehua Shen, ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
BeeSpace Informatics Research: From Information Access to Knowledge Discovery ChengXiang Zhai Nov. 14, 2007.
Data Mining and Decision Trees 1.Data Mining and Biological Information 2.Data Mining and Machine Learning Techniques 3.Decision trees and C5 4.Applications.
Supporting Knowledge Discovery: Next Generation of Search Engines Qiaozhu Mei 04/21/2005.
Discovering Evolutionary Theme Patterns from Text - An Exploration of Temporal Text Mining Qiaozhu Mei and ChengXiang Zhai Department of Computer Science.
Automatic Labeling of Multinomial Topic Models
Automatic Labeling of Multinomial Topic Models Qiaozhu Mei, Xuehua Shen, and ChengXiang Zhai DAIS The Database and Information Systems Laboratory.
Text Information Management ChengXiang Zhai, Tao Tao, Xuehua Shen, Hui Fang, Azadeh Shakery, Jing Jiang.
Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.
Discovering Evolutionary Theme Patterns from Text -An exploration of Temporal Text Mining KDD’05, August 21–24, 2005, Chicago, Illinois, USA. Qiaozhu Mei.
CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Link mining ( based on slides.
CS570: Data Mining Spring 2010, TT 1 – 2:15pm Li Xiong.
Book web site:
Information Overload on the Internet: The Web Mining Techniques Approach UNIVERSITI UTARA MALAYSIA COLLEGE OF ARTS AND SCIENCES RESEARCH METHODOLOGY (SZRZ6014)
Semantic Graph Mining for Biomedical Network Analysis: A Case Study in Traditional Chinese Medicine Tong Yu HCLS
Data mining in web applications
Data Mining – Intro.
Automatic cLasification d
Sentiment analysis algorithms and applications: A survey
School of Computer Science & Engineering
Probabilistic Topic Model
Introduction to IR Research
中国计算机学会学科前沿讲习班:信息检索 Course Overview
Personalized Social Image Recommendation
Course Summary (Lecture for CS410 Intro Text Info Systems)
Data Mining: Concepts and Techniques Course Outline
ChengXiang (“Cheng”) Zhai Department of Computer Science
Introduction to TIMAN: Text Information Managemetn & Analysis
CS510 (Fall 2018) Advanced Topics in Information Retrieval
Data Warehousing and Data Mining
Supporting End-User Access
CSE 635 Multimedia Information Retrieval
Course Summary ChengXiang “Cheng” Zhai Department of Computer Science
Web Mining Department of Computer Science and Engg.
Data Mining: Concepts and Techniques
Web Mining Research: A Survey
Presentation transcript:

A Researcher’s Workbench in 2020: Intelligent Information Systems for Knowledge Synthesis and Discovery ChengXiang (“Cheng”) Zhai Department of Computer Science Institute for Genomic Biology Graduate School of Library and Information Science Department of Statistics University of Illinois at Urbana-Champaign http://www.cs.uiuc.edu/homes/czhai DOE Institute for Computing in Science (ICiS) Workshop on “Integrating, Representing, and Reasoning over Human Knowledge: A Computational Grand Challenge for the 21st Century”, SnowBird, Utah (Aug, 2010).

What are the key computational challenges? Assuming data sharing isn’t a problem, what kind of systems are needed to effectively support representing, integrating, and reasoning over human knowledge? What are the key computational challenges?

Computer-Aided Research (CAR) in 2020 Public data/Info/ knowledge Public data/Info/ knowledge … Network … 1. Multi-level integration of data/info/knowledge 2. Multimode info access 5. Collaborative research 3. Research task support 4. Personalized CAR Personal data/info/ knowledge Personal data/info/ knowledge

1. We need multiple levels of integration

Five Levels of Integration Level 1: “Syntactic” integration of multiple sources Scalable, robust, but minimum support for discovery Level 2: Semantic integration (ontology) Scalable, less robust, better support for discovery Level 3: Synthesis of knowledge (entities, relations) Less scalable, not robust, support for interactive discovery Level 4: Synthesis of knowledge + Inference rules Only applicable to a limited domain, but potentially support automatic discovery Level 5: Specialized discovery model Automatic hypothesis testing, but limited to a special discovery/prediction task

Multi-level support is needed because… Knowledge extraction is far from 100% accurate (NLP is difficult) Interpretation of knowledge is inherently context-sensitive and low-level support is needed for context and provenance Automation-scalability tradeoff will not disappear (soon) …

Automation-Scalability Tradeoff Automation of discovery Goal Specialized statistical prediction models “Beyond ontology” integration Logic-based Inference systems ER graph analysis engine Ontology-based semantic integration Federated search engines “Ontology-Free” integration Scalability/Generality

Interactive ER Graph Analysis The extracted entities and relations form a weighted graph Need to develop techniques to mine the graph for knowledge Store graphs Index graphs Mining algorithms (neighbor finding, path finding, entity comparison, outlier detection, frequent subgraphs,….) Mining language

Example of Interactive Graph Mining Behavior B2 isa isa Co-occur-fly X= PathBetween({A4,A4’}, B4, {co-occur, reg,isa}) Co-occur-bee Behavior B1 Gene A1 Behavior B3 Behavior B4 Orth-mos Co-occur-mos Co-occur-fly Gene A1’ Gene A2 Gene A3 Reg Reg Reg orth Reg Gene A4’ Gene A4 Gene A5 1.X=NeighborOf(B4, Behavior, {co-occur,isa}) {B1,B2,B3} 2. Y=NeighborOf(X, Gene, {c-occur, orth} {A1,A1’,A2,A3} 3. Y=Y + {A5, A6} {A1,A1’, A2, A3,A5,A6} 4. Z=NeighborOf(Y, Gene, {reg}) {A4, A4’}

Inference-Based Discovery Encode all kinds of knowledge in the same knowledge representation language Perform logic inferences Example Regulate (GeneA, GeneB, ContextC). [Text mining] SeqSimilar(GeneA,GeneA’) [Sequence mining] Regulate(X,Y,C) Regulate(Z,Y,C) & SeqSimilar(X,Z) [Human knowledge]  Regulate(GeneA’,GeneB,ContextC) ADD: InPathway(GeneB, P1) InPathway(X,P) Regulate(X,Y,C) & InPathway(Y,P) [Human knowledge]  InvolvedInPathway(GeneA’,P1)

Integration of Expert Knowledge How can we combine expert knowledge with knowledge extracted from literature? Possible strategies: Interactive mining (human knowledge is used to guide the next step of mining) Inference-based integration Trainable programs (focused miner, targeting at certain kind of knowledge)

2. We need multiple-mode information access Querying/Browsing Researcher Recommendation How can we connect the right information with the right user at the right time?

Collaborative Surfing [Want et al. 09] Browsing and querying are tightly integrated Search log organized as a topic map A sustained way of collaborative surfing

News Recommender for Facebook [Gupta et al. 09] Recommendation of research papers?

3. We need to go beyond information access to support tasks Research topic identification “hot topic” retrieval, interdisciplinary topic retrieval, topic recommendation Literature review automatic survey generation Collaborator recommendation To work on an emerging interdisciplinary topic To work on a joint grant proposal Hypothesis generation & testing (question answering)

Topical Trends in KDD [Mei & Zhai 05] gene 0.0173 expressions 0.0096 probability 0.0081 microarray 0.0038 … marketing 0.0087 customer 0.0086 model 0.0079 business 0.0048 … rules 0.0142 association 0.0064 support 0.0053 …

Theme Evolution Graph [Mei & Zhai 05] 1999 2000 2001 2002 2003 2004 T web 0.009 classifica –tion 0.007 features0.006 topic 0.005 … SVM 0.007 criteria 0.007 classifica – tion 0.006 linear 0.005 … mixture 0.005 random 0.006 cluster 0.006 clustering 0.005 variables 0.005 … topic 0.010 mixture 0.008 LDA 0.006 semantic 0.005 … decision 0.006 tree 0.006 classifier 0.005 class 0.005 Bayes 0.005 … … Classifica - tion 0.015 text 0.013 unlabeled 0.012 document 0.008 labeled 0.008 learning 0.007 … Informa - tion 0.012 web 0.010 social 0.008 retrieval 0.007 distance 0.005 networks 0.004 … … …

Imagine we can compare literature in two related areas… Comparing News Articles [Zhai et al. 04] Iraq War (30 articles) vs. Afghan War (26 articles) The common theme indicates that “United Nations” is involved in both wars Cluster 1 Cluster 2 Cluster 3 Common Theme united 0.042 nations 0.04 … killed 0.035 month 0.032 deaths 0.023 Iraq n 0.03 Weapons 0.024 Inspections 0.023 troops 0.016 hoon 0.015 sanches 0.012 Afghan Northern 0.04 alliance 0.04 kabul 0.03 taleban 0.025 aid 0.02 taleban 0.026 rumsfeld 0.02 hotel 0.012 front 0.011 Collection-specific themes indicate different roles of “United Nations” in the two wars Imagine we can compare literature in two related areas…

Task support + ER Question answering BeeSpace System [He et al. 10] Task support + ER Question answering

4. Personalization & Workflow Management Different users have different tasks  personalization Tracking a user’s history and learning a user’s preferences Exploiting the preferences to customize/optimize the support Allowing a user to define/build special function modules Workflow management

UCAIR: User-Centered Adaptive IR [Shen et al. 05] When a user clicks on the “back” button after viewing a document, UCAIR reranks unseen results to pull up documents similar to the one the user has viewed

5. Collaborative Research Information/Knowledge/Workflow Sharing Different users may perform similar tasks  Information/Knowledge/workflow sharing Capturing user intentions Recommend information/knowledge/workflow How do we solve the problem of privacy? Massive collaborations? Each user contributes a small amount of knowledge All the knowledge can be combined to infer new knowledge An ESP-like online game for discovery?

Knowledge Synthesis & Discovery Game (inspired by the ESP game) Which of the following genes is likely associated with foraging behavior? Bonus score based on validation in publication Hypothesis Selection Ontology Mapping … Hypothesis Selection Ontology Mapping … Immediate Scoring based on Consensus Which of the following concepts can also describe “car”? …

Big Challenges What’s the right system architecture (= sharing model?)? centralized vs. distributed, client vs. server 2. How can we sustain sharing and massive collaboration? open system, “plug and play”, KSD game … 3. How can we seamlessly support multiple-level integration? 4. Specific computational challenges: -- Large-scale NLP, particularly information extraction ( Large-scale machine learning and knowledge base?) -- Large-scale semantic mapping (ontology) -- Interactive fuzzy ER graph mining -- Scalable inference engines (probabilistic datalog) … Public data/Info/ knowledge Public data/Info/ knowledge … Network … 1. Multi-level integration of data/info/knowledge 2. Multimode info access 5. Collaborative research 3. Research task support 4. Personalized CAR Personal data/info/ knowledge Personal data/info/ knowledge

A Possible System Architecture User User Interface/ Workflow Manager Inference Engine User Modeling & Personalization Special Search Analysis Engine Hypothesis Knowledge Base Search & Navigation NLP Machine Learning Expert Knowledge ER Graph Mining … InformationRetrieval NCBI Genome Databases Information Extraction Entities Relations Data/Info + Ontology

References [1] Xuanhui Wang, Bin Tan, Azadeh Shakery, ChengXiang Zhai, Beyond Hyperlinks: Organizing Information Footprints in Search Logs to Support Effective Browsing, Proceedings of the 18th ACM International Conference on Information and Knowledge Management ( CIKM'09), pp.1237-1246, 2009. http://doi.acm.org/10.1145/1645953.1646110 [2] Manish Agrawal, Maryam Karimzadehgan, and ChengXiang Zhai. An Online News Recommender System for Social Networks. In Proceedings of ACM SIGIR 2009 workshop on Search in Social Media, 2009. http://times.cs.uiuc.edu/czhai/pub/sigir09ssm-facebook.pdf [3] Qiaozhu Mei, ChengXiang Zhai, Discovering Evolutionary Theme Patterns from Text -- An Exploration of Temporal Text Mining, Proceedings of the 2005 ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , (KDD'05 ), pages 198-207, 2005 http://doi.acm.org/10.1145/1081870.1081895 [4] ChengXiang Zhai, Atulya Velivelli, Bei Yu, A cross-collection mixture model for comparative text mining, Proceedings of ACM KDD 2004 ( KDD'04 ), pages 743-748, 2004. http://doi.acm.org/10.1145/1014052.1014150 [5] Xin He, Yanen Li, Radhika Khetani, Barry Sanders, Yue Lu, Xu Ling, ChengXiang Zhai, Bruce Schatz. BSQA: integrated text mining using entity relation semantics extracted from biological literature of insects, Nucleic Acids Research, 2010 38(Web Server issue):W175-W181. http://nar.oxfordjournals.org/cgi/content/full/38/suppl_2/W175 [6] Xuehua Shen, Bin Tan, and ChengXiang Zhai, Implicit User Modeling for Personalized Search , In Proceedings of the 14th ACM International Conference on Information and Knowledge Management ( CIKM'05), pages 824-831. 2005, http://doi.acm.org/10.1145/1099554.1099747 [7] Qiaozhu Mei, ChengXiang Zhai. Generating Impact-Based Summaries for Scientific Literature , Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies ( ACL-08:HLT), pages 816-824, 2008. http://www.aclweb.org/anthology/P/P08/P08-1093.pdf