Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Researcher’s Workbench in 2020: Intelligent Information Systems for Knowledge Synthesis and Discovery ChengXiang (“Cheng”) Zhai Department of Computer.

Similar presentations


Presentation on theme: "A Researcher’s Workbench in 2020: Intelligent Information Systems for Knowledge Synthesis and Discovery ChengXiang (“Cheng”) Zhai Department of Computer."— Presentation transcript:

1 A Researcher’s Workbench in 2020: Intelligent Information Systems for Knowledge Synthesis and Discovery ChengXiang (“Cheng”) Zhai Department of Computer Science Institute for Genomic Biology Graduate School of Library and Information Science Department of Statistics University of Illinois at Urbana-Champaign http://www.cs.uiuc.edu/homes/czhai 1

2 Assuming data sharing isn’t a problem, what kind of systems are needed to effectively support representing, integrating, and reasoning over human knowledge? What are the key computational challenges? 2

3 Computer-Aided Research (CAR) in 2020 3 Public data/Info/ knowledge Public data/Info/ knowledge … Personal data/info/ knowledge Personal data/info/ knowledge … Network 1. Multi-level integration of data/info/knowledge 2. Multimode info access 3. Research task support 5. Collaborative research 4. Personalized CAR

4 1. We need multiple levels of integration 4

5 Five Levels of Integration Level 1: “Syntactic” integration of multiple sources –Scalable, robust, but minimum support for discovery Level 2: Semantic integration (ontology) –Scalable, less robust, better support for discovery Level 3: Synthesis of knowledge (entities, relations) –Less scalable, not robust, support for interactive discovery Level 4: Synthesis of knowledge + Inference rules –Only applicable to a limited domain, but potentially support automatic discovery Level 5: Specialized discovery model –Automatic hypothesis testing, but limited to a special discovery/prediction task 5

6 Multi-level support is needed because… Knowledge extraction is far from 100% accurate (NLP is difficult) Interpretation of knowledge is inherently context-sensitive and low-level support is needed for context and provenance Automation-scalability tradeoff will not disappear (soon) … 6

7 Automation-Scalability Tradeoff 7 Automation of discovery Scalability/Generality Specialized statistical prediction models Logic-based Inference systems ER graph analysis engine Ontology-based semantic integration Federated search engines Goal “Ontology-Free” integration “Beyond ontology” integration

8 Interactive ER Graph Analysis The extracted entities and relations form a weighted graph Need to develop techniques to mine the graph for knowledge –Store graphs –Index graphs –Mining algorithms (neighbor finding, path finding, entity comparison, outlier detection, frequent subgraphs,….) –Mining language 8

9 Example of Interactive Graph Mining Gene A2 Gene A1 Gene A4 Gene A3 Gene A4’ Gene A1’ Behavior B4Behavior B3 Behavior B2 Behavior B1 isa Co-occur-fly Orth-mos Co-occur-mos Co-occur-bee Co-occur-fly Reg orth Reg 1.X=NeighborOf(B4, Behavior, {co-occur,isa}) {B1,B2,B3} 2. Y=NeighborOf(X, Gene, {c-occur, orth} {A1,A1’,A2,A3} 3. Y=Y + {A5, A6} {A1,A1’, A2, A3,A5,A6} 4. Z=NeighborOf(Y, Gene, {reg}) {A4, A4’} Gene A5 Reg X= PathBetween({A4,A4’}, B4, {co-occur, reg,isa}) 9

10 Inference-Based Discovery Encode all kinds of knowledge in the same knowledge representation language Perform logic inferences Example Regulate (GeneA, GeneB, ContextC). [Text mining] SeqSimilar(GeneA,GeneA’) [Sequence mining] Regulate(X,Y,C)  Regulate(Z,Y,C) & SeqSimilar(X,Z) [Human knowledge]  Regulate(GeneA’,GeneB,ContextC) ADD : InPathway(GeneB, P1) InPathway(X,P)  Regulate(X,Y,C) & InPathway(Y,P) [Human knowledge]  InvolvedInPathway(GeneA’,P1) 10

11 Integration of Expert Knowledge How can we combine expert knowledge with knowledge extracted from literature? Possible strategies: –Interactive mining (human knowledge is used to guide the next step of mining) –Inference-based integration –Trainable programs (focused miner, targeting at certain kind of knowledge) 11

12 2. We need multiple-mode information access 12 Researcher Querying/Browsing Recommendation How can we connect the right information with the right user at the right time?

13 Collaborative Surfing [Want et al. 09] 13 Search log organized as a topic map A sustained way of collaborative surfing Browsing and querying are tightly integrated

14 News Recommender for Facebook [Gupta et al. 09] 14 Recommendation of research papers?

15 3. We need to go beyond information access to support tasks Research topic identification –“hot topic” retrieval, interdisciplinary topic retrieval, topic recommendation Literature review –automatic survey generation Collaborator recommendation –To work on an emerging interdisciplinary topic –To work on a joint grant proposal Hypothesis generation & testing (question answering) 15

16 16 Topical Trends in KDD [Mei & Zhai 05] gene 0.0173 expressions 0.0096 probability 0.0081 microarray 0.0038 … marketing 0.0087 customer 0.0086 model 0.0079 business 0.0048 … rules 0.0142 association 0.0064 support 0.0053 …

17 17 Theme Evolution Graph [Mei & Zhai 05] T SVM 0.007 criteria 0.007 classifica – tion 0.006 linear 0.005 … decision 0.006 tree 0.006 classifier 0.005 class 0.005 Bayes 0.005 … Classifica - tion 0.015 text 0.013 unlabeled 0.012 document 0.008 labeled 0.008 learning 0.007 … Informa - tion 0.012 web 0.010 social 0.008 retrieval 0.007 distance 0.005 networks 0.004 … ………… 1999 … web 0.009 classifica – tion 0.007 features0.006 topic 0.005 … mixture 0.005 random 0.006 cluster 0.006 clustering 0.005 variables 0.005 … topic 0.010 mixture 0.008 LDA 0.006 semantic 0.005 … … 20002001200220032004

18 18 Comparing News Articles [Zhai et al. 04] Iraq War (30 articles) vs. Afghan War (26 articles) Cluster 1Cluster 2Cluster 3 Common Theme united 0.042 nations 0.04 … killed 0.035 month 0.032 deaths 0.023 … … Iraq Theme n 0.03 Weapons 0.024 Inspections 0.023 … troops 0.016 hoon 0.015 sanches 0.012 … … Afghan Theme Northern 0.04 alliance 0.04 kabul 0.03 taleban 0.025 aid 0.02 … taleban 0.026 rumsfeld 0.02 hotel 0.012 front 0.011 … … The common theme indicates that “United Nations” is involved in both wars Collection-specific themes indicate different roles of “United Nations” in the two wars Imagine we can compare literature in two related areas…

19 BeeSpace System [He et al. 10] Task support + ER Question answering

20 4. Personalization & Workflow Management Different users have different tasks  personalization –Tracking a user’s history and learning a user’s preferences –Exploiting the preferences to customize/optimize the support –Allowing a user to define/build special function modules Workflow management 20

21 21 UCAIR: User-Centered Adaptive IR [Shen et al. 05] When a user clicks on the “back” button after viewing a document, UCAIR reranks unseen results to pull up documents similar to the one the user has viewed

22 5. Collaborative Research Information/Knowledge/Workflow Sharing Different users may perform similar tasks  Information/Knowledge/workflow sharing –Capturing user intentions –Recommend information/knowledge/workflow –How do we solve the problem of privacy? Massive collaborations? –Each user contributes a small amount of knowledge –All the knowledge can be combined to infer new knowledge –An ESP-like online game for discovery? 22

23 Knowledge Synthesis & Discovery Game (inspired by the ESP game) 23 Hypothesis Selection Ontology Mapping … Immediate Scoring based on Consensus Bonus score based on validation in publication Hypothesis Selection Ontology Mapping … … Which of the following genes is likely associated with foraging behavior? Which of the following concepts can also describe “car”?

24 Big Challenges 24 Public data/Info/ knowledge Public data/Info/ knowledge … Personal data/info/ knowledge Personal data/info/ knowledge … Network 1. Multi-level integration of data/info/knowledge 2. Multimode info access 3. Research task support 5. Collaborative research 4. Personalized CAR 1.What’s the right system architecture (= sharing model?)? centralized vs. distributed, client vs. server 2. How can we sustain sharing and massive collaboration? open system, “plug and play”, KSD game … 3. How can we seamlessly support multiple-level integration? 4. Specific computational challenges: -- Large-scale NLP, particularly information extraction (  Large-scale machine learning and knowledge base?) -- Large-scale semantic mapping (ontology) -- Interactive fuzzy ER graph mining -- Scalable inference engines (probabilistic datalog) …

25 A Possible System Architecture Data/Info + Ontology Search & Navigation Entities Relations ER Graph Mining NLP Machine Learning Expert Knowledge Special Search Analysis Engine User Information Extraction User Modeling & Personalization InformationRetrieval NCBI Genome Databases … Hypothesis Knowledge Base Inference Engine User Interface/ Workflow Manager 25

26 References [1] Xuanhui Wang, Bin Tan, Azadeh Shakery, ChengXiang Zhai, Beyond Hyperlinks: Organizing Information Footprints in Search Logs to Support Effective Browsing, Proceedings of the 18th ACM International Conference on Information and Knowledge Management ( CIKM'09), pp.1237-1246, 2009. http://doi.acm.org/10.1145/1645953.1646110http://doi.acm.org/10.1145/1645953.1646110 [2] Manish Agrawal, Maryam Karimzadehgan, and ChengXiang Zhai. An Online News Recommender System for Social Networks. In Proceedings of ACM SIGIR 2009 workshop on Search in Social Media, 2009. http://times.cs.uiuc.edu/czhai/pub/sigir09ssm-facebook.pdfhttp://times.cs.uiuc.edu/czhai/pub/sigir09ssm-facebook.pdf [3] Qiaozhu Mei, ChengXiang Zhai, Discovering Evolutionary Theme Patterns from Text -- An Exploration of Temporal Text Mining, Proceedings of the 2005 ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, (KDD'05 ), pages 198-207, 2005 http://doi.acm.org/10.1145/1081870.1081895 http://doi.acm.org/10.1145/1081870.1081895 [4] ChengXiang Zhai, Atulya Velivelli, Bei Yu, A cross-collection mixture model for comparative text mining, Proceedings of ACM KDD 2004 ( KDD'04 ), pages 743-748, 2004. http://doi.acm.org/10.1145/1014052.1014150 http://doi.acm.org/10.1145/1014052.1014150 [5] Xin He, Yanen Li, Radhika Khetani, Barry Sanders, Yue Lu, Xu Ling, ChengXiang Zhai, Bruce Schatz. BSQA: integrated text mining using entity relation semantics extracted from biological literature of insects, Nucleic Acids Research, 2010 38(Web Server issue):W175- W181. http://nar.oxfordjournals.org/cgi/content/full/38/suppl_2/W175http://nar.oxfordjournals.org/cgi/content/full/38/suppl_2/W175 [6] Xuehua Shen, Bin Tan, and ChengXiang Zhai, Implicit User Modeling for Personalized Search, In Proceedings of the 14th ACM International Conference on Information and Knowledge Management ( CIKM'05), pages 824-831. 2005, http://doi.acm.org/10.1145/1099554.1099747http://doi.acm.org/10.1145/1099554.1099747 [7] Qiaozhu Mei, ChengXiang Zhai. Generating Impact-Based Summaries for Scientific Literature, Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies ( ACL-08:HLT), pages 816-824, 2008. http://www.aclweb.org/anthology/P/P08/P08-1093.pdf http://www.aclweb.org/anthology/P/P08/P08-1093.pdf 26


Download ppt "A Researcher’s Workbench in 2020: Intelligent Information Systems for Knowledge Synthesis and Discovery ChengXiang (“Cheng”) Zhai Department of Computer."

Similar presentations


Ads by Google