Challenge Problem: Link Mining Lise Getoor University of Maryland, College Park.

Slides:



Advertisements
Similar presentations
Improving Learning Object Description Mechanisms to Support an Integrated Framework for Ubiquitous Learning Scenarios María Felisa Verdejo Carlos Celorrio.
Advertisements

UCLA : GSE&IS : Department of Information StudiesJF : 276lec1.ppt : 5/2/2015 : 1 I N F S I N F O R M A T I O N R E T R I E V A L S Y S T E M S Week.
Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Hall.
Machine Learning and Data Mining Course Summary. 2 Outline  Data Mining and Society  Discrimination, Privacy, and Security  Hype Curve  Future Directions.
Wikipedia: Pros and Cons Christine Kickels College of DuPage Library Associate Professor Librarian and “Wiki-user”
Asking Questions on the Internet
1 Enriching UK PubMed Central SPIDER launch meeting, Wolfson College, Oxford Paul Davey, UK PubMed Central Engagement Manager.
1 Adaptive Management Portal April
© Prentice Hall1 DATA MINING TECHNIQUES Introductory and Advanced Topics Eamonn Keogh (some slides adapted from) Margaret Dunham Dr. M.H.Dunham, Data Mining,
© Tefko Saracevic, Rutgers University 1 EVALUATION in searching IR systems Digital libraries Reference sources Web sources.
The Web is perhaps the single largest data source in the world. Due to the heterogeneity and lack of structure, mining and integration are challenging.
WebMiningResearch ASurvey Web Mining Research: A Survey By Raymond Kosala & Hendrik Blockeel, Katholieke Universitat Leuven, July 2000 Presented 4/18/2002.
Web Projections Learning from Contextual Subgraphs of the Web Jure Leskovec, CMU Susan Dumais, MSR Eric Horvitz, MSR.
Wikipedi A Web 2.0 Collaboration By Andy Hoffner, Jason Held, Brian Sax.
Christy Gavin Wikipedia Jimmy Wales January 15, 2001.
Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.
Guillaume Rivalle APRIL 2014 MEASURE YOUR RESEARCH PERFORMANCE WITH INCITES.
Websites vs. Databases Glenforest Secondary School Library Resource Centre Primary Source: M. Rosettis, St. Augustine.
Attention and Event Detection Identifying, attributing and describing spatial bursts Early online identification of attention items in social media Louis.
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
Research paper: Web Mining Research: A survey SIGKDD Explorations, June Volume 2, Issue 1 Author: R. Kosala and H. Blockeel.
Page 1 WEB MINING by NINI P SURESH PROJECT CO-ORDINATOR Kavitha Murugeshan.
Some Thoughts on HPC in Natural Language Engineering Steven Bird University of Melbourne & University of Pennsylvania.
Tables to Linked Data Zareen Syed, Tim Finin, Varish Mulwad and Anupam Joshi University of Maryland, Baltimore County
About WIKIS What is a WIKI ? A WIKI is a website that invites anybody to create pages. Or, add content. Or, edit content.
Georgia Library Media Wiki Andy Spinks Supervisor of Library Media Education Cobb County School District.
INTRODUCTION TO RESEARCH. Learning to become a researcher By the time you get to college, you will be expected to advance from: Information retrieval–
1 Controversial Issues  Data mining (or simple analysis) on people may come with a profile that would raise controversial issues of  Discrimination 
Finding Credible Sources
Digital Citizenship Lesson 3 Collective Intelligence.
윤언근 DataMining lab.  The Web has grown exponentially in size but this growth has not been isolated to good-quality pages.  spamming and.
Wikipedia – The Free Encyclopedia Petr Kadlec 16th Annual Conference of EINIRAS, 25/09/2006.
Data Warehousing Data Mining Privacy. Reading Bhavani Thuraisingham, Murat Kantarcioglu, and Srinivasan Iyer Extended RBAC-design and implementation.
Instructional Technology & Design Office or The World of Wikis Presented by Rebecca McGuire.
Data Mining By Dave Maung.
Collaborative Information Retrieval - Collaborative Filtering systems - Recommender systems - Information Filtering Why do we need CIR? - IR system augmentation.
Individualized Knowledge Access David Karger Lynn Andrea Stein Mark Ackerman Ralph Swick.
6.1 © 2010 by Prentice Hall 6 Chapter Foundations of Business Intelligence: Databases and Information Management.
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
Chapter 5: Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization DECISION SUPPORT SYSTEMS AND BUSINESS.
Tajik Wikipedia Free Encyclopedia Ibrahim Rustamov Note: To view pages on the Internet properly with all Tajik letters, please.
Understanding User’s Query Intent with Wikipedia G 여 승 후.
WEB MINING. In recent years the growth of the World Wide Web exceeded all expectations. Today there are several billions of HTML documents, pictures and.
ICT TOOLS AND SOCIETY INVOLVEMENT AMONG THE EUPAN NETWORK HIGHLIGHTS FROM THE SURVEY RESULTS TANYA CHETCUTI AND MARCO FICHERA - WORKSHOP EUROPEAN COMMISSION.
Data Mining: Knowledge Discovery in Databases Peter van der Putten ALP Group, LIACS Pre-University College LAPP-Top Computer Science February 2005.
Summarizing Encyclopedic Term Descriptions on the Web from Coling 2004 Atsushi Fujii and Tetsuya Ishikawa Graduate School of Library, Information and Media.
WEB 2.0 PATTERNS Carolina Marin. Content  Introduction  The Participation-Collaboration Pattern  The Collaborative Tagging Pattern.
RESEARCH 101 AFE: Advertising Savvy Use of the Internet Playing It Safe (and Smart!) With Databases.
Website that support online communities 1. Wikis 2. Blogs 3. Forums 4. Social networking sites.
© 2010 Deep Web Technologies, Inc. Taking the Library Back from Google Abe Lederman, President and CTO Deep Web Technologies May 12, 2010.
Integrated Knowledge System on Climate Change Adaptation Conceptual & Technological Framework OneWorld South Asia December 2008.
Extracting value from grey literature Processes and technologies for aggregating and analysing the hidden Big Data treasure of the organisations.
Internet & Evaluation Frederic Murray, M.L.I.S. Instructional Services Librarian Al Harris Library SWOSU.
By Stvila, Twidale, Smith, Gasser (2008) Betsy Mahoney and Jill Hoskins.
CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Link mining ( based on slides.
September 2003, 7 th EDG Conference, Heidelberg – Roberta Faggian, CERN/IT CERN – European Organization for Nuclear Research The GRACE Project GRid enabled.
Chapter 8: Web Analytics, Web Mining, and Social Analytics
PAIR project progress report Yi-Ting Chou Shui-Lung Chuang Xuanhui Wang.
Research and Library Resources. Three Essential Research Techniques 1.Know your topic thoroughly 2.Understand how to limit your search 3.Know the best.
New data sources (such as Big Data) and Traditional Sources Work Package 2.
MINING DEEP KNOWLEDGE FROM SCIENTIFIC NETWORKS
Model Discovery through Metalearning
DATA MINING © Prentice Hall.
Wikipedia, the free encyclopedia
More people than live in the United States.
Text & Web Mining 9/22/2018.
An ecosystem of contributions
Introduction of KNS55 Platform
Data Warehousing Data Mining Privacy
How do you find relevant and reliable information?
Presentation transcript:

Challenge Problem: Link Mining Lise Getoor University of Maryland, College Park

Link Mining Data –Structured Input: Mining graphs and networks –Structured Output: Extracting entity and relationships from unstructured data Making use of Links –For ranking nodes –For collective classification of nodes Discovering Links –Predicting missing links –Discovering new kinds of links and relationships

Link Mining Tasks Node Centric –Labeling/ranking nodes (aka Collective Classification/PageRank) –Consolidating nodes (aka Entity Resolution) –Discovering hidden nodes (aka Group Discovery) Edge Centric –Labeling/ranking edges –Predicting the existence of edges –Predicting the number of edges –Discovering new relations/paths Graph/Subgraph Centric –Discovering frequent subpatterns –Generative models –Metadata discovery, extraction, and reformulation Reference: SigKDD Explorations Special Issue on Link Mining, December 2005.

The Link Mining Challenge Current research mostly focus on a single task, e.g., node ranking or link prediction In real data analysis scenarios, we need a mix of all of these capabilities Many potential domains: –Bioinformatics –Social network analysis –Citation Analysis –Fraud detection –….

Challenge Problem Requirements 1.Relevant to data mining and based on analysis of large volumes of data (including web, text, images, links, etc), preferably publicly available data. 2.Important and difficult so that its solution will advance the field and benefit the society 3.Interesting and exciting to attract researchers, public and press attention, and funding. This requires a simple and concise problem statement 4.The required domain knowledge should be relatively accessible. 5.Other groups are not actively working on this problem already

Domain Evangelists: “Goal to distribute free encyclopedia to every single person on the planet in their own language” Jimmy Wales Wikipedia founder Detractors::”Wikipedia has gone from a nearly perfect anarchy to an anarchy with gang rule.” Larry Sanger Wikipedia co-founder Know It All: Can Wikipedia Conquer Expertise? Stacy Schiff, New Yorker, July 31, 2006 Collaboratively edited user contributed encyclopedia Largest example of participatory journalism to date. Mantra: maintain a neutral point of view (NPOV) “Disaster is not too strong a word for wikipedia… the site is infested with moonbats” Eric Raymond, Open-source movement figure

Task #1: Descriptive Modeling Modeling Growth of Wikipedia

Task #2: User Classification Wiki Gnome: user that keeps a low profile, fixing typos, poor grammar and broken links Wiki Troll: disruptive user who persistently violates the site’s guidelines GnomeTroll vs.

Task #3: Text Classification Three Wikipedia Content Guidelines: 1. NPOV: represent views fairly and without bias 2. Verifiability 3. No original research

#4: Link Prediction/Completion Identify where links should exist As Wikipedia grows, it becomes harder for any given author to know about other relevant stuff they can/should link to from some article. Some method that could help with this (link suggestion, auto linking, etc.) would potentially be very useful. Evaluation: Generate a dataset by taking a given set of wikipedia pages, removing some of the existing links, and then see if a system could identify those places and suggest appropriate links.

Other Link Mining Tasks Trust/Reputation analysis –“Gives no privilege to those who know what they are talking about”, William Connolley, climate modeler and Wikipedia admin Social network analysis –Identification of communities Accuracy –Nature comparison with Britannica (4-3 error ratio) Misuse –Vandalism and self-promotion Coverage –Which areas aren’t covered, or are poorly covered/linked?

But none of these are grand challenges… According to wikipedia

The Wikipedia Grand Challenge The Wikipedia Test: Given a collection of entries constructed via participatory journalism (PJ) vs. link mining (LM), Can you distinguish between PJ and LM? Which is better? Evaluation: Via a panel of human experts Via page rank Solution will require a variety of integrated link mining capabilities

$$ Already Available… Hutter prize ,000 € ≈ $64, _Dollar_Question 00_Dollar_Question