Challenges with XML Challenges with Semi-Structured collections Ludovic Denoyer University of Paris 6 Bridging the gap between research communities.

Slides:



Advertisements
Similar presentations
Language Technologies Reality and Promise in AKT Yorick Wilks and Fabio Ciravegna Department of Computer Science, University of Sheffield.
Advertisements

New Technologies Supporting Technical Intelligence Anthony Trippe, 221 st ACS National Meeting.
Web Mining Research: A Survey Authors: Raymond Kosala & Hendrik Blockeel Presenter: Ryan Patterson April 23rd 2014 CS332 Data Mining pg 01.
Joint Sentiment/Topic Model for Sentiment Analysis Chenghua Lin & Yulan He CIKM09.
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
Relational Learning with Gaussian Processes By Wei Chu, Vikas Sindhwani, Zoubin Ghahramani, S.Sathiya Keerthi (Columbia, Chicago, Cambridge, Yahoo!) Presented.
Explorations in Tag Suggestion and Query Expansion Jian Wang and Brian D. Davison Lehigh University, USA SSM 2008 (Workshop on Search in Social Media)
Multiple Instance Learning
WebMiningResearch ASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007.
CS335 Principles of Multimedia Systems Content Based Media Retrieval Hao Jiang Computer Science Department Boston College Dec. 4, 2007.
XML Document Mining Challenge Bridging the gap between Information Retrieval and Machine Learning Ludovic DENOYER – University of Paris 6.
INEX 2009 XML Mining Track James Reed Jonathan McElroy Brian Clevenger.
WebMiningResearch ASurvey Web Mining Research: A Survey By Raymond Kosala & Hendrik Blockeel, Katholieke Universitat Leuven, July 2000 Presented 4/18/2002.
WebMiningResearchASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007 Revised.
ReQuest (Validating Semantic Searches) Norman Piedade de Noronha 16 th July, 2004.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
Introduction to Data Mining Engineering Group in ACL.
Web Spam Detection: link-based and content-based techniques Reporter : 鄭志欣 Advisor : Hsing-Kuo Pao 2010/11/8 1.
CS344: Introduction to Artificial Intelligence Vishal Vachhani M.Tech, CSE Lecture 34-35: CLIR and Ranking in IR.
Webpage Understanding: an Integrated Approach
Opinion mining in social networks Student: Aleksandar Ponjavić 3244/2014 Mentor: Profesor dr Veljko Milutinović.
MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.
Research paper: Web Mining Research: A survey SIGKDD Explorations, June Volume 2, Issue 1 Author: R. Kosala and H. Blockeel.
Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Institute for System Programming of RAS.
Citation Recommendation 1 Web Technology Laboratory Ferdowsi University of Mashhad.
Know your Neighbors: Web Spam Detection Using the Web Topology Presented By, SOUMO GORAI Carlos Castillo(1), Debora Donato(1), Aristides Gionis(1), Vanessa.
Detecting Semantic Cloaking on the Web Baoning Wu and Brian D. Davison Lehigh University, USA WWW 2006.
Topical Crawlers for Building Digital Library Collections Presenter: Qiaozhu Mei.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
WebMining Web Mining By- Pawan Singh Piyush Arora Pooja Mansharamani Pramod Singh Praveen Kumar 1.
Internet Information Retrieval Sun Wu. Course Goal To learn the basic concepts and techniques of internet search engines –How to use and evaluate search.
Annotating Words using WordNet Semantic Glosses Julian Szymański Department of Computer Systems Architecture, Faculty of Electronics, Telecommunications.
Machine Learning Using Support Vector Machines (Paper Review) Presented to: Prof. Dr. Mohamed Batouche Prepared By: Asma B. Al-Saleh Amani A. Al-Ajlan.
Math Information Retrieval Zhao Jin. Zhao Jin. Math Information Retrieval Examples: –Looking for formulas –Collect teaching resources –Keeping updated.
Automatic Detection of Social Tag Spams Using a Text Mining Approach Hsin-Chang Yang Associate Professor Department of Information Management National.
Understanding User’s Query Intent with Wikipedia G 여 승 후.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
1 Using The Past To Score The Present: Extending Term Weighting Models with Revision History Analysis CIKM’10 Advisor : Jia Ling, Koh Speaker : SHENG HONG,
Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志.
Detecting New a Priori Probabilities of Data Using Supervised Learning Karpov Nikolay Associate professor NRU Higher School of Economics.
Exploring in the Weblog Space by Detecting Informative and Affective Articles Xiaochuan Ni, Gui-Rong Xue, Xiao Ling, Yong Yu Shanghai Jiao-Tong University.
Text Document Categorization by Term Association Maria-luiza Antonie Osmar R. Zaiane University of Alberta, Canada 2002 IEEE International Conference on.
A Classification-based Approach to Question Answering in Discussion Boards Liangjie Hong, Brian D. Davison Lehigh University (SIGIR ’ 09) Speaker: Cho,
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
BOOTSTRAPPING INFORMATION EXTRACTION FROM SEMI-STRUCTURED WEB PAGES Andrew Carson and Charles Schafer.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Mining Tag Semantics for Social Tag Recommendation Hsin-Chang Yang Department of Information Management National University of Kaohsiung.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Link mining ( based on slides.
Corpus Exploitation from Wikipedia for Ontology Construction Gaoying Cui, Qin Lu, Wenjie Li, Yirong Chen The Department of Computing The Hong Kong Polytechnic.
September 2003, 7 th EDG Conference, Heidelberg – Roberta Faggian, CERN/IT CERN – European Organization for Nuclear Research The GRACE Project GRid enabled.
Institute of Informatics & Telecommunications NCSR “Demokritos” Spidering Tool, Corpus collection Vangelis Karkaletsis, Kostas Stamatakis, Dimitra Farmakiotou.
Wen Chan 1 , Jintao Du 1, Weidong Yang 1, Jinhui Tang 2, Xiangdong Zhou 1 1 School of Computer Science, Shanghai Key Laboratory of Data Science, Fudan.
WebMiningResearchASurvey Web Mining Research: A Survey Authors: Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Computer Science Department University.
University Of Seoul Ubiquitous Sensor Network Lab Query Dependent Pseudo-Relevance Feedback based on Wikipedia 전자전기컴퓨터공학 부 USN 연구실 G
Introduction to Machine Learning, its potential usage in network area,
Bridging Domains Using World Wide Knowledge for Transfer Learning
Sentiment analysis algorithms and applications: A survey
Machine Learning overview Chapter 18, 21
Machine Learning overview Chapter 18, 21
Finding Clusters within a Class to Improve Classification Accuracy
Data Mining Chapter 6 Search Engines
Overview of Machine Learning
iSRD Spam Review Detection with Imbalanced Data Distributions
[jws13] Evaluation of instance matching tools: The experience of OAEI
Course Introduction CSC 576: Data Mining.
Information Retrieval and Web Design
Presentation transcript:

Challenges with XML Challenges with Semi-Structured collections Ludovic Denoyer University of Paris 6 Bridging the gap between research communities

Outline Motivations XML Mining Challenge Graph Labelling/WebSpam Challenge Conclusion and future work

General Idea The two challenges have been proposed to try to attract researchers from different domains: ◦ Mainly Machine Learning and Information Retrieval Show to IR researchers that ML methods are able to solve some of their problems Show to ML researchers that IR tasks provide interesting context for developping new general Machine Learning Algorithms

General Idea Find generic tasks that correspond to: ◦ IR new real-applications ◦ ML new generic problems To work together…. To mutualize efforts… To solve these tasks faster… To compare the approaches…

Open questions in ML Structure+content classification Classification of inter-dependant variables Structured output classification

Open questions in IR Structure+content classification Classification of inter-dependant variables Structured output classification Semi structured documents (XML) Interconnected documents Heterogeneous collections

Motivations Structured input classification Classification of inter- dependant variables Structured output classification Semi structured documents (XML) Hyperlinked documents Heterogeneous collections XML Mining Challenge

Motivations Structured input classification Classification of inter- dependant variables Structured output classification Semi structured documents (XML) Hyperlinked documents Heterogeneous collections WebSpam Challenge XML Mining Challenge

Motivations Information Retrieval Machine Learning Data MiningWeb Proposed Challenges

Challenges XML Mining Challenge ◦ « Bridging the gap between Machine Learning and Information Retrieval » Graph Labelling Challenge ◦ Application to WebSpam detection

Outline Motivations XML Mining Challenge WebSpam Challenge Conclusion and future work

XML Mining Challenge Launched in 2005 ◦ PASCAL (Network of excellence in ML) ◦ DELOS (Network of excellence in Digital Librairies) Organized as a INEX Track ◦ INEX: Initiative for the Evaluation of XML IR  More than 50 different institutes involved One event each year at INEX (december) Biggest INEX Track (after ad-hoc retrieval) We are currently launching the 4th XML Mining track

XML Mining Challenge ML Goal ◦ Classification of large collections of structures IR Goal ◦ Classification of semi-structured collections  Using both structure and content

Underlying idea Using structure and content Information

Collections Different collections have been used: ◦ 2005  Artificial collection  Movie collection ◦ 2006  Scientific articles  Wikipedia XML based collection ◦ 2007  Wikipedia XML based collection  96,000 documents in XML  21 categories

Submitted papers

Large variety of models Different existing ML Methods have been applied: ◦ Self Organizing Map ◦ SVM ◦ (Graph) Neural Network ◦ CRF ◦ Incremental Models ◦ … Some new models have been developped

Short Typology See Report on the XML Mining track – SIGIR Forum

Results Classification AuthorsMethodMicro recallMacro recall Zhang and al.Kernel+SVM L. M. de Campos and al. Graphical Models – Bayesian netwoks Meenakshi and al. Negative Category Document Frequency ….

XML Structure Mapping task Proposed in 2006 ML task : Structured ouput classification ◦ Learning to transform trees IR application : Dealing with hetereogenous collections ◦ Learning to transform heterogeneous documents to a mediated schema

XML Structure Mapping A generic ML model able to solve this task has a lot of potential applications: ◦ Conversion between file formats ◦ Automatic translation ◦ Natural Language processing ◦ …

Conclusion Existing structured input models (kernel,…) have been tested on this task New specific models have been developped Difficult to know which model is the best ◦ Need to wait one more year The challenge has attracted researchers from different communities ◦ Each year, ML researchers are coming to INEX and:  Discover a new domain  Present advanced ML models to other researchers The collections are freely available and have been downloaded a hundred times ◦ …some articles start to appear in different conferences…

WebSpam Challenge PASCAL « Graph Labelling Challenge » Organized by: ◦ Ricardo BAEZA-YATES (Yahoo! Research Barcelona) Ricardo BAEZA-YATES ◦ Carlos CASTILLO (Yahoo! Research Barcelona) Carlos CASTILLO ◦ Brian DAVISON (Lehigh University, USA ) Brian DAVISON ◦ Ludovic DENOYER (University Paris 6, France) Ludovic DENOYER ◦ Patrick GALLINARI (University Paris 6, France) Patrick GALLINARI The Web Spam Challenge 2007 was supported by PASCAL The Web Spam Challenge 2007 was also supported by the DELIS EU - FET research project

WebSpam Challenge Three Events: ◦ AirWeb workshop 2007 (WWW’07)  May 2007  Web-oriented part ◦ GraphLab workshop 2007 – P KDD/ECML  September 2007  ML-oriented part ◦ AirWeb workshop 2008 (WWW’08 ?)

WebSpam Challenge IR (Web) Task : ◦ Detection of web spam  Spam = any attempt to get “an unjustifiably favorable relevance or importance score for some Web pages, considering the page’s true value”

Example of spam

WebSpam Challenge ML Learning task: ◦ Graph labelling ◦ Classification of inter-dependant variables

Collection A collection of interconnected Web pages ◦ 77 millions pages ◦ About 11,000 hosts ◦ manually labeled as spam or normal (host level) Blinded evaluation of models

Participants

Participants Why such an increase of ML participants during GraphLab ?

GraphLab workshop at ECML/PKDD 2007 Collection has been fully preprocessed by the organizers  Each node corresponds to a vector (in SVMLight format) based on the words distribution in each host/page  The contingenchy matrix has been built One small collection with 9,000 nodes One large collection with 400,000 nodes 10% for train/20% for validation/70% for test You can easily apply your « relationnal » models on this corpus without knowing anything about text processing

Results Small collection (9,000 nodes) ParticipantsMethodsAUC Abernethy and al.Semi supervised learning 95.2 Tang and al.SVM95.1 Filoche and al.Stacked Learning92.7 Csalogany and al.C Tian and al.Semi Supervised86.3 ………

Results Large collection (400,000 nodes) ParticipantsMethodsAUC Weiss and al.Semi supervised learning 99.8 Filoche and al.Stacked Learning99.1 Tang and al.SVM98.9 ………

Conclusion on WebSpam Different pure ML methods used « as if » ◦ Semi supervised methods ◦ Stacked Learning ◦ … Very nice performances of ML models (equivalent to Web « hand-made » models)

Conclusion on WebSpam Devlopment of a ML benchmark for graph labelling WebSpam also proposes interesting ML challenges that could be integrated in the challenge ◦ Learning with a few examples ◦ Large scale problems ◦ Adversial Machine Learning ◦ …

Conclusion The two challenges have proposed benchmarks for IR/Web applications and also for generic ML problems It is possible to mix researchers from different communities ML researchers dislike to clean real collections ◦ you have to preprocess the collections ML researchers dislike large collections ◦ but it is moving…

Future work XML Mining will continue this year ◦ See ◦ The corpus will be preprocessed ? WebSpam challenge will also continue ◦ See ◦ We will see after WWW’08 if we propose an other GraphLab workshop (see ◦ Note that a new larger corpus has been developped in 2008

Thank you for your attention (Thank you to the participants of the different challenges that are in the room)