Measuring Complexity of Web Pages Using Gate

Slides:



Advertisements
Similar presentations
Presentation at Society of The Query conference, Amsterdam November 13-14, 2009 (original title: Learning from Google: software design as a methodology.
Advertisements

Yansong Feng and Mirella Lapata
Towards Methods for the Collective Gathering and Quality Control of Relevance Assessments SIGIR´09, July 2009.
INFORMATION SOLUTIONS Citation Analysis Reports. Copyright 2005 Thomson Scientific 2 INFORMATION SOLUTIONS Provide highly customized datasets based on.
Problem Semi supervised sarcasm identification using SASI
Joint Sentiment/Topic Model for Sentiment Analysis Chenghua Lin & Yulan He CIKM09.
A Quality Focused Crawler for Health Information Tim Tang.
Sunita Sarawagi.  Enables richer forms of queries  Facilitates source integration and queries spanning sources “Information Extraction refers to the.
Predicting Text Quality for Scientific Articles AAAI/SIGART-11 Doctoral Consortium Annie Louis : Louis A. and Nenkova A Automatically.
Gimme’ The Context: Context- driven Automatic Semantic Annotation with CPANKOW Philipp Cimiano et al.
Enhance legal retrieval applications with an automatically induced knowledge base Ka Kan Lo.
Overview of Search Engines
Selective Sampling on Probabilistic Labels Peng Peng, Raymond Chi-Wing Wong CSE, HKUST 1.
Query session guided multi- document summarization THESIS PRESENTATION BY TAL BAUMEL ADVISOR: PROF. MICHAEL ELHADAD.
Result presentation. Search Interface Input and output functionality – helping the user to formulate complex queries – presenting the results in an intelligent.
Supporting the Automatic Construction of Entity Aware Search Engines Lorenzo Blanco, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Dipartimento di Informatica.
Annotating Search Results from Web Databases. Abstract An increasing number of databases have become web accessible through HTML form-based search interfaces.
What is Readability?  A characteristic of text documents..  “the sum total of all those elements within a given piece of printed material that affect.
1 Scopus as a Research Tool March Why Scopus?  A comprehensive abstract and citation database of peer-reviewed literature and quality web sources.
Iterative Readability Computation for Domain-Specific Resources By Jin Zhao and Min-Yen Kan 11/06/2010.
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
Which of the two appears simple to you? 1 2.
Google Scholar as a cybermetric tool Alastair G Smith Victoria University of Wellington New Zealand
Date: 2013/8/27 Author: Shinya Tanaka, Adam Jatowt, Makoto P. Kato, Katsumi Tanaka Source: WSDM’13 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang Estimating.
Improve your R&D Effectiveness and Manage Your Intellectual Property Assets with Luxid ® for Life Sciences.
Domain-Specific Iterative Readability Computation Jin Zhao 13/05/2011.
Measures of Central Tendency And Spread Understand the terms mean, median, mode, range, standard deviation.
Measures of Central Tendency Foundations of Algebra.
1 Opinion Retrieval from Blogs Wei Zhang, Clement Yu, and Weiyi Meng (2007 CIKM)
C. Lawrence Zitnick Microsoft Research, Redmond Devi Parikh Virginia Tech Bringing Semantics Into Focus Using Visual.
MOTIVATION AND CHALLENGE Big data Volume Velocity Variety Veracity Contributor Content Context Value 5 Vs of Big Data 3 Cs of Veracity.
Poorva Potdar Sentiment and Textual analysis of Create-Debate data EECS 595 – End Term Project.
Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms Author: Monika Henzinger Presenter: Chao Yan.
Evaluation of the NSDL and Google for Obtaining Pedagogical Resources Frank McCown, Johan Bollen, and Michael L. Nelson Old Dominion University Computer.
Ranking Definitions with Supervised Learning Methods J.Xu, Y.Cao, H.Li and M.Zhao WWW 2005 Presenter: Baoning Wu.
Web Information Retrieval Prof. Alessandro Agostini 1 Context in Web Search Steve Lawrence Speaker: Antonella Delmestri IEEE Data Engineering Bulletin.
Date: 2013/6/10 Author: Shiwen Cheng, Arash Termehchy, Vagelis Hristidis Source: CIKM’12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang Predicting the Effectiveness.
Post-Ranking query suggestion by diversifying search Chao Wang.
TWC Illuminate Knowledge Elements in Geoscience Literature Xiaogang (Marshall) Ma, Jin Guang Zheng, Han Wang, Peter Fox Tetherless World Constellation.
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
Extracting and Ranking Product Features in Opinion Documents Lei Zhang #, Bing Liu #, Suk Hwan Lim *, Eamonn O’Brien-Strain * # University of Illinois.
Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.
By: Kem Forbs Advanced Google Search. Tips and Tricks Keywords: adding additional terms or keywords can redefine your search and make the most relevant.
Assess usability of a Web site’s information architecture: Approximate people’s information-seeking behavior (Monte Carlo simulation) Output quantitative.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
1 Discovering Web Communities in the Blogspace Ying Zhou, Joseph Davis (HICSS 2007)
A research proposal is a document written for the purpose of obtaining funding for a research project.
UIC at TREC 2006: Blog Track Wei Zhang Clement Yu Department of Computer Science University of Illinois at Chicago.
Ontology Evaluation and Ranking using OntoQA Samir Tartir and I. Budak Arpinar Large-Scale Distributed Information Systems Lab University of Georgia The.
Semi-Supervised Recognition of Sarcastic Sentences in Twitter and Amazon -Smit Shilu.
1 e-Resources on Social Sciences: Scopus. 2 Why Scopus?  A comprehensive abstract and citation database of peer-reviewed literature and quality web sources.
Vertical Search for Courses of UIUC Homepage Classification The aim of the Course Search project is to construct a database of UIUC courses across all.
New data sources (such as Big Data) and Traditional Sources Work Package 2.
© NCSR, Frascati, July 18-19, 2002 CROSSMARC big picture Domain-specific Web sites Domain-specific Spidering Domain Ontology XHTML pages WEB Focused Crawling.
Automatic Writing Evaluation
Hot Off the Press: 2013 Ranking Factors
Research Methodology Lecture No :25 (Hypothesis Testing – Difference in Groups)
Web News Sentence Searching Using Linguistic Graph Similarity
Link Label Text Label… Click Here… Image Image Lorem Ipsum Lorem Ipsum
Scholarly Communication & Institutional Ranking: A study based on NIRF
Evaluating RFP’s Presented by:
Applying Key Phrase Extraction to aid Invalidity Search
Towards a Personal Briefing Assistant
iSRD Spam Review Detection with Imbalanced Data Distributions
Sadov M. A. , NRU HSE, Moscow, Russia Kutuzov A. B
Effective Entity Recognition and Typing by Relation Phrase-Based Clustering
CS246: Information Retrieval
Date: 2012/11/15 Author: Jin Young Kim, Kevyn Collins-Thompson,
Toward Large Scale Integration
Week 6 Presentation Ngoc Ta Aidean Sharghi.
Presentation transcript:

Measuring Complexity of Web Pages Using Gate Prepared by: The Who

Subject1: Can more meaningful indicators be extracted from the resources (webpages), e.g. a more interesting complexity, diversity or even other like sentiment.

Complexity Definition: How to learn the features associated to the difficulty to understand the resources.

Our Vision To employ entities liked to diverse contexts as a base to determine the complexity of a Webpage by: Gathering sets of Webpages from different domains Annotating the complexity of the pages (Crowdsourcing) Obtaining the set of named entities on each page (Gate) Determining a complexity score for each entity based on which pages it appears (Centrality / text ranking / Entity authority metrics: how many times it appears in the page vs how many entities are in that page and what is the page complexity score) Employing the set of weighted entities to predict a score for new pages Correlate the outputs with the commonly employed sentence metrics

Proposed approach

Run Entity and Terms Recognition on a sample from the data set . 1. Create Datastore for the sample 1 3 2

1 2 2. Populate the sample on to the corpus & save it to the datastore.

3. Run the TermRaider (it is already contain the annieGazetteer for entity recognition ) 1 2

4. Search for specific Annotation Type

5. Export the Terms and annotation set

Scoring Score the complexity of the entities This score is based on the average complexity score of documents that the entity appears on. 2

Calculate the page based on the scores of the entities that appear in it Score the complexity of the entities This score is based on the average complexity score of documents that the entity appears on.

Compare scores by the two methods Site Vanilla Score Proposed Score https://www.ijcai-18.org/cfp/ 0.6 .475 https://research.fb.com/programs/research-awards/proposals/computational-social-science-methodology-request-for-proposals/ 0.796 https://en.wikipedia.org/wiki/Cosine_similarity 0.568 .75 https://en.wikipedia.org/wiki/Dorian_Gray_(character) 0.536 .45 https://en.wikipedia.org/wiki/Goalball 0.52 .55 iswc2013_demo_36.html .375 https://en.wikipedia.org/wiki/Malawi 0.504 .775 https://en.wikipedia.org/wiki/Oscar_Wilde 0.464 .6 https://en.wikipedia.org/wiki/Scala_(programming_language) 0.48 .725 https://en.wikipedia.org/wiki/The_Picture_of_Dorian_Gray 0.528 https://en.wikipedia.org/wiki/Underwater_rugby .5

Thank You! Gracias! Ευχαριστώ! Prepared by: The Who