Zvika Marx, Extending Ontology ‘s.

Slides:



Advertisements
Similar presentations
Classification & Your Intranet: From Chaos to Control Susan Stearns Inmagic, Inc. E-Libraries E204 May, 2003.
Advertisements

Bringing Order to the Web: Automatically Categorizing Search Results Hao Chen SIMS, UC Berkeley Susan Dumais Adaptive Systems & Interactions Microsoft.
Problem Semi supervised sarcasm identification using SASI
Catching the Drift: Learning Broad Matches from Clickthrough Data Sonal Gupta, Mikhail Bilenko, Matthew Richardson University of Texas at Austin, Microsoft.
Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
Explorations in Tag Suggestion and Query Expansion Jian Wang and Brian D. Davison Lehigh University, USA SSM 2008 (Workshop on Search in Social Media)
PROBLEM BEING ATTEMPTED Privacy -Enhancing Personalized Web Search Based on:  User's Existing Private Data Browsing History s Recent Documents 
April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:
Automatic Web Page Categorization by Link and Context Analysis Giuseppe Attardi Antonio Gulli Fabrizio Sebastiani.
Event Extraction: Learning from Corpora Prepared by Ralph Grishman Based on research and slides by Roman Yangarber NYU.
Commentary-based Video Categorization and Concept Discovery By Janice Leung.
Recommender systems Ram Akella February 23, 2011 Lecture 6b, i290 & 280I University of California at Berkeley Silicon Valley Center/SC.
ApMl (All Purpose Machine Learning) Toolkit David W. Miller and Helen Howell Semantic Web Final Project Spring 2002 Department of Computer Science University.
1 I256: Applied Natural Language Processing Marti Hearst Nov 8, 2006.
Recommender systems Ram Akella November 26 th 2008.
Extracting Interest Tags from Twitter User Biographies Ying Ding, Jing Jiang School of Information Systems Singapore Management University AIRS 2014, Kuching,
Chapter 5: Information Retrieval and Web Search
An Automatic Segmentation Method Combined with Length Descending and String Frequency Statistics for Chinese Shaohua Jiang, Yanzhong Dang Institute of.
MAKING THE BUSINESS BETTER Presented By Mohammed Dwikat DATA MINING Presented to Faculty of IT MIS Department An Najah National University.
Tag-based Social Interest Discovery
MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.
( Information about the most famous sport in the world )
Citation Recommendation 1 Web Technology Laboratory Ferdowsi University of Mashhad.
APPLICATIONS OF DATA MINING IN INFORMATION RETRIEVAL.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
ONTOLOGY LEARNING AND POPULATION FROM FROM TEXT Ch8 Population.
Statistics Overview Jinchang Wang. Purpose of this Overview This overview is to help catch the essential ideas of statistics.
Classification Technology at LexisNexis SIGIR 2001 Workshop on Operational Text Classification Mark Wasson LexisNexis September.
Knowledge Discovery and Data Mining Evgueni Smirnov.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Time Series Data Analysis - I Yaji Sripada. Dept. of Computing Science, University of Aberdeen2 In this lecture you learn What are Time Series? How to.
This work is supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center contract number.
Chapter 6: Information Retrieval and Web Search
Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session Summarized.
Leveraging Asset Reputation Systems to Detect and Prevent Fraud and Abuse at LinkedIn Jenelle Bray Staff Data Scientist Strata + Hadoop World New York,
Bootstrapping for Text Learning Tasks Ramya Nagarajan AIML Seminar March 6, 2001.
Mining Binary Constraints in Feature Models: A Classification-based Approach Yi Li.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Cluster-specific Named Entity Transliteration Fei Huang HLT/EMNLP 2005.
STEP 1 Enter search words in the text box and click on “Search.” In this demo version, LaserSearch downloads just a few hundred documents from the Internet.
Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A Ralph Grishman NYU.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Mining Logs Files for Data-Driven System Management Advisor.
April 2014 SEWM Event Detection from Social Media: User-centric Parallel Split-n-merge and Composite Kernel  Truc-Vien T. Nguyen, Lugano University,
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Improving the performance of personal name disambiguation.
Iterative similarity based adaptation technique for Cross Domain text classification Under: Prof. Amitabha Mukherjee By: Narendra Roy Roll no: Group:
Automating Readers’ Advisory to Make Book Recommendations for K-12 Readers by Alicia Wood.
1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:
Discovering Relations among Named Entities from Large Corpora Takaaki Hasegawa *, Satoshi Sekine 1, Ralph Grishman 1 ACL 2004 * Cyberspace Laboratories.
Information Extraction Entity Extraction: Statistical Methods Sunita Sarawagi.
Feature Assignment LBSC 878 February 22, 1999 Douglas W. Oard and Dagobert Soergel.
Bringing Order to the Web : Automatically Categorizing Search Results Advisor : Dr. Hsu Graduate : Keng-Wei Chang Author : Hao Chen Susan Dumais.
Semi-Supervised Recognition of Sarcastic Sentences in Twitter and Amazon -Smit Shilu.
Relation Extraction (RE) via Supervised Classification See: Jurafsky & Martin SLP book, Chapter 22 Exploring Various Knowledge in Relation Extraction.
Data Mining and Text Mining. The Standard Data Mining process.
Sentimental feature selection for sentiment analysis of Chinese online reviews Lijuan Zheng 1,2, Hongwei Wang 2, and Song Gao 2 1 School of Business, Liaocheng.
Trends in NL Analysis Jim Critz University of New York in Prague EurOpen.CZ 12 December 2008.
The rise of statistics Statistics is the science of collecting, organizing and interpreting data. The goal of statistics is to gain understanding from.
Queensland University of Technology
System for Semi-automatic ontology construction
Mining the Data Charu C. Aggarwal, ChengXiang Zhai
Social Knowledge Mining
Module 5: Data Cleaning and Building Reports
Organizational structures
Content Augmentation for Mixed-Mode News Broadcasts Mike Dowman
INF 141: Information Retrieval
Form 9 Teacher Irina Zarubina
Ontology-Enhanced Aspect-Based Sentiment Analysis
Presentation transcript:

Zvika Marx, Extending Ontology ‘s

‘ 42 ’ wins weekend Box Office Adam Scott wins The Master’s Last day for Tax Returns Dish Network bids for Sprint Trend detection identifies, in real time, topics and trends that are currently hot in the social network chatter. Hours    Occurrence volume   

Purpose of trend detection  Real-time advertising in social media – o Use hot, preferably new (& not yet expansive) search terms or terms reflecting interest  Around social media campaigns there are several interesting projects not reviewed in this talk: o Automated bidding o Automated campaign management o …

A trend is characterized by a particular term, or a group of terms, that has an occurrence frequency peak within some time window.  Terms = phrases (n-grams) given in a pre-defined table.  Additional purposes of terms: o Network of related terms to add to the ones directly identified (co-occurrence based) o Segmentation – different populations are interested in different terms Taykey term-system Tax Returns Adam Scott Battle Droid The Empire Strikes Back Naboo AT-AT walker George Darth Vader Coruscant Star Wars

New term discovery  Main idea: similarity to existing terms o Accept (or flood as candidate) if average similarity among 10 top similarities passes some threshold  + Rules -- exclude patterns such as: o [ ] “1.2 ghz” o [in ] “in london”

Features: context phrases Term candidate : From the statistics of repeating patterns in text (number of occurrences should pass several thresholds...) extract feature vector, representing the candidate 

Features: processing  Features (as shown) are two-word and three- word combinations, of enough occurrences with enough existing (=training) terms  “semi-stopword” list: o exclude features made solely of the list’s words  Cleaning – o most non-alphanumeric characters deleted o numbers/digits replaced with a symbol

New-term discovery setting  Feature vectors are extracted for o every training term (few month sample) o every n-gram in recent (few days) data  TF-IDF weights o where “document” = a training term feature- vector  Cosine similarity (reflects high proportion of shared features)

New-term discovery example New-term candidate: Similar terms: Common features: I’m watching Witches of East End “02 x 10 the fall of the house of Beauchamp” = “marked as seen” in Portuguese season number X episode number

Taykey term classification *classes* ~ entity type:  'event', 'city', ‘other location‘, 'art-piece', … *subjects* ~ domain (multi-tag possible):  'fashion', 'consumer electronics', 'science‘, … Classification helps in population segmentation & campaign match

Term auto-classification  Multi-class classification to “classes” and to “subjects”  Maximum likelihood o feature vectors as in term discovery  Applies to: o discovered new term candidates o terms not categorized previously o existing-vs-calculated conflicts, for example – in terms that change their meaning over time requires manual examination

Term auto-classification results  Classes: ~80%; Subjects: ~85% accuracy (vs. ~92% if tested on already classified terms)  It looks as if some of those terms were left unclassified for a reason… one example : o A sasaeng fan is an excessively obsessed fan of the Hallyu wave (= South Korean pop culture, which became increasingly popular since late 90s). It was miss-classified as ‘sports’ (rather than ‘lifestyle’). The classifier is mislead by features that include the sport-related word ‘fan’ :

Fine-grain classification  We started exploring sets of terms that characterize more specific population segments o instead of “sports” --- “football”, “baseball”, …  Ongoing experiment: seed Taykey sports terms with “baketball” vs. “other sports” tags o Seed taken from freebase.com  Information gain feature selection top “basketball” features:

Thank You, Zvika Marx,