Information Extraction Yahoo! Labs Bangalore Rajeev Rastogi Yahoo! Labs Bangalore.

Information Extraction Research @ Yahoo! Labs Bangalore Rajeev Rastogi Yahoo! Labs Bangalore

The most visited site on the internet 600 million+ users per month Super popular properties – News, finance, sports – Answers, flickr, del.icio.us – Mail, messaging – Search

Unparalleled scale 25 terabytes of data collected each day – Over 4 billion clicks every day – Over 4 billion emails per day – Over 6 billion instant messages per day Over 20 billion web documents indexed Over 4 billion images searchable No other company on the planet processes as much data as we do!

Yahoo! Labs Bangalore Focus is on basic and applied research – Search – Advertizing – Cloud computing University relations – Faculty research grants – Summer internships – Sharing data/computing infrastructure – Conference sponsorships – PhD co-op program

What does search look like today?

Search results of the future: Structured abstracts yelp.com babycenter epicurious answers.com LinkedIn webmd New York Times Gawker

Rank by price Search results of the future: Intelligent ranking

A key technology for enabling search transformation Information extraction (IE)

Reviews Information extraction (IE) Goal: Extract structured records from Web pages Name Address Category Phone Price Map

Multiple verticals Business, social networking, video, ….

Price Category Address PhonePrice One schema per vertical Name Title Education Connections Posted by Title Date RatingViews

IE on the Web is a hard problem Web pages are noisy Pages belonging to different Web sites have different layouts Noise

Web page types Template-based Hand-crafted

Template-based pages Pages within a Web site generated using scripts, have very similar structure – Can be leveraged for extraction ~30% of crawled Web pages Information rich, frequently appear in the top results of search queries E.g. search query: “Chinese Mirch New York” – 9 template-based pages in the top 10 results

Wrapper Induction Learn Annotate Pages Sample pages Website pages Learn Wrappers Apply wrappers Records XPath Rules Extract Annotations Extract Website pages Sample Enables extraction from template-based pages

Example XPath: /html/body/div/div/div/div/div/div/span /html/body//div//span Generalize

Filters Apply filters to prune from multiple candidates that match XPath expression XPath: /html/body//div//span Regex Filter (Phone): ([0-9] 3 ) [0-9] 3 -[0-9] 4

Limitations of wrappers Won’t work across Web sites due to different page layouts Scaling to thousands of sites can be a challenge – Need to learn a separate wrapper for each site – Annotating example pages from thousands of sites can be time-consuming & expensive

Research challenge Unsupervised IE: Extract attribute values from pages of a new Web site without annotating a single page from the site Only annotate pages from a few sites initially as training data

Conditional Random Fields (CRFs) Models conditional probability distribution of label sequence y=y 1,…,y n given input sequence x=x 1,…,x n – f k : features, k : weights Choose k to maximize log-likelihood of training data Use Viterbi algorithm to compute label sequence y with highest probability

CRFs-based IE Name Category Address Phone Noise Web pages can be viewed as labeled sequences Train CRF using pages from few Web sites Then use trained CRF to extract from remaining sites

Drawbacks of CRFs Require too many training examples Have been used previously to segment short strings with similar structure However, may not work too well across Web sites that – contain long pages with lots of noise – have very different structure

An alternate approach that exploits site knowledge Build attribute classifiers for each attribute – Use pages from a few initial Web sites For each page from a new Web site – Segment page into sequence of fields (using static repeating text) – Use attribute classifiers to assign attribute labels to fields Use constraints to disambiguate labels – Uniqueness: an attribute occurs at most once in a page – Proximity: attribute values appear close together in a page – Structural: relative positions of attributes are identical across pages of a Web site

Attribute classifiers + constraints example Chinese Mirch Chinese, Indian 120 Lexington Avenue New York, NY 10016 (212) 532 3663 Page1: Jewel of India Indian 15 W 44 th St New York, NY 10016 (212) 869 5544 Page2: 21 Club American 21 W 52 nd St New York, NY 10019 (212) 582 7200 Page3: Phone Address Category Name Category Category, Name Name Name, Noise Address Phone Uniqueness constraint: Name Precedence constraint: Name < Category 21 Club American 21 W 52 nd St New York, NY 10019 (212) 582 7200 Category Name Address Phone

Performance evaluation: Datasets 100 pages from 5 restaurant Web sites with very different structure – www.citysearch.com www.citysearch.com – www.fromers.com www.fromers.com – www.nymag.com www.nymag.com – www.superpages.com www.superpages.com – www.yelp.com www.yelp.com Extract attributes: Name, Address, Phone num, Hours of operation, Description

Methods considered CRFs, attribute classifiers + constraints Features – Lexicon: Words in the training Web pages – Regex: isAlpha, isAllCaps, isNum, is5DigitNum, isDay,… – Attribute-level: Num of words, Overlap with title,…

Evaluation methodology Metrics – Precision, recall, F1 for attributes Test on one site, use pages from remaining 4 sites as training data Average measures over all 5 sites

Experimental results CRFConstraintCRFConstraint Name.391.341 Phone.021.2.99 Address.01.81.16.83 Hours.221.361 Desc.13.250.15 Overall.15.81.21.76 PrecisionRecall

Other IE scenarios: Browse page extraction Similar-structured records

IE big picture/taxonomy Things to extract from – Template-based, browse, hand-crafted pages, text Things to extract – Records, tables, lists, named entities Techniques used – Structure-based (HTML tags, DOM tree paths) – e.g. Wrappers – Content-based (attribute values/models) – e.g. dictionaries – Structure + Content (sequential/hierarchical relationships among attribute values) – e.g. hierarchical CRFs Level of automation – Manual, supervised, unsupervised

Information Extraction Yahoo! Labs Bangalore Rajeev Rastogi Yahoo! Labs Bangalore.

Similar presentations

Presentation on theme: "Information Extraction Yahoo! Labs Bangalore Rajeev Rastogi Yahoo! Labs Bangalore."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Information Extraction Yahoo! Labs Bangalore Rajeev Rastogi Yahoo! Labs Bangalore.

Similar presentations

Presentation on theme: "Information Extraction Yahoo! Labs Bangalore Rajeev Rastogi Yahoo! Labs Bangalore."— Presentation transcript:

Similar presentations

About project

Feedback