Download presentation
Presentation is loading. Please wait.
Published byClarissa Allison Modified over 9 years ago
1
Information Extraction Research @ Yahoo! Labs Bangalore Rajeev Rastogi Yahoo! Labs Bangalore
2
The most visited site on the internet 600 million+ users per month Super popular properties – News, finance, sports – Answers, flickr, del.icio.us – Mail, messaging – Search
3
Unparalleled scale 25 terabytes of data collected each day – Over 4 billion clicks every day – Over 4 billion emails per day – Over 6 billion instant messages per day Over 20 billion web documents indexed Over 4 billion images searchable No other company on the planet processes as much data as we do!
4
Yahoo! Labs Bangalore Focus is on basic and applied research – Search – Advertizing – Cloud computing University relations – Faculty research grants – Summer internships – Sharing data/computing infrastructure – Conference sponsorships – PhD co-op program
5
What does search look like today?
6
Search results of the future: Structured abstracts yelp.com babycenter epicurious answers.com LinkedIn webmd New York Times Gawker
7
Rank by price Search results of the future: Intelligent ranking
8
A key technology for enabling search transformation Information extraction (IE)
9
Reviews Information extraction (IE) Goal: Extract structured records from Web pages Name Address Category Phone Price Map
10
Multiple verticals Business, social networking, video, ….
11
Price Category Address PhonePrice One schema per vertical Name Title Education Connections Posted by Title Date RatingViews
12
IE on the Web is a hard problem Web pages are noisy Pages belonging to different Web sites have different layouts Noise
13
Web page types Template-based Hand-crafted
14
Template-based pages Pages within a Web site generated using scripts, have very similar structure – Can be leveraged for extraction ~30% of crawled Web pages Information rich, frequently appear in the top results of search queries E.g. search query: “Chinese Mirch New York” – 9 template-based pages in the top 10 results
15
Wrapper Induction Learn Annotate Pages Sample pages Website pages Learn Wrappers Apply wrappers Records XPath Rules Extract Annotations Extract Website pages Sample Enables extraction from template-based pages
16
Example XPath: /html/body/div/div/div/div/div/div/span /html/body//div//span Generalize
17
Filters Apply filters to prune from multiple candidates that match XPath expression XPath: /html/body//div//span Regex Filter (Phone): ([0-9] 3 ) [0-9] 3 -[0-9] 4
18
Limitations of wrappers Won’t work across Web sites due to different page layouts Scaling to thousands of sites can be a challenge – Need to learn a separate wrapper for each site – Annotating example pages from thousands of sites can be time-consuming & expensive
19
Research challenge Unsupervised IE: Extract attribute values from pages of a new Web site without annotating a single page from the site Only annotate pages from a few sites initially as training data
20
Conditional Random Fields (CRFs) Models conditional probability distribution of label sequence y=y 1,…,y n given input sequence x=x 1,…,x n – f k : features, k : weights Choose k to maximize log-likelihood of training data Use Viterbi algorithm to compute label sequence y with highest probability
21
CRFs-based IE Name Category Address Phone Noise Web pages can be viewed as labeled sequences Train CRF using pages from few Web sites Then use trained CRF to extract from remaining sites
22
Drawbacks of CRFs Require too many training examples Have been used previously to segment short strings with similar structure However, may not work too well across Web sites that – contain long pages with lots of noise – have very different structure
23
An alternate approach that exploits site knowledge Build attribute classifiers for each attribute – Use pages from a few initial Web sites For each page from a new Web site – Segment page into sequence of fields (using static repeating text) – Use attribute classifiers to assign attribute labels to fields Use constraints to disambiguate labels – Uniqueness: an attribute occurs at most once in a page – Proximity: attribute values appear close together in a page – Structural: relative positions of attributes are identical across pages of a Web site
24
Attribute classifiers + constraints example Chinese Mirch Chinese, Indian 120 Lexington Avenue New York, NY 10016 (212) 532 3663 Page1: Jewel of India Indian 15 W 44 th St New York, NY 10016 (212) 869 5544 Page2: 21 Club American 21 W 52 nd St New York, NY 10019 (212) 582 7200 Page3: Phone Address Category Name Category Category, Name Name Name, Noise Address Phone Uniqueness constraint: Name Precedence constraint: Name < Category 21 Club American 21 W 52 nd St New York, NY 10019 (212) 582 7200 Category Name Address Phone
25
Performance evaluation: Datasets 100 pages from 5 restaurant Web sites with very different structure – www.citysearch.com www.citysearch.com – www.fromers.com www.fromers.com – www.nymag.com www.nymag.com – www.superpages.com www.superpages.com – www.yelp.com www.yelp.com Extract attributes: Name, Address, Phone num, Hours of operation, Description
26
Methods considered CRFs, attribute classifiers + constraints Features – Lexicon: Words in the training Web pages – Regex: isAlpha, isAllCaps, isNum, is5DigitNum, isDay,… – Attribute-level: Num of words, Overlap with title,…
27
Evaluation methodology Metrics – Precision, recall, F1 for attributes Test on one site, use pages from remaining 4 sites as training data Average measures over all 5 sites
28
Experimental results CRFConstraintCRFConstraint Name.391.341 Phone.021.2.99 Address.01.81.16.83 Hours.221.361 Desc.13.250.15 Overall.15.81.21.76 PrecisionRecall
29
Other IE scenarios: Browse page extraction Similar-structured records
30
IE big picture/taxonomy Things to extract from – Template-based, browse, hand-crafted pages, text Things to extract – Records, tables, lists, named entities Techniques used – Structure-based (HTML tags, DOM tree paths) – e.g. Wrappers – Content-based (attribute values/models) – e.g. dictionaries – Structure + Content (sequential/hierarchical relationships among attribute values) – e.g. hierarchical CRFs Level of automation – Manual, supervised, unsupervised
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.