Information Extraction Yahoo! Labs Bangalore Rajeev Rastogi Yahoo! Labs Bangalore
The most visited site on the internet 600 million+ users per month Super popular properties – News, finance, sports – Answers, flickr, del.icio.us – Mail, messaging – Search
Unparalleled scale 25 terabytes of data collected each day – Over 4 billion clicks every day – Over 4 billion s per day – Over 6 billion instant messages per day Over 20 billion web documents indexed Over 4 billion images searchable No other company on the planet processes as much data as we do!
Yahoo! Labs Bangalore Focus is on basic and applied research – Search – Advertizing – Cloud computing University relations – Faculty research grants – Summer internships – Sharing data/computing infrastructure – Conference sponsorships – PhD co-op program
What does search look like today?
Search results of the future: Structured abstracts yelp.com babycenter epicurious answers.com LinkedIn webmd New York Times Gawker
Rank by price Search results of the future: Intelligent ranking
A key technology for enabling search transformation Information extraction (IE)
Reviews Information extraction (IE) Goal: Extract structured records from Web pages Name Address Category Phone Price Map
Multiple verticals Business, social networking, video, ….
Price Category Address PhonePrice One schema per vertical Name Title Education Connections Posted by Title Date RatingViews
IE on the Web is a hard problem Web pages are noisy Pages belonging to different Web sites have different layouts Noise
Web page types Template-based Hand-crafted
Template-based pages Pages within a Web site generated using scripts, have very similar structure – Can be leveraged for extraction ~30% of crawled Web pages Information rich, frequently appear in the top results of search queries E.g. search query: “Chinese Mirch New York” – 9 template-based pages in the top 10 results
Wrapper Induction Learn Annotate Pages Sample pages Website pages Learn Wrappers Apply wrappers Records XPath Rules Extract Annotations Extract Website pages Sample Enables extraction from template-based pages
Example XPath: /html/body/div/div/div/div/div/div/span /html/body//div//span Generalize
Filters Apply filters to prune from multiple candidates that match XPath expression XPath: /html/body//div//span Regex Filter (Phone): ([0-9] 3 ) [0-9] 3 -[0-9] 4
Limitations of wrappers Won’t work across Web sites due to different page layouts Scaling to thousands of sites can be a challenge – Need to learn a separate wrapper for each site – Annotating example pages from thousands of sites can be time-consuming & expensive
Research challenge Unsupervised IE: Extract attribute values from pages of a new Web site without annotating a single page from the site Only annotate pages from a few sites initially as training data
Conditional Random Fields (CRFs) Models conditional probability distribution of label sequence y=y 1,…,y n given input sequence x=x 1,…,x n – f k : features, k : weights Choose k to maximize log-likelihood of training data Use Viterbi algorithm to compute label sequence y with highest probability
CRFs-based IE Name Category Address Phone Noise Web pages can be viewed as labeled sequences Train CRF using pages from few Web sites Then use trained CRF to extract from remaining sites
Drawbacks of CRFs Require too many training examples Have been used previously to segment short strings with similar structure However, may not work too well across Web sites that – contain long pages with lots of noise – have very different structure
An alternate approach that exploits site knowledge Build attribute classifiers for each attribute – Use pages from a few initial Web sites For each page from a new Web site – Segment page into sequence of fields (using static repeating text) – Use attribute classifiers to assign attribute labels to fields Use constraints to disambiguate labels – Uniqueness: an attribute occurs at most once in a page – Proximity: attribute values appear close together in a page – Structural: relative positions of attributes are identical across pages of a Web site
Attribute classifiers + constraints example Chinese Mirch Chinese, Indian 120 Lexington Avenue New York, NY (212) Page1: Jewel of India Indian 15 W 44 th St New York, NY (212) Page2: 21 Club American 21 W 52 nd St New York, NY (212) Page3: Phone Address Category Name Category Category, Name Name Name, Noise Address Phone Uniqueness constraint: Name Precedence constraint: Name < Category 21 Club American 21 W 52 nd St New York, NY (212) Category Name Address Phone
Performance evaluation: Datasets 100 pages from 5 restaurant Web sites with very different structure – – – – – Extract attributes: Name, Address, Phone num, Hours of operation, Description
Methods considered CRFs, attribute classifiers + constraints Features – Lexicon: Words in the training Web pages – Regex: isAlpha, isAllCaps, isNum, is5DigitNum, isDay,… – Attribute-level: Num of words, Overlap with title,…
Evaluation methodology Metrics – Precision, recall, F1 for attributes Test on one site, use pages from remaining 4 sites as training data Average measures over all 5 sites
Experimental results CRFConstraintCRFConstraint Name Phone Address Hours Desc Overall PrecisionRecall
Other IE scenarios: Browse page extraction Similar-structured records
IE big picture/taxonomy Things to extract from – Template-based, browse, hand-crafted pages, text Things to extract – Records, tables, lists, named entities Techniques used – Structure-based (HTML tags, DOM tree paths) – e.g. Wrappers – Content-based (attribute values/models) – e.g. dictionaries – Structure + Content (sequential/hierarchical relationships among attribute values) – e.g. hierarchical CRFs Level of automation – Manual, supervised, unsupervised