Information Extraction. Two Types of Extraction Extracting from template-based data –An example on how this data is generated –Querying on Amazon by filling.

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Arnd Christian König Venkatesh Ganti Rares Vernica Microsoft Research Entity Categorization Over Large Document Collections.
CrowdER - Crowdsourcing Entity Resolution
Chapter 5: Introduction to Information Retrieval
Imbalanced data David Kauchak CS 451 – Fall 2013.
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Applying Crowd Sourcing and Workflow in Social Conflict Detection By: Reshmi De, Bhargabi Chakrabarti 28/03/13.
Introduction to Mechanized Labor Marketplaces: Mechanical Turk Uichin Lee KAIST KSE.
Case Tools Trisha Cummings. Our Definition of CASE  CASE is the use of computer-based support in the software development process.  A CASE tool is a.
Introduction to Supervised Machine Learning Concepts PRESENTED BY B. Barla Cambazoglu February 21, 2014.
ELPUB 2006 June Bansko Bulgaria1 Automated Building of OAI Compliant Repository from Legacy Collection Kurt Maly Department of Computer.
Realtime Equipment Database F.R.E.D. stands for Fastline’s Realtime Equipment Database. F.R.E.D. will allow you to list all your inventory online. F.R.E.D.
Amazon Mechanical Turk (Mturk) What is MTurk? – Crowdsourcing Internet marketplace that utilizes human intelligence to perform tasks that computers are.
1 Introduction to Computability Theory Lecture12: Decidable Languages Prof. Amos Israeli.
Information Retrieval in Practice
Publishing Workflow for InDesign Import/Export of XML
Crowdsourcing research data UMBC ebiquity,
Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.
Chapter 2: Algorithm Discovery and Design
Chapter 5: Information Retrieval and Web Search
Finding Advertising Keywords on Web Pages Scott Wen-tau YihJoshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University.
CrowdSearch: Exploiting Crowds for Accurate Real-Time Image Search on Mobile Phones Original work by Yan, Kumar & Ganesan Presented by Tim Calloway.
Managing Information Extraction: A Database Perspective Adapted from SIGMOD 2006 Tutorial.
Chong Sun, Narasimhan Rampalli, Frank Yang, AnHai Doan
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
Using the Georgia Online Assessment System(OAS) We will lead the nation in improving student achievement. Kathy Cox, State Superintendent of Schools.
SOFT COMPUTING (Optimization Techniques using GA) Dr. N.Uma Maheswari Professor/CSE PSNA CET.
Today’s Topics Chapter 2 in One Slide Chapter 18: Machine Learning (ML) Creating an ML Dataset –“Fixed-length feature vectors” –Relational/graph-based.
An Introduction to Machine Learning and Natural Language Processing Tools Presented by: Mark Sammons, Vivek Srikumar (Many slides courtesy of Nick Rizzolo)
Detecting Semantic Cloaking on the Web Baoning Wu and Brian D. Davison Lehigh University, USA WWW 2006.
Types of Extraction. Wrappers 2 IE from Text 3 AttributeWalmart ProductVendor Product Product NameCHAMP Bluetooth Survival Solar Multi- Function Skybox.
Agile User Stories. What is a User Story? User stories are short, simple description of a feature told from the perspective of the person who desires.
Just as there are many human languages, there are many computer programming languages that can be used to develop software. Some are named after people,
CSC 395 – Software Engineering Lecture 13: Object-Oriented Analysis –or– Let the Pain Begin (At Least I’m Honest!)
Author Instructions How to upload Abstracts and Sessions to the Paper Management System.
Automatic Rule Refinement for Information Extraction Bin Liu University of Michigan Laura Chiticariu IBM Research - Almaden Vivian Chu IBM Research - Almaden.
Downloading and Installing Autodesk Revit 2016
XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 3 1 Searching the Web Using Search Engines and Directories Effectively Tutorial.
Presenter: Shanshan Lu 03/04/2010
Optimizing Complex Extraction Programs over Evolving Text Data Fei Chen 1, Byron Gao 2, AnHai Doan 1, Jun Yang 3, Raghu Ramakrishnan 4 1 University of.
Data Mining Algorithms for Large-Scale Distributed Systems Presenter: Ran Wolff Joint work with Assaf Schuster 2003.
Copyright 2007, Paradigm Publishing Inc. ACCESS 2007 Chapter 3 BACKNEXTEND 3-1 LINKS TO OBJECTIVES Modify a Table – Add, Delete, Move Fields Modify a Table.
India Research Lab © Copyright IBM Corporation 2006 Entity Annotation using operations on the Inverted Index Ganesh Ramakrishnan, with Sreeram Balakrishnan.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
CrowdSearch: Exploiting Crowds for Accurate Real-Time Image Search on Mobile Phones Original work by Tingxin Yan, Vikas Kumar, Deepak Ganesan Presented.
Next Generation Search Engines Ehsun Daroodi 1 Feb, 2003.
Mtivity Client Support System Quick start guide. Mtivity Client Support System We are very pleased to announce the launch of a new Client Support System.
ECE450 - Software Engineering II1 ECE450 – Software Engineering II Today: Introduction to Software Architecture.
The Software Development Process
Chapter One An Introduction to Programming and Visual Basic.
ESIP Semantic Web Products and Services ‘triples’ “tutorial” aka sausage making ESIP SW Cluster, Jan ed.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
 Shopping Basket  Stages to maintain shopping basket in framework  Viewing Shopping Basket.
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
Author Instructions How to upload Abstracts and Sessions to the Paper Management System.
Data Profiling 13 th Meeting Course Name: Business Intelligence Year: 2009.
The single most important skill for a computer programmer is problem solving Problem solving means the ability to formulate problems, think creatively.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Chapter – 8 Software Tools.
Purchase Orders Notes:.
Data Acquisition. Get all data necessary for the analysis task at hand Some data comes from inside the company –Need to go and talk with various data.
 Every word matters. Generally, all the words you put in the query will be used.  Search is always case insensitive. A search for [ new york times ]
Opinion spam and Analysis 소프트웨어공학 연구실 G 최효린 1 / 35.
Text Based Information Retrieval
Supervised Time Series Pattern Discovery through Local Importance
CS 430: Information Discovery
This course is based on a Samsung Product.
Intro to Machine Learning
Extracting Information from Diverse and Noisy Scanned Document Images
AI Builder for Power Platform
Presentation transcript:

Information Extraction

Two Types of Extraction Extracting from template-based data –An example on how this data is generated –Querying on Amazon by filling in a form interface using Jignesh Patel –The query goes to a database in the backend –Database result is plugged into template-based pages –This is called wrappers Extracting entities and relationships from textual data 2

Wrappers 3

IE from Text 4 AttributeWalmart ProductVendor Product Product NameCHAMP Bluetooth Survival Solar Multi- Function Skybox with Emergency AM/FM NOAA Weather Radio (RCEP600WR) Product Short Description BLTH SURVIVAL SKYBOX W WR Product Long Description BLTH SURVIVAL SKYBOX W WR Product SegmentElectronics Product TypeCB Radios & ScannersPortable Radios ColorBlack Actual ColorBlack UPC Unique product identifier (aka key in e-commerce industry)

IE from Text 5 AttributeWalmart ProductVendor Product Product NameGreatShield 6FT Apple MFi Licensed Lightning Sync Charge Cable for Apple iPhone 6 6 Plus 5S 5C 5 iPad 4 Air Mini - Black GreatShield 6FT Apple MFi Licensed Lightning Sync Charge Cable for Apple iPhone 6 6 Plus 5S 5C 5 iPad 4 Air Mini - White Product Short DescriptionGreatShield 6FT Apple MFi Licensed Lightning Sync Charge Cable for Apple iPhone 6 6 Plus 5S 5C 5 iPad 4 Air Mini - Black Product Long DescriptionGreatShield Apple MFi Licensed Lightning Charge & Sync Cable This USB 2.0 cable connects your iPhone, iPad, or iPod with Lightning … GreatShield Apple MFi Licensed Lightning Charge & Sync Cable This USB 2.0 cable connects your iPhone, iPad, or iPod with Lightning … Product SegmentElectronics Product TypeCable Connectors BrandGreatShield Manufacturer Part NumberGS09055

6 IE from Text For years, Microsoft Corporation CEO Bill Gates was against open source. But today he appears to have changed his mind. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Name Title Organization Bill Gates CEO Microsoft Bill Veghte VP Microsoft Richard Stallman Founder Free Soft.. PEOPLE Select Name From PEOPLE Where Organization = ‘Microsoft’ Bill Gates Bill Veghte (from Cohen’s IE tutorial, 2003)

7 Two Main Solution Approaches Hand-crafted rules –Eg regexes –Dictionary based Learning-based approaches

Example: Regexes Extract attribute values from products 8 title= X-Mark Pair of 45 lb. Rubber Hex Dumbbells material= Rubber finer categorizations= Dumbbells__Weight Sets type= Hand Weights … title= X-Mark Pair of 45 lb. Rubber Hex Dumbbells material= Rubber finer categorizations= Dumbbells__Weight Sets type= Hand Weights … title= Zalman ZM-T2 ATX Mini Tower Case - Black brand= Zalman finer categorizations= Computer Cases … title= Zalman ZM-T2 ATX Mini Tower Case - Black brand= Zalman finer categorizations= Computer Cases …

Example Discuss how to extract weights such as 45 lbs –Something to recognize the number –Something to recognize all variations of weight units –The resulting regex can be very complicated 9

10 Goal: build a simple person-name extractor –input: a set of Web pages W, a list of names –output: all mentions of names in W Simplified Person-Name extraction –for each name e.g., David Smith –generate variants (V): “David Smith”, “D. Smith”, “Smith, D.”, etc. –find occurrences of these variants in W –clean the occurrences Example: Dictionary Based

11 Compiled Dictionary D. Miller, R. Smith, K. Richard, D. Li ……. David Miller Rob Smith Renee Miller

12 Hand-coded rules can be arbitrarily complex Find conference name in raw text ############################################################################# # Regular expressions to construct the pattern to extract conference names ############################################################################# # These are subordinate patterns my $wordOrdinals="(?:first|second|third|fourth|fifth|sixth|seventh|eighth|ninth|tenth|eleventh|twelfth|thirteenth|fourteenth|fifteenth)"; my $numberOrdinals="(?:\\d?(?:1st|2nd|3rd|1th|2th|3th|4th|5th|6th|7th|8th|9th|0th))"; my $ordinals="(?:$wordOrdinals|$numberOrdinals)"; my $confTypes="(?:Conference|Workshop|Symposium)"; my $words="(?:[A-Z]\\w+\\s*)"; # A word starting with a capital letter and ending with 0 or more spaces my $confDescriptors="(?:international\\s+|[A-Z]+\\s+)"; #.e.g "International Conference...' or the conference name for workshops (e.g. "VLDB Workshop...") my $connectors="(?:on|of)"; my $abbreviations="(?:\\([A-Z]\\w\\w+[\\W\\s]*?(?:\\d\\d+)?\\))"; # Conference abbreviations like "(SIGMOD'06)" # The actual pattern we search for. A typical conference name this pattern will find is # "3rd International Conference on Blah Blah Blah (ICBBB-05)" my $fullNamePattern="((?:$ordinals\\s+$words*|$confDescriptors)?$confTypes(?:\\s+$connectors\\s+.*?|\\s+)?$abbreviations?)(?:\\n|\\r|\\.|, look for the conference pattern ############################################################## lookForPattern($dbworldMessage, $fullNamePattern); ######################################################### # In a given, look for occurrences of # is a regular expression ######################################################### sub lookForPattern { my ($file,$pattern)

13 Example Code of Hand-Coded Extractor # Only look for conference names in the top 20 lines of the file my $maxLines=20; my $topOfFile=getTopOfFile($file,$maxLines); # Look for the match in the top 20 lines - case insenstive, allow matches spanning multiple lines if($topOfFile=~/(.*?)$pattern/is) { my ($prefix,$name)=($1,$2); # If it matches, do a sanity check and clean up the match # Get the first letter # Verify that the first letter is a capital letter or number if(!($name=~/^\W*?[A-Z0-9]/)) { return (); } # If there is an abbreviation, cut off whatever comes after that if($name=~/^(.*?$abbreviations)/s) { $name=$1; } # If the name is too long, it probably isn't a conference if(scalar($name=~/[^\s]/g) > 100) { return (); } # Get the first letter of the last word (need to this after chopping off parts of it due to abbreviation my ($letter,$nonLetter)=("[A-Za-z]","[^A-Za-z]"); " $name"=~/$nonLetter($letter) $letter*$nonLetter*$/; # Need a space before $name to handle the first $nonLetter in the pattern if there is only one word in name my $lastLetter=$1; if(!($lastLetter=~/[A-Z]/)) { return (); } # Verify that the first letter of the last word is a capital letter # Passed test, return a new crutch return newCrutch(length($prefix),length($prefix)+length($name),$name,"Matched pattern in top $maxLines lines","conference name",getYear($name)); } return (); }

14 Two Main Solution Approaches Hand-crafted rules –Eg regexes –Dictionary based Learning-based approaches

15 IE from Text For years, Microsoft Corporation CEO Bill Gates was against open source. But today he appears to have changed his mind. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Name Title Organization Bill Gates CEO Microsoft Bill Veghte VP Microsoft Richard Stallman Founder Free Soft.. PEOPLE Select Name From PEOPLE Where Organization = ‘Microsoft’ Bill Gates Bill Veghte (from Cohen’s IE tutorial, 2003)

A Quick Intro to Classification Also known as supervised learning Given training examples, train a classifier Apply the classifier to a new example to classify Training examples: feature vectors + label A new example: a feature vector Example: predict if a guy will be a good husband 16

17 Learning to Extract Person Names For years, Microsoft Corporation CEO Bill Gates was against open source. But today he appears to have changed his mind. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Name Title Organization Bill Gates CEO Microsoft Bill Veghte VP Microsoft Richard Stallman Founder Free Soft.. PEOPLE Select Name From PEOPLE Where Organization = ‘Microsoft’ Bill Gates Bill Veghte (from Cohen’s IE tutorial, 2003)

The Entire End-to-End Process Take some pages Manually mark up all person names Create a set of features Convert each marked-up name into a feature vector with a positive label => a positive example Create negative examples Train a classifier on training data Now use it to extract names from the rest of the pages –Must generate candidate names Compute accuracy 18

Computing Accuracy, or How To Evaluate IE Solutions? Precision Recall Precison/Recall curve Often need to know what is the accuracy target of the end application. 19

In Practice the Whole Process is More Complex Development stage –Develop best extractor, try to fine tune as much as possible Production stage –Apply to (often a lot of) data 20

21 Hand-Coded Methods Easy to construct in many cases –e.g., to recognize prices, phone numbers, zip codes, conference names, etc. Easier to debug & maintain –especially if written in a “high-level” language (as is usually the case) –Eg this is zipcode because it’s five digits and is preceded by two capitalized characters Easier to incorporate / reuse domain knowledge Can be quite labor intensive to write

22 Learning-Based Methods Can work well when training data is easy to construct and is plentiful Can capture complex patterns that are hard to encode with hand-crafted rules –e.g., determine whether a review is positive or negative –extract long complex gene names The human T cell leukemia lymphotropic virus type 1 Tax protein represses MyoD-dependent transcription by inhibiting MyoD-binding to the KIX domain of p300.“ [From AliBaba] Can be labor intensive to construct training data –not sure how much training data is sufficient Can be hard to understand and debug Complementary to hand-coded methods

A New Solution Method: Crowdsourcing (Next Few Slides Taken From a KAIST Tutorial) 23

Mechanical Turk Begin with a project –Define the goals and key components of your project. For example, your goal might be to clean your business listing database so that you have accurate information for consumers. Break it into tasks and design your HIT –Break the project into individual tasks; e.g., if you have 1,000 listings to verify, each listing would be an individual task. –Next, design your Human Intelligence Tasks (HITs) by writing crisp and clear instructions, identifying the specific outputs/inputs desired and how much you will pay to have work completed. Publish HITs to the marketplace –You can load millions of HITs into the marketplace. Each HIT can have multiple assignments so that different Workers can provide answers to the same set of questions and you can compare the results to form an agreed-upon answer.

Mechanical Turk Workers accept assignments –If Workers need special skills to complete your tasks, you can require that they pass a Qualification test before they are allowed to work on your HITs. –You can also require other Qualifications such as the location of a Worker or that they have completed a minimum number of HITs. Workers submit assignments for review –When a Worker completes your HIT, he or she submits an assignment for you to review. Approve or reject assignments –When your work items have been completed, you can review the results and approve or reject them. You pay only for approved work. Complete your project –Congratulations! Your project has been completed and your Workers have been paid.

Screenshot

28

Type of Tasks in M-Turk

How Could We Use Crowdsourcing for IE? 30

A Real-Life Case Study 31

IE from Text 32 AttributeWalmart ProductVendor Product Product NameCHAMP Bluetooth Survival Solar Multi- Function Skybox with Emergency AM/FM NOAA Weather Radio (RCEP600WR) Product Short Description BLTH SURVIVAL SKYBOX W WR Product Long Description BLTH SURVIVAL SKYBOX W WR Product SegmentElectronics Product TypeCB Radios & ScannersPortable Radios ColorBlack Actual ColorBlack UPC Unique product identifier (aka key in e-commerce industry)

IE from Text 33 AttributeWalmart ProductVendor Product Product NameGreatShield 6FT Apple MFi Licensed Lightning Sync Charge Cable for Apple iPhone 6 6 Plus 5S 5C 5 iPad 4 Air Mini - Black GreatShield 6FT Apple MFi Licensed Lightning Sync Charge Cable for Apple iPhone 6 6 Plus 5S 5C 5 iPad 4 Air Mini - White Product Short DescriptionGreatShield 6FT Apple MFi Licensed Lightning Sync Charge Cable for Apple iPhone 6 6 Plus 5S 5C 5 iPad 4 Air Mini - Black Product Long DescriptionGreatShield Apple MFi Licensed Lightning Charge & Sync Cable This USB 2.0 cable connects your iPhone, iPad, or iPod with Lightning … GreatShield Apple MFi Licensed Lightning Charge & Sync Cable This USB 2.0 cable connects your iPhone, iPad, or iPod with Lightning … Product SegmentElectronics Product TypeCable Connectors BrandGreatShield Manufacturer Part NumberGS09055

Attribute Extraction from Text Our focus: brand name extraction – Problem definition: extracting a product’s brand name from the product title (a short textual product description) e.g. extracting “Hitachi” from “Hitachi TV 32" in black HD 368X-42” Knowing brand names is important for – Trend analysis – Sales prediction – Inventory management – … 8/17/2015 Large-Scale Information Extraction Using Rules, Machine Learning and Crowdsourcing 34

Challenges 1.Hard to achieve high accuracy – Require precision above 0.95 and recall improving over time – Hard to achieve high precision Ambiguous brand names – e.g. “Apple iPad Mini 16GB – Black” and “Apple Juice by Minute Maid, 1 Gallon” Variations and typos – Hard to achieve high recall A lot of brand names only have a few products – e.g. “Orginnovations Inc” with only 15 product items in our dataset 2.Limited human resources – 1 or 2 analysts/developers 8/17/2015 Large-Scale Information Extraction Using Rules, Machine Learning and Crowdsourcing 35

Key Ideas of Our Solution 1.Use dictionary-based IE – Construct, monitor and maintain a brand name dictionary for each product department – Use dictionaries to perform IE – Achieving high precision Monitor precision by the crowd When precision drops below 0.95, then ask the analyst/developer to modify the dictionary to improve precision – Achieving high recall Crowdsource the extraction of brand name for products with brand names not in the dictionary 8/17/2015 Large-Scale Information Extraction Using Rules, Machine Learning and Crowdsourcing 36

Key Ideas of Our Solution (Cont.) 2.Don’t involve the developer/analyst as long as the accuracy requirements are satisfied – Use crowdsourcing whenever possible Evaluate and monitor precision and recall Improve recall 8/17/2015 Large-Scale Information Extraction Using Rules, Machine Learning and Crowdsourcing 37

Architecture of Our Solution 8/17/2015 Large-Scale Information Extraction Using Rules, Machine Learning and Crowdsourcing 38 Dictionary Construction Web Crawls In-house Databases Online Listings Brand Name Dictionaries Brand Name Extraction Is precision > 0.95? Tune for Precision (Analyst/Developer) No Yes Populate Product Database Is recall > 0.9? Done Yes No Tune for Recall (Crowd) Evaluate Precision Extraction Results Result Sample Product Items Evaluate Recall

Architecture of Our Solution 8/17/2015 Large-Scale Information Extraction Using Rules, Machine Learning and Crowdsourcing 39 Dictionary Construction Web Crawls In-house Databases Online Listings Brand Name Dictionaries Brand Name Extraction Is precision > 0.95? Tune for Precision (Analyst/Developer) No Yes Populate Product Database Is recall > 0.9? Done Yes No Tune for Recall (Crowd) Evaluate Precision Extraction Results Result Sample Product Items Evaluate Recall

Dictionary Construction: Initialization Create a brand name dictionary for each product department using: – In-house data – Product pages crawled from other retailers’ web sites – Online brand name lists e.g. names.html 8/17/2015 Large-Scale Information Extraction Using Rules, Machine Learning and Crowdsourcing 40

Dictionary Construction: Clean Up For each entry in brand name dictionaries, discard if: – Number of product items in our in-house with this brand name is too small (e.g. < 10) – It is a very common word in our in-house product item descriptions (e.g. more than 2000 item descriptions contain this entry) 8/17/2015 Large-Scale Information Extraction Using Rules, Machine Learning and Crowdsourcing 41

Dictionary Construction: Adding Variations Add brand name variations – Using the following rules: If brand name contains “ and ”, add the variation with “ & ” and vice versa If brand name contains any of the following phrases, add the variations with others replaced: – “ co”, “ corp”, “ corporation”, “ ltd”, “ limited”, “ inc”, “ incorporated” If brand name contains dot character(s), add variations with arbitrary no of dots removed – e.g. for “S. Lichtenberg & Co.” add “S Lichtenberg & Co”, “S. Lichtenberg and Co.”, etc. 8/17/2015 Large-Scale Information Extraction Using Rules, Machine Learning and Crowdsourcing 42

Architecture of Our Solution 8/17/2015 Large-Scale Information Extraction Using Rules, Machine Learning and Crowdsourcing 43 Dictionary Construction Web Crawls In-house Databases Online Listings Brand Name Dictionaries Brand Name Extraction Is precision > 0.95? Tune for Precision (Analyst/Developer) No Yes Populate Product Database Is recall > 0.9? Done Yes No Tune for Recall (Crowd) Evaluate Precision Extraction Results Result Sample Product Items Evaluate Recall

Brand Name Extraction For each newly arrived product item: 1.Detect the product’s department e.g. using Chimera product classification system [DOAN’14] 2.Load the corresponding brand name dictionary as a prefix tree 3.Use prefix tree to look up the product title for brand names occurring in predefined patterns: Brand name appearing at the beginning of the title – Example: “Nuvo Lighting 60/332 Two Light Reversible Lighting” etc 8/17/2015 Large-Scale Information Extraction Using Rules, Machine Learning and Crowdsourcing 44

Brand Name Extraction (Cont.) 4.Add all the dictionary entries found in the title to the candidate brand set 5.For each pair of entries in the candidate brand set: If one is a substring of the other, discard the shorter one – Example: discard “Tommee” if “Tommee Tippee” is also in the result set 6.Report an extracted brand name for the current product item if: a)There is only one candidate brand name in the candidate brand set b)This candidate brand name is not in the current department’s brand name blacklist (created by analyst(s)) 8/17/2015 Large-Scale Information Extraction Using Rules, Machine Learning and Crowdsourcing 45

Architecture of Our Solution 8/17/2015 Large-Scale Information Extraction Using Rules, Machine Learning and Crowdsourcing 46 Dictionary Construction Web Crawls In-house Databases Online Listings Brand Name Dictionaries Is recall > 0.9? Done Yes No Tune for Recall (Crowd) Extraction Results Brand Name Extraction Product Items Is precision > 0.95? Tune for Precision (Analyst/Developer) No Yes Populate Product Database Evaluate Precision Result Sample Evaluate Recall

Evaluate Extraction Precision 8/17/2015 Large-Scale Information Extraction Using Rules, Machine Learning and Crowdsourcing 47

Tune for Precision Take a sample of the product items we have extracted a brand name for – e.g. 100 product items Ask the analyst to go through them and add non-brands or ambiguous brand names to the blacklist of the corresponding product department Go to brand extraction step 8/17/2015 Large-Scale Information Extraction Using Rules, Machine Learning and Crowdsourcing 48

Architecture of Our Solution 8/17/2015 Large-Scale Information Extraction Using Rules, Machine Learning and Crowdsourcing 49 Dictionary Construction Web Crawls In-house Databases Online Listings Is precision > 0.95? Tune for Precision (Analyst/Developer) No Yes Populate Product Database Evaluate Precision Result Sample Brand Name Dictionaries Brand Name Extraction Product Items Extraction Results Is recall > 0.9? Done Yes No Tune for Recall (Crowd) Evaluate Recall

Estimate Extraction Recall 8/17/2015 Large-Scale Information Extraction Using Rules, Machine Learning and Crowdsourcing 50

Tune for Recall Take a sample of the product items the brand names of which do not appear in the brand dictionary – e.g. sample size = 1000 Send the sample to the crowd for manual brand extraction – Send each item to 2 workers – If extracted brands are the same, then add it to the brand name dictionary – Otherwise Send the item to a 3 rd worker If 2 out of 3 agree on a brand name, then add it to the brand name dictionary Otherwise ignore them Go to brand extraction step 8/17/2015 Large-Scale Information Extraction Using Rules, Machine Learning and Crowdsourcing 51

Experiments Home products department – 142K product items for which a brand name has not been extracted before Constructing brand name dictionary – ~37K brand names Tuning the system – Perform 7 rounds of precision evaluation (crowd) and tuning (developer) – Perform 1 round of recall evaluation and tuning (crowd) 8/17/2015 Large-Scale Information Extraction Using Rules, Machine Learning and Crowdsourcing 52

Results Accuracy: – Precision = 0.95 (27917 / 29276) – Recall = 0.93 (27917 / 30000) Precision evaluation (Samasource*) – Cost = ~$2500 (~12K items, $210 per 1000 items) – Duration = ~34 hours (2 hr 50 min per 1000 items) Recall tuning (Amazon Mechanical Turk**) – Cost = $154 (for 1000 items) – Duration = 1 hour 35 minutes (for 1000 items) 8/17/2015 Large-Scale Information Extraction Using Rules, Machine Learning and Crowdsourcing 53 * **

Conclusion Our proposed solution can extract brand names from product titles with high accuracy and relatively low cost. Using this solution is effective for domains that: – Have relatively small number of ambiguous values e.g. appearance in an English language dictionary as an indication of ambiguity – ~2000 brand names in home department dictionary appear in an English language dictionary. – Don’t grow too fast The rate of values added to the domain comparable to the rate our solution can find new brand names within budget limits – e.g. ~250 brand names (found via crowdsourcing) in ~2 hours spending $154 8/17/2015 Large-Scale Information Extraction Using Rules, Machine Learning and Crowdsourcing 54