Information Extraction on Real Estate Rental Classifieds Eddy Hartanto Ryohei Takahashi
Overview We want to extract 10 fields: Security deposit Square footage Number of bathrooms Contact person’s name Contact phone number Nearby landmarks Cost of parking Date available Building style / architecture Number of units in building These fields can’t easily be served by keyword search
Approach Hand labeled test set as precision and recall computation base Pattern matching approach with Rapier Statistical approach using HMM with different structures
Demo …
Hidden Markov Models We consider three different HMM structures We train one HMM per field Words in postings are output symbols of HMM Hexagons represent target states, which emit the relevant words for that field
Training Data We use a randomly-selected set of 110 postings to use as the training data We manually label which words in each posting are relevant to each of the 10 fields
HMM Structure #1 A single prefix state and single suffix state Prefixes and suffixes can be of arbitrary length
HMM Structure #2 Varying numbers of prefix, suffix, and target states
HMM Structure #3 Varying numbers of prefix, suffix, and target states Prefixes and suffixes are fixed in length
Cross-Validation We use cross-validation to find the optimal number of prefix, suffix, and target states
Preventing Underflow Postings are hundreds of words long Forward and backward probabilities become incredibly small => underflow To avoid underflow, we normalize the forward probabilities: instead of
Smoothing We perform add-one smoothing for the emission probabilities:
Rapier Rapier automatically learns rules to extract fields from training examples We use the same 110 training postings as for the HMMs
Data Preparation Sentence Splitter (Cognitive Computation Group at UIUC, puts one sentence on each line Stanford Tagger (Stanford NLP Group, tags each word with part of speech We then manually create a template file for each of the files, with the information for the 10 fields filled in
Test Data We use a randomly-selected set of 100 postings to use as the test data We manually label these 100 postings with the fields
Rapier Results We use Rapier’s “test2” program to evaluate performance on the labeled postings Training Set Precision: Recall: F-measure: Test Set Precision: Recall: F-measure:
Another run at Rapier Overall PrecisionRecallF-measure FieldCorrectRetrieved Correct& Retrieved PrecisionRecall F- measure security_deposit square_footage no_bathrooms contact_person contact_phone nearby_landmarks parking_cost date_available building_style no_units
HMM Structure#1 FieldCorrectRetrievedCorrectRetrieved PrecisionRecall F- measure security_deposit square_footage no_bathrooms contact_person contact_phone nearby_landmarks parking_cost date_available building_style no_units Overall PrecisionRecallF-measure
HMM Structure#2 FieldCorrectRetrieved Correct RetrievedPrecisionRecallF-measure security_deposit square_footage no_bathrooms contact_person contact_phone nearby_landmarks parking_cost date_available building_style no_units Overall PrecisionRecallF-measure
HMM Structure#3 FieldCorrectRetrieved Correct RetrievedPrecisionRecallF-measure security_deposit square_footage no_bathrooms contact_person contact_phone nearby_landmarks parking_cost date_available building_style no_units Overall PrecisionRecallF-measure
Insights Relatively good performance with Rapier Not too good performance with HMM, due to lack of training data (only 0.67% or 100 sampled randomly from postings) while test data is 10% or 1500 postings sampled from postings. Limitation of automatic spelling correction although enhanced with California town, city, county names and first person names. Wish the availability of advanced ontology as Wordnet is somewhat limited: recognize entity such as SJSU, Albertson, street names
Question & Answer