Download presentation
Presentation is loading. Please wait.
1
Information Extraction on Real Estate Rental Classifieds Eddy Hartanto Ryohei Takahashi
2
Overview We want to extract 10 fields: Security deposit Square footage Number of bathrooms Contact person’s name Contact phone number Nearby landmarks Cost of parking Date available Building style / architecture Number of units in building These fields can’t easily be served by keyword search
3
Approach Hand labeled test set as precision and recall computation base Pattern matching approach with Rapier Statistical approach using HMM with different structures
4
Demo …
5
Hidden Markov Models We consider three different HMM structures We train one HMM per field Words in postings are output symbols of HMM Hexagons represent target states, which emit the relevant words for that field
6
Training Data We use a randomly-selected set of 110 postings to use as the training data We manually label which words in each posting are relevant to each of the 10 fields
7
HMM Structure #1 A single prefix state and single suffix state Prefixes and suffixes can be of arbitrary length
8
HMM Structure #2 Varying numbers of prefix, suffix, and target states
9
HMM Structure #3 Varying numbers of prefix, suffix, and target states Prefixes and suffixes are fixed in length
10
Cross-Validation We use cross-validation to find the optimal number of prefix, suffix, and target states
11
Preventing Underflow Postings are hundreds of words long Forward and backward probabilities become incredibly small => underflow To avoid underflow, we normalize the forward probabilities: instead of
12
Smoothing We perform add-one smoothing for the emission probabilities:
13
Rapier Rapier automatically learns rules to extract fields from training examples We use the same 110 training postings as for the HMMs
14
Data Preparation Sentence Splitter (Cognitive Computation Group at UIUC, http://l2r.cs.uiuc.edu/~cogcomp/tools.php): puts one sentence on each line http://l2r.cs.uiuc.edu/~cogcomp/tools.php Stanford Tagger (Stanford NLP Group, http://nlp.stanford.edu/software/tagger.shtml): tags each word with part of speech http://nlp.stanford.edu/software/tagger.shtml We then manually create a template file for each of the files, with the information for the 10 fields filled in
15
Test Data We use a randomly-selected set of 100 postings to use as the test data We manually label these 100 postings with the fields
16
Rapier Results We use Rapier’s “test2” program to evaluate performance on the labeled postings Training Set Precision: 0.990099 Recall: 0.408998 F-measure: 0.578871 Test Set Precision: 0.747126 Recall: 0.151869 F-measure: 0.252427
17
Another run at Rapier Overall PrecisionRecallF-measure 0.8470.2010.324 FieldCorrectRetrieved Correct& Retrieved PrecisionRecall F- measure security_deposit2300 000 square_footage2410 10.4170.588 no_bathrooms582825 0.8930.4310.581 contact_person402824 0.8570.60.706 contact_phone9321 0.50.0110.021 nearby_landmarks7685 0.6250.0660.119 parking_cost400 000 date_available2110 000 building_style644 10.6670.8 no_units1443 0.750.2140.333
18
HMM Structure#1 FieldCorrectRetrievedCorrectRetrieved PrecisionRecall F- measure security_deposit2300 000 square_footage2400 000 no_bathrooms5810017 0.170.2930.215 contact_person401000 000 contact_phone934126 0.6340.280.388 nearby_landmarks7600 000 parking_cost4590 000 date_available211000 000 building_style61002 0.020.3330.038 no_units1400 000 Overall PrecisionRecallF-measure 0.090.1250.105
19
HMM Structure#2 FieldCorrectRetrieved Correct RetrievedPrecisionRecallF-measure security_deposit2300000 square_footage24980.8890.3330.485 no_bathrooms5800000 contact_person4000000 contact_phone9310070.070.0750.073 nearby_landmarks7600000 parking_cost400000 date_available2100000 building_style630000 no_units1400000 Overall PrecisionRecallF-measure 0.1340.0420.064
20
HMM Structure#3 FieldCorrectRetrieved Correct RetrievedPrecisionRecallF-measure security_deposit2300000 square_footage24980.8890.3330.485 no_bathrooms58100370.370.6380.468 contact_person4010040.040.10.057 contact_phone9310060.060.0650.062 nearby_landmarks7610070.070.0920.08 parking_cost400000 date_available2140000 building_style63110.0320.1670.054 no_units1400000 Overall PrecisionRecallF-measure 0.1420.1750.157
21
Insights Relatively good performance with Rapier Not too good performance with HMM, due to lack of training data (only 0.67% or 100 sampled randomly from 15000 postings) while test data is 10% or 1500 postings sampled from 15000 postings. Limitation of automatic spelling correction although enhanced with California town, city, county names and first person names. Wish the availability of advanced ontology as Wordnet is somewhat limited: recognize entity such as SJSU, Albertson, street names
22
Question & Answer
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.