Information Extraction on Real Estate Rental Classifieds Eddy Hartanto Ryohei Takahashi.

Information Extraction on Real Estate Rental Classifieds Eddy Hartanto Ryohei Takahashi

Overview We want to extract 10 fields: Security deposit Square footage Number of bathrooms Contact person’s name Contact phone number Nearby landmarks Cost of parking Date available Building style / architecture Number of units in building These fields can’t easily be served by keyword search

Approach Hand labeled test set as precision and recall computation base Pattern matching approach with Rapier Statistical approach using HMM with different structures

Demo …

Hidden Markov Models We consider three different HMM structures We train one HMM per field Words in postings are output symbols of HMM Hexagons represent target states, which emit the relevant words for that field

Training Data We use a randomly-selected set of 110 postings to use as the training data We manually label which words in each posting are relevant to each of the 10 fields

HMM Structure #1 A single prefix state and single suffix state Prefixes and suffixes can be of arbitrary length

HMM Structure #2 Varying numbers of prefix, suffix, and target states

HMM Structure #3 Varying numbers of prefix, suffix, and target states Prefixes and suffixes are fixed in length

Cross-Validation We use cross-validation to find the optimal number of prefix, suffix, and target states

Preventing Underflow Postings are hundreds of words long Forward and backward probabilities become incredibly small => underflow To avoid underflow, we normalize the forward probabilities: instead of

Smoothing We perform add-one smoothing for the emission probabilities:

Rapier Rapier automatically learns rules to extract fields from training examples We use the same 110 training postings as for the HMMs

Data Preparation Sentence Splitter (Cognitive Computation Group at UIUC, http://l2r.cs.uiuc.edu/~cogcomp/tools.php): puts one sentence on each line http://l2r.cs.uiuc.edu/~cogcomp/tools.php Stanford Tagger (Stanford NLP Group, http://nlp.stanford.edu/software/tagger.shtml): tags each word with part of speech http://nlp.stanford.edu/software/tagger.shtml We then manually create a template file for each of the files, with the information for the 10 fields filled in

Test Data We use a randomly-selected set of 100 postings to use as the test data We manually label these 100 postings with the fields

Rapier Results We use Rapier’s “test2” program to evaluate performance on the labeled postings Training Set Precision: 0.990099 Recall: 0.408998 F-measure: 0.578871 Test Set Precision: 0.747126 Recall: 0.151869 F-measure: 0.252427

Another run at Rapier Overall PrecisionRecallF-measure 0.8470.2010.324 FieldCorrectRetrieved Correct& Retrieved PrecisionRecall F- measure security_deposit2300 000 square_footage2410 10.4170.588 no_bathrooms582825 0.8930.4310.581 contact_person402824 0.8570.60.706 contact_phone9321 0.50.0110.021 nearby_landmarks7685 0.6250.0660.119 parking_cost400 000 date_available2110 000 building_style644 10.6670.8 no_units1443 0.750.2140.333

HMM Structure#1 FieldCorrectRetrievedCorrectRetrieved PrecisionRecall F- measure security_deposit2300 000 square_footage2400 000 no_bathrooms5810017 0.170.2930.215 contact_person401000 000 contact_phone934126 0.6340.280.388 nearby_landmarks7600 000 parking_cost4590 000 date_available211000 000 building_style61002 0.020.3330.038 no_units1400 000 Overall PrecisionRecallF-measure 0.090.1250.105

HMM Structure#2 FieldCorrectRetrieved Correct RetrievedPrecisionRecallF-measure security_deposit2300000 square_footage24980.8890.3330.485 no_bathrooms5800000 contact_person4000000 contact_phone9310070.070.0750.073 nearby_landmarks7600000 parking_cost400000 date_available2100000 building_style630000 no_units1400000 Overall PrecisionRecallF-measure 0.1340.0420.064

HMM Structure#3 FieldCorrectRetrieved Correct RetrievedPrecisionRecallF-measure security_deposit2300000 square_footage24980.8890.3330.485 no_bathrooms58100370.370.6380.468 contact_person4010040.040.10.057 contact_phone9310060.060.0650.062 nearby_landmarks7610070.070.0920.08 parking_cost400000 date_available2140000 building_style63110.0320.1670.054 no_units1400000 Overall PrecisionRecallF-measure 0.1420.1750.157

Insights Relatively good performance with Rapier Not too good performance with HMM, due to lack of training data (only 0.67% or 100 sampled randomly from 15000 postings) while test data is 10% or 1500 postings sampled from 15000 postings. Limitation of automatic spelling correction although enhanced with California town, city, county names and first person names. Wish the availability of advanced ontology as Wordnet is somewhat limited: recognize entity such as SJSU, Albertson, street names

Question & Answer

Information Extraction on Real Estate Rental Classifieds Eddy Hartanto Ryohei Takahashi.

Similar presentations

Presentation on theme: "Information Extraction on Real Estate Rental Classifieds Eddy Hartanto Ryohei Takahashi."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Information Extraction on Real Estate Rental Classifieds Eddy Hartanto Ryohei Takahashi.

Similar presentations

Presentation on theme: "Information Extraction on Real Estate Rental Classifieds Eddy Hartanto Ryohei Takahashi."— Presentation transcript:

Similar presentations

About project

Feedback