Presentation is loading. Please wait.

Presentation is loading. Please wait.

Sunita Sarawagi IIT Bombay Team: Rahul Gupta (PhD)

Similar presentations


Presentation on theme: "Sunita Sarawagi IIT Bombay Team: Rahul Gupta (PhD)"— Presentation transcript:

1 Statistical learning models for information extraction and entity resolution
Sunita Sarawagi IIT Bombay Team: Rahul Gupta (PhD) Abhishek Agarkar (Mtech) Upendra (Mtech) Pranav Kashyap (Btech) TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAAAAA

2 Information extraction
Formulate task as a statistical learning model Exploiting entity-level features Semi-markov conditional random fields for information extraction. In NIPS, 2004 Collective inference New method of modeling the inference problem  interesting inference algorithms Inference algorithms Efficient inference on sequence segmentation models, ICML 2006 Efficient inference with cardinality-based clique potentials. In ICML, 2007. Training algorithms for structured models Better generalizability and more tractable inference. (ICML 08) Domain adaptation: adapt models trained in one domain to new domains

3 Similarity to author’s column in database
Semi-markov models t x y Features describe the single word “Fagin” y1 y2 y3 y4 y5 y6 y7 y8 1 2 3 4 5 6 7 8 R. Fagin and J. Halpern Belief Awareness Reasoning Author Other Title Segmentation model x y Features describe full entity Similarity to author’s column in database l,u List more entity-level features Say Begin cont end encoding is an approximation. l1=1, u1=2 l1=u1=3 l1=4, u1=5 l1=6, u1=8 R. Fagin and J. Halpern Belief Awareness Reasoning Author Other Title

4 Inference in segmentation models
R. Fagin and J. Helpern, Belief, awareness, reasoning, In AI Many large tables Surface features (cheap) Database lookup features (expensive!) S Chakrabarti Jay Shan Jackie Chan Bill Gates Thorsten J Kleinberg J. Gherke Claire Cardie Jeffrey Ullman Ron Fagin J. Ullman M Y Vardi Name Authors Inverted index Efficient search for top-k most similar entities Large DBs – think DBLP and citeseer dbs. Challenge in first: top-k tf-idf searches highly optimized to exploit tf scores of each word in query segment Second: combine two top-k searches Batch up to do better than individual top-k? Find top segmentation without top-k matches for all segments?

5 Collective information extraction
Y has character. Mr. X lives in Y. X buys Y Times daily. y12 y22 y32 y42 y52 y13 y23 y33 y43 y53 y11 y21 y31 y41 Associative scores wfe(i,i) > wfe(i,j) Todo: add pointers/colors

6 Information extraction
Formulate task as a statistical learning model Exploiting entity-level features Semi-markov conditional random fields for information extraction. In NIPS, 2004 Collective inference New method of modeling the inference problem  interesting inference algorithms Inference algorithms Efficient inference on sequence segmentation models, ICML 2006 Efficient inference with cardinality-based clique potentials. In ICML, 2007. Training algorithms for structured models Better generalizability and more tractable inference. (ICML 08) Domain adaptation: adapt models trained in one domain to new domains

7 Managing imprecision Representing the imprecision of extraction as simple row and column uncertainty models for easy querying (VLDB 06) Aggregate queries over data with uncertain duplicates (EDBT 08) Given a large set of entities with noisy duplicates where finding duplicate groups is expensive, find K largest group Groups with count >= threshold All papers available at:


Download ppt "Sunita Sarawagi IIT Bombay Team: Rahul Gupta (PhD)"

Similar presentations


Ads by Google