Sunita Sarawagi IIT Bombay Team: Rahul Gupta (PhD)

Statistical learning models for information extraction and entity resolution
Sunita Sarawagi IIT Bombay Team: Rahul Gupta (PhD) Abhishek Agarkar (Mtech) Upendra (Mtech) Pranav Kashyap (Btech) TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAAAAA

Information extraction
Formulate task as a statistical learning model Exploiting entity-level features Semi-markov conditional random fields for information extraction. In NIPS, 2004 Collective inference New method of modeling the inference problem  interesting inference algorithms Inference algorithms Efficient inference on sequence segmentation models, ICML 2006 Efficient inference with cardinality-based clique potentials. In ICML, 2007. Training algorithms for structured models Better generalizability and more tractable inference. (ICML 08) Domain adaptation: adapt models trained in one domain to new domains

Similarity to author’s column in database
Semi-markov models t x y Features describe the single word “Fagin” y1 y2 y3 y4 y5 y6 y7 y8 1 2 3 4 5 6 7 8 R. Fagin and J. Halpern Belief Awareness Reasoning Author Other Title Segmentation model x y Features describe full entity Similarity to author’s column in database l,u List more entity-level features Say Begin cont end encoding is an approximation. l1=1, u1=2 l1=u1=3 l1=4, u1=5 l1=6, u1=8 R. Fagin and J. Halpern Belief Awareness Reasoning Author Other Title

Inference in segmentation models
R. Fagin and J. Helpern, Belief, awareness, reasoning, In AI Many large tables Surface features (cheap) Database lookup features (expensive!) S Chakrabarti Jay Shan Jackie Chan Bill Gates Thorsten J Kleinberg J. Gherke Claire Cardie Jeffrey Ullman Ron Fagin J. Ullman M Y Vardi Name Authors Inverted index Efficient search for top-k most similar entities Large DBs – think DBLP and citeseer dbs. Challenge in first: top-k tf-idf searches highly optimized to exploit tf scores of each word in query segment Second: combine two top-k searches Batch up to do better than individual top-k? Find top segmentation without top-k matches for all segments?

Collective information extraction
Y has character. Mr. X lives in Y. X buys Y Times daily. y12 y22 y32 y42 y52 y13 y23 y33 y43 y53 y11 y21 y31 y41 Associative scores wfe(i,i) > wfe(i,j) Todo: add pointers/colors

Information extraction
Formulate task as a statistical learning model Exploiting entity-level features Semi-markov conditional random fields for information extraction. In NIPS, 2004 Collective inference New method of modeling the inference problem  interesting inference algorithms Inference algorithms Efficient inference on sequence segmentation models, ICML 2006 Efficient inference with cardinality-based clique potentials. In ICML, 2007. Training algorithms for structured models Better generalizability and more tractable inference. (ICML 08) Domain adaptation: adapt models trained in one domain to new domains

Managing imprecision Representing the imprecision of extraction as simple row and column uncertainty models for easy querying (VLDB 06) Aggregate queries over data with uncertain duplicates (EDBT 08) Given a large set of entities with noisy duplicates where finding duplicate groups is expensive, find K largest group Groups with count >= threshold All papers available at:

Sunita Sarawagi IIT Bombay Team: Rahul Gupta (PhD)

Similar presentations

Presentation on theme: "Sunita Sarawagi IIT Bombay Team: Rahul Gupta (PhD)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Sunita Sarawagi IIT Bombay Team: Rahul Gupta (PhD)

Similar presentations

Presentation on theme: "Sunita Sarawagi IIT Bombay Team: Rahul Gupta (PhD)"— Presentation transcript:

Similar presentations

About project

Feedback