1 Accessing, Managing, and Mining Unstructured Data Eugene Agichtein.

1 Accessing, Managing, and Mining Unstructured Data Eugene Agichtein

2 The Web 20B+ of machine-readable text (some of it useful) (Mostly) human-generated for human consumption  Both “artificial” and “natural” phenomenon Still growing? Local and global structure (links) Headaches:  Dynamic vs. static content  People figured out how to make money  Positives:  Everything (almost) is on the web  People (eventually) can find info  People (on average) are not evil

3 Wait, there is more Blogs, wikipedia Hidden web: > 25 million databases  Accessible via keyword search interfaces  E.g., MedLine, CancerLit, USPTO, …  100x more data than surface web (Transcribed) speech from Genetic sequence annotations Biological & Medical literature Medical records, reports, alerts, 911 calls Classified

4 Outline Unstructured data (text, web, …) is  Important (really!)  Not so unstructured Main tasks/requirements and challenges Example problem: query optimization for text-centric tasks Fundamental research problems/directions

5 Unstructured data = natural language text (for this talk) Incredibly powerful and flexible means of communicating knowledge  Papers, news, web pages, lecture notes, patient records, shopping lists… Local structures: syntax  English syntax  HTML layout Semantics implicit, ambiguous, subjective I saw a man with a chainsaw Need incredibly powerful and flexible decoder

6 Some more structure Explicit link structure  Web, Blogs, Wikipedia, citations Implicit link structure  Co-occurrence of entities within same document/context implies link between entities  Occurrence of same entity in multiple documents implies link between documents Physical location  Page primarily “about” Atlanta  User somewhere around N. Decatur Rd  E-mail sender is two floors down More on this later

7 Global Problem Space Crawling (accessing) the data Storing (multiple version of) data “Understanding” the data  information Indexing information Integration from multiple sources User-driven information retrieval Exploiting unstructured data in applications System-driven knowledge discovery Building a nuclear/hydro/wind/ power plant

8 To Search or to Crawl? Towards a Query Optimizer for Text- Centric Tasks, [Ipeirotis, Agichtein, Jain, Gavano, SIGMOD 2006] Information extraction applications extract structured relations from unstructured text May 19 1995, Atlanta -- The Centers for Disease Control and Prevention, which is in the front line of the world's response to the deadly Ebola epidemic in Zaire, is finding itself hard pressed to cope with the crisis… DateDisease NameLocation Jan. 1995MalariaEthiopia July 1995Mad Cow DiseaseU.K. Feb. 1995PneumoniaU.S. May 1995EbolaZaire Information Extraction System (e.g., NYU’s Proteus) Disease Outbreaks in The New York Times

9 An Abstract View of Text-Centric Tasks Output Tokens … Extraction System Text Database 3.Extract output tokens 2.Process documents 1.Retrieve documents from database TaskToken Information ExtractionRelation Tuple Database SelectionWord (+Frequency) Focused CrawlingWeb Page about a Topic For the rest of the talk

10 Executing a Text-Centric Task Output Tokens … Extraction System Text Database 3.Extract output tokens 2.Process documents 1.Retrieve documents from database Similar to relational world Two major execution paradigms Scan-based: Retrieve and process documents sequentially Index-based: Query database (e.g., [case fatality rate]), retrieve and process documents in results Unlike the relational world Indexes are only “approximate”: index is on keywords, not on tokens of interest Choice of execution plan affects output completeness (not only speed) → underlying data distribution dictates what is best

11 Execution Plan Characteristics Output Tokens … Extraction System Text Database 3.Extract output tokens 2.Process documents 1.Retrieve documents from database Execution Plans have two main characteristics: Execution Time Recall (fraction of tokens retrieved) Question: How do we choose the fastest execution plan for reaching a target recall ? “What is the fastest plan for discovering 10% of the disease outbreaks mentioned in The New York Times archive?”

12 Outline Description and analysis of crawl- and query-based plans  Scan  Filtered Scan  Iterative Set Expansion  Automatic Query Generation Optimization strategy Experimental results and conclusions Crawl-based Query-based (Index-based)

13 Scan Output Tokens … Extraction System Text Database 3.Extract output tokens 2.Process documents 1.Retrieve docs from database Scan Scan retrieves and processes documents sequentially (until reaching target recall) Execution time = |Retrieved Docs| · (R + P) Time for retrieving a document Question: How many documents does Scan retrieve to reach target recall? Time for processing a document Filtered Scan Filtered Scan uses a classifier to identify and process only promising documents (details in paper)

14 Estimating Recall of Scan Modeling Scan for Token t: What is the probability of seeing t (with frequency g(t)) after retrieving S documents? A “sampling without replacement” process After retrieving S documents, frequency of token t follows hypergeometric distribution Recall for token t is the probability that frequency of t in S docs > 0 S documents Probability of seeing token t after retrieving S documents g(t) = frequency of token t

15 Estimating Recall of Scan Modeling Scan: Multiple “sampling without replacement” processes, one for each token Overall recall is average recall across tokens → We can compute number of documents required to reach target recall Execution time = |Retrieved Docs| · (R + P)

16 Outline Description and analysis of crawl- and query-based plans  Scan  Filtered Scan  Iterative Set Expansion  Automatic Query Generation Optimization strategy Experimental results and conclusions Crawl-based Query-based

17 Iterative Set Expansion Output Tokens … Extraction System Text Database 3.Extract tokens from docs 2.Process retrieved documents 1.Query database with seed tokens Execution time = |Retrieved Docs| * (R + P) + |Queries| * Q Time for retrieving a document Time for answering a query Question: How many queries and how many documents does Iterative Set Expansion need to reach target recall? Time for processing a document Query Generation 4.Augment seed tokens with new tokens Question: How many queries and how many documents does Iterative Set Expansion need to reach target recall? (e.g., [Ebola AND Zaire]) (e.g., )

18 Querying Graph The querying graph is a bipartite graph, containing tokens and documents Each token (transformed to a keyword query) retrieves documents Documents contain tokens TokensDocuments t1t1 t2t2 t3t3 t4t4 t5t5 d1d1 d2d2 d3d3 d4d4 d5d5

19 Using Querying Graph for Analysis We need to compute the: Number of documents retrieved after sending Q tokens as queries (estimates time ) Number of tokens that appear in the retrieved documents (estimates recall ) To estimate these we need to compute the: Degree distribution of the tokens discovered by retrieving documents Degree distribution of the documents retrieved by the tokens (Not the same as the degree distribution of a randomly chosen token or document – it is easier to discover documents and tokens with high degrees) TokensDocuments t1t1 t2t2 t3t3 t4t4 t5t5 d1d1 d2d2 d3d3 d4d4 d5d5 Elegant analysis framework based on generating functions – details in the paper

20 Recall Limit: Reachability Graph t 1 retrieves document d 1 that contains t 2 t1t1 t2t2 t3t3 t4t4 t5t5 TokensDocuments t1t1 t2t2 t3t3 t4t4 t5t5 d1d1 d2d2 d3d3 d4d4 d5d5 Upper recall limit: determined by the size of the biggest connected component Reachability Graph

21 Automatic Query Generation Details in the papers Iterative Set Expansion Iterative Set Expansion has recall limitation due to iterative nature of query generation Automatic Query Generation Automatic Query Generation avoids this problem by creating queries offline (using machine learning), which are designed to return documents with tokens

22 Outline Description and analysis of crawl- and query-based plans Optimization strategy Experimental results and conclusions

23 Summary of Cost Analysis Our analysis so far:  Takes as input a target recall  Gives as output the time for each plan to reach target recall (time = infinity, if plan cannot reach target recall) Time and recall depend on task-specific properties of database:  Token degree distribution  Document degree distribution Next, we show how to estimate degree distributions on-the-fly

24 Estimating Cost Model Parameters Token and document degree distributions belong to known distribution families Can characterize distributions with only a few parameters! TaskDocument DistributionToken Distribution Information ExtractionPower-law Content Summary ConstructionLognormalPower-law (Zipf) Focused Resource DiscoveryUniform

25 Parameter Estimation Naïve solution for parameter estimation:  Start with separate, “parameter-estimation” phase  Perform random sampling on database  Stop when cross-validation indicates high confidence We can do better than this! No need for separate sampling phase Sampling is equivalent to executing the task: →Piggyback parameter estimation into execution

26 On-the-fly Parameter Estimation Pick most promising execution plan for target recall assuming “default” parameter values Start executing task Update parameter estimates during execution Switch plan if updated statistics indicate so Important  Only Scan acts as “random sampling”  All other execution plan need parameter adjustment (see paper) Correct (but unknown) distribution Initial, default estimationUpdated estimation

27 Outline Description and analysis of crawl- and query-based plans Optimization strategy Experimental results and conclusions

28 Correctness of Theoretical Analysis Solid lines: Actual time Dotted lines: Predicted time with correct parameters Task: Disease Outbreaks Snowball IE system 182,531 documents from NYT 16,921 tokens

29 Experimental Results (Information Extraction) Solid lines: Actual time Green line: Time with optimizer (results similar in other experiments – see paper)

30 Conclusions Common execution plans for multiple text-centric tasks Analytic models for predicting execution time and recall of various crawl- and query-based plans Techniques for on-the-fly parameter estimation Optimization framework picks on-the-fly the fastest plan for target recall

31 Global Problem Space Crawling (accessing) the data “Understand” the data  information Indexing information Integration from multiple sources User-driven information retrieval Exploiting unstructured data in applications System-driven knowledge discovery

32 Some Research Directions Modeling explicit and Implicit network structures  Modeling evolution of explicit structure on web, blogspace, wikipedia  Modeling implicit link structures in text, collections, web  Exploiting implicit & explicit social networks (e.g., for epidemiology) Knowledge Discovery from Biological and Medical Data  Automatic sequence annotation  bioinformatics, genetics  Actionable knowledge extraction from medical articles Robust information extraction, retrieval, and query processing  Integrating information in structured and unstructured sources  Robust search/question answering for medical applications  Confidence estimation for extraction from text and other sources  Detecting reliable signals from (noisy) text data (e.g.,: medical surveillance)  Accuracy (!=authority) of online sources Information diffusion/propagation in online sources  Information propagation on the web  In collaborative sources (wikipedia, MedLine)

33 Page Quality: In Search of an Unbiased Web Ranking [Cho, Roy, Adams, SIGMOD 2005] “popular pages tend to get even more popular, while unpopular pages get ignored by an average user”

34 Sic Transit Gloria Telae: Towards an Understanding of the Web’s Decay [Bar-Yossef, Broder, Kumar, Tomkins, WWW 2004]

35 Modeling Social Networks for Epidemiology, security, … Email exchange mapped onto cubicle locations.

36 Some Research Directions Modeling explicit and Implicit network structures  Modeling evolution of explicit structure on web, blogspace, wikipedia  Modeling implicit link structures in text, collections, web  Exploiting implicit & explicit social networks (e.g., for epidemiology) Knowledge Discovery from Biological and Medical Data  Automatic sequence annotation  bioinformatics, genetics  Actionable knowledge extraction from medical articles Robust information extraction, retrieval, and query processing  Integrating information in structured and unstructured sources  Query processing over unstructured text  Robust search/question answering for medical applications  Confidence estimation for extraction from text and other sources  Detecting reliable signals from (noisy) text data (e.g.,: medical surveillance) Information diffusion/propagation in online sources  Information propagation on the web  In collaborative sources (wikipedia, MedLine)

37 Applying Text Mining for Bioinformatics 100,000+ gene and protein synonyms extracted from 50,000+ journal articles Approximately 40% of confirmed synonyms not previously listed in curated authoritative reference (SWISSPROT) ISMB 2003 “APO-1, also known as DR6…” “MEK4, also called SEK1…”

38 Examples of Entity-Relationship Extraction „We show that CBF-A and CBF-C interact with each other to form a CBF-A-CBF-C complex and that CBF-B does not interact with CBF-A or CBF-C individually but that it associates with the CBF-A-CBF-C complex.“ CBF-A CBF-C CBF-B CBF-A-CBF-C complex interact complex associates

39 Another Example Z-100 is an arabinomannan extracted from Mycobacterium tuberculosis that has various immunomodulatory activities, such as the induction of interleukin 12, interferon gamma (IFN-gamma) and beta-chemokines. The effects of Z-100 on human immunodeficiency virus type 1 (HIV-1) replication in human monocyte-derived macrophages (MDMs) are investigated in this paper. In MDMs, Z-100 markedly suppressed the replication of not only macrophage-tropic (M-tropic) HIV-1 strain (HIV-1JR-CSF), but also HIV-1 pseudotypes that possessed amphotropic Moloney murine leukemia virus or vesicular stomatitis virus G envelopes. Z-100 was found to inhibit HIV-1 expression, even when added 24 h after infection. In addition, it substantially inhibited the expression of the pNL43lucDeltaenv vector (in which the env gene is defective and the nef gene is replaced with the firefly luciferase gene) when this vector was transfected directly into MDMs. These findings suggest that Z-100 inhibits virus replication, mainly at HIV-1 transcription. However, Z- 100 also downregulated expression of the cell surface receptors CD4 and CCR5 in MDMs, suggesting some inhibitory effect on HIV-1 entry. Further experiments revealed that Z-100 induced IFN-beta production in these cells, resulting in induction of the 16-kDa CCAAT/enhancer binding protein (C/EBP) beta transcription factor that represses HIV-1 long terminal repeat transcription. These effects were alleviated by SB 203580, a specific inhibitor of p38 mitogen-activated protein kinases (MAPK), indicating that the p38 MAPK signalling pathway was involved in Z-100-induced repression of HIV-1 replication in MDMs. These findings suggest that Z-100 might be a useful immunomodulator for control of HIV-1 infection.

40 Query PubMed visualized Extracted info Links to databases AliBaba, Ulf Leser, http://wbi.informatik.hu-berlin.de:8080/

41 Mining Text and Sequence Data Agichtein & Eskin, PSB 2004 ROC 50 scores for each class and method

42 Some Research Directions Modeling explicit and Implicit network structures  Modeling evolution of explicit structure on web, blogspace, wikipedia  Modeling implicit link structures in text, collections, web  Exploiting implicit & explicit social networks (e.g., for epidemiology) Knowledge Discovery from Biological and Medical Data  Automatic sequence annotation  bioinformatics, genetics  Actionable knowledge extraction from medical articles Robust information extraction, retrieval, and query processing  Integrating information in structured and unstructured sources  Robust search/question answering for medical applications  Confidence estimation for extraction from text and other sources  Detecting reliable signals from (noisy) text data (e.g.,: medical surveillance)  Accuracy (!=authority) of online sources Information diffusion/propagation in online sources  Information propagation on the web  In collaborative sources (wikipedia, MedLine)

43 Structure and evolution of blogspace [Kumar, Novak, Raghavan, Tomkins, CACM 2004, KDD 2006] Fraction of nodes in components of various sizes within Flickr and Yahoo! 360 timegraph, by week.

44 Connected Components Visualization DiseaseOutbreaks, New York Times 1995 Structure of implicit entity-entity networks in text [Agichtein&Gravano, ICDE 2003]

45 Some Research Directions Modeling explicit and Implicit network structures  Modeling evolution of explicit structure on web, blogspace, wikipedia  Modeling implicit link structures in text, collections, web  Exploiting implicit & explicit social networks (e.g., for epidemiology) Knowledge Discovery from Biological and Medical Data  Automatic sequence annotation  bioinformatics, genetics  Actionable knowledge extraction from medical articles Robust information extraction, retrieval, and query processing  Integrating information in structured and unstructured sources  Robust search/question answering for medical applications  Confidence estimation for extraction from text and other sources  Detecting reliable signals from (noisy) text data (e.g.,: medical surveillance)  Accuracy (!=authority) of online sources Information diffusion/propagation in online sources  Information propagation on the web, news  In collaborative sources (wikipedia, MedLine)

46 Thank You Details: http://www.mathcs.emory.edu/~eugene/

1 Accessing, Managing, and Mining Unstructured Data Eugene Agichtein.

Similar presentations

Presentation on theme: "1 Accessing, Managing, and Mining Unstructured Data Eugene Agichtein."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Accessing, Managing, and Mining Unstructured Data Eugene Agichtein.

Similar presentations

Presentation on theme: "1 Accessing, Managing, and Mining Unstructured Data Eugene Agichtein."— Presentation transcript:

Similar presentations

About project

Feedback