Download presentation
Presentation is loading. Please wait.
Published byColeen Webb Modified over 9 years ago
1
Intelligent Access to Text: Integrating Information Extraction Technology into Text Browsers Robert Gaizauskas 1, Patrick Herring 1, Michael Oakes 1 Micheline Beaulieu 2, Peter Willett 2, Helene Fowkes 2, and Anna Jonsson 2 1 Department of Computer Science, 2 Department of Information Studies University of Sheffield
2
March, 2001 HLT01, San Diego Outline of Talk Is Information Extraction Technology Useful? Barriers to Deployment Information Seeking in Large Enterprises The TRESTLE System System Overview NEAT: Named Entity Access to Text SCAT: Scenario Access to Text Preliminary User Evaluation Evaluation Methodology Access Strategies User Perceptions Conclusions and Discussion
3
March, 2001 HLT01, San Diego Is Information Extraction Technology Useful? Information Extraction (IE) technology has led to impressive new abilities to extract structured information from texts Named entity recognition Template Element/Relation filling Scenario Template filling IE complements traditional Information Retrieval (IR) capabilities However, unlike IR, IE has not found its way into widely used end- user systems, such as Web search engines Document indexing systems Why not?
4
March, 2001 HLT01, San Diego Barriers to Deployment Porting Cost Moving to new domains requires considerable time + expertise to create/modify domain-specific resources + rule bases to annotate texts for supervised machine learning approaches Sensitivity to inaccuracies in extracted data MUC-7 results – F-measure scores 50-92% depending on task Thus, IE only appropriate for applications where some error is tolerable/readily detectable by end users Note: formal IR evaluation results comparable, but application contexts make error less significant Complexity of integration into end-user systems IE systems’ outputs must be incorporated into larger application systems, if end users are to benefit from them
5
March, 2001 HLT01, San Diego IE and Information Seeking in Large Enterprises To investigate the utility of IE in a real setting have developed an advanced text access facility to support information workers at GlaxoSmithKline TRESTLE – Text Retrieval Extraction and Summarisation Technology for Large Enterprises Aim: increase effectiveness of employees in “industry watch” function – current awareness/tracking of People Companies Products – particularly progress of new drugs through clinical trial/regulatory approval process Approach: provide enhanced access to Scrip the largest circulation pharmaceutical industry newsletter
6
March, 2001 HLT01, San Diego IE and Information Seeking in Large Enterprises User requirements study at GSK (questionnaire, observation, interviews) revealed 2 key types of information seeking: 1. Current awareness general updating (what's happened in the industry today/this week) entity or event-based tracking (e.g. what's happened concerning a specific drug or what regulatory decisions have been made) 2. Retrospective search historical tracking of entities or events of interest (e.g. where has a specific person been reported before, what is the clinical trial history of a particular drug) search for a specific event or a remembered context in which a specific entity played a role Note: both activities require identification of entities/events in the news = what IE systems do
7
March, 2001 HLT01, San Diego TRESTLE System Overview The system consists of two components Off-line component LaSIE IE system Input: Scrip texts delivered daily via the Internet Output: IE results Named entities: MUC-7 categories + drugs + diseases Scenario templates: Person Tracking; Clinical Trials; Regulatory Announcements Summary Writer Input: Scenario templates Output: Single sentence NL summaries of the templates Entity/Scenario Indexer Input: NE annotated texts; Scenario templates Output: Indices keyed by NE + date with pointers to source texts
8
March, 2001 HLT01, San Diego TRESTLE System Overview (cont) On-line component Browser scripts Input: User requests for information Output: Results to requests returned from annotated Scrip DB Entity/Scenario Index Search + Dynamic Page Generator Input: User information requests forwarded from Web server + entity/scenario indices + NE annotated texts/summaries Output: Relevant HTML pages with link info dynamically generated link information
9
March, 2001 HLT01, San Diego TRESTLE System Architecture User Scrip Index Search + Dynamic Page Creator LaSIE System Summary Writer Indexer Entity/ Scenario Indices Scenario Templates NE Tagged Texts Scenario Summaries Off-Line System Web Server Internet Web Browser Info Seeking
10
March, 2001 HLT01, San Diego TRESTLE Interface Overview TRESTLE browser-based interface allows 4 routes to access texts: by headline by named entity (NEAT: Named Entity Access to Text) by scenario summary (SCAT: Scenario Access to Text) by free text search For first 3 routes date range of accessed articles may be set to current day previous day last week last four weeks full archive
11
March, 2001 HLT01, San Diego TRESTLE Interface: Underlying Design Head Frame Access Frame Index Frame Text Frame Head Frame User state Date range selection Access Frame Choose access mode NE/Scenario/free text search Index Frame Headline list, or NE + headline list, or Summary list Text Frame Full text of source text embedded NE hyperlinks
12
March, 2001 HLT01, San Diego NEAT: Named Entity Access to Text RUN
13
March, 2001 HLT01, San Diego SCAT: Scenario Access To Text RUN
14
March, 2001 HLT01, San Diego Preliminary User Evaluation: Methodology Prelude to full end-user study: preliminary study with 8 Information Studies postgrad students Aim: to gain insight into ease of use and learnability of the system preferred strategies for accessing text problems in interpreting the interface Instruments: usability questionnaire, verbal protocols, observational notes Procedure: brief verbal introduction to evaluation and system undirected exploration of system, asking questions/providing comments simulated tasks of real end-user You've heard that one of your colleagues, Mr Garcia, has recently accepted an appointment at another pharmaceutical company. You want to find out which company he will be moving to and what post he has taken up.
15
March, 2001 HLT01, San Diego Preliminary User Evaluation: Access Strategies NEAT: access to named entities was made available in three ways: 1. by clicking directly on a list of NE categories in the access frame 2. through the NE index look up query box in the access frame 3. through highlighted entries in a full article displayed in the text frame Observation: users preferred 2 over 1 or 3, regardless of task perhaps because users knew what they were looking for perhaps more familiar than browsing NE’s perhaps because of prominence of NE lookup box in interface SCAT: Observation: for tasks where SCAT was appropriate users opted for NE index lookup perhaps because of novelty of scenario tracking perhaps because SCAT functionality not clear from interface
16
March, 2001 HLT01, San Diego Preliminary User Evaluation: User Perceptions Colour coding + hyper-linking of NE’s Highly noticeable; some objections to colour choice Disagreement about utility – distracting when reading full texts, but highly useful in leading to related previous Scrip Integration of current awareness + retrospective searching via NE’s highly appreciated NE index look-up Found very useful by all but one participant Some confusion over scope – differences wrt free-text search/only 5 searchable NE categories Exact string matching limiting (limitation now removed) Scenario Tracking Function misunderstood from labelling in access frame Confusion between SCAT summaries and headlines Flag icons for summaries in headline lists not well understood
17
March, 2001 HLT01, San Diego Conclusions (I) To date IE largely a “technology push” activity For IE technology to become usable and influenced by end user requirements (“user pull”), end user prototypes must be built which: exploit the significant achievement of the technology to date acknowledge its limitations TRESTLE attempts to do this by exploiting NE and scenario template IE technology to offer users novel ways to access textual information via a familiar text browsing interface
18
March, 2001 HLT01, San Diego Conclusions (II) Preliminary user evaluation has revealed: search options initially selected from the access frame were not always optimal for set tasks on the whole colour-coded textual/iconic cue in headline index + full text enabled users to exploit the different functions seamlessly interface supported interaction at procedural level, but some misunderstanding at the conceptual level – esp. scenario access other studies report similar issues in introducing more complex interactive search functions further investigation + modifications (e.g. to labelling) underway Full evaluation in real end user environment now being organised To answer question: can professional information workers use IE- based searching and awareness approaches effectively?
19
March, 2001 HLT01, San Diego The End
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.