A Portal for Access to Complex Distributed Information about Energy Jose Luis Ambite, Yigal Arens, Eduard H. Hovy, Andrew Philpot DGRC Information Sciences.

Slides:



Advertisements
Similar presentations
Data Mining and the Web Susan Dumais Microsoft Research KDD97 Panel - Aug 17, 1997.
Advertisements

Dr. Leo Obrst MITRE Information Semantics Information Discovery & Understanding Command & Control Center February 6, 2014February 6, 2014February 6, 2014.
The Application of Machine Translation in CADAL Huang Chen, Chen Haiying Zhejiang University Libraries, Hangzhou, China
WEB DESIGN TABLES, PAGE LAYOUT AND FORMS. Page Layout Page Layout is an important part of web design Why do you think your page layout is important?
Learning Semantic Information Extraction Rules from News The Dutch-Belgian Database Day 2013 (DBDBD 2013) Frederik Hogenboom Erasmus.
Making the Case for Metadata at SRS-NSF National Science Foundation Division of Science Resources Statistics Jeri Mulrow, Geetha Srinivasarao, and John.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Taxonomies, Lexicons and Organizing Knowledge Wendi Pohs, IBM Software Group.
Recent Work at ISI Jose Luis Ambite Yigal Arens Eduard Hovy Andrew Philpot USC/ISI.
March DGRC FedStats Visit Aggregation in Main Memory Kenneth A. Ross Columbia University.
MIS DATABASE SYSTEMS, DATA WAREHOUSES, AND DATA MARTS MBNA
The Unreasonable Effectiveness of Data Alon Halevy, Peter Norvig, and Fernando Pereira Kristine Monteith May 1, 2009 CS 652.
A Web of Concepts Dalvi, et al. Presented by Andrew Zitzelberger.
Co-Directors: Yigal Arens USC / Information Sciences Institute Judith Klavans Columbia University.
VISUAL INTERFACE DATABASE WITH FISHEYE TECHNOLOGY Peter Sommer Ju-Ling Shih, Laura Zadoff Columbia Center for New Media Teaching and Learning Visual Interface.
Interactive Dynamic Aggregate Queries Kenneth A. Ross Junyan Ding Columbia University.
MS DB Proposal Scott Canaan B. Thomas Golisano College of Computing & Information Sciences.
Co-Directors: Yigal Arens USC / Information Sciences Institute Judith Klavans Columbia University.
1 Using Natural Language Processing To Lead the User to Data Judith Klavans, Walter Bourne, Brian Whitman, Deniz Sarioz Columbia University Digital Government.
DATABASE GRAPHICAL USER INTERFACE WITH FISHEYE TECHNOLOGY Agency Columbia Center for New Media Teaching and Learning CCNMTL Title: Visual Interface Evaluation.
Columbia University Dept of Computer Science Center for Research on Info Access University of So. Calif Information Sciences Institute (ISI)
Bieber et al., NJIT © Slide 1 Digital Library Integration Masters Project and Masters Thesis Summer and Fall 2005 CIS 786 / CIS Fall.
User Interfaces for DGRC Steven Feiner Surabhan Temiyabutr Department of Computer Science Columbia University New York, NY 10027
User Interfaces for DGRC Steven Feiner Surabhan Temiyabutr Department of Computer Science Columbia University New York, NY 10027
1 Information Integration and Source Wrapping Jose Luis Ambite, USC/ISI.
Information systems and databases Database information systems Read the textbook: Chapter 2: Information systems and databases FOR MORE INFO...
DartGrid Browser-based mapping tool of SQL to RDF Point Template Zhejiang University & OpenLink Software.
OMAP: An Implemented Framework for Automatically Aligning OWL Ontologies SWAP, December, 2005 Raphaël Troncy, Umberto Straccia ISTI-CNR
DataCache Application Recreated and improved -- from Filemaker to the Intranet The Communications Services department depended on the Filemaker Pro DataCache.
MIS DATABASE SYSTEMS, DATA WAREHOUSES, AND DATA MARTS MBNA ebay
A Spotfire Demo Gallery with Data Science Dr. Brand Niemann Director and Senior Data Scientist Semantic Community November 13, 2011 DRAFT 1.
Classroom User Training June 29, 2005 Presented by:
Data Management Turban, Aronson, and Liang Decision Support Systems and Intelligent Systems, Seventh Edition.
An Integrated Approach to Extracting Ontological Structures from Folksonomies Huairen Lin, Joseph Davis, Ying Zhou ESWC 2009 Hyewon Lim October 9 th, 2009.
Terminology and Standards Dan Gillman US Bureau of Labor Statistics.
© 2010 Pearson Addison-Wesley. All rights reserved. Addison Wesley is an imprint of Designing the User Interface: Strategies for Effective Human-Computer.
GCMD/IDN STATUS AND PLANS Stephen Wharton CWIC Meeting February19, 2015.
1 Technologies for distributed systems Andrew Jones School of Computer Science Cardiff University.
Knowledge Representation and Indexing Using the Unified Medical Language System Kenneth Baclawski* Joseph “Jay” Cigna* Mieczyslaw M. Kokar* Peter Major.
Towards Web Semantics Spreadsheets and the US Government Lee Feigenbaum, Cambridge Semantics Brand Niemann, U.S. EPA SICoP Special Conference February.
Knowledge Modeling, use of information sources in the study of domains and inter-domain relationships - A Learning Paradigm by Sanjeev Thacker.
Copyright © 2015 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.
5-1 McGraw-Hill/Irwin Copyright © 2007 by The McGraw-Hill Companies, Inc. All rights reserved.
Collaborative Annotation of the AMI Meeting Corpus Jean Carletta University of Edinburgh.
BAA - Big Mechanism using SIRA Technology Chuck Rehberg CTO at Trigent Software and Chief Scientist at Semantic Insights™
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
United Nations Economic Commission for Europe Statistical Division The Importance of Databases in the Dissemination Process Steven Vale, UNECE.
Introduction to Enterprise Guide Jennifer Schmidt Rhonda Ellis Cassandra Hall.
MANAGING DATA RESOURCES ~ pertemuan 7 ~ Oleh: Ir. Abdul Hayat, MTI.
Copyright © 2005, Pearson Education, Inc. Slides from resources for: Designing the User Interface 4th Edition by Ben Shneiderman & Catherine Plaisant Slides.
VizDB A tool to support Exploration of large databases By using Human Visual System To analyze mid-size to large data.
BOĞAZİÇİ UNIVERSITY DEPARTMENT OF MANAGEMENT INFORMATION SYSTEMS MATLAB AS A DATA MINING ENVIRONMENT.
Data Integration Hanna Zhong Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009.
Using Ontologies to Enable Access to Multiple Heterogeneous Databases CARDGIS Eduard Hovy Information Sciences Institute University of Southern California.
Information Retrieval
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Semantic Wiki: Automating the Read, Write, and Reporting functions Chuck Rehberg, Semantic Insights.
VERA AULIA ( ).  Oil palm is one of the major edible oil traded in the global market.  Oil palm tree will start to produce fruits within three.
Large Scale Semantic Data Integration and Analytics through Cloud: A Case Study in Bioinformatics Tat Thang Parallel and Distributed Computing Centre,
UNIFIED MEDICAL LANGUAGE SYSTEMS (UMLS)
CCNT Lab of Zhejiang University
Lecture #11: Ontology Engineering Dr. Bhavani Thuraisingham
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
MANAGING DATA RESOURCES
Data Warehousing and Data Mining
VISUAL INTERFACE DATABASE WITH FISHEYE TECHNOLOGY
One Language. One Enterprise.™
CHAPTER 7: Information Visualization
DATABASES WHAT IS A DATABASE?
Metadata supported full-text search in a web archive
Presentation transcript:

A Portal for Access to Complex Distributed Information about Energy Jose Luis Ambite, Yigal Arens, Eduard H. Hovy, Andrew Philpot DGRC Information Sciences Institute University of Southern California Walter Bourne, Peter T. Davis, Steven Feiner, Judith L. Klavans, Samuel Popper, Ken Ross, Ju-Ling Shih, Peter Sommer, Surabhan Temiyabutr, Laura Zadoff DGRC Columbia University

The Vision: Ask the Government... How have property values in the area changed over the past decade? How many people had breast cancer in the area over the past 30 years? Is there an orchestra? An art gallery? How far are the nightclubs? We’re thinking of moving to Denver... What are the schools like there? Census Labor Stats Query results

The Energy Data Collection project EDC research team Government partners Research challenge Information Sciences Institute, USC Dept of CS, Columbia University Energy Information Admin. (EIA) Bureau of Labor Statistics (BLS) Census Bureau Make accessible in standardized way the contents of thousands of data sets, represented in many different ways (webpages, pdf, MS Access, Excel, text…) Xxx x x Xx xxxxxx Xx xx Xxx xx X Xxx x x xx

Data Integration Trade EPA Census EIA Labor Heterogeneous Data Sources User Interface Information Access Data Access and Query Processing Metadata and Terminology Management User Evaluation Interface Design and Task-based Evaluation Concept Ontology Terminology Sources

Data access using SIMS ‘Hide’ from user details of data sources: 1. ‘Wrap’ each source in software that handles access to its data 2. Record the types of info in each source in a ‘Source Model’ 3. Arrange all source models together in the same space—the Domain Model SIMS data access planner transforms user’s request into individual access queries SIMS extracts the right data from the appropriate sources Current databases and models: –Databases: 58,000+ series (EIA OGIRS and others) –Webpages: 60+ (BLS, CEC tables) SENSUS ontology: 90,000 nodes (from ISI’s NLP technology) –Domain model: 500 nodes (manual; for database access planner) –LKB: 6000 nodes (NL term/info extraction from glossaries) Xxx x x Xx xxxxxx Xx xx Xxx xx X Xxx x x xx Sources: Models: x x (Ambite et al., ISI)

Data access using in-memory query processing How can you provide fast access to millions of data values? Cache data that doesn’t change much in data warehouse Create rich multidimensional index structures; keep in memory Adapt index depending on user’s patterns of use Technical details: Same engine for many data sets Client/server parallel Branch Misprediction SIMD Asynchronous work Use: Real-time interactive data exploration: ‘fly’ over the data (Ross et al., Columbia) Mediator Data Request Unified Results User Web... Graphical User Interface Dynamic Query Data Files e.g., PUMS Dynamic Query Engine

Large ontology (SENSUS) Data sources Domain-specific ontologies (SIMS models) Logical mapping Linguistic Mapping (semi-automated) Concepts from glossaries (by GlossIT) The Heart of EDC (Hovy et al., ISI)

Taxonomy, multiple superclass links Approx. 90,000 concepts Top level: Penman Upper Model (ISI) Body: WordNet 1.6 (Princeton), rearranged New information added by text mining Used at ISI for machine translation, text summarization, database access SENSUS and DINO browser (Knight et al., ISI)

Extracting term info from online sources GetGloss: given a URL, find all the glossary files ParseGloss: given a set of NL glossary definitions, extract and format the important information (Klavans et al., Columbia) GetGloss: –Glossary identification rules consider format tags, etc. –F-score: 0.68 (2nd after SVM at 0.92) ParseGloss: –Identify term, def, head noun, etc. –Evaluation underway

Term-to-ontology alignment How to link new concepts into the Ontology (or Domain Model) in the right places? Manual approach expensive: NxM steps Approach: try to automatically propose links, then hand-check only the best proposals –Created and tested various match heuristics ( NAME, DEF, TAXONOMY, DISPERSAL ) –Tried various clustering methods: CLINK, SLINK, Ward’s Method…, new version of k-Means (Euclidean and spherical distance measures) –Tested numerous parameter combinations (stemming, etc.) in EDC and NHANES domains; see  Results not great (Hovy et al., ISI) ?

User interface testbed Ontology entry shown in beam for selected item –Located as near as possible –Color coding shows parental and semantic relationships Fisheye magnification of region of interest –Magnified group laid out to avoid internal overlap Menu presented as grid of alternating rows and columns (Feiner et al., Columbia)

AskCal: User requests in English ATN: –341 nodes –14 question types Automated paraphrase to confirm Dialogue continues via menus for detailed selection (Philpot et al., ISI)

Interface/usage evaluation Evaluation study, started late 2001 What to evaluate? Variables –Category display –Magnifying columns –Fisheye proximity & magnification –Searchlight –Synonyms Methods –Observe cognitive styles –Examples in other domains Research on content –Energy vs. Census domains (Sommer et al., Columbia) Task evaluation Process –Task scenario –Interview –Observation Goal –User behaviors –User intuitiveness for different groups of users –Strengths and weaknesses of the design Participants –Content experts –Government agency workers –Faculty and students

Thank you! Please come see our demos this afternoon!