Extracting value from grey literature Processes and technologies for aggregating and analysing the hidden Big Data treasure of the organisations.

Slides:



Advertisements
Similar presentations
Data Mining and the Web Susan Dumais Microsoft Research KDD97 Panel - Aug 17, 1997.
Advertisements

© Keith G Jeffery, Anne G S Asserson GL 11 Washington Keith G Jeffery Director, IT & International Strategy, STFC
Improving Learning Object Description Mechanisms to Support an Integrated Framework for Ubiquitous Learning Scenarios María Felisa Verdejo Carlos Celorrio.
Distributed search for complex heterogeneous media Werner Bailer, José-Manuel López-Cobo, Guillermo Álvaro, Georg Thallinger Search Computing Workshop.
C6 Databases.
4.1.5 System Management Background What is in System Management Resource control and scheduling Booting, reconfiguration, defining limits for resource.
SEVENPRO – STREP KEG seminar, Prague, 8/November/2007 © SEVENPRO Consortium SEVENPRO – Semantic Virtual Engineering Environment for Product.
CNRIS CNRIS 2.0 Challenges for a new generation of Research Information Systems.
1 Knowledge Management Session 4. 2 Objectives 1.What is knowledge management? Why do businesses today need knowledge management programs and systems.
Selecting Preservation Strategies for Web Archives Stephan Strodl, Andreas Rauber Department of Software.
Semantic Web and Web Mining: Networking with Industry and Academia İsmail Hakkı Toroslu IST EVENT 2006.
Report on Intrusion Detection and Data Fusion By Ganesh Godavari.
Session 1. Group 1: Implications of multiple- purpose forestry for information providers (Horrendous, unfamiliar, stressful) Wide audience – wide net.
Advanced Topics COMP163: Database Management Systems University of the Pacific December 9, 2008.
© Tefko Saracevic, Rutgers University 1 EVALUATION in searching IR systems Digital libraries Reference sources Web sources.
© Tefko Saracevic, Rutgers University1 digital libraries and human information behavior Tefko Saracevic, Ph.D. School of Communication, Information and.
WebMiningResearch ASurvey Web Mining Research: A Survey By Raymond Kosala & Hendrik Blockeel, Katholieke Universitat Leuven, July 2000 Presented 4/18/2002.
Application of Apriori Algorithm to Derive Association Rules Over Finance Data Set Presented By Kallepalli Vijay Instructor: Dr. Ruppa Thulasiram.
Business Driven Technology Unit 2
© Tefko Saracevic, Rutgers University1 digital libraries and human information behavior Tefko Saracevic, Ph.D. School of Communication, Information and.
Business Intelligence System September 2013 BI.
NIST BIG DATA WG Reference Architecture Subgroup Meeting Agenda Co-chairs: Orit Levin (Microsoft) James Ketner (AT&T) Don Krapohl (Augmented Intelligence)
LEVERAGING THE ENTERPRISE INFORMATION ENVIRONMENT Louise Edmonds Senior Manager Information Management ACT Health.
Basic Concepts of Datawarehousing An Overview Prasanth Gurram.
Chapter 1: Business Intelligence and its Impacts
SC32 WG2 Metadata Standards Tutorial Metadata Registries and Big Data WG2 N1945 June 9, 2014 Beijing, China.
University of Dublin Trinity College Localisation and Personalisation: Dynamic Retrieval & Adaptation of Multi-lingual Multimedia Content Prof Vincent.
LIS 506 (Fall 2006) LIS 506 Information Technology Week 11: Digital Libraries & Institutional Repositories.
1 1 Slide Introduction to Data Mining and Business Intelligence.
VOA3R Virtual Open Access Agriculture & Aquaculture Repository: sharing scientific and scholarly research related to agriculture, food, and environment.
Introduction to Web Mining Spring What is data mining? Data mining is extraction of useful patterns from data sources, e.g., databases, texts, web,
Report on Intrusion Detection and Data Fusion By Ganesh Godavari.
Data Warehousing Data Mining Privacy. Reading Bhavani Thuraisingham, Murat Kantarcioglu, and Srinivasan Iyer Extended RBAC-design and implementation.
Data Mining By Dave Maung.
Statistics New Zealand’s End-to-End Metadata Life-Cycle ”Creating a New Business Model for a National Statistical Office if the 21 st Century” Gary Dunnet.
5 - 1 Copyright © 2006, The McGraw-Hill Companies, Inc. All rights reserved.
Problemsolving Problem Solving – 4 Stages Analysis Design Development Evaluate (ADDE) Note: In this unit Evaluate is not covered in depth.
Copyright © 2012, SAS Institute Inc. All rights reserved. ANALYTICS IN BIG DATA ERA ANALYTICS TECHNOLOGY AND ARCHITECTURE TO MANAGE VELOCITY AND VARIETY,
Data resource management
NIST BIG DATA WG Reference Architecture Subgroup Intermediate Report Co-chairs: Orit Levin (Microsoft) James Ketner (AT&T) Don Krapohl (Augmented Intelligence)
MOTIVATION AND CHALLENGE Big data Volume Velocity Variety Veracity Contributor Content Context Value 5 Vs of Big Data 3 Cs of Veracity.
ReSeTrus Development of a digital library technology based on redundancy elimination and semantic elevation, with special emphasis on trust management.
NIST BIG DATA WG Reference Architecture Subgroup Draft Co-chairs: Orit Levin (Microsoft) James Ketner (AT&T) Don Krapohl (Augmented Intelligence) August.
Digital Libraries1 David Rashty. Digital Libraries2 “A library is an arsenal of liberty” Anonymous.
Using automation to enhance the process of Digital Forensic analysis Daniel Walton School of Computer and Information Science
Towards a Reference Quality Model for Digital Libraries Maristella Agosti Nicola Ferro Edward A. Fox Marcos André Gonçalves Bárbara Lagoeiro Moreira.
Web Information Retrieval Prof. Alessandro Agostini 1 Context in Web Search Steve Lawrence Speaker: Antonella Delmestri IEEE Data Engineering Bulletin.
Metadata By N.Gopinath AP/CSE Metadata and it’s role in the lifecycle. The collection, maintenance, and deployment of metadata Metadata and tool integration.
26/05/2005 Research Infrastructures - 'eInfrastructure: Grid initiatives‘ FP INFRASTRUCTURES-71 DIMMI Project a DI gital M ulti M edia I nfrastructure.
Livia Bizikova and Laszlo Pinter
User Modeling and Recommender Systems: Introduction to recommender systems Adolfo Ruiz Calleja 06/09/2014.
IoT Meets Big Data Standardization Considerations
Data Resource Management Agenda What types of data are stored by organizations? How are different types of data stored? What are the potential problems.
Hemera KickOff October 5th, 2010 Working Group B5 Efficient management of very large volumes of information for data- intensive applications Gabriel Antoniu,
Copyright © 2016 Pearson Education, Inc. Modern Database Management 12 th Edition Jeff Hoffer, Ramesh Venkataraman, Heikki Topi CHAPTER 11: BIG DATA AND.
Big Data Quality Panel Norman Paton University of Manchester.
Statistical process model Workshop in Ukraine October 2015 Karin Blix Quality coordinator
 The processes used for RE vary widely depending on the application domain, the people involved and the organisation developing the requirements.  However,
Information Day on “Search Engines for Audio-Visual Content”
RECENT TRENDS IN METADATA GENERATION
Data Warehouse.
Outline Pursue Interoperability: Digital Libraries
(VIP-EDC) Point 6 of the agenda
Metadata The metadata contains
Data Warehousing Data Mining Privacy
Big DATA.
BPaaS Evaluation Environment Research Prototype
Reportnet 3.0 Database Feasibility Study – Approach
Template for methodological application
CSE591: Data Mining by H. Liu
Presentation transcript:

Extracting value from grey literature Processes and technologies for aggregating and analysing the hidden Big Data treasure of the organisations

Big data: the 3 "V"s Data sets satisfying one or more of the following requisites: Volume: huge amounts of data, which cannot be effectively managed with usual technologies (e.g.: relational data base management systems) Velocity: high throughput data collection and provisioning Variety: heterogeneity of sources, types and formats

Big data: definition issues 1/2 Need for a less ambiguous operational definition, with explicit reference to usage contexts, technological environments and involved actors – e.g.: Volume and Velocity thresholds cannot be defined as absolute values because they are strictly connected to current technological constraints Current definition does not take into account all challenges – What about Veracity: how can I ensure data reliability in such an heterogeneous and dynamic scenario? Validity: how can I ascertain relevance for the intended use? Volatility: how long is data valid and how long should it be stored?

Big data: definition issues 2/2 Volume, velocity, variety → quantitative and measurable aspects, more easily definable Veracity, volatility, validity → cannot be assessed with a simple, direct measure.

Big data and grey literature In grey literature, big data challenges include the management and processing of: – Digital contents – Metadata – Contexts and relations Grey literature products: are by definition characterized by Variety – in terms of heterogeneity of content types, formats and internal structures present issues of Veracity, Volatility and Validity

Potential benefits: Early discovery of research trends Detection of hidden relations Metadata enrichment Potential benefits: Early discovery of research trends Detection of hidden relations Metadata enrichment Grey literature: text and data mining on a large scale Big data technologies + text and data mining analysis tools

Stakeholders Political decision-makers: support for long term research planning Industry: useful indicators to be taken into account for future investments Researchers: detection of upcoming/implicit research trends or interesting connections between research groups/fields

Issues and solutions Main issues Proposed solutions: quality level agreements data cleansing procedures cross-check with external sources. Highly dependent by context Process-bound issues VeracityValidityVolatility Main issues To Be Addressed: Validity Volatility

Analysis types Classification  e.g. for metadata enrichment and linking Clustering  e.g. detection of non-trivial connections between research fields or organizations/research groups Association rules  pattern detection (e.g. if a research group is specialized in field A, they have or will develop connections with groups specialized in field B with probability x)

Data layer Metadata (NoSQL DB) Digital contents (Distributed FS) Service layer BI Data Mining Text Mining Information Retrieval Presentation layer Navigation and browsingReports and analisys Access layer IngestionRetrieval User community A User community B User community C Content providers LOD Cloud Analisys Results Proposed Architecture

External sources Inference algorithms Metadata Contents Metadata extraction/ validation Virus check Text extraction Deduplication SIP Indexing Clustering Classification Association Rules Cleansing Enrichment Proposed Workflow

Conclusions and future work Main issues mostly related to organisational aspects Benefits from the integration of metadata, contents and analysis results in the LOD cloud  system’s value added Importance of feedback from user communities for the continuous improvement of analysis result quality Formal definitions for Veracity, Validity, Volatility and related evaluation criteria.