B ig Data at B ITEM Research Group (Text|Web) Mining Research Group – Research projects:

Slides:



Advertisements
Similar presentations
Metadata Normalization (Stein) Runar Bergheim. About Metadata Normalization The best place to perform normalization is in the collection management system.
Advertisements

Medical Image Resource Center. What is MIRC? Medical Image Resource Center Makes it easier to locate and share electronic medical images and related information.
Dr. Leo Obrst MITRE Information Semantics Information Discovery & Understanding Command & Control Center February 6, 2014February 6, 2014February 6, 2014.
Web Archives and Large-Scale Data: Preliminary Techniques for Facilitating Research Nicholas Woodward Latin American Network Information Center
Beyond the Google Book: the Future of the Digital Library Cory Snavely Library IT Core Services manager University of Michigan April 20, 2010.
Overview of Twitter API Nathan Liu. Twitter API Essentials Twitter API is a Representational State Transfer(REST) style web services exposed over HTTP(S).
Connecting Knowledge Silos using Federated Text Mining Guy Singh Senior Manager, Product & Strategic Alliances ©2014 Linguamatics Ltd.
From Web Archiving services to Web scale data processing platform Internet Memory Research GA IIPC, Paris, May 19th 2014.
Montalescot G, et al. Lancet 2009;373: Trial profile Montalescot G, et al. Lancet 2009;373:
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Text mining Extract from various presentations: Temis, URI-INIST-CNRS, Aster Data …
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
1 Enriching UK PubMed Central SPIDER launch meeting, Wolfson College, Oxford Paul Davey, UK PubMed Central Engagement Manager.
Semantic Web and Web Mining: Networking with Industry and Academia İsmail Hakkı Toroslu IST EVENT 2006.
Web Mining Research: A Survey
1 BrainWave Biosolutions Limited Accelerating Life Science Research through Technology.
Advanced Database Applications Database Indexing and Data Mining CS591-G1 -- Fall 2001 George Kollios Boston University.
Implementing Metadata Marjorie M K Hlava, President Access Innovations, Inc. Albuquerque, NM
NLM-Semantic Medline Data Science Data Publication Commons Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data.
Contents:  1 – Introduction to the subject of web mining and techniques  2 – Overview of research conducted (both theory and practical)  3 – Software.
CONTI’2008, 5-6 June 2008, TIMISOARA 1 Towards a digital content management system Gheorghe Sebestyen-Pal, Tünde Bálint, Bogdan Moscaliuc, Agnes Sebestyen-Pal.
Web Archives, IDEAL, and PBL Overview Edward A. Fox Digital Library Research Laboratory Dept. of Computer Science Virginia Tech Blacksburg, VA, USA 21.
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Strengths: SEO – Moderate Page Placement Inbound Links: 11 Onsite Lead Generation Mobile Optimization Onsite Blogging -API To Social Sites - Facebook,
Multimedia Databases (MMDB)
Semantic Publishing Update Second TUC meeting Munich 22/23 April 2013 Barry Bishop, Ontotext.
Entity Recognition via Querying DBpedia ElShaimaa Ali.
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.
Digital Enterprise Research Institute HADA – An Access Controlled Application for Publishing and Discovering Linked Government Data Owen Sacco.
Software Agents for Web Mining FYP Project by: Shuchi Mittal Quek Siew Guat Patricia Professor: Franklin Fu.
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
Librarians as a Resource for African Journals Partnership Project (AJPP) Journals Christine Wamunyima Kanyengo
Washingtonpost.com  Introduction  Who we are - four very different sites –washingtonpost.com –budgettravelonline.com –newsweek.com –slate.com.
WEB MINING. In recent years the growth of the World Wide Web exceeded all expectations. Today there are several billions of HTML documents, pictures and.
Linked Data: Emblematic applications on Legacy Data in Libraries.
Modern Information Retrieval Presented by Miss Prattana Chanpolto Faculty of Information Technology.
MICROSOFT SEMANTIC ENGINE Unified Search, Discovery and Insight.
Information Retrieval CSE 8337 Spring 2007 Introduction/Overview Some Material for these slides obtained from: Modern Information Retrieval by Ricardo.
Text Analytics A Tool for Taxonomy Development Tom Reamy Chief Knowledge Architect KAPS Group Program Chair – Text Analytics World Knowledge Architecture.
Mapping to Ontologies Nigam Shah
PubMed …featuring more than 20 million citations for biomedical literature from MEDLINE, life science journals, and online books.
Paloma Marín Arraiza 17 th International Conference on Grey Literature 1 st and 2 nd December 2015, Amsterdam (Netherlands) SCIENTIFIC AUDIOVISUAL MATERIALS.
Metadata-based Discovery: Experience in Crystallography UKOLN is supported by: Monica Duke UKOLN, University of Bath, UK A centre of.
Statistical techniques for video analysis and searching chapter Anton Korotygin.
Architecting Search in 2013/2016 On-Prem Ajay Iyer.
BioCreAtIvE Critical Assessment for Information Extraction in Biology Granada, Spain, March28-March 31, 2004 Task 2: Functional annotation of gene products.
1 Catherine Ordun, MBA, MPH May 10, 2016 Challenges and Considerations of Big Data Analytics Workshop on Big Data and Analytics for Infectious Disease.
WEB SEARCH BASICS By K.KARTHIKEYAN. Web search basics The Web Ad indexes Web spider Indexer Indexes Search User Sec
Adverse Event (AE) Coding Data mining and cleaning Medical literature screening and report writing Cleaning juggled Drug- AE data and converting to machine.
Trends in NL Analysis Jim Critz University of New York in Prague EurOpen.CZ 12 December 2008.
TDM in the Life Sciences Application to Drug Repositioning *
Architecting Search in SharePoint 2016
Best pTree organization? level-1 gives te, tf (term level)
Introduction Multimedia initial focus
Contextual Intelligence as a Driver of Services Innovation
Information Organization
Toward FAIR Semantic Resources
Search Engine Optimization By Maddova Media Pvt. Ltd.
What is IR? In the 70’s and 80’s, much of the research focused on document retrieval In 90’s TREC reinforced the view that IR = document retrieval Document.
Information Retrieval
EU Law and Publications Access and reuse the content
A platform for Linked Data publishing
TDM=Text Mining “automated processing of large amounts of structured digital textual content for purposes of information retrieval, extraction, interpretation.
CSE 635 Multimedia Information Retrieval
Social media for global scientific community – Mendeley project
Pre-classification and AI
Web archives as a research subject
Jonathan Griffin, Managing Director, IFIS Publishing &
Knowledge Sharing Mechanism in Social Networking for Learning
Big Data.
Presentation transcript:

B ig Data at B ITEM Research Group (Text|Web) Mining Research Group – Research projects: Digital Libraries, Web, Personalized medicine, Patent analytics, Consumer Analytics, Pharmacovigilance, Clinical trials… Specialised in (semi|un)structured data – We like text, text and more text – Especially on the noisy/dirty Web Technological expertise: CouchDB replication, SolrCloud (distributed indexing and search), indexing/searching in SSD/HDFS/Hadoop, SPARQL endpoints…

Drugbank Twitter API Couch DB Cleaning Normalisation Cleaning Normalisation RSS Forum Trends Analysis Trends Analysis Correlation Analysis Correlation Analysis Novelty Detection Pharmacovigilance on Big Social Media Data Dynamic and Real Time Data Analysis 26’000 per day 19’000 drug names checked each 10 mn 7 M of docs in 9 months

Managing the data deluge for proteins annotation 40’000 concepts [Big-scale Multiclass Multilabel Classifier]  Lazy learning ! articles Proteins annotation based on litterature by curators annotated articles GOA Manual annotation planned for 2045 ! (Baumgartner et al) Machine Learning based on Information Retrieval methods Assisting curators Assisting curators Macro reading of litterature Profiling any textual content

Patent retrieval 4 The real situation (0.5-1 TB) Experiments Database 13 millions of patents Extraction 33 days XML patents Tb Normalization 33 days XML patents + metadata Tb Indexing 5 days Index 0.1 Tb Database A sample of 1 million of patents Extraction 2.5 days XML patents 17 Gb Normalization 2.5 days XML patents + metadata 18 Gb Indexing 10 hours Index 3 Gb