From Web Archiving services to Web scale data processing platform Internet Memory Research GA IIPC, Paris, May 19th 2014.

Slides:

Advertisements

Similar presentations

Data Mining and the Web Susan Dumais Microsoft Research KDD97 Panel - Aug 17, 1997.

Advertisements

Open Source Intelligence: Presented by Abe Lederman, President and CTO Deep Web Technologies, LLC IOP 06 Sheraton Premier, Tysons Corner, Virginia January.

DELOS Highlights COSTANTINO THANOS ITALIAN NATIONAL RESEARCH COUNCIL.

Classification & Your Intranet: From Chaos to Control Susan Stearns Inmagic, Inc. E-Libraries E204 May, 2003.

Information Society Technologies Third Call for Proposals Norbert Brinkhoff-Button DG Information Society European Commission Key action III: Multmedia.

Opportunities for the cultural sector Claude POLIART DG Information Society (IFSO)

COMBASE: strategic content management system Soft Format, 2006.

Text mining Extract from various presentations: Temis, URI-INIST-CNRS, Aster Data …

Advance Analytics Capabilities

Semantic Web and Web Mining: Networking with Industry and Academia İsmail Hakkı Toroslu IST EVENT 2006.

Presentation Outline  Project Aims  Introduction of Digital Video Library  Introduction of Our Work  Considerations and Approach  Design and Implementation.

Web Search – Summer Term 2006 III. Web Search - Introduction (Cont.) - Jeff Dean, Google's Systems Lab:

Semantic Web and Web Mining: Networking with Industry and Academia İsmail Hakkı Toroslu IST EVENT 2006.

Overview of Search Engines

1 Archiving and Preserving the Web Kristine Hanna Internet Archive April 2006.

Towards EU big data economy Kimmo Rossi European Commission

ASIDIC Spring Conference ‘Smart Content’ Uncovering the Value and Benefits of Semantic Technology Richard C. Fusco Director, Content Strategy – McGraw-Hill.

Annick Le Follic Bibliothèque nationale de France Tallinn,

This presentation was scheduled to be delivered by Brian Mitchell, Lead Architect, Microsoft Big Data COE Follow him Contact him.

The SEASR project and its Meandre infrastructure are sponsored by The Andrew W. Mellon Foundation SEASR Overview Loretta Auvil and Bernie Acs National.

CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.

How to Face the Challenges of Web Archiving? The experiences of a small library on the edge. Chloe Martin, Internet Memory Catherine Ryan, National Library.

Building Scalable Web Archives Florent Carpentier, Leïla Medjkoune Internet Memory Foundation IIPC GA, Paris, May 2014.

Making You Explore the Potential of Online Business CMS Based - Web Development Solutions.

Annick Le Follic Bibliothèque nationale de France Tallinn,

For info about the proprietary technology used in comScore products, refer to comScore Aaron Rhodes: Director.

Per Møldrup-Dalum State and University Library SCAPE Information Day State and University Library, Denmark, SCAPE Scalable Preservation Environments.

The physics departments and documents network EUNIS Conference, Bled, June 29 th -July 2 nd 2004 Michael Schlenker: Dynamic.

EUscreen: Examining An Aggregator ’ s Role in Digital Preservation Samantha Losben Digital Preservation - Final Project December 15, 2010.

It is impossible to guarantee that all relevant pages are returned (even inspected) (Figure 1): Millions of pages available, many of them not indexed in.

Webarchivering in het Audiovisuele Domein Web archiving in the audiovisual Domain Julia Vytopil- Nederlands Instituut voor Beeld en Geluid Netherlands.

IBM Bluemix Your gateway to cloud innovation Dejan Podgoršek IBM Ecosystem Development Manager, SEE IBM Slovenija.

10/07/2008 Semantic Web Technologies & Higher Education.

Oracle Database 11g Semantics Overview Xavier Lopez, Ph.D., Dir. Of Product Mgt., Spatial & Semantic Technologies Souripriya Das, Ph.D., Consultant Member.

Digital Libraries1 David Rashty. Digital Libraries2 “A library is an arsenal of liberty” Anonymous.

Millman—Nov 04—1 An Update on Digital Libraries David Millman Director of Research & Development Academic Information Systems Columbia University

DuraCloud Open technologies and services for managing durable data in the cloud Michele Kimpton, CBO DuraSpace.

Site Technology TOI Fest Q Celebration From Keyword-based Search to Semantic Search, How Big Data Enables That?

Dalit Gasul Department of Geography and Environmental Studies, University of Haifa CRI-Project Review Day, Tuesday, February 26, 2008.

The TERENA-OER Portal Eli Shmueli IUCC- Israeli-Inter Universities Communication Center MEITAL- Inter-University Center for e-Learning

Think Digital, Think Ally Digital Media 1of19 SEO Press Release Strategy 2015.

Chapter 8: Web Analytics, Web Mining, and Social Analytics

Abstract MarkLogic Database – Only Enterprise NoSQL DB Aashi Rastogi, Sanket V. Patel Department of Computer Science University of Bridgeport, Bridgeport,

 GEETHA P.  Originally coined by Tim O’Reilly Publishing Media  Second generation of services available on www.  Lets people collaborate and share.

Large Scale Semantic Data Integration and Analytics through Cloud: A Case Study in Bioinformatics Tat Thang Parallel and Distributed Computing Centre,

Copenhagen 11 March 2015 Dias 1 Theme 2a: Media Tools — NetLab, a Research Infrastructure for Internet Studies Niels Brügger, Aarhus University Advisory.

Data mining in web applications

Connected Infrastructure

Stavros Vologiannidis Founder

architecting the DIGITAL enterprise

Introduction to Kentico CMS

Joseph JaJa, Mike Smorul, and Sangchul Song

Firefish Software for Professional Recruiters Stays Available Around the Clock from Any Device and Anywhere by Using the Microsoft Azure Platform Partner.

WELCOME Mobile Applications Testing

Power BI Premium overview

Extraction, aggregation and classification at Web Scale

Overview & Applications Welcome!

DIGITAL LIBRARY.

TDM=Text Mining “automated processing of large amounts of structured digital textual content for purposes of information retrieval, extraction, interpretation.

XtremeData on the Microsoft Azure Cloud Platform:

YANDEX ZEN based on Award Winning machine learning technology

Course Summary ChengXiang “Cheng” Zhai Department of Computer Science

BUILDING A DIGITAL REPOSITORY FOR LEARNING RESOURCES

AGMLAB Information Technologies

The Most In-Demand Skills for Cloud Computing.

Web archives as a research subject

New Platform to Support Digital Humanities in the Czech Republic

Modernising dissemination and communication of European statistics

Presentation transcript:

From Web Archiving services to Web scale data processing platform Internet Memory Research GA IIPC, Paris, May 19th 2014

Overview Internet Memory Research Company Vision Techno logies Services Archive the Net Mignify Newstretto Use-Cases Improve your Selection Process Search in your Web archive Extract valuable information Internet Memory Research2

Spin-off of the Internet Memory Foundation French start-up, founded in engineers Actively engaged in the Web Information Mining field: EU Projects: DOPA, Annomarket, TrendMiner, Rethink Big, ASAP Clusters Cap Digital & Systematic Alliance Big Data Conferences: Search, iexpo, Crawl the Web... Internet Memory Research3

Vision The Web is full of valuable data:  Variety  Quantity This data is not so easy to collect, access and process at large scale Making Web data available will create many new business opportunities for the data ecosystem 23/04/2015Internet Memory Research4

Technologies Large Scale Crawler with high performances Scalable platform based on A distributed architecture Big data components (Hadoop, Hbase, HDFS,...) Set of proprietary and open source analytic agents providing Text Mining & Data Mining Semantical operations Statistical operations Infrastructure 170+ servers Innovative infrastructure with low consumption Internet Memory Research5

6 References

From 23/04/2015Internet Memory Research7 ✓ SaaS, automated software service with a friendly user interface ✓ Qualified team to provide quality ✓ Combining new technology and user needs Any institution whose aim is to collect and preserve web material for historical, cultural or heritage purpose For whom? Archives / Research Selective crawls with high level of Quality Assurance National Libraries Large scale crawl for the German National Library A.V. Archives Advanced module for web video and social media content

To Web data processing platform Market place for technological bricks Crawl on demande Sources Packages Set of extracted data (price, posts, micro-formats) Internet Memory Research8

Through 23/04/2015Internet Memory Research9 Innovative app fighting information deluge and bringing you information sur mesure You give Keywords, and it brings back From the Web and social media Selected hot and relevant news, without all the noise. Today 8+M URLs are sent to the platform and around of the ¼ URLs match with users favorite topics.

Improve your Selection Process o Manual selection VS Newstretto o Automated refreshment rate for active sources (RSS, Forums,...) o Smart discovery crawl for large crawls (topic, language, TLD,...) Internet Memory Research10

Internet Memory Research11 Example of RSS Refreshment Rate (sample)

Search in your Large Corpus o Full text Index with Elastic Search o Automated categorization (News, Forums, Blogs,...) o Semantic expansion o TopicMatching Internet Memory Research12

Internet Memory Research13 Example of Semantic Expansion

Extract valuable information from your large corpus for Users / Researchers o Cleaned text o Keywords to add Cloud o Outlinks to analyze Graphs o Structure unstructured data (forums,...) o Named entities (partner’s brick) o Summarization (partner’s brick) o More are coming soon... Internet Memory Research14

Internet Memory Research15 URL Thread Dates User names Content Example of Extracted Data

What if you could integrate those tools on the top of your current corpus? Internet Memory Research16

Internet Memory Research17 Chloé Martin Co-founder & Sales Manager With the support of the European Commission Internet Memory Research