Download presentation
Presentation is loading. Please wait.
Published byKadin Lynde Modified over 9 years ago
1
From Web Archiving services to Web scale data processing platform Internet Memory Research GA IIPC, Paris, May 19th 2014
2
Overview Internet Memory Research Company Vision Techno logies Services Archive the Net Mignify Newstretto Use-Cases Improve your Selection Process Search in your Web archive Extract valuable information Internet Memory Research2
3
Spin-off of the Internet Memory Foundation French start-up, founded in 2011 20+ engineers Actively engaged in the Web Information Mining field: EU Projects: DOPA, Annomarket, TrendMiner, Rethink Big, ASAP Clusters Cap Digital & Systematic Alliance Big Data Conferences: Search, iexpo, Crawl the Web... Internet Memory Research3
4
Vision The Web is full of valuable data: Variety Quantity This data is not so easy to collect, access and process at large scale Making Web data available will create many new business opportunities for the data ecosystem 23/04/2015Internet Memory Research4
5
Technologies Large Scale Crawler with high performances Scalable platform based on A distributed architecture Big data components (Hadoop, Hbase, HDFS,...) Set of proprietary and open source analytic agents providing Text Mining & Data Mining Semantical operations Statistical operations Infrastructure 170+ servers Innovative infrastructure with low consumption Internet Memory Research5
6
6 References
7
From 23/04/2015Internet Memory Research7 ✓ SaaS, automated software service with a friendly user interface ✓ Qualified team to provide quality ✓ Combining new technology and user needs Any institution whose aim is to collect and preserve web material for historical, cultural or heritage purpose For whom? Archives / Research Selective crawls with high level of Quality Assurance National Libraries Large scale crawl for the German National Library A.V. Archives Advanced module for web video and social media content
8
To Web data processing platform Market place for technological bricks Crawl on demande Sources Packages Set of extracted data (price, posts, micro-formats) Internet Memory Research8
9
Through 23/04/2015Internet Memory Research9 Innovative app fighting information deluge and bringing you information sur mesure You give Keywords, and it brings back From the Web and social media Selected hot and relevant news, without all the noise. Today 8+M URLs are sent to the platform and around of the ¼ URLs match with users favorite topics.
10
Improve your Selection Process o Manual selection VS Newstretto o Automated refreshment rate for active sources (RSS, Forums,...) o Smart discovery crawl for large crawls (topic, language, TLD,...) Internet Memory Research10
11
Internet Memory Research11 Example of RSS Refreshment Rate (sample)
12
Search in your Large Corpus o Full text Index with Elastic Search o Automated categorization (News, Forums, Blogs,...) o Semantic expansion o TopicMatching Internet Memory Research12
13
Internet Memory Research13 Example of Semantic Expansion
14
Extract valuable information from your large corpus for Users / Researchers o Cleaned text o Keywords to add Cloud o Outlinks to analyze Graphs o Structure unstructured data (forums,...) o Named entities (partner’s brick) o Summarization (partner’s brick) o More are coming soon... Internet Memory Research14
15
Internet Memory Research15 URL Thread Dates User names Content Example of Extracted Data
16
What if you could integrate those tools on the top of your current corpus? Internet Memory Research16
17
Internet Memory Research17 Chloé Martin chloe@internetmemory.net Co-founder & Sales Manager http://archivethe.net contact@archivethe.net @archivethenet With the support of the European Commission http://newstretto.com contact@newstretto.com @newstretto http://mignify.com contact@mignify.com @mignify Internet Memory Research
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.