From Web Archiving services to Web scale data processing platform Internet Memory Research GA IIPC, Paris, May 19th 2014
Overview Internet Memory Research Company Vision Techno logies Services Archive the Net Mignify Newstretto Use-Cases Improve your Selection Process Search in your Web archive Extract valuable information Internet Memory Research2
Spin-off of the Internet Memory Foundation French start-up, founded in engineers Actively engaged in the Web Information Mining field: EU Projects: DOPA, Annomarket, TrendMiner, Rethink Big, ASAP Clusters Cap Digital & Systematic Alliance Big Data Conferences: Search, iexpo, Crawl the Web... Internet Memory Research3
Vision The Web is full of valuable data: Variety Quantity This data is not so easy to collect, access and process at large scale Making Web data available will create many new business opportunities for the data ecosystem 23/04/2015Internet Memory Research4
Technologies Large Scale Crawler with high performances Scalable platform based on A distributed architecture Big data components (Hadoop, Hbase, HDFS,...) Set of proprietary and open source analytic agents providing Text Mining & Data Mining Semantical operations Statistical operations Infrastructure 170+ servers Innovative infrastructure with low consumption Internet Memory Research5
6 References
From 23/04/2015Internet Memory Research7 ✓ SaaS, automated software service with a friendly user interface ✓ Qualified team to provide quality ✓ Combining new technology and user needs Any institution whose aim is to collect and preserve web material for historical, cultural or heritage purpose For whom? Archives / Research Selective crawls with high level of Quality Assurance National Libraries Large scale crawl for the German National Library A.V. Archives Advanced module for web video and social media content
To Web data processing platform Market place for technological bricks Crawl on demande Sources Packages Set of extracted data (price, posts, micro-formats) Internet Memory Research8
Through 23/04/2015Internet Memory Research9 Innovative app fighting information deluge and bringing you information sur mesure You give Keywords, and it brings back From the Web and social media Selected hot and relevant news, without all the noise. Today 8+M URLs are sent to the platform and around of the ¼ URLs match with users favorite topics.
Improve your Selection Process o Manual selection VS Newstretto o Automated refreshment rate for active sources (RSS, Forums,...) o Smart discovery crawl for large crawls (topic, language, TLD,...) Internet Memory Research10
Internet Memory Research11 Example of RSS Refreshment Rate (sample)
Search in your Large Corpus o Full text Index with Elastic Search o Automated categorization (News, Forums, Blogs,...) o Semantic expansion o TopicMatching Internet Memory Research12
Internet Memory Research13 Example of Semantic Expansion
Extract valuable information from your large corpus for Users / Researchers o Cleaned text o Keywords to add Cloud o Outlinks to analyze Graphs o Structure unstructured data (forums,...) o Named entities (partner’s brick) o Summarization (partner’s brick) o More are coming soon... Internet Memory Research14
Internet Memory Research15 URL Thread Dates User names Content Example of Extracted Data
What if you could integrate those tools on the top of your current corpus? Internet Memory Research16
Internet Memory Research17 Chloé Martin Co-founder & Sales Manager With the support of the European Commission Internet Memory Research