As news analysis tool SNATZ TECHNOLOGY. Main terms used in presentation Term – a phrase, which system uses for training NLP algorithms. Summary – a phrase,

Slides:



Advertisements
Similar presentations
Data Mining and the Web Susan Dumais Microsoft Research KDD97 Panel - Aug 17, 1997.
Advertisements

PNS: Personalized Multi-Source News Delivery Georgios Paliouras(1), Mouzakidis Alexandros(1), Christos Ntoutsis(2), Angelos Alexopoulos(3), Christos Skourlas(2)
Content Management & Hashtag Recommendation IN P2P OSN By Keerthi Nelaturu.
Web Mining Research: A Survey Authors: Raymond Kosala & Hendrik Blockeel Presenter: Ryan Patterson April 23rd 2014 CS332 Data Mining pg 01.
Text mining Extract from various presentations: Temis, URI-INIST-CNRS, Aster Data …
T-FLEX DOCs PLM, Document and Workflow Management.
Information Retrieval in Practice
Search Engines and Information Retrieval
Managing Data Resources
WebMiningResearch ASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007.
Building an Intelligent Web: Theory and Practice Pawan Lingras Saint Mary’s University Rajendra Akerkar American University of Armenia and SIBER, India.
Supervised by Prof. LYU, Rung Tsong Michael Department of Computer Science & Engineering The Chinese University of Hong Kong Prepared by: Chan Pik Wah,
Web Mining Research: A Survey
WebMiningResearchASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007 Revised.
Enterprise Search With SharePoint Portal Server V2 Steve Tullis, Program Manager, Business Portal Group 3/5/2003.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
Overview of Search Engines
1 Introduction to Web Development. Web Basics The Web consists of computers on the Internet connected to each other in a specific way Used in all levels.
FALL 2012 DSCI5240 Graduate Presentation By Xxxxxxx.
Linux Operations and Administration
Attention and Event Detection Identifying, attributing and describing spatial bursts Early online identification of attention items in social media Louis.
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
Web 2.0: Concepts and Applications 4 Organizing Information.
 Cloud computing  Workflow  Workflow lifecycle  Workflow design  Workflow tools : xcp, eucalyptus, open nebula.
Implementation of HUBzero as a Knowledge Management System in a Large Organization HUBBUB Conference 2012 September 24 th, 2012 Gaurav Nanda, Jonathan.
Search Engines and Information Retrieval Chapter 1.
Web Search Created by Ejaj Ahamed. What is web?  The World Wide Web began in 1989 at the CERN Particle Physics Lab in Switzerland. The Web did not gain.
Chapter 7 DATA, TEXT, AND WEB MINING Pages , 311, Sections 7.3, 7.5, 7.6.
Avalanche Internet Data Management System. Presentation plan 1. The problem to be solved 2. Description of the software needed 3. The solution 4. Avalanche.
Web 2.0: Concepts and Applications 6 Linking Data.
Adaptive News Access Daniel Billsus Presented by Chirayu Wongchokprasitti.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
JASS 2005 Next-Generation User-Centered Information Management Information visualization Alexander S. Babaev Faculty of Applied Mathematics.
Web 2.0: Making the Web Work for You - Illustrated Unit C: Collaborating and Sharing Information.
Master Thesis Defense Jan Fiedler 04/17/98
Nobody’s Unpredictable Ipsos Portals. © 2009 Ipsos Agenda 2 Knowledge Manager Archway Summary Portal Definition & Benefits.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Use of Hierarchical Keywords for Easy Data Management on HUBzero HUBbub Conference 2013 September 6 th, 2013 Gaurav Nanda, Jonathan Tan, Peter Auyeung,
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.
Digital Filing A Simple Way to Digitally Centralize and Distribute Documents.
Search Result Interface Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.
1 Team Members: Rohan Kothari Vaibhav Mehta Vinay Rambhia Hybrid Review System.
Data Mining By Dave Maung.
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Andrew S. Budarevsky Adaptive Application Data Management Overview.
Module 10 Administering and Configuring SharePoint Search.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Curtis Spencer Ezra Burgoyne An Internet Forum Index.
GUIDED BY DR. A. J. AGRAWAL Search Engine By Chetan R. Rathod.
Individualized Knowledge Access David Karger Lynn Andrea Stein Mark Ackerman Ralph Swick.
Intelligent Web Topics Search Using Early Detection and Data Analysis by Yixin Yang Presented by Yixin Yang (Advisor Dr. C.C. Lee) Presented by Yixin Yang.
WEB 2.0 PATTERNS Carolina Marin. Content  Introduction  The Participation-Collaboration Pattern  The Collaborative Tagging Pattern.
Search Result Interface Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.
Site Technology TOI Fest Q Celebration From Keyword-based Search to Semantic Search, How Big Data Enables That?
ELISQ Systems Demonstration Sagnik Ray Choudhury Doha -- May 2015.
1 CS 8803 AIAD (Spring 2008) Project Group#22 Ajay Choudhari, Avik Sinharoy, Min Zhang, Mohit Jain Smart Seek.
September 2003, 7 th EDG Conference, Heidelberg – Roberta Faggian, CERN/IT CERN – European Organization for Nuclear Research The GRACE Project GRid enabled.
General Architecture of Retrieval Systems 1Adrienn Skrop.
Managing Data Resources File Organization and databases for business information systems.
Data mining in web applications
Information Retrieval in Practice
Data Mining: Concepts and Techniques
Information Organization: Overview
Search Engine Architecture
PLM, Document and Workflow Management
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Web Mining Research: A Survey
Information Organization: Overview
Presentation transcript:

as news analysis tool SNATZ TECHNOLOGY

Main terms used in presentation Term – a phrase, which system uses for training NLP algorithms. Summary – a phrase, which system automatically detects during analyzing of news content. Trend - an unique chain, which contains one, two or more summary. These chains are created as result of cluster analysis. Tag – a term, which created by moderator for detecting user’s interest category. User interests - cloud of tags, which system recognizes from user’s social accounts and OPML files. Segments - a groups of ‘similar’ Trends, which are intersected more than 30% by search results. Semantic network - is a network which represents semantic relations between keywords Data warehouse - is a database used for reporting and data analysis.

The main goal of SNATZ Snatz is a data mining instrument. It can recognizes semantic of news content using NLP algorithm On the basis of acquired summary SNATZ can define new knowledges: detect new Summary sets gathering Trends statistics opportunity to build Segments using new Summary as Terms for training NLP algorithm Making recommendation of news from different Segments Our solutions allow to change the paradigm of ‘Collaborative Filtering’

Snatz platform architecture Snatz platform architecture consists of: SNATZ Recommender System - personal recommendation based on the users’ interests SNATZ Data Mining Tool – semantic network of trends. It is created by sending recognized metadata to analysis processing

SNATZ Recommender CRAWLER. Crawler interacts with Web sites by receiving RSS-feeds and tweets. Content of RSS and tweets are the main resources. Blogosphere. All resources which web crawler detected are saved in data warehouse makes internal SNATZ “blogosphere”. Data Processing. Exporting resources to the SNATZ DM Tool. Also, together with news articles it sends sets of labels, terms, summary. USERS Tags/Posts: - Posts. System recognized users’ posts from Fb, Tw and OPML files. - Tags. Using NLP algorithm system defines the User’s interests. Recommendations. Component contains rules of forming news recommendations. News archive. News items which were recommended for a users

SNATZ Data Mining Tool Documenter. Imports resources from Data Processing component and sends them to the NLP NLP. Semantic analysis: - POS Tagging - Defining articles attributes: labels, terms and summary. Meta-Docs. Data warehouse of articles with semantic analysis Analysis: - Multi Clusterization - Trends defining Semantic Network of Trends: - Segments Reporter - Trends statistics

Data workflow CRAWLER Blogs Data Processing Users Recommendation News Archive Tags/ Posts Meta Docs Docs NLP Analysis Reporter Semantic Network of Trends IMPORT EXPORT Recommender System Data Mining Tool WEB

Building segments Documents Meta-Docs Tree of Trends Meta –Docs Labels Terms Summary Meta –Docs Labels Terms Summary Tree of Trends Segments

Building segments Documenter imports resources and sets of labels, terms, summary from Data Processing component. And sends them to the NLP. NLP recognizes an attributes in recourses: labels, terms, summary. These resources become a meta-docs and are saved in Data warehouse. Meta Docs are sent to the Analysis and system forms actual Trend Tree. Trends identify related summary, i.e. the main direction of its topics and sub- topics. Through such relations of trends SNATZ finds similar/related topics and groups them into Segments: - If Trends intersects more than 30% than trends create a new Segment.

Recommendations Users Posts Meta-Docs Interests Tags Related Tags Daily Review

Recommendations System parses users posts from Fb, Tw, uploaded OPML file of subscriptions. Updating of interests performed every 4 hours NLP recognized interests from Posts resulting a set of Tags. If number of Tags is less than 12, system tries to find relates Tags. System takes Trends which were received from Meta-Docs and defines related Trends for users interests. If Trend contains user’s interest it becomes connected with user. Summary which are in related Trend becomes the Related Tags. System takes trends from User’s Trend tree and makes Daily Review

Personal Recommendations User Interests Segments Interests User’s Trends User’s Tree of Trends List of 12 News User’s Trends Check TrendsGet Trends 'Diversity' Filtering

Personal Recommendations System takes ‘Last Trends’ which contains users interests and forms User’s Trends User’s Trends are checked on segments and forms User’s Trends Tree. ’Diversity’ filtering: System does not take more than 2 interests from one category No more than one news article for the trend System gets news only with new keywords (i.e. comparing with previous sets of news) Only 1 news from same segment Only 2 news from one category

SNATZ server architecture

Cluster High-availability provides the following services: 1.virtual ip for cluster. 2.DRBD storage of cluster. 3.ext4 file system on top of DRBD. 4.containers openVZ on ext4 over DRBD. each cluster is assembled on two nodes. corosyn is used for managing. Pacemaker is a resource manager. system is five two-node clusters.

SNATZ server architecture Redundant services are performed on openVZ containers and start together with the start of the container. Interaction redundant services between the containers is carried via the local network, which is connected via a separate commutator to the second network interface of each node. For each two-node cluster written sequence of start of redundant services: 1. switching active / passive DRBD 2. mount the ext4 file system to the mount point of the active node. 3. start of openVZ containers which are placed on DRBD.

SNATZ + Elasticsearch engine Elasticsearch is a search server which provides distributed, multitenant-capable full-text search engine with a RESTful web interface and schema-free JSON documents. Advantages Elasticsearch for SNATZ: Elasticsearch is a stable working project AWS Cloud Plugin (allows to use Amazon EC2 API) Real time data Search and Analysis Index versioning support Search opportunities: fuzzy requests & etc.

Elasticsearch + Amazon EC2

ability to maintain a high performance cluster designed for I/O intensive operations new instances are started and stopped when required no need to pay for long-term servers and their administration pricing is per instance-hour consumed for each instance ability to create images from a working machine (configured & set up) and start other instances from these images Features:

SNATZ + NLP Features: Part-of-speech-tagging Summary extraction User-defined Terms and Labels Synonyms handling Supervised text classification using user-defined datasets for training/evaluating performance Language support: English Japanese (using third-party tools like MeCab)

Challenges of SNATZ Filter Bubble (user’s interests) Diversity and ‘Long Tail’ Data sparsity (‘the cold start problem’) Scalability Segmentation (‘related topics’)

How SNATZ solves this problems? Using TRENDs

What is Filter Bubble User can see popular news only by TOP-Tags from his interests’ categories. But user doesn’t see related Tags outside the Filter Bubble

What is TREND? All summary and terms of articles has close connections. The task of SNATZ to define significant connections.

How Trends are detected? Clustering By Labels Clustering By Terms Clustering By Summary System detected Trends News with Terms

Abstraction of algorithm Multilevel clustering algorithm has 3 abstractions: Labels Terms Summary

SNATZ outside Filter bubble SNATZ tries to show news beyond users` filter bubble to cover more Trends. Trends identify related summary, i.e. the main direction of its topics and sub-topics

Long Tail problem SNATZ solves this problem: For user recommendations SNATZ selects Trends only by different Segments In order to provide users with *new* content, SNATZ does NOT make recommendations based on Summary that were already picked for previous recommendations. This way the user can see the news based on the latest Trends SNATZ does NOT use TOP-Tags from user’s interest categories. Users usually doesn't see most of news because they have too small Popularity Rank.

Collaborating Filter The first and most common way to determine the significance of an article is its social rating. This is determined through an advanced technique called Collaborative Filtering, which collects taste preferences or personal information (such as language, country, etc.) from many users and uses that data to make automatic predictions. SNATZ recommends news solely on the basis of user interests. Every step of recommendations is unique and depends on the previous step. Recommendations are made only on the basis of the individual user's experience.

Effective Content Personalization One approach to effective content personalization is called ‘the classification of trends’, and based on the principles of identifying the most significant relationships between summary, creating a unique chain of summary called a trend. A trend contains one or more summary from Web content, and determines specific subtopics. The main characteristic of a trend is dynamics of chains or summary, with positive (growing) or negative (fading) conditions over a specific period of time.

Automatic segmentation of blogosphere Through such relations of chains SNATZ finds similar/related topics and groups them into Segments. For example: ‘Network+Tumblr’ intersects with ‘Network+Tumblr+Instagram’ by more than 30%. These chains create a new Segment. Trends are determined by analyzing content in the current news state of the daily Blogosphere at it’s most basic form - relevant daily news topics. If a recommendation engine calculates the thematic proximity of trends, then it can auto-classify them into trend segments, so that similar sub-topics are put in the same segments. This auto-classification of segments splits Web content on various major topics.

Automatic segmentation A recommendation engine that applies this classification process on trends (and not tags) solves two major personalization problems: Removes Long Tail, making news recommendations from different segments possible Solves the problem of thematic proximity, making sure that similar or duplicate news is filtered out

Data mining result. Infographics System detect System can: detects more actual Trends for any given topic. detects ‘Related’ Tags for any given topic. detects the dynamics of Trends detects the sentiment of news

Findings Information becomes increasingly dense, consumers deserve to get the news that they want to read – not the news an algorithm thinks they want. SNATZ gives is a personalization algorithm that can solve the challenges of the filter bubble and long tail

Thanks for your attention! SNATZ Team