Link Analysis: Current State of the Art Ronen Feldman Computer Science Department Bar-Ilan University, ISRAEL

Slides:



Advertisements
Similar presentations
Three-Step Database Design
Advertisements

The Chinese Room: Understanding and Correcting Machine Translation This work has been supported by NSF Grants IIS Solution: The Chinese Room Conclusions.
22C:19 Discrete Math Graphs Fall 2014 Sukumar Ghosh.
Midwestern State University Department of Computer Science Dr. Ranette Halverson CMPS 2433 – CHAPTER 4 GRAPHS 1.
Practical Text Mining Ronen Feldman Information Systems Department
Relationship Mining Network Analysis Week 5 Video 5.
Data Structure and Algorithms (BCS 1223) GRAPH. Introduction of Graph A graph G consists of two things: 1.A set V of elements called nodes(or points or.
Force directed graph drawing Thomas van Dijk. The problem Given a set of vertices and edges, compute positions for the vertices. If the edges don’t have.
C++ Programming: Program Design Including Data Structures, Third Edition Chapter 21: Graphs.
Applied Discrete Mathematics Week 12: Trees
Using Structure Indices for Efficient Approximation of Network Properties Matthew J. Rattigan, Marc Maier, and David Jensen University of Massachusetts.
A New Force-Directed Graph Drawing Method Based on Edge- Edge Repulsion Chun-Cheng Lin and Hsu-Chen Yen Department of Electrical Engineering, National.
Centrality and Prestige HCC Spring 2005 Wednesday, April 13, 2005 Aliseya Wright.
Connected Components, Directed Graphs, Topological Sort COMP171.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
CSE 6405 Graph Drawing Text Books T. Nishizeki and M. S. Rahman, Planar Graph Drawing, World Scientific, Singapore, G. Di Battista,
Visual Querying By Color Perceptive Regions Alberto del Bimbo, M. Mugnaini, P. Pala, and F. Turco University of Florence, Italy Pattern Recognition, 1998.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
© Copyright Eliyahu Brutman Programming Techniques Course.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Overview of Search Engines
22C:19 Discrete Math Graphs Spring 2014 Sukumar Ghosh.
Network Measures Social Media Mining. 2 Measures and Metrics 2 Social Media Mining Network Measures Klout.
System Analysis Overview Document functional requirements by creating models Two concepts help identify functional requirements in the traditional approach.
Search Engines and Information Retrieval Chapter 1.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
IAT Text ______________________________________________________________________________________ SCHOOL OF INTERACTIVE ARTS + TECHNOLOGY [SIAT]
Slides are based on Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems.
Developing Student Researchers Part 4 Dr. Gene and Ms. Tarfa Al- Naimi Research Skills Development Unit Education Institute.
EE 492 ENGINEERING PROJECT LIP TRACKING Yusuf Ziya Işık & Ashat Turlibayev Yusuf Ziya Işık & Ashat Turlibayev Advisor: Prof. Dr. Bülent Sankur Advisor:
Automated Social Hierarchy Detection through Network Analysis (SNAKDD07) Ryan Rowe, Germ´an Creamer, Shlomo Hershkop, Salvatore J Stolfo 1 Advisor:
Querying Structured Text in an XML Database By Xuemei Luo.
CSC 395 – Software Engineering Lecture 13: Object-Oriented Analysis –or– Let the Pain Begin (At Least I’m Honest!)
Advanced Methods and Analysis for the Learning and Social Sciences PSY505 Spring term, 2012 April 25, 2012.
School of Information Sciences University of Pittsburgh TELCOM2125: Network Science and Analysis Konstantinos Pelechrinis Spring 2013 Figures are taken.
Structural Properties of Networks: Introduction Networked Life NETS 112 Fall 2015 Prof. Michael Kearns.
Special Topics in Educational Data Mining HUDK5199 Spring 2013 March 25, 2012.
LexPageRank: Prestige in Multi- Document Text Summarization Gunes Erkan and Dragomir R. Radev Department of EECS, School of Information University of Michigan.
9-1 © Prentice Hall, 2007 Chapter 9: Analysis Classes Object-Oriented Systems Analysis and Design Joey F. George, Dinesh Batra, Joseph S. Valacich, Jeffrey.
ITGS Databases.
Artificial Intelligence Research Center Pereslavl-Zalessky, Russia Program Systems Institute, RAS.
Andreas Papadopoulos - [DEXA 2015] Clustering Attributed Multi-graphs with Information Ranking 26th International.
IAT Text ______________________________________________________________________________________ SCHOOL OF INTERACTIVE ARTS + TECHNOLOGY [SIAT]
WTC/911 CPT Garcia CPT Saad
Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Trevor Crum 04/23/2014 *Slides modified from Shamil Mustafayev’s 2013 presentation * 1.
Slides are modified from Lada Adamic
Finding frequent and interesting triples in text Janez Brank, Dunja Mladenić, Marko Grobelnik Jožef Stefan Institute, Ljubljana, Slovenia.
Web Information Retrieval Prof. Alessandro Agostini 1 Context in Web Search Steve Lawrence Speaker: Antonella Delmestri IEEE Data Engineering Bulletin.
1 Use graphs and not pure logic Variables represented by nodes and dependencies by edges. Common in our language: “threads of thoughts”, “lines of reasoning”,
Complexity and Efficient Algorithms Group / Department of Computer Science Testing the Cluster Structure of Graphs Christian Sohler joint work with Artur.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
How to Analyse Social Network? Social networks can be represented by complex networks.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
MMM2005The Chinese University of Hong Kong MMM2005 The Chinese University of Hong Kong 1 Video Summarization Using Mutual Reinforcement Principle and Shot.
Chapter 20: Graphs. Objectives In this chapter, you will: – Learn about graphs – Become familiar with the basic terminology of graph theory – Discover.
Topical Analysis and Visualization of (Network) Data Using Sci2 Ted Polley Research & Editorial Assistant Cyberinfrastructure for Network Science Center.
GRAPH AND LINK MINING 1. Graphs - Basics 2 Undirected Graphs Undirected Graph: The edges are undirected pairs – they can be traversed in any direction.
GUILLOU Frederic. Outline Introduction Motivations The basic recommendation system First phase : semantic similarities Second phase : communities Application.
 Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems n Introduction.
Structural Properties of Networks: Introduction
Summarizing Entities: A Survey Report
Structural Properties of Networks: Introduction
Social Knowledge Mining
Section 7.12: Similarity By: Ralucca Gera, NPS.
CSc4730/6730 Scientific Visualization
Graphs All tree structures are hierarchical. This means that each node can only have one parent node. Trees can be used to store data which has a definite.
Graphs Chapter 11 Objectives Upon completion you will be able to:
Visualization of Content Information in Networks using GlyphNet
The ultimate in data organization
EE 492 ENGINEERING PROJECT
Presentation transcript:

Link Analysis: Current State of the Art Ronen Feldman Computer Science Department Bar-Ilan University, ISRAEL

Introduction to Text Mining

Find Documents matching the Query Display Information relevant to the Query Extract Information from within the documents Actual information buried inside documents Long lists of documents Aggregate over entire collection

Read Consolidate Absorb / Act Understand Find Material Let Text Mining Do the Legwork for You Text Mining

What Is Unique in Text Mining? Feature extraction. Very large number of features that represent each of the documents. The need for background knowledge. Even patterns supported by small number of document may be significant. Huge number of patterns, hence need for visualization, interactive exploration.

Document Types Structured documents –Output from CGI Semi-structured documents –Seminar announcements –Job listings –Ads Free format documents –News –Scientific papers

Text Representations Character Trigrams Words Linguistic Phrases Non-consecutive phrases Frames Scripts Role annotation Parse trees

The 100,000 foot Picture

Intelligent Auto-Tagging (c) 2001, Chicago Tribune. Visit the Chicago Tribune on the Internet at Distributed by Knight Ridder/Tribune Information Services. By Stephen J. Hedges and Cam Simpson Finsbury Park Mosque Abu Hamza al-Masri chief cleric Finsbury Park Mosque England Abu Hamza al-Masri London 1999 his alleged involvement in a Yemen bomb plot England France United States Belgium Abu Hamza al-Masri London ……. The Finsbury Park Mosque is the center of radical Muslim activism in England. Through its doors have passed at least three of the men now held on suspicion of terrorist activity in France, England and Belgium, as well as one Algerian man in prison in the United States. ``The mosque's chief cleric, Abu Hamza al- Masri lost two hands fighting the Soviet Union in Afghanistan and he advocates the elimination of Western influence from Muslim countries. He was arrested in London in 1999 for his alleged involvement in a Yemen bomb plot, but was set free after Yemen failed to produce enough evidence to have him extradited..'' ……

Intelligence Article

Google’s Article

Merger

Leveraging Content Investment Any type of content Unstructured textual content (current focus) Structured data; audio; video (future) From any source WWW; file systems; news feeds; etc. Single source or combined sources In any format Documents; PDFs; s; articles; etc “Raw” or categorized Formal; informal; combination

Information Extraction

Relevant IE Definitions Entity: an object of interest such as a person or organization. Attribute: a property of an entity such as its name, alias, descriptor, or type. Fact: a relationship held between two or more entities such as Position of a Person in a Company. Event: an activity involving several entities such as a terrorist act, airline crash, management change, new product introduction.

IE Accuracy by Information Type Information Type Accuracy Entities90-98% Attributes80% Facts60-70% Events50-60%

MUC Conferences ConferenceYearTopic MUC Naval Operations MUC Naval Operations MUC Terrorist Activity MUC Terrorist Activity MUC Joint Venture and Micro Electronics MUC Management Changes MUC Spaces Vehicles and Missile Launches

Applications of Information Extraction Routing of Information Infrastructure for IR and for Categorization (higher level features) Event Based Summarization. Automatic Creation of Databases and Knowledge Bases.

Where would IE be useful? Semi-Structured Text Generic documents like News articles. Most of the information in the document is centered around a set of easily identifiable entities.

Approaches for Building IE Systems Knowledge Engineering Approach –Rules are crafted by linguists in cooperation with domain experts. –Most of the work is done by inspecting a set of relevant documents. –Can take a lot of time to fine tune the rule set. –Best results were achieved with KB based IE systems. –Skilled/gifted developers are needed. –A strong development environment is a MUST!

Approaches for Building IE Systems Automatically Trainable Systems –The techniques are based on pure statistics and almost no linguistic knowledge –They are language independent –The main input is an annotated corpus –Need a relatively small effort when building the rules, however creating the annotated corpus is extremely laborious. –Huge number of training examples is needed in order to achieve reasonable accuracy. –Hybrid approaches can utilize the user input in the development loop.

Components of IE System

Why is IE Difficult? Different Languages –Morphology is very easy in English, much harder in German and Hebrew. –Identifying word and sentence boundaries is fairly easy in European language, much harder in Chinese and Japanese. –Some languages use orthography (like english) while others (like hebrew, arabic etc) do no have it. Different types of style –Scientific papers –Newspapers –memos – s –Speech transcripts Type of Document –Tables –Graphics –Small messages vs. Books

Link Analysis on Large Textual Networks Social Network Analysis

The Kevin Bacon Game The game works as follows: given any actor, find a path between the actor and Kevin Bacon that has less than 6 edges. For instance, Kevin Costner links to Kevin Bacon by using one direct link: Both were in JFK. Julia Louis-Dreyfus of TV's Seinfeld, however, needs two links to make a path: Julia Louis- Dreyfus was in Christmas Vacation (1989) with Keith MacKechnie. Keith MacKechnie was in We Married Margo (2000) with Kevin Bacon.Julia Louis- DreyfusChristmas Vacation (1989) Keith MacKechnie We Married Margo (2000)Kevin Bacon You can play the game by using the following URL

The Erdos Number A similar idea is also used in the mathematical society and is called the Erdös number of a researcher. Paul Erdös (1913–1996), wrote hundreds of mathematical research papers in many different areas, many in collaboration with others.Paul Erdös There is a link between any two mathematicians if they co-authored a paper. Paul Erdös is the root of the mathematical research network and his Erdös number is 0. Erdös’s co-authors have Erdös number 1. People other than Erdös who have written a joint paper with someone with Erdös number 1 but not with Erdös have Erdös number 2, and so on.

Running Example

Hijackers by Flight Flight 77 : PentagonFlight 11 : WTC 1Flight 175 : WTC 2Flight 93: PA Khalid Al-Midhar Satam Al Suqami Marwan Al-Shehhi Saeed Alghamdi Majed Moqed Waleed M. Alshehri Fayez Ahmed Ahmed Alhaznawi Nawaq Alhamzi Wail Alshehri Ahmed Alghamdi Ahmed Alnami Salem Alhamzi Mohamed Atta Hamza Alghamdi Ziad Jarrahi Hani Hanjour Abdulaziz Alomari Mohald Alshehri

Automatic layout of networks Pretty Graph Drawing

Motivation I In order to display large networks on the screen we need to use automatic layout algorithms. These algorithms display the graphs in an aesthetic way without any user intervention. The most commonly used aesthetic criteria are to expose symmetries and make drawing as compact as possible or alternatively fill the space available for the drawing.

Motivation II Many of the “higher-level” aesthetic criteria are implicit consequences of: –minimized number of edge crossings –evenly distributed edge length –evenly distributed vertex positions on the graph area –sufficiently large vertex-edge distances –sufficiently large angular resolution between edges.

Disadvantages of the Spring based methods They are computationally expensive and hence minimizing the energy function when dealing with large graphs is computationally prohibitive. Since all methods rely on heuristics, there is no guarantee that the “best” layout will be found. The methods behave as black boxes and hence it is almost impossible to integrate additional constraints on the layout (such as fixing the positions of certain vertices, or specifying the relative ordering of the vertices) Even when the graphs are planar it is quite possible that we will get edge crossings. The methods try to optimize just the placement of vertices and edges while ignoring the exact shape of the vertices or the fact the vertices may have labels.

Kamada and Kawai’s (KK) Method

Fruchterman Reingold (FR) Method

Classic Graph Operations

Finding the shortest Path (from Atta)

A better Visualization

Centrality

Degree If the graph is undirected then the degree of a vertex v  V is the number of other vertices that are directly connected to it. –degree(v) = |{(v1, v2)  E | v1 = v or v2 = v}| If the graph is directed then we can talk about in-degree or out-degree. An edge (v1,v2)  E in the directed graph is leading from vertex v1 to v2. –In-degree(v) = |{(v1, v)  E }| –Out-degree(v) = |{(v, v2)  E }|

Degree of the Hijackers

Closeness Centrality - Motivation Degree centrality measures might be criticized because they only take into account the direct connections that an entity has, rather than indirect connections to all other entities. One entity might be directly connected to a large number of entities that might be pretty isolated from the network. Such an entity is central only in a local neighborhood of the network.

Closeness Centrality This measure is based on the calculation of the geodesic distance between the entity and all other entities in the network. We can either use directed or undirected geodesic distances between the entities. The sum of these geodesic distances for each entity is the "farness" of the entity from all other entities. We can convert this into a measure of closeness centrality by taking the reciprocal. In addition, we can normalize the closeness measure by dividing it by the closeness measure of the most central entity.

Closeness : Formally let d(v1,v2) = the minimal distance between v1 and v2, i.e., the minimal number of vertices that we need to pass on the way from v1 to v2.

Closeness of the Hijackers NameCloseness Abdulaziz Alomari0.6 Ahmed Alghamdi Ziad Jarrahi Fayez Ahmed Mohamed Atta Majed Moqed Salem Alhamzi Hani Hanjour0.5 Marwan Al Shehhi Satam Al Suqami Waleed M. Alshehri Wail Alshehri Hamza Alghamdi0.45 Khalid Al Midhar Mohald Alshehri Nawaq Alhamzi Saeed Alghamdi Ahmed Alnami Ahmed Alhaznawi

Betweeness Centrality The betweeness centrality measures the effectiveness in which the vertex connects the various parts of the network. The main idea behind betweeness centrality is that entities that are mediators have more power. Entities that are on many geodesic paths between other pairs of entities are more powerful since they control the flow of information between the pairs.

Betweeness - Formally Highest Possible Betweeness g jk = the number of geodetic paths that connect v j with v k g jk (v i ) = the number of geodetic paths that connect v j with v k and pass via v i.

Betweenness of the Hijackers

Eigen Vector Centrality The main idea behind eigenvector centrality is that entities receiving many communications from other well connected entities, will be better and more valuable sources of information, and hence be considered central. The Eigenvector centrality scores correspond to the values of the principal eigenvector of the adjacency matrix M. Formally, the vector v satisfies the equation where is the corresponding eigenvalue and M is the adjacency matrix.

EigenVector centralities of the hijackers NameE1 Mohamed Atta0.518 Marwan Al-Shehhi0.489 Abdulaziz Alomari0.296 Ziad Jarrahi0.246 Fayez Ahmed0.246 Satam Al Suqami0.241 Waleed M. Alshehri0.241 Wail Alshehri0.241 Salem Alhamzi0.179 Majed Moqed0.165 Hani Hanjour0.151 Khalid Al-Midhar0.114 Ahmed Alghamdi0.085 Nawaq Alhamzi0.064 Mohald Alshehri0.054 Hamza Alghamdi0.015 Saeed Alghamdi0.002 Ahmed Alnami0 Ahmed Alhaznawi0

Power Centrality Given an adjacency matrix M, the power centrality of vertex i (denoted ci), is given by  is used to normalize the score; the normalization parameter is automatically selected so that the sum of squares of the vertices’s centralities is equal to the number of vertices in the network.  is an attenuation factor that controls the effect that the power centralities of the neighboring vertices should have on the power centrality of the vertex.

Power - Motivation In a similar way to the eigenvector centrality, the power centrality of each vertex is determined by the centrality of the vertices it is connected to. By specifying positive or negative values to  the user can control if the fact that a vertex is connected to powerful vertices should have a positive effect on its score or a negative effect. The rational for specifying a positive  is that if you are connected to powerful colleagues it makes you more powerful. On the other hand, the rational for a negative  is that powerful colleagues have many connections and hence are not controlled by you, while isolated colleagues have no other sources of information and hence are pretty much controlled by you.

Power of the Hijackers Power :  = 0.99Power :  = Mohamed Atta Marwan Al-Shehhi Abdulaziz Alomari Ziad Jarrahi Fayez Ahmed Satam Al Suqami Waleed M. Alshehri Wail Alshehri Salem Alhamzi Majed Moqed Hani Hanjour Khalid Al-Midhar Ahmed Alghamdi Nawaq Alhamzi Mohald Alshehri Hamza Alghamdi Saeed Alghamdi Ahmed Alnami Ahmed Alhaznawi

Network Centralization In addition to the individual vertex centralization measures, we can assign a number between 0 and 1 that will signal the level of centralization of the whole network. The network centralization measures will be computed based on the centralization values of its vertices and hence we will have for type of individual centralization measure an associated network centralization measure. A network that is structured like a circle will have a network centralization value of 0 (since all vertices have the same centralization value), while a network that structured like a star will have a network centralization value of 1. We will now provide some of the formulas for the different network centralization measures.

Degree For the Hijackers network Net Degree = 0.31

Betweenness For the Hijackers network Net Bet = 0.24

Summary Diagram