Pixel Visualization of keyword search results in large databases. Jay Koven Fall 2013
Research Overview ● The problem: Both criminal and Civil investigations are being over with with information in the cyber age. ● New techniques are needed to handle the overload ● Visualization of data can provide solutions
The Investigative Problem ● Datasets are rapidly growing in size for all types of investigation ○ National Security ○ Criminal ○ Civil ● The Datasets ○ Most investigations focus on communications ○ s are the largest portion of these communication ○ Chats, IM, Phone logs and other social communication channels are also becoming important.
Related Research ● Jigsaw ○ Open Source Investigative tool kit being developed at Georgia Tech. ○ Focus on entity relationships and time relationships ○ Views are traditional
Related Research continued ● Daniel Keim ○ Pixel oriented display visualization ○ Large amounts of data can be viewed at once ○ Alternative display methodologies ○ Personal mailbox analysis
Related Research continued ●Other Visual analysis Techniques ○ Time SFU Vancouver ■ Plots relationships overtime by sender or by threads ● Run on Enron dataset ● Not sure why ○ Thread arcs - IBM ■ Traces a single thread using arcs to show trends ● Interactive, highlights individuals, can highlight attributes ● Used to analyze trends ○ Graphs and maps ■ Show relationships but not very useful for Ultra large datasets
Related Research continued ●Chris North - Use of Large Displays ○ Not specific to but useful thoughts ●W. Bradford Paley - Textarc ○ Relationships of words in a concordance ○ Images behind my proposal
My proposed research ●Pixel Visualization of Large Datasets ○ Search by Keywords ○ Multiple displays of returned sets ■ Entity - Entity ■ Entity - Keyword ■ Keyword - Time ■ Entity - Time ○ Interaction to Refine Search ■ Add / Remove Keywords ■ Add / Remove Entities ■ Limit time frame ○ Interaction to Drill Down to actual messages ■ By Subject ■ By Message Content
Key issues to be solved for investigative visualization of s ●Relative weights of s must be calculated against some standard ●Visualizations should minimize the distance of related s between points to show important clusters around entities, keywords and time.
My proposal - “Document Galaxy” ●Basic idea is to treat documents as stars in a circular galaxy ○ Place relevant data points, such as entities, around outside with associated weights. ○ Place documents inside galaxy based on relative “attraction” to outside points. ●Possible to have multiple outside rings to add additional attributes to calculations ●User interacts with outside rings to add / remove / move attraction points. ●User can explore contents of inner points and clusters to derive information about document content. ●Colors of documents can used to show additional attributes
Might look something like this
What use is this? ●Might make a good lead in tool to add to jigsaw as a lead in to reduce size of document set to be explored ●Separate tool for exploring e-discovery datasets