Document Clustering for Forensic Analysis: An Approach for Improving Computer Inspection.

Slides:



Advertisements
Similar presentations
Abstract Shortest distance query is a fundamental operation in large-scale networks. Many existing methods in the literature take a landmark embedding.
Advertisements

Optimizing Cloud Resources for Delivering IPTV Services Through Virtualization.
Abstract Cloud data center management is a key problem due to the numerous and heterogeneous strategies that can be applied, ranging from the VM placement.
Energy-Optimum Throughput and Carrier Sensing Rate in CSMA-Based Wireless Networks.
Annotating Search Results from Web Databases. Abstract An increasing number of databases have become web accessible through HTML form-based search interfaces.
Abstract Load balancing in the cloud computing environment has an important impact on the performance. Good load balancing makes cloud computing more.
A Secure Protocol for Spontaneous Wireless Ad Hoc Networks Creation.
Personalized QoS-Aware Web Service Recommendation and Visualization.
Discovering Emerging Topics in Social Streams via Link Anomaly Detection.
Crowdsourcing Predictors of Behavioral Outcomes. Abstract Generating models from large data sets—and deter¬mining which subsets of data to mine—is becoming.
Secure Encounter-based Mobile Social Networks: Requirements, Designs, and Tradeoffs.
Cross-Domain Privacy-Preserving Cooperative Firewall Optimization.
A Survey of Mobile Cloud Computing Application Models
NICE :Network Intrusion Detection and Countermeasure Selection in Virtual Network Systems.
Dynamic Resource Allocation Using Virtual Machines for Cloud Computing Environment.
Fast Nearest Neighbor Search with Keywords. Abstract Conventional spatial queries, such as range search and nearest neighbor retrieval, involve only conditions.
Understanding the External Links of Video Sharing Sites: Measurement and Analysis.
Security Evaluation of Pattern Classifiers under Attack.
A Framework for Mining Signatures from Event Sequences and Its Applications in Healthcare Data.
Vampire Attacks: Draining Life from Wireless Ad Hoc Sensor Networks.
BestPeer++: A Peer-to-Peer Based Large-Scale Data Processing Platform.
Improving Network I/O Virtualization for Cloud Computing.
m-Privacy for Collaborative Data Publishing
PACK: Prediction-Based Cloud Bandwidth and Cost Reduction System
Tweet Analysis for Real-Time Event Detection and Earthquake Reporting System Development.
A Fast Clustering-Based Feature Subset Selection Algorithm for High- Dimensional Data.
Optimal Client-Server Assignment for Internet Distributed Systems.
Protecting Sensitive Labels in Social Network Data Anonymization.
Identity-Based Secure Distributed Data Storage Schemes.
Incentive Compatible Privacy-Preserving Data Analysis.
Enabling Dynamic Data and Indirect Mutual Trust for Cloud Computing Storage Systems.
Hiding in the Mobile Crowd: Location Privacy through Collaboration.
LARS*: An Efficient and Scalable Location-Aware Recommender System.
Cooperative Caching for Efficient Data Access in Disruption Tolerant Networks.
Anonymization of Centralized and Distributed Social Networks by Sequential Clustering.
Accuracy-Constrained Privacy-Preserving Access Control Mechanism for Relational Data.
Abstract Link error and malicious packet dropping are two sources for packet losses in multi-hop wireless ad hoc network. In this paper, while observing.
A System for Denial-of- Service Attack Detection Based on Multivariate Correlation Analysis.
Modeling the Pairwise Key Predistribution Scheme in the Presence of Unreliable Links.
Anomaly Detection via Online Over-Sampling Principal Component Analysis.
A Method for Mining Infrequent Causal Associations and Its Application in Finding Adverse Drug Reaction Signal Pairs.
A Generalized Flow-Based Method for Analysis of Implicit Relationships on Wikipedia.
Keyword Query Routing.
Towards Online Shortest Path Computation. Abstract The online shortest path problem aims at computing the shortest path based on live traffic circumstances.
Facilitating Document Annotation using Content and Querying Value.
Privacy Preserving Back- Propagation Neural Network Learning Made Practical with Cloud Computing.
Clustering Sentence-Level Text Using a Novel Fuzzy Relational Clustering Algorithm.
Participatory Privacy: Enabling Privacy in Participatory Sensing
Preventing Private Information Inference Attacks on Social Networks.
Video Dissemination over Hybrid Cellular and Ad Hoc Networks.
Scalable Keyword Search on Large RDF Data. Abstract Keyword search is a useful tool for exploring large RDF datasets. Existing techniques either rely.
DCIM: Distributed Cache Invalidation Method for Maintaining Cache Consistency in Wireless Mobile Networks.
Supporting Privacy Protection in Personalized Web Search.
Twitsper: Tweeting Privately. Abstract Although online social networks provide some form of privacy controls to protect a user's shared content from other.
m-Privacy for Collaborative Data Publishing
A Scalable Two-Phase Top-Down Specialization Approach for Data Anonymization Using MapReduce on Cloud.
A New Algorithm for Inferring User Search Goals with Feedback Sessions.
Data Mining with Big Data. Abstract Big Data concerns large-volume, complex, growing data sets with multiple, autonomous sources. With the fast development.
Harnessing the Cloud for Securely Outsourcing Large- Scale Systems of Linear Equations.
Privacy-Enhanced Web Service Composition. Abstract Data as a Service (DaaS) builds on service-oriented technologies to enable fast access to data resources.
Privacy-Preserving and Content-Protecting Location Based Queries.
Mona: Secure Multi-Owner Data Sharing for Dynamic Groups in the Cloud.
Whole Test Suite Generation. Abstract Not all bugs lead to program crashes, and not always is there a formal specification to check the correctness of.
Distributed Processing of Probabilistic Top-k Queries in Wireless Sensor Networks.
Load Rebalancing for Distributed File Systems in Clouds.
Facilitating Document Annotation Using Content and Querying Value.
Fast Transmission to Remote Cooperative Groups: A New Key Management Paradigm.
Dynamic Query Forms for Database Queries. Abstract Modern scientific databases and web databases maintain large and heterogeneous data. These real-world.
Spatial Approximate String Search. Abstract This work deals with the approximate string search in large spatial databases. Specifically, we investigate.
Presentation transcript:

Document Clustering for Forensic Analysis: An Approach for Improving Computer Inspection

Abstract In computer forensic analysis, hundreds of thousands of files are usually examined. Much of the data in those files consists of unstructured text, whose analysis by computer examiners is dif¬ficult to be performed. In this context, automated methods of anal¬ysis are of great interest. In particular, algorithms for clustering documents can facilitate the discovery of new and useful knowl¬edge from the documents under analysis. We present an approach that applies document clustering algorithms to forensic analysis of computers seized in police investigations. We illustrate the pro¬posed approach by carrying out extensive experimentation with six well- known clustering algorithms (K-means, K-medoids, Single Link, Complete Link, Average Link, and CSPA) applied to five real-world datasets obtained from computers seized in real-world investigations.

Abstract con… Experiments have been performed with different combinations of parameters, resulting in 16 different instantiations of algorithms. In addition, two relative validity indexes were used to automatically estimate the number of clusters. Related studies in the literature are significantly more limited than our study. Our experiments show that the Average Link and Complete Link algo¬rithms provide the best results for our application domain. If suit-ably initialized, partitional algorithms (K-means and K-medoids) can also yield to very good results. Finally, we also present and dis¬cuss several practical results that can be useful for researchers and practitioners of forensic computing.

Existing system IT IS estimated that the volume of data in the digital world increased from 161 hexabytes in 2006 to 988 hexabytes in 2010 [1]—about 18 times the amount of information present in all the books ever written—and it continues to grow exponen¬tially. This large amount of data has a direct impact in Computer Forensics, which can be broadly defined as the discipline that combines elements of law and computer science to collect and analyze data from computer systems in a way that is admissible as evidence in a court of law. In our particular application do¬main, it usually involves examining hundreds of thousands of files per computer. This activity exceeds the expert’s ability of analysis and interpretation of data. Therefore, methods for auto¬mated data analysis, like those widely used for machine learning and data mining, are of paramount importance. In particular, al¬gorithms for pattern recognition from the information present in text documents are promising, as it will hopefully become evi¬dent later in the paper.

Architecture Diagram

System specification HARDWARE REQUIREMENTS Processor : intel Pentium IV Ram : 512 MB Hard Disk : 80 GB HDD SOFTWARE REQUIREMENTS Operating System : windows XP / Windows 7 FrontEnd : Java BackEnd : MySQL 5

CONCLUSION We presented an approach that applies document clustering methods to forensic analysis of computers seized in police in¬vestigations. Also, we reported and discussed several practical results that can be very useful for researchers and practitioners of forensic computing. More specifically, in our experiments the hierarchical algorithms known as Average Link and Com¬plete Link presented the best results. Despite their usually high computational costs, we have shown that they are particularly suitable for the studied application domain because the dendro¬grams that they provide offer summarized views of the docu¬ments being inspected, thus being helpful tools for forensic ex¬aminers that analyze textual documents from seized computers. As already observed in other application domains, dendrograms provide very informative descriptions and visualization capabil¬ities of data clustering structures [5].

THANK YOU