Academic Advisor: Dr. Yuval Elovici Technical Advisor: Dr. Lidror Troyansky ADD Presentation
Continued…
Develop a system which will: –be able to Configure the searching parameters. –scan the P2P networks. –download files suspicious as confidential. –analyze the material using Machine Learning. –generate reports. –produce statistics.
Scanning and looking for suspicious target (e.g. as confidential) information in the P2P network (Gnutella). Downloading the suspicious target (e.g. as confidential) information from the P2P network (Gnutella). Analyzing the scanned results (determine the value of the documents). –The system will use the Machine Learning based on the filtering algorithm to classify the documents.filtering algorithm Statistics Gathering: –The number of users which currently hold the target information. –Using IP Geolocation and finding out the geographic location of the leaked information. –The history of searched for, downloaded & analyzed files.
Performance constraints: –The system should return a search result for suspicious target after no more than 15 minutes. –The system should not limit the download target time. (Remark: it should be configurable. By default, a time-out should always be set) –The system should hold history result and statistics of not more than one year ago. Safety and Security: –The system will not be used for any other purpose than find information leaks in P2P networks (e.g. to find MP3 shares). –The system will not expose the confidential documents it downloads and the documents were used in the Machine Learning algorithm.
The system is constructed from several components which are written in different languages and communicate between each other in several ways. All software modules reside in the same computer. –IGTellaHandler- The primary responsibility of this component is downloading documents from the Gnutella Network. The IGTellaHandler is written in Java and communicates with the main component (P2PinspectorGadget) via RMI technology (to increase the de-coupling between the copmonents). –IGConfClassifier- The primary responsibility of this component is classifying documents using different classification rules. The output of this process will be saved in the database, and will be available for further use.
–IGDBHandler- The primary responsibility of this component is connecting to an external database and stand as an interface for the system's modules for the database. IGDBHandler will be written in java and will communicate with the main component via RMI communication. –P2PInspectorGadget- This component is the system's main component, it has two primary reponsibilities, the first is interaction with the user via the Graphical User Interface, and the second is to control the flow of the system. P2PinspectorGadget will be written in Java and will connect to the different components with the connection mentioned above, and will not communicate with any other external system.
Searching files – seq. diagram
Unit testing All the units will be tested for every use case. For each use-case all of the possible paths will be tested. The unit testing is a part of the design of the project, an automated tests are running all of the time when we develop the system. Here are some of the testing in the test-plan: –[Start system] Starting the system with a firewall blocking of the P2P needed ports, and see that the system doesn't crush and outputs the right error message. –[Scan network] Verify that this process concludes after a pre- defined time-out. –[Analyze downloaded files] Verify that the system converts the different text formats (DOC, PPT and PDF) correctly into "raw" text. –[Analyze downloaded files] Verify accuracy of the algorithm (achieving the standard of false-positive and true-positive as defined in the project's targets.
Acceptance Tests: As a part of the acceptance tests, all of the use cases will be fully checked from the beginning to the end. In addition, all of the non-functional requirements will be tested to make sure they meet their targets: –System's History: In order to verify that the System saves all the information for the period that the user has defined (default is 1 year), we shall manually try to change the system's clock and trick and see that the data that needs to be saved is saved and the data that should have been deleted, is deleted. –System legitimacy (non pirate uses): The system will be blocked for uploading data, this will be checked with planting a unique media file (maybe MP3, or MPEG) that we composed, with a unique name, and try with a different client to download the media file.
–Content Safety: In order to test for Content safety (classified documents used for the learning part of the algorithm will not be exposed to the P2P network), those two sub-application are running as a separate processes with different memory space. The test will be attempt from another client to download the classified documents or the list of the documents from the process that connnects to the P2P network.
Create and Integrate the GUI. Find a list of Gnutella1 working servers. Classification algorithm inspecting and learning. Integrate Python written algorithm to Java. Finish PDF 2 DOC converter. Finish Gnutella driver (able to perform search and download of documents).