Supervisor: Mr. Phan Trường Lâm Supervisor:
Team information
Agenda Introduction Project plan System Requirement Specifications System Analysis and Design Testing Deployment and User Guide Summary Demo and Q&A
Introduction Initial IdeaLiterature Review of Existing SystemProposal & Product
Initial Idea
We decide to develop a new system that integrated: Collect documents Organize these documents Extract keyword Ranking Searching
Literature Review of Existing System Methods that these websites use to build their systems: Big database Search Ranking and highlight return results Compare documents to detect plagiarism
Literature Review Achievements of the existing systems Attractive Easy to use Speed & Reliability Quality Results Ensuring Security Awareness Limitations of the existing systems Costs Privacy
Proposal Collect and manage Capstone projects Support looking up Capstone projects Avoid repeating and copying idea Ranking results Refer to other materials Friendly interface like Google Chipper to build Free to use Public for everyone Inside and outside University
Product (in future) Mobile application Web application
Project Plan Development environment Process Project organization Project schedule Risk management
Development Environment Gb of RAM 100Gb of hard disk Core 2 Duo 2.0 GHz 2 Gb of RAM 100Gb of hard disk Core 2 Duo 2.0 GHz HARD WARE SOFT WARE
Process Follow Waterfall model
Project organization
Controlling and Monitoring Meeting Assign task Tracking task Issue resolve Review task Report Project organization
Communication control Online activity Chat Phone Offline activity Kick-Off project Team building Project organization
Project Schedule Overall plan
Risk Management RiskManagement People risk Estimation risk Technology risk Requirement risk Schedule risk
System Requirement Specifications User Requirements System Requirements Non-functional requirements
User Requirements Lecturers and Students: Search project documents. Download documents. Librarians: Edit profile. Search documents. Add/Edit/Delete document. Add/Edit/Delete category. Administrator Edit profile. Add/Edit/Delete account.
User Requirements Other requirement Searched results will be ranked. Document has following information: Name Author Supervisor Category Description
User Requirements Input files: Keyword file Abstract file Full document file Other materials
System Requirements Communicate via the protocol HTTP to complete interactions based on service with client computers and use standard protocols. Configuration Server: Windows Server 2008 operating system.NET framework 3.5 SQL server 2008 IIS 7 Client: Web browser
Non-functional Requirements Usability Availability Security Reliability Performance Security Maintainability
System Analysis and Design Architectural design Detail design Database design Coding convention Extract Keyword algorithm Ranking
Architectural design Overall architecture MVC architecture design pattern
Detail design CProDMS Component Diagram
Database design Entity diagram
Coding convention Follow: Microsoft.NET Library Standards FxCop rules and Code Analysis for Managed Code Warnings
Extract Keyword Algorithm Introduction Study Algorithm Evaluation Keyword Extraction from a Single Document using Word Co-occurrence Statistical Information (YUTAKA MATSUO and MITSURU ISHIZUKA) (Dec. 10, 2003)
Algorithm – What is the keyword? Position Meaning Frequency
Algorithm – Step by step Preprocessing Processing Discard stop words Stem Extract frequency Calculate X’ 2 value Calculate X’ 2 value Output Expected probability Select frequent term
Algorithm – Studying Original Text Information is the most powerful weapon in the modern society. Every day we are overflowed with a huge amount of data in form of electronic newspaper articles, s, web pages and search results. Often, information we receive is incomplete, such that further search activities are required to enable correct interpretation and usage of this information. Example: Information powerful weapon modern society day overflowed huge amount data electronic newspaper articles s web pages search results Often information receive incomplete such further search activities required enable correct interpretation usage information Stemmed Words Information is the most powerful weapon in the modern society. Every day we are overflowed with a huge amount of data in form of electronic newspaper articles, s, web pages and search results. Often, information we receive is incomplete, such that further search activities are required to enable correct interpretation and usage of this information. Discarded Stop Words Step1 Step2 Using Porter Stemming Algorithm Information is the most powerful weapon in the modern society. Every day we are overflowed with a huge amount of data in form of electronic newspaper articles, s, web pages and search results. Often, information we receive is incomplete, such that further search activities are required to enable correct interpretation and usage of this information. Informat power weapon modern societi day overflow huge amoun data electronic newspaper articl web page search result Often informat receive incomplet such further search activ requir enable correct interpret usag informat
Algorithm – Studying The top ten frequent terms (denoted as G) and the probability of occurrence, normalized so that the sum is to be 1. Select frequent Term As study, number of keyword is about 10% number of term in document and no more than 30 terms.
Algorithm – Studying Two terms in a sentence are considered to co-occur once. Co-occurrence and Importance Example: The imitation game could then be played with the machine in question and the mimicking digital computer and the interrogator would be unable to distinguish them. “imitation” and “digital computer” have one co-occurrence
Algorithm – Studying Co-occurrence and Importance
Algorithm – Studying The degree of biases of co-occurrence can be used as a indicator of term importance Co-occurrence and Importance
Algorithm – Studying The statistical value of χ 2 is defined as p g Unconditional probability of a frequent term g ∈ G (the expected probability) n w The total number of co-occurrence of term w and frequent terms G freq (w, g) Frequency of co-occurrence of term w and term g
Algorithm – Studying p g (the sum of the total number of terms in sentences where g appears) divided by (the total number of terms in the document) n w The total number of terms in the sentences where w appears including w We consider the length of each sentence and revise our definitions
Algorithm – Studying
the following function to measure robustness of bias values Subtracts the maximal term from the X 2 value
Algorithm – Studying
To improve extracted keyword, we will cluster terms Two major approaches (Hofmann & Puzicha 1998) are: Similarity-based clustering If terms w1 and w2 have similar distribution of co-occurrence with other terms, w1 and w2 are considered to be the same cluster. Pairwise clustering If terms w1 and w2 co-occur frequently, w1 and w2 are considered to be the same cluster. Eg: Monday is a day in week. Tuesday is a day in week. Wednesday is a day in week Algorithm – Studying
Similarity-based clustering centers upon Red Circles Pairwise clustering focuses on Green Circles Algorithm – Studying
Where: Similarity-based clustering Cluster a pair of terms whose Jensen-Shannon divergence is and: Algorithm – Studying
Cluster a pair of terms whose mutual information is Pairwise clustering Where: Algorithm – Studying
Algorithm – Evaluation Precision: Ratio of right keyword to number of keywordCoverage: Ratio of indispensable keyword in list to all the indispensable terms Frequency index: average frequency of keyword in list
Ranking – Why? Ranking Result
Ranking
Ranking Use rank calculate formula Term in a collection documents: ( Automatic Keyword Extraction for Database Search First examiner : Prof. Dr. techn. Dipl.-Ing. Wolfgang Nejdl Second examiner : Prof. Dr. Heribert Vollmer Supervisor : MSc. Dipl.-Inf. Elena Demidova ) R(t) = Fd(t)*log(1 + N/N(t)) (1) Rank of Term t in all the collection Total number of documents in the collection Frequency of Term t in the given document Total number of documents that contain Term t Ranking formula : Rank = d * Rd(t) / R(t) (2) =>Rank = d * Rd(t) / (Fd(t)*log(1 + N/N(t))) (3) reliability coefficient Rank of Term t in document, which extracted by Extract Service
Searching
Testing V - model
Testing
Testing NoTesterModule codePassFailUntestedN/ANumber of test cases 1 AnhNT Master Page AnhNT Home Page AnhNT Search Result AnhNT User Account AnhNT Error Page NamH Category NamH Document NamH Authenticated NamH User Document Detail Sub total Test coverage % Test successful coverage % Test result
Deployment Package Source Code Client side Server side
User guide
Summary Strong point Enthusiasm Creative Cope with change Weak point Lack of technical skill Lack of management skills Lessons learned Improve technical & management skills Release on-time product with the restriction of time and resource Improve communication skills & problem solving
Demo & Q&A