Advanced Science and Technology Letters Vol.31 (MulGraB 2013), pp.284-289 An Efficient and Privacy-Preserving.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Text Databases Text Types
Efficient Information Retrieval for Ranked Queries in Cost-Effective Cloud Environments Presenter: Qin Liu a,b Joint work with Chiu C. Tan b, Jie Wu b,
Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.
TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Hongliang Li, Senior Member, IEEE, Linfeng Xu, Member, IEEE, and Guanghui Liu Face Hallucination via Similarity Constraints.
Introduction to Information Retrieval (Part 2) By Evren Ermis.
INTRODUCTION PROBLEM FORMULATION FRAMEWORK AND PRIVACY REQUIREMENTS FOR MRSE PRIVACY-PRESERVING AND EFFICIENT MRSE PERFORMANCE ANALYSIS RELATED WORK CONCLUSION.
多媒體網路安全實驗室 Towards Secure and Effective Utilization over Encrypted Cloud Data 報告人 : 葉瑞群 日期 :2012/05/09 出處 :IEEE Transactions on Knowledge and Data Engineering.
What is missing? Reasons that ideal effectiveness hard to achieve: 1. Users’ inability to describe queries precisely. 2. Document representation loses.
Hinrich Schütze and Christina Lioma
Dept. of Computer Science & Engineering, CUHK1 Trust- and Clustering-Based Authentication Services in Mobile Ad Hoc Networks Edith Ngai and Michael R.
1 Latent Semantic Indexing Jieping Ye Department of Computer Science & Engineering Arizona State University
Singular Value Decomposition in Text Mining Ram Akella University of California Berkeley Silicon Valley Center/SC Lecture 4b February 9, 2011.
An Authentication Service Against Dishonest Users in Mobile Ad Hoc Networks Edith Ngai, Michael R. Lyu, and Roland T. Chin IEEE Aerospace Conference, Big.
IR Models: Latent Semantic Analysis. IR Model Taxonomy Non-Overlapping Lists Proximal Nodes Structured Models U s e r T a s k Set Theoretic Fuzzy Extended.
Privacy and Integrity Preserving in Distributed Systems Presented for Ph.D. Qualifying Examination Fei Chen Michigan State University August 25 th, 2009.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
Computing Sketches of Matrices Efficiently & (Privacy Preserving) Data Mining Petros Drineas Rensselaer Polytechnic Institute (joint.
Freenet A Distributed Anonymous Information Storage and Retrieval System I Clarke O Sandberg I Clarke O Sandberg B WileyT W Hong.
Methodology Conceptual Database Design
1 CS 430 / INFO 430 Information Retrieval Lecture 9 Latent Semantic Indexing.
Chapter 5: Information Retrieval and Web Search
JOURNAL OF INFORMATION SCIENCE AND ENGINEERING 30, (2014) BERLIN CHEN, YI-WEN CHEN, KUAN-YU CHEN, HSIN-MIN WANG2 AND KUEN-TYNG YU Department of Computer.
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
Longbiao Kang, Baotian Hu, Xiangping Wu, Qingcai Chen, and Yan He Intelligent Computing Research Center, School of Computer Science and Technology, Harbin.
Cong Wang1, Qian Wang1, Kui Ren1 and Wenjing Lou2
作者 :Jin Li, Qian Wang, Cong Wang, Ning Cao, Kui Ren, and Wenjing Lou 出處 :IEEE Transactions on Knowledge and Data Engineering(2011) 日期 :2012/05/15 報告人 :
WebPage Summarization Using Clickthrough Data JianTao Sun & Yuchang Lu, TsingHua University, China Dou Shen & Qiang Yang, HK University of Science & Technology.
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
Indices Tomasz Bartoszewski. Inverted Index Search Construction Compression.
Discovering Gene-Disease Association using On-line Scientific Text Abstracts. Raj Adhikari Advisor: Javed Mostafa.
Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.
Chapter 6: Information Retrieval and Web Search
Yu-Li Lin and Chien-Lung Hsu Department of Information Management, Chang-Gung University Information Science(SCI) Reporter: Tzer-Long Chen.
SECURED OUTSOURCING OF FREQUENT ITEMSET MINING Hana Chih-Hua Tai Dept. of CSIE, National Taipei University.
Efficient Instant-Fuzzy Search with Proximity Ranking Authors: Inci Centidil, Jamshid Esmaelnezhad, Taewoo Kim, and Chen Li IDCE Conference 2014 Presented.
SINGULAR VALUE DECOMPOSITION (SVD)
1 Common Secure Index for Conjunctive Keyword-Based Retrieval over Encrypted Data Peishun Wang, Huaxiong Wang, and Josef Pieprzyk: SDM LNCS, vol.
1 Web-Page Summarization Using Clickthrough Data* JianTao Sun, Yuchang Lu Dept. of Computer Science TsingHua University Beijing , China Dou Shen,
Abstract With the advent of cloud computing, data owners are motivated to outsource their complex data management systems from local sites to the commercial.
P2: Privacy-Preserving Communication and Precise Reward Architecture for V2G Networks in Smart Grid P2: Privacy-Preserving Communication and Precise Reward.
Alternative IR models DR.Yeni Herdiyeni, M.Kom STMIK ERESHA.
Gene Clustering by Latent Semantic Indexing of MEDLINE Abstracts Ramin Homayouni, Kevin Heinrich, Lai Wei, and Michael W. Berry University of Tennessee.
Wei-Shinn Ku Slide 1 Auburn University Computer Science and Software Engineering Query Integrity Assurance of Location-based Services Accessing Outsourced.
Building a Distributed Full-Text Index for the Web by Sergey Melnik, Sriram Raghavan, Beverly Yang and Hector Garcia-Molina from Stanford University Presented.
Sensor Networks: privacy-preserving queries Nguyen Dinh Thuc University of Science, HCMC
1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret
Data Integrity Proofs in Cloud Storage Author: Sravan Kumar R and Ashutosh Saxena. Source: The Third International Conference on Communication Systems.
Computer System Design Lab 1 Inverted Index Based Multi-Keyword Public-key Searchable Encryption with Strong Privacy Guarantee Bing Wang * Wei Song *†
1 CS 430: Information Discovery Lecture 11 Latent Semantic Indexing.
1 Data Mining: Text Mining. 2 Information Retrieval Techniques Index Terms (Attribute) Selection: Stop list Word stem Index terms weighting methods Terms.
Concept-based P2P Search How to find more relevant documents Ingmar Weber Max-Planck-Institute for Computer Science Joint work with Holger Bast Torino,
Bloom Cookies: Web Search Personalization without User Tracking Authors: Nitesh Mor, Oriana Riva, Suman Nath, and John Kubiatowicz Presented by Ben Summers.
Presented By Amarjit Datta
A Patent Document Retrieval System Addressing Both Semantic and Syntactic Properties Liang Chen*,Naoyuki Tokuda+, Hisahiro Adachi+ *University of Northern.
Hybrid Content and Tag-based Profiles for recommendation in Collaborative Tagging Systems Latin American Web Conference IEEE Computer Society, 2008 Presenter:
Refined Online Citation Matching and Adaptive Canonical Metadata Construction CSE 598B Course Project Report Huajing Li.
Secure Data Outsourcing
Keyword search on encrypted data. Keyword search problem  Linux utility: grep  Information retrieval Basic operation Advanced operations – relevance.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Advanced Science and Technology Letters Vol.28 (EEC 2013), pp Fuzzy Technique for Color Quality Transformation.
Similarity Measurement and Detection of Video Sequences Chu-Hong HOI Supervisor: Prof. Michael R. LYU Marker: Prof. Yiu Sang MOON 25 April, 2003 Dept.
Packet Classification Using Multi- Iteration RFC Author: Chun-Hui Tsai, Hung-Mao Chu, Pi-Chung Wang Publisher: 2013 IEEE 37th Annual Computer Software.
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
Searchable Encryption in Cloud
Efficient Multi-User Indexing for Secure Keyword Search
Latent Semantic Analysis
Presentation transcript:

Advanced Science and Technology Letters Vol.31 (MulGraB 2013), pp An Efficient and Privacy-Preserving Semantic Multi-Keyword Ranked Search over Encrypted Cloud Data Zhihua Xia, Li Chen, Xingming Sun, and Jin Wang xia z Jiangsu Engineering Center of Network Monitoring, Nanjing University of Information Science & Technology, Nanjing, , China School of Computer & Software, Nanjing University of Information Science & Technology, Nanjing, , China Abstract. As so much advantage of cloud computing, more and more data owners centralize their sensitive data into the cloud. In this paper, we propose a semantic multi-keyword ranked search scheme over the encrypted cloud data, which simultaneously meets a set of strict privacy requirements. Firstly, we utilize the “Latent Semantic Analysis” to reveal relationship between terms and documents. The relationship between terms is automatically captured. Secondly, our scheme employ secure “k-nearest neighbor (k-NN)” to achieve secure search functionality. The proposed scheme could return not only the exact matching files, but also the files including the terms latent semantically associated to the query keyword. Finally, the experimental result demonstrates that our method is better than the original MRSE scheme. 1 Introduction Due to the rapid expansion of data, the data owners tend to store their data into the cloud to release the burden of data storage and maintenance[1]. However, as the cloud customers and the cloud server are not in the same trusted domain, our outsourced data may be under the exposure to the risk. Thus, before sent to the cloud, the sensitive data needs to be encrypted to protect for data privacy and combat unsolicited accesses. Fuzzy keyword searches [2-4] have been developed. Chuah et al. [2] propose a privacy-aware bed-tree method to support fuzzy multi-keyword search. This approach uses edit distance to build fuzzy keyword sets. Bloom filters are constructed for every keyword. Then, it constructs the index tree for all files where each leaf node a hash value of a keyword. Li et al. [3] exploit edit distance to quantify keywords similarity and construct storage-efficient fuzzy keyword sets. Specially, the wildcard-based fuzzy set construction approach is designed to save storage overhead. Wang et al. [4] employ wildcard-based fuzzy set to build a private trie-traverse searching index. These fuzzy search methods support tolerance of minor typos and format ISSN: ASTL Copyright © 2013 SERSC

Advanced Science and Technology Letters Vol.31 (MulGraB 2013) inconsistencies, but do not support semantic fuzzy search. Considering the existence of polysemy and synonymy [5], the model that supports multi-keyword ranked search and semantic search is more reasonable. In this paper, we will solve the problem of multi-keyword latent semantic ranked search over encrypted cloud data and retrieve the most relevant files. We define a new scheme named Latent Semantic Analysis (LSA)-based multi-keyword ranked search which supports multi-keyword latent semantic ranked search. By using LSA, the proposed scheme could return not only the exact matching files, but also the files including the terms latent semantically associated to the query keyword. The reminder of this paper is organized as follows. In section 2, we describe the system model, privacy requirements, and notations. Section 3 provides the detailed description of our proposed mechanism. Section 4 presents the experiment and security analysis. Section 5 summarizes the conclusion. 2 Problem Formulation 2.1 System Model The system model can be considered as three entities, as depicted in Fig.1: the data owner, the data user and the cloud server. Data owner has a collection of data documents D = {d 1,d 2,...,d m }.A set of distinct keywords W= = {w 1,w 2,...,w n } is extracted from the data collection D. The data owner will firstly construct an encrypted searchable index I from the data collection D.Then, the data owner upload both the encrypted index I and the encrypted data collection C to the cloud server. Data user provides t keywords for the cloud server. The cloud server only sends back top-l files that are most relevant to the search query. Outsource Search request Cloud Server User Data owner Files Top- l ranked file s Fig. 1. Architecture of ranked search over encrypted cloud data 2.2 Threat models and Design Goals The cloud server both follows the designated protocol specification but at the same time analyzes data in its storage and message flows received during the protocol so as to learn additional information. The designed goals of our system are following: Latent Semantic Search: We use statistical techniques to estimate the latent index Encrypted Files Outsource Copyright © 2013 SERSC 285

Advanced Science and Technology Letters Vol.31 (MulGraB 2013) semantic structure, and get rid of the obscuring “noise” [5]. Multi-keyword Ranked Search: It supports both multi-keyword query and support result ranking. Privacy-Preserving: Our scheme is designed to meet the privacy requirement and prevent the cloud server from learning additional information from index and trapdoor. 1)Index Confidentiality. The TF values of keywords are stored in the index. Thus, the index stored in the cloud server needs to be encrypted; 2)Trapdoor Unlinkability. The cloud server should not be able to deduce relationship between trapdoors. 3)Keyword Privacy. The cloud server could not discern the keyword in query, index by analyzing the statistical information like term frequency. 2.3 Notations and Preliminaries  D --the plaintext document collection, denoted as a set of n data documents D = {d 1,d 2,...,d m }  C --the encrypted document collection stored in the cloud server, denoted as C = {c 1,c 2,...,c m }.  W= --the dictionary, the keyword set composing of m keyword, denoted as W= = { w 1, w 2,..., w n }.  I--the searchable index associated C, denoted as (I 1,I 2,...,I m ).  tf i,j --the term frequency, the i-th term appears times in thej-th document.  A ′ [j] --the data vector for document d j where the element A ′ [i][j] represents the term frequency tf i, j of the corresponding keyword i W= in document d j.  Q --the query vector indicating the keywords of interest where each bit Q[ j ] ∈ {0,1} represents the existence of the corresponding keyword in the query W . Latent Semantic Analysis: In information retrieval, latent semantic analysis is a solution for discovering the latent semantic relationship. It adopts singular-value decomposition, which is abbreviated as SVD to find the semantic structure between terms and documents. In this paper, the term-document matrix consists of n rows, each of which represents the data vector for each file, A ′ = A ′ [1]... A ′ [ j]... A ′ [m] ) ( (1) as depicted in the Eq.1. Then, we take a large term-document matrix and decompose it into a set of k, orthogonal factors from which the original matrix can be approximated by linear combination [5]. For example, a term-document matrix named A ′ can be decomposed into the product of three other matrices: A U S V ′ = ′ ⋅ ′ ⋅ ′ n × m n × t t × t t × m (2) 286 Copyright © 2013 SERSC

Advanced Science and Technology Letters Vol.31 (MulGraB 2013) such that U ′ and V ′ have orthonormal columns, S ′ is diagonal. We choose previous k columns of S ′, and then deleting the corresponding columns of U ′ and V ′ respectively.The result is a reduced model: A U S V A n x m n x k k x k k x m = ′ ′ ′ ≈ ′ (3) which is the rank- k model with the best possible least-squares-fit to A ′ [5]. Secure k-NN: In order to compute the inner product in a privacy-preserving method, we will adapt the secure k -nearest neighbor scheme[6]. This splitting technique is secure against known-plaintext attack, which is roughly equal in security to a d-bit symmetric key. We will get the details from [6, 7]. 3 Proposed Scheme 3.1 Our Scheme According to the above definition about LSA, the data owner builds a term- document matrix A ′. We reduce the dimensions of the original matrix A ′ to get a new matrix A which is calculated the best “reduced-dimension” approximation to the original term-document matrix[5]. Specially, A[j] denotes the j -th column of the matrix A.  Setup The data owner generates a n + 2 -bit vector as X and two (n + 2) x (n + 2) invertible matrices {M 1,M 2 }.The secret key SK is the form of a 3-tuple as {X,M 1,M 2 }.  BuildIndex(A′, SK) The data owner extracts a term-document matrix A ′. Following, we multiply these three matrices to get the result matrix A.Taking privacy into consideration, it is necessary that the matrix A is encrypted before outsourcing. After applying dimension-extending, the original A[j] is extended to ( n + 2 ) -dimensions, instead of n. Namely, the ( n + 1 ) -th entry in A[j] is set to a random number e j, and the ( n + 2 ) -th entry in A[j] is set to 1 during the dimension extending. Finally, A[j] can be represented as (( A [ ]) T,,1) T j e j.The W  as input, one binary generated as { M 1 Q, M 2 Q } 1 − ′ − 1 ′′.  Query(T w ,l,I) The inner product ofI j and wT  is calculated by the cloud server. After sorting all scores, the cloud server returns the top-l ranked id list to the data subindex { M 1 A [ ], M 2 A [ ]} T I j = ′ j ″ j is built.  Trapdoor( W  ) With t keywords of interest in vector Q is generated. The ( n + 1 ) -th entry in Q is set to a random number 1, and then scaled by a random numberr≠ 0, and the ( n + 2 ) -th entry in Q is set to another random number t. Q can be represented as ( Q, ) r r,t. The trapdoor wT  is T Copyright © 2013 SERSC 287

Advanced Science and Technology Letters Vol.31 (MulGraB 2013) user. The final similarity scores would be: I  T  ATM {m 1 1,M 2 1 Q   } j w   ( 4 ) ( [ ]) T A (A [], j,1 ) ( Q Q   ) r, r, t  T j  j T  r(A[j ] Q  j  ) t  4 Performance Analysis In this section, we show a thorough experimental evaluation of the proposed technique on a real dataset: the MED dataset. 4.1 Performance Analysis F-measure that combines precision F = ( 5 ) and recall is the harmonic mean of precision and recall[8]. Here, we adopt F-measure to weigh the result of our experiments. 2  precision  recall For a clear comparison, our proposed scheme attains score higher than the original MRSE in F-measure. Since the original scheme employs exact match, it must miss some similar words which is similar with the keywords. However, our scheme can make up for this disadvantage, and retrieve the most relevant files. Fig.2 shows that our method achieves remarkable result. # of documents in the dataset Fig. 2. Comparison of two schemes. 288 Copyright © 2013 SERSC F-measure L S A - MR S E O ri g i n al MR S E 2 0 A  ( [ ]) T j Q A T    ( [ j ]) Q

Advanced Science and Technology Letters Vol.31 (MulGraB 2013) 5 Conclusion In this paper, a multi-keyword ranked search scheme over encrypted cloud data is proposed, which meanwhile supports latent semantic search. We use the vectors consisting of TF values as indexes to documents. These vectors constitute a matrix, from which we analyze the latent semantic association between terms and documents by LSA. Taking security and privacy into consideration, we employ a secure splitting k-NN technique to encrypt the index and the queried vector, so that we can obtain the accurate ranked results and protect the confidence of the data well. The experimental effect is remarkable Acknowledgments. This work is supported by the NSFC ( , , , , , , ), GYHY , , 2013DFG12860, BC and PAPD fund. References 1.Armbrust, M., et al., A view of cloud computing. Communications of the ACM, (4): p Chuah, M. and W. Hu. Privacy-aware bedtree based solution for fuzzy multi-keyword search over encrypted data. in Distributed Computing Systems Workshops (ICDCSW), st International Conference on IEEE. 3.Deshpande, S., et al., Fuzzy keyword search over encrypted data in cloud computing. World Journal of Science and Technology, (10). 4.Wang, C., et al. Secure ranked keyword search over encrypted cloud data. in Distributed Computing Systems (ICDCS), 2010 IEEE 30th International Conference on IEEE. 5.Deerwester, S.C., et al., Indexing by latent semantic analysis. JASIS, (6): p Wong, W.K., et al. Secure kNN computation on encrypted databases. in Proceedings of the 2009 ACM SIGMOD International Conference on Management of data ACM. 7.Yang, C., et al. A Fast Privacy-Preserving Multi-keyword Search Scheme on Cloud Data. in Cloud and Service Computing (CSC), 2012 International Conference on IEEE. 8.Powers, D.M. The problem with kappa. in Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics Association for Computational Linguistics. Copyright © 2013 SERSC 289