文本挖掘简介邹权博士，助理教授. Outline  Introduction  TF-IDF  Similarity.

文本挖掘简介邹权博士，助理教授

Outline  Introduction  TF-IDF  Similarity

Introduction  Why ？ Text mining ≈ Web mining  How ？ Classification or Clustering Retrieval

文本分类一般过程  预处理将文档集表示成易于计算机处理的形式  特征表示与选择、降维根据适宜的权重计算方法表示文档中各项的重要性  学习建模构建分类器

文本分类预处理  去标点、多余空格、数字（可选）  大小写统一  去停用词（ stop words ）没有实际含义的词，比如 and, you, have 等等  统一词根 PorterStemmer  分词英文？中文

特征表示  向量空间模型  以词项为特征组成高维特征向量  TF/IDF 得到权值

TF-IDF  TF （ Term Frequency ）表示词项频率  IDF （ Inverse Document Frequency ）逆文档频率  TF*IDF 值

8 Similarity Applications Many Web-mining problems can be expressed as finding “similar” sets: Plagiarism/Mirror Pages/Articles from the Same Source/Duplication Remove Collaborative Filtering as a Similar-Sets Problem Recommend to users items that were liked by other users who have exhibited smilar tastes

Measurement  Edit distance Short text, words For personal text  Jaccard distance Long text, ignoring the word similarity For government text

Microsoft Academic Search PK http://academic.research.microsoft. com/Author/2037349.aspx http://academic.research.microsoft.com/Author/3054641.aspx Real-world Data is Rather Dirty ！  Kenneth De JongKenneth Dejong 2016-5-27 Trie-Join @ VLDB2010 10/38

 Typo in “author”  Typo in “title” relaxed related Argyrios Zymnis Argyris Zymnis DBLP Complete Search 2016-5-27 Real-world Data is Rather Dirty ！  Trie-Join @ VLDB2010 11/38

The similarity join is an essential operation for data integration and cleaning  Perform a similarity join on Name attribute (find all record pairs whose Name attributes are similar)  Output: (2037349, 3054641), … Similarity Joins R IdNameUniv. 2037349Kenneth De Jong George … …… … 3054641Kenneth Dejong George … …… … 2016-5-27 Trie-Join @ VLDB2010 12/38

Near Duplicate Data On one end, a winded Pete Sampras tried to summon enough energy to give the New York fans another memorable win to talk about it on the subway ride home. On the other side, Roger Federer wore a sly grin like he knew age was about to catch up to the former world No. 1 - the man who owns the record of 14 Grand Slams he wants. 03/11/2008 | 11:28 AM By JAY COHEN, AP Sports Writer Mar 11, 4:23 am EDT

Similarity Join  Tokenize: Each record is a set of tokens from a finite universe. Suppose each record is a single text document x = “ yes as soon as possible ” y = “ as soon as possible please ” x = {A, B, C, D, E} y = {B, C, D, E, F} word yesassoonas 1 possbileplease token ABCDEF

参考文献 Chuan Xiao, Wei Wang, Xuemin Lin, Jeffrey Xu Yu. Efficient Similarity Joins for Near Duplicate Detection. WWW 2008. Guoliang Li, Dong Deng, Jiannan Wang, Jianhua Feng. Pass-Join: A Partition based Method for Similarity Joins. VLDB 2012.Pass-Join: A Partition based Method for Similarity Joins

文本挖掘简介邹权博士，助理教授. Outline  Introduction  TF-IDF  Similarity.

Similar presentations

Presentation on theme: "文本挖掘简介邹权博士，助理教授. Outline  Introduction  TF-IDF  Similarity."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

文本挖掘简介 邹权 博士，助理教授. Outline  Introduction  TF-IDF  Similarity.

Similar presentations

Presentation on theme: "文本挖掘简介 邹权 博士，助理教授. Outline  Introduction  TF-IDF  Similarity."— Presentation transcript:

Similar presentations

About project

Feedback

文本挖掘简介邹权博士，助理教授. Outline  Introduction  TF-IDF  Similarity.

Presentation on theme: "文本挖掘简介邹权博士，助理教授. Outline  Introduction  TF-IDF  Similarity."— Presentation transcript: