Download presentation
Presentation is loading. Please wait.
Published byOswald Peters Modified over 8 years ago
1
+7 (499) 135-04-63 117312, Moscow pr. 60-letiya Oktyabrya, 9 www.isa.ru SYSTEM FOR INTELLIGENT SEARCH AND ANALYSIS OF LARGE-SCALE TEXT COLLECTIONS Institute for Systems Analysis Federal Research Center «Computer Science and Control» of Russian Academy of Sciences Ilya Tikhomirov PhD
2
About Russian Academy of Sciences: the national academy of Russia methodological guidance of more than 400 research centers Federal Research Center “Computer Science and Control” of Russian Academy of Sciences: multidisciplinary research (mathematics, IT, economics etc.) 1200+ employees, 300+PhDs
3
www.textapp.ru /34 TextAppliance TextAppliance – system for intelligent search and analysis of large-scale text collections Different from Uses deep natural language processing Based on advanced Exactus technology Result of the state-of-the-art research in computer science 3 etc.
4
www.textapp.ru /34 Functions Text Appliance consists of a hardware cluster and software intelligent services for search and analysis of large-scale text collections: Semantic and explorative search Search for similar documents Semantic plagiarism detection Formation, comparison and topic analysis of user’s collections Automatic extraction of keywords Automatic generation of document summary Topic analysis for document collections 4
5
www.textapp.ru /34 Features of TextAppliance Processes documents in Russian English Extensible for Persian languages Can be easily integrated into infrastructures of organizations Provides a wide set of search and analytical functions High quality of text processing Supports common document formats, including pdf without text layer (performs OCR) 5
6
www.textapp.ru /34 Architecture 6 Scalability Resiliency Easy to integrate (JSON / XML-RPC) Support for Big Data Full-text indexing Extracting and indexing of metadata Support for common document formats
7
www.textapp.ru /34 Implementation The implementation on a computational cluster running Linux Debian Distributed computing provides scalability and stability at high load 7
8
/34 Technologies behind TextAppliance 8
9
www.textapp.ru /34 Semantic search method Perform deep natural language processing of user query POS-tagging Syntactic parsing Semantic role labeling Semantic relation extraction Named entity recognition Compare linguistic structure of query with structures of documents in a large indexed textual collection 9
10
www.textapp.ru /34 Semantic search scheme 10
11
www.textapp.ru /34 Tokenization and sentence splitting Extract tokens from raw text Extract sentences from raw text 11 The mother brings her son to school. themotherbringshersontoschool.
12
www.textapp.ru /34 POS-tagging and morphological analysis Determine part-of-speech (POS) tags of words Determine morphological features of words (for morphologically rich languages) 12 themotherbringshersontoschool. the det mother noun brings verb her pronoun son noun to prep school noun.
13
www.textapp.ru /34 Syntax parsing Build a syntax tree Extract grammatical structure of a sentence 13
14
www.textapp.ru /34 Semantic analysis Creates an abstracted representation of text that does not depend on a particular language Extracts semantic roles and semantic relations 14
15
www.textapp.ru /34 Relational-Situational model of text (1) 15 Syntax relations Semantic roles and values of syntaxemes Semantic relations between syntaxemes Coreference relations Other information extracted from texts: names of persons names companies geographical objects etc. Example: “Oxygen arrives at tissues from lungs through blood. There it is spent on oxidation of various substances.”
16
www.textapp.ru /34 Relational-Situational model of text (2) M = S – set of syntaxemes, S = {s 1, s 2, …, s n }, s i – syntaxeme R – family of relationships on the set of syntaxemes, R S × S T s – syntaxeme types I s : S →T s s = T s, T s = {‘p’, ’n’} W – word P – syntaxeme features including categorial semantic class, prepositions and other morphological properties – type of syntaxeme (‘p’ – predicate word; ‘n’ – nominal syntaxeme) R = {(s 1, s 2 )} is a family of binary relationships, it consists of three subfamilies: R p – relationships between predicate words and nominal syntaxemes (or syntaxeme meanings) R n – relationships between nominal syntaxemes R c – relationships that express anaphora and co-reference 16
17
www.textapp.ru /34 Semantic search example 17
18
www.textapp.ru /34 Indexing technology Fast indexing. Sublinear dependency between number of indexed documents and indexing speed Efficient search enhanced by linguistic information including semantic structures and syntax trees 18 Stores rich linguistic structures of texts efficiently. Minimum overhead for keeping semantic information
19
www.textapp.ru /34 Evaluation of semantic search (ROMIP’08) 19 Recall Precision – 1 st place
20
www.textapp.ru /34 Evaluation of question answering search (ROMIP’10) Best results for all metrics 20
21
/34 Evaluation of plagiarism detection (CLEF’2014) 21 Developed method shows 2 d result on F-measure and 1 st result on the ratio of F-measure/number of checked fragments F-measure The ratio of F-measure to number of queries
22
/34 TextAppliance applications 22
23
www.textapp.ru /34 Academic applications – provides search and analytics on scientific publications, fields, and research groups. Created for Ministry of Education and Science of the Russian Federation. – searches plagiarism in scientific publications – intelligent patent search and patent analytics 23 – TextAppliance helps Russian Foundation for Basic Research to expertise applications for grants
24
www.textapp.ru /34 Academic applications (1) 24 Analysis of publication activity on the topic of "expert systems":
25
www.textapp.ru /34 Academic applications (2) 25
26
www.textapp.ru /34 Academic applications (3) Example: some analytics on “electronic book” patents 26 Patent holdersPatents by country www.textapp.ru Number of patents (Rest)
27
www.textapp.ru /34 The Russian Foundation for Basic Research (RFBR) –the biggest scientific fund in Russia RFBR uses TextAppliance to improve expertise of applications for scientific grants TextAppliance helps to structure large collections of applications and reports find plagiarism in applications and reports find topically similar projects extract emerging scientific fields that need additional support assign experts to projects 27 Academic applications (4)
28
www.textapp.ru /34 Business applications Leading Russian publishers: Infra-M, product Znanium.com Rucont.ru Integrated in their products: Plagiarism detection service Service of intelligent thematic search Service for analysis of scientific document structure (evaluation of publication quality) 28
29
/34 Business application (1) Znanium.com 29
30
www.textapp.ru /34 Business application (2) Rucont.ru 30 Example: 3D representation of clustering results
31
www.textapp.ru /34 Customers and partners 31
32
TextAppliance team (1) PhD Olga Vybornova Dr.Sc, prof. Gennady Osipov PhD Ilya Tikhomirov PhD Ivan Smirnov Researcher Dmitry Devyatkin PhD Alexander Shvets PhD Artem Shelmanov PhD Ilya Sochenkov
33
TextAppliance team (2) PhD-student Roman Suvorov PhD-student Denis Zubarev Dr.Sc, prof. Sergey Krylov PhD-student Margarita Ananyeva PhD-student Margarita Kamenskaya PhD-student Ivan Khramoin Student Vasiliy Iadrencev Student Vadim Isakov
34
www.textapp.ru demo.textapp.ru Institute for Systems Analysis Federal Research Center “Computer Science and Control” of Russian Academy of Sciences 117312, Moscow, pr. 60-letiya Oktyabrya, 9 Tel/fax: +7 499 135 0463 tih@isa.ru
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.