Download presentation
Presentation is loading. Please wait.
Presented by: Prof. Ali Jaoua
Feature Extraction Based on Isolated Labels: Application for Automatic News Categorization Presented by: Prof. Ali Jaoua
Agenda Introduction Financial Watch Project Related Work
Problems: Reducing dimensionality of space of words Exploring pertinent features X2-statistic, Document Frequency Thresholding, Information gain Proposed Approach Basic Aspect & Preprocessing phase: (Empty words, Segmentation, Stemming,) Conceptual Approach & Mathematical background Conclusion & future work
Financial watch project objectives:
Develop a financial news text mining platform (sources from: web-pages, blogs and forum postings, messages, etc.) Collect, analyze and extract relevant concepts and events from Arabic and English financial news Trigger alerts to potential users of the system assisting them to take decisions Identify and alert decision-makers Identify the best tools to be integrated into the FinancialWatch platform.
Why news should process ???
Applications Health Care Delivery IE systems Intelligence Gathering IE systems IE for disaster response Financial Watch System
Financial Watch Project
Automatic Text Categorization
Definition and History Classify a set of documents into one or more predefined categories or classes. Definition of set of alerting systems Approach: Supervised Text Categorization Definition of financial news ontology Semantics characterizing the concept labels Use of Support Vector Machine (SVM) for evaluation
Proposed Approach Semantic Filtering Exploring pertinent features
Segmentation Stop word Removal Synonym and Stemming Process Synonyms in English text: Loading Wordnet3 in database to process synonyms Stemming process: Arabic Stemmer English Stemmer Exploring pertinent features Conceptual Approach & Mathematical background
Algorithm of Text Categorization System
Training Set Global Feature List TF.IDF Weighting Document Representation Training Vectors Training Phase Step 2: Obtain Training Vectors Perform stop word elimination and stemming. Apply TF.IDF weighting Represent a document as vector. Step 3: Apply SVM Classifier Apply SVM classifier for each pair of classes of training vectors and obtain k(k-1)/2 classifiers. Elimination of stop words Stemming Classifier 1 Classifier k(k- 1)/2 SVM Classifier ……
Formal Context Coverage based on Isolated Labels
Algorithm Map textual document into binary relation R. Compute the fringe relation Rd ∩ R◦R−1◦R associated with a binary context R and for all pairs (x; y) in Rd, only different generated concepts are added to the coverage C. If R is still not covered by C, then it is extended by additional composed properties or labels. Second and third steps are repeated until obtaining a complete coverage of the binary relation R.
FCA Background Formal Information Representation(1/2)
Formal Context I: (A,B,R) {} {A1, A2, A3, A4, A5} {B2}{A1, A2, A3, A4} {B3}{A3, A4, A5} {B1, B2}{A1, A2} {B2, B3}{A3, A4} {B3, B4}{A4, A5} {B2, B3, B4}{A4} {B1, B2, B3, B4}{} Lattice of Context I R B1 B2 B3 B4 A1 1 A2 A3 A4 A5 A binary relation R between two finite sets D and T is a subset of the Cartesian product DXT. A Rectangle: of R is a Cartesian product of two sets (A,B) such that AD, BT and AXBR. If AXB A´XB´ R, we have A = A´ and B = B´ the Rectangle is maximal If AXB A´XB´ R, we have A = A´ and B = B´ the Rectangle is maximal (Concept) Concepts c1 and c2 are connected iff c1 ≤ c2 and ∄ c3 such that c1 ≤ c3 ≤ c2.
Background Formal Information Representation(2/2)
Formal Context I: (A,B,R) {} {A1, A2, A3, A4, A5} {B2}{A1, A2, A3, A4} {B3}{A3, A4, A5} {B1, B2}{A1, A2} {B2, B3}{A3, A4} {B3, B4}{A4, A5} {B2, B3, B4}{A4} {B1, B2, B3, B4}{} Lattice of Context I R B1 B2 B3 B4 A1 1 A2 A3 A4 A5 Minimal Coverage: {C4,C5,C6} C4 C5 C6 Pseudo Concept (A4, B3) Is the union of all concepts containing (A4,B3): PS = I(B3.R-1) o R o I(A4.R) The set of all concepts of a formal context and the partial ordering can be represented graphically using a concept lattice. A concept lattice consists of nodes that represent the concepts and edges connecting these nodes. The nodes for concepts c1 and c2 are connected if and only if c1 ≤ c2 and there is no other concept c3 such that c1 ≤ c3 ≤ c2.
Formal Context Coverage based on Isolated Labels
Example D= {a b c d e. a b e. a b e. a b c d e. b d. a b c d e. a b c d e. d f g. b f g.}. First Iteration Isolated Points: {(x1, a), (x1, e), (x2 a), (x2, e), (x3, c)} Concepts: C0={x0; x1; x2; x5; x6} × {a; b; e}. Labels are a, e. C1={x0; x3; x5; x6} × {b; c} Label is c.
Formal Context Coverage based on Isolated Labels
Next Iteration (Coverage is not reached) Expand Repeat same steps until obtaining full coverage of R. Concepts: C0={x0; x1; x2; x5; x6} × {a; b; e}. Labels are a, e. C1={x0; x3; x5; x6} × {b; c} Label is c. C2={x0; x4; x5; x6} × {b; d} Label is b.d. C3={x7} × {d; f; g} Label is d.f. C4={x8} × {b; f; g} Label is b.f. Features: { a, e, c, b.d, d.f, b.f}
Formal Context Coverage based on Isolated Labels
Example إرتفعت الأرباح الصافية لمصرف الراجحي السعودي بنسبة 10% خلال الربع الثاني من العام الحالي مقارنة مع الفترة ذاتها من العام الماضي. وبلغ صافي ربح الشركة خلال الربع الثالث 24مليون ريال مقابل صافي ربح بمقدار 21.8 مليون ريال للربع المماثل من العام السابق. كما بلغ إجمالي الربح خلال الربع الثالث 27 مليون ريال مقابل 24 مليون ريال عن نفس الفترة من العام السابق أي بارتفاع قدره 12.5%. وبلغ إجمالي الربح التشغيلي خلال الربع الثالث 24مليون ريال مقابل 21مليون ريال للربع المماثل عن نفس الفترة من العام السابق. Apply Stop Word Elimination and Stemming رفع ربح صفا صرف رجح سعد نسب ربع ثني عام حال قرن فتر عام مضي. بلغ صفا ربح شرك ربع ثلث مليون ريال قبل صفا ربح قدر مليون ريال ربع مثل عام سبق. بلغ جمل ربح ربع ثلث مليون ريال مليون ريال فتر عام سبق رفع. بلغ جمل ربح شغل ربع ثلث مليون ريال قبل مليون ريال ربع مثل فترعام سبق. Labels (Features) } رفع.ربح، صرف، رجح، سعد، نسب، قرن، مضي، ثني، شرك، قدر {
Illustrative example: original text
Illustrative example: (cont.)
Textual document after preprocessing Features Extracted
Illustrative example: (cont.)
Stemmed Words
Methodology Algorithms
Find minimal coverage Use of composed properties to reach a coverage Find pertinent information in of quality and reasonable efficiency .. Management Change Transaction Performance Others
Evaluation – English Texts
Evaluation – Arabic Texts
Conclusion & Future work
New methods were proposed to handle dimensional reduction of the space of words FCA is used for its simplicity and effectiveness as an analysis tool Domain information is used to reduce the structuring noise and improve labeling process Applications like corpus organization, text summarization and feature-extraction were experimented in context of the FWATCH project Evaluation results show the significance of the new methods for incrementally updating a stable information store
Thank You Q & A
Similar presentations
© 2025 Inc.
All rights reserved.