Identrics – Team Vistula

Identrics – Team Vistula

Agenda: Business Understanding Data Understanding Data processing
Modeling Evaluation Deployment

1. Business Understanding:
Main aims: Use coreference algorithm to extract more data about entity Create meaningful keywords for entities

2. Data Understanding Data sets
Documents.csv – list of news to extract info Entities.csv – list of entities for whom keywords should be extracted Coreference algorithm and how does it work – fighting with personal pronouns mainly Example: Input: Ana has dog. She loves him. Output: Ana has dog. Ana loves dog.

3. Data processing: Step 1:
build as EC2 instance on AWS using tornado python framework. Inside the server a script was run to clone repository. One_shot_coref() and get_resolver_utterances() functions were run against the data. In results a document with substituted words such as: “her”, “his”, “it” and other connected with context were replaced with the names of entities. Step 2: gather all the sentences where an entity appeared. Concatenate sentences into a single string and put all such strings into a list of sentences connected with entities. Step 3: Csv-s transformation into more readable format to machine and much more friendly for looping over.

4. Modeling: We have chosen an algorithm implemented as Python library ‘summa’. Summa library uses TextRank algorithm for choosing keywords from given sentences. Is it widely described inside following github repository:

5. Evaluation: Some of the models that were taken under consideration were Stanford Core NLP algorithm and more simple approach using NLTK tokenization. Stanford NLP algorithm is widely used and described in many formal articles – both its speed and reliability are predictable. Model that was tested against those algorithm were neural coreference algorithm build on top of Spacy. It predicted results that could satisfy business problem presented by Identrics.

6. Deployment: Time requirements:
Whole process of data preparation and computation shouldn’t take longer than 20 minutes. Hardware requirements: An algorithm was run on EC2 Machine with 4 cores and 16GB of ram The bottleneck of the code might be the python or neural net which should be tested – the solution could be the increase in size of the EC2 machine. EDIT: We should do multiprocessing pool Output examples: Lukoil,petrom,signed,crude transport TGN,gas,bse,company Gazprom,gas,transgaz,old,romania

Identrics – Team Vistula

Similar presentations

Presentation on theme: "Identrics – Team Vistula"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Identrics – Team Vistula

Similar presentations

Presentation on theme: "Identrics – Team Vistula"— Presentation transcript:

Similar presentations

About project

Feedback