Meandre Workbench, Installation and NLP overview

Meandre Workbench, Installation and NLP overview
University of Illinois at Urbana-Champaign

Outline Overview of Workbench Overview of Repositories
Designing and Constructing Flows Installation NLP Overview & Examples Attendee Project Plan

Workbench

Meandre: Data Driven Execution
Execution Paradigms Conventional programs perform computational tasks by executing a sequence of instructions. Data driven execution revolves around the idea of applying transformation operations to a flow or stream of data when it is available.

Meandre: The Dataflow Component
Data dictates component execution semantics Inputs Outputs Component P Descriptor in RDF of its behavior The component implementation

Meandre: Dataflow Example
Value1 Value2 Sum Logical Operation Output Inputs

Meandre: Dataflow Example
Dataflow Addition Example Logical Operation ‘+’ Requires two inputs Produces one output When two inputs are available Logical operation can be preformed Sum is output When output is produced Reset internal values Wait for two new input values to become available Value1 Value2 Sum Logical Operation Output Inputs

Meandre: Component Metadata
Describes a component Separates: Components semantics (black box) Components implementation Provides a unified framework: Basic building blocks or units (components) Complex tasks (flows) Standardized metadata

Meandre: Components Types
Components are the basic building block of any computational task. There are two kinds of Meandre components: Executable components Perform computational tasks that require no human interactions during runtime Processes are initialized during flow startup and are fired when in accordance to the policies defined for it. Control components Used to pause dataflow during user interaction cycles WebUI may be a HTML Form, Applet, or Other user interface

Meandre: Flow Connectivity
Defined by connecting outputs from one component to the inputs of another. Cyclical connections are supported Components may have Zero to many inputs Zero to many outputs Properties that control runtime behavior Described using RDF Enables storage, reuse, and sharing like for components

Meandre: Flow A flow is a collection of connected components
Read P Merge Do Show Get Dataflow execution

A Little more on the Nuts & Bolts!
Programming Paradigm What does Meandre Execution Engine Do? What are the possible Component Scenarios Data Driven Flow Creation (Workbench/Zigzag)

Meandre Server Prepares a Flow
The Meandre Server prepares Data Intensive Flow by reading the RDF component descriptors Executable Components and the connections between them are prepared by using a Queue mechanism to store data as it becomes available on the ports. Meandre provides each component an executing thread for processing. Meandre manages the logic queues for component connections in a flow Meandre activates component for initialization, data events, and termination Meandre provides components with access to runtime resources Context A Queue Context B A B

Meandre Server Relationship to Component
Meandre Server Infrastructure Defines Firing.Policy ALL or ANY Input & Ouput Data Ports that require a logical queue to be managed by server Component RDF Descriptor defines: Component Pull Inputs Meandre Server Push Outputs

Meandre Server Flows & Connectors
Flows are made up of “One or More” components with “None to Many” connectors that are described to the Mendre Server for management Flows may contain connectors that are cyclical over one or more components Flows must contain at minimum one component with NO Inputs to cause an Execute call to be made. *Outputs are Always Optional. Flows can have any number of components with “None to Many” Inputs data ports “None to Many” Output data ports Flow components may have multiple connectors assigned to any input data port

Meandre: Programming Paradigm
The programming paradigm creates complex tasks by linking together a bunch of specialized components. Meandre's publishing mechanism allows components developed by third parties to be assembled in a new flow. There are two ways to develop flows : Meandre’s Workbench visual programming tool Meandre’s ZigZag scripting language

Workbench Web-based UI Components and flows are retrieved from server
Additional locations of components and flows can be added to server Create flow using a graphical drag and drop interface Change property values Execute the flow

What is it? Visual programming environment
Thankfully, no code writing skills are required.* Provides a mechanism to create and execute flows Built on top of GWT (Google Web Toolkit) – accessible from all major browsers

Getting Started Fire up your favorite browser and connect Log in
If you installed the Workbench on your local machine, use to access it, otherwise replace “localhost” with the correct address of the computer where the Workbench is running at. Log in You will notice that when loading the workbench, the web browser address changes by appending either “Workbench.html” or “WorkbenchIE.html”. If you want to create a bookmark for quick access to the Workbench, use the address ending in “1712” (without the automatically-added context) for maximum flexibility.

The Workbench

The Workbench The Workspace The Details Panel The Repository
used as a main staging area for building / editing flows The Output Panel

The Workspace Components can be dragged into this region from the “Components” panel and interconnected to create flows.

The Flow Toolbar Provides access to frequently used functions
Save flows (Save and Save As buttons) Export as ZigZag or MAU Copy and Paste between flows in this session window Remove components Flow execution (Run Flow and Stop Flow buttons)

Saving a Flow Required metadata: Name Base URL
Separate tags with commas

Copy and Paste Copies components and any connections between the selected components Select component(s) to copy Click the Copy button Click the Paste button in the flow where you want the copies The copied components are still highlighted and can be moved together as a unit. If in the same flow, then placement is on top of original components, but can be dragged and moved.

Removing Components Two ways:
Select the component(s) and click “Remove” on the toolbar Right-click the component you want to remove and select “Remove”

Controlling Flow Execution
Run Flow Executes the current flow loaded in the Workspace. Any output from the flow will be displayed in the Output panel. If the flow contains interactive components, they will be displayed automatically. Important: Please be sure to set your browser to allow pop-ups from the Workbench, otherwise the web interactive components will not display! Stop Flow Sends a request to the Meandre server to abort the currently executing flow. May take a while – the server waits for components to finish their current operation.

The Repository Panel Three sections: Components Flows Locations
Searching is supported Display is Customizable: Column selection Sorting Grouping Selecting to view a particular section is done by clicking on the section name, or on the [+] button to the right of the section title. Once a section is expanded, it can be collapsed again by clicking on the [-] button. The Repository panel can be collapsed as well to maximize the screen real estate. This can be accomplished by clicking on the [<<] button. Once collapsed, the panel can be expanded temporarily by clicking on the left side bar; it can be expanded “permanently” by clicking on the [>>] button on the side bar. A “Refresh” button, located in the top right corner of the Repository panel, tells the Workbench to retrieve a new copy of the components and flows from the Meandre server. Click on the Components tab in the Repository panel to view all the components in your Workbench. The components listed are available from the Meandre server specified during login. Components are listed with their Name, Creator, and Date shown by default, but the selection of columns to be displayed can be configured by the user by clicking on the small downward-pointing arrow that appears when hovering the column title (see below) and placing a check mark next to each column to be displayed in the Columns submenu. The icon in the first column identifies the component type as Java, Python or Lisp.

Components Software units are designed to accomplish a particular task
May have inputs, outputs, and properties Components with properties can be identified by a symbol appearing in the lower left-hand side of a component icon their power is unleashed when multiple components are logically connected to form a flow (application). In order for components to be connectable, they must define, at a minimum, an input or an output port. These ports represent the communication points to other components. A component can define more than one input and/or output port. The input ports are always on the left, and outputs on the right.

Component Categories Components are organized into categories, enabling users to easily identify the functionality of any given component in an application.

Flows A Flow is essentially an application — a group of components connected together to perform a set of tasks Click on the Flows tab in the Repository panel to view the flows in your Workbench. Double click on a flow to load that flow into the Workspace.

Locations Adding a repository location causes all the components and flows hosted at that location to be imported in the user’s private repository on the server Removing a location also removes the associated components and flows from the server. You can find a list of available repository locations at The Meandre server has the ability to access components and flows that have been uploaded directly to the server (via Meandre Server Interface or the Meandre Development Eclipse Plugin or ant scripts). It also has the ability to load RDF repositories from other Meandre servers or from an RDF file that may exist on a web server. We have created several component and flow repositories that the user can load – their location addresses are available on the seasr.org website. Adding a repository location causes all the components and flows hosted at that location to be imported in the user’s private repository on the server. A repository location can be removed by selecting it and clicking the “Remove location” button. This also removes the components and flows in this repository location from the server.

The Details Panel Properties Description
For components, the Description displays information about the component function. For flows, the Description displays information about the flow and the components it contains and their property values. Shows the properties and description of a selected component or flow

The Output Panel In the figure above, the Output panel shows the result of running the simple flow presented. The resulting “HELLO SEASR” string is the uppercase version of “Hello SEASR” which was set as the value of the “message” property of the “Push String” component. Displays output and error messages generated by the Workbench

Using the Workspace Placing Components
The first step in building a flow is to choose components from the Repository panel and place them into the Workspace. To place a component, click on the Components section in the Repository panel and drag the desired component over into the Workspace area. Note: A flow must have at least one component with no inputs to be able to be executed by the Meandre server. Selecting Components Components can be selected by single clicking on them in the Workspace. When a component is selected, other selected items are deselected. While selected, a component can be moved about the Workspace or deleted. A selected component (or flow) can be unselected by using CTRL+click on that component (or flow).

Using the Workspace Labeling Components
Editing the component label only changes the name of the component in the given flow. The label must remain unique among the other component labels in the flow. The label can be edited by single-clicking on it and entering the desired text. Pressing ESC while editing a label cancels the labeling operation and restores the original label. Connecting and Disconnecting Components To make a connection, click on the output port of the desired source component (the port you clicked will be colored red), and then click on the input port to which you wish to connect. You should now have a line connecting the output and input port. If, after selecting a port, you wish to cancel the operation, simply clicking the same port again will unselect it. The ports of two components should only be connected if their data types are compatible with one another. Any errors resulting from data incompatibilities will occur at runtime. To remove a connection, simply right-click one of the ports and select “Disconnect” from the context menu. Alternatively, you can remove groups of ports by right-clicking the component and selecting the appropriate menu option.

Using the Workspace Connecting and Disconnecting Components
A component’s output port may only be connected to one input port. However, a component’s input port may be connected to several different output ports. This could be useful when you are retrieving the same data format from multiple components. The connection line is highlighted if the user hovers over an input or output port. This is useful for verifying connections in a complex flow. When hovering over a component port, the description of that port is also briefly displayed.

Installation

Using SEASR-Powered Services
SEASR provides some demo services Requires a browser You can access them from Community Hub to execute a flow Zotero to analyze your collections with existing flows Meandre Server Client to execute a flow; or tune properties and execute a flow Hosted at Meandre Workbench to execute a flow; or tune properties and execute a flow; or create a flow Hosted at

I Need To Run SEASR on my laptop
I want to run on my laptop (server) I have copyrighted information I have collection for analysis that is too big to be moved I just want to test it and have fun with it Getting Meandre server up and running Getting Meandre workbench up and running

Prerequisites Oracle Java 1.6+ Scala 2.7.7 MongoDB
Scala 2.7.7 direct link: lang.org/sites/default/files/linuxsoft_archives/downloa ds/distrib/files/scala final.zip cd [SCALA DIR]/bin and run this command chmod +x * Note: newer versions of Scala will NOT work for now; we are looking into upgrading our code to work with the new versions soon MongoDB

Installation of Meandre 2.0 Server
Create a directory say /Users/[ME]/Meandre/mongo-data Launch MongoDB for repository storage and use directory that you just created mongod --dbpath=/Users/[ME]/Meandre/mongo-data Download the Meandre 2.0 Server into a new directory Unzip the downloaded file cd [MEANDRE DIR] Edit run.sh and Add path to scala Execute Meandre ./run.sh Access your new installation at

Installation of Meandre Workbench
Download the jar file 11/Meandre-Workbench jar Execute Double click on the jar file Or use this command java -Xmx1g -jar Meandre-Workbench jar Download the war file Install your favorite application server Deploy the war file against the application server Access your new installation at

Adding Locations Latest Components Demo Flows Example Flow
Components/repository_components.nt Demo Flows lows/demo-all/repository_flows.nt Example Flow lows/examples-all/repository_flows.nt Custom Flows lows/custom-all/repository_flows.nt

NLP Overview

SEASR Text Analytics Goals
Address the Scholarly text analytics needs by: Efficiently managing distributed Literary and Historical textual assets Structuring extracted information to facilitate knowledge discovery Extracting information from text at a level of semantic/functional abstraction that is sufficiently rich and efficient for analysis Devising a representation for the extracted information Devising algorithms for question answering and inference Developing UI for effective visual knowledge discovery and data exploration with separate query logic from application logic Leveraging existing machine learning approaches for text Enabling the text analytics through SEASR components

Text Analytics Definition
Many definitions in the literature The process of deriving high-quality information from text The non trivial extraction of implicit, previously unknown, and potentially useful information from (large amount of) textual data An exploration and analysis of textual (natural- language) data by automatic and semi automatic means to discover new knowledge

Sense-Making Stuart Card, PARC, InfoVis 2004 Conference

Text Analytics: General Application Areas
Information Retrieval Indexing and retrieval of textual documents Finding a set of (ranked) documents that are relevant to the query Information Extraction Extraction of partial knowledge in the text Sentiment Analysis Identifying sentiment from a corpus Web Mining Indexing and retrieval of textual documents and extraction of partial knowledge using the web Classification Predict a class for each text document Clustering Generating collections of similar text documents Question and Answering, Automatic summarization, Machine Translation and many more…

Text Analytics Process

Text Analytics Process
Text Preprocessing Text/Data Analytics Syntactic Text Analysis Classification: Supervised Learning Semantic Text Analysis Clustering: Unsupervised Learning Features Generation Bag of Words Information Extraction Ngrams Analyzing Results Feature Selection Visual Exploration, Discovery and Knowledge Extraction Simple Counting Statistics Query-based question answering Selection based on POS

Text Representation Many machine learning algorithms need numerical data, so text must be transformed Determining this representation can be challenging

Text Characteristics (1)
Large textual data base Enormous wealth of textual information on the Web Publications are electronic High dimensionality Consider each word/phrase as a dimension Noisy data Spelling mistakes Abbreviations Acronyms Text messages are very dynamic Web pages are constantly being generated (removed) Web pages are generated from database queries Not well structured text /Chat rooms/Twitter/Blogs “r u available ?” “Hey whazzzzzz up”

Text Characteristics (2)
Dependency Relevant information is a complex conjunction of words/phrases Order of words in the query hot dog stand in the amusement park hot amusement stand in the dog park Ambiguity Word ambiguity Pronouns (he, she …) Synonyms (buy, purchase) Multiple meanings (bat – it is related to baseball or mammal) Semantic ambiguity The king saw the monkey with his glasses. (multiple meanings) Authority of the source IBM is more likely to be an authorized source then my second cousin

Feature Selection Reduce Dimensionality Irrelevant Features
Learners have difficulty addressing tasks with high dimensionality Irrelevant Features Not all features help! Remove features that occur in only a few documents Reduce features that occur in too many documents

Syntactic Analysis Tokenization Lemmitization/Stemming
Text document is represented by the words it contains (and their occurrences) e.g., “Lord of the rings”  {“the”, “Lord”, “rings”, “of”} Highly efficient Makes learning far simpler and easier Order of words is not that important for certain applications Lemmitization/Stemming Involves the reduction of corpus words to their respective headwords (i.e. lemmas) Means removal suffixes, prefixes and infixes to root Reduces dimensionality Identifies a word by its root e.g., flying, flew  fly Bigrams and trigrams Retains semantic content Shallow Some methods of QA use keyword-based techniques to locate interesting passages and sentences from the retrieved documents and then filter based on the presence of the desired answer type within that candidate text. Ranking is then done based on syntactic features such as word order or location and similarity to query. When using massive collections with good data redundancy, some systems use templates to find the final answer in the hope that the answer is just a reformulation of the question. If you posed the question "What is a dog?", the system would detect the substring "What is a X" and look for documents which start with "X is a Y". This often works well on simple "factoid" questions seeking factual tidbits of information such as names, dates, locations, and quantities. Deep However, in the cases where simple question reformulation or keyword techniques will not suffice, more sophisticated syntactic, semantic and contextual processing must be performed to extract or construct the answer. These techniques might include named-entity recognition, relation detection, coreference resolution, syntactic alternations, word sense disambiguation, logic form transformation, logical inferences (abduction) and commonsense reasoning, temporal or spatial reasoning and so on. These systems will also very often utilize world knowledge that can be found in ontologies such as WordNet, or the Suggested Upper Merged Ontology (SUMO) to augment the available reasoning resources through semantic connections and definitions. More difficult queries such as Why or How questions, hypothetical postulations, spatially or temporally constrained questions, dialog queries, badly-worded or ambiguous questions will all need these types of deeper understanding of the question. Complex or ambiguous document passages likewise need more NLP techniques applied to understand the text. Statistical QA, which introduces statistical question processing and answer extraction modules, is also growing in popularity in the research community. Many of the lower-level NLP tools used, such as part-of-speech tagging, parsing, named-entity detection, sentence boundary detection, and document retrieval, are already available as probabilistic applications.

Syntactic Analysis Stop words Scaling words
Identifies the most common words that are unlikely to help with text analytics, e.g., “the”, “a”, “an”, “you” Identifies context dependent words to be removed, e.g., “computer” from a collection of computer science documents Scaling words Important words should be scaled upwards, and vice versa TF-IDF stands for Term Frequency and Inverse Document Frequency product Parsing / Part of Speech (POS) tagging Generates a parse tree (graph) for each sentence Each sentence is a stand alone graph Find the corresponding POS for each word e.g., John (noun) gave (verb) the (det) ball (noun) Shallow Parsing: analysis of a sentence which identifies the constituents (noun groups, verbs,...), but does not specify their internal structure, nor their role in the main sentence Shallow Some methods of QA use keyword-based techniques to locate interesting passages and sentences from the retrieved documents and then filter based on the presence of the desired answer type within that candidate text. Ranking is then done based on syntactic features such as word order or location and similarity to query. When using massive collections with good data redundancy, some systems use templates to find the final answer in the hope that the answer is just a reformulation of the question. If you posed the question "What is a dog?", the system would detect the substring "What is a X" and look for documents which start with "X is a Y". This often works well on simple "factoid" questions seeking factual tidbits of information such as names, dates, locations, and quantities. Deep However, in the cases where simple question reformulation or keyword techniques will not suffice, more sophisticated syntactic, semantic and contextual processing must be performed to extract or construct the answer. These techniques might include named-entity recognition, relation detection, coreference resolution, syntactic alternations, word sense disambiguation, logic form transformation, logical inferences (abduction) and commonsense reasoning, temporal or spatial reasoning and so on. These systems will also very often utilize world knowledge that can be found in ontologies such as WordNet, or the Suggested Upper Merged Ontology (SUMO) to augment the available reasoning resources through semantic connections and definitions. More difficult queries such as Why or How questions, hypothetical postulations, spatially or temporally constrained questions, dialog queries, badly-worded or ambiguous questions will all need these types of deeper understanding of the question. Complex or ambiguous document passages likewise need more NLP techniques applied to understand the text. Statistical QA, which introduces statistical question processing and answer extraction modules, is also growing in popularity in the research community. Many of the lower-level NLP tools used, such as part-of-speech tagging, parsing, named-entity detection, sentence boundary detection, and document retrieval, are already available as probabilistic applications.

Demonstration We will be demonstrating the use of the Workbench for creating flows Use “Tag Cloud Viewer” flow as an example and explain how it was created

Learning Exercises Installation
Explore the functionality of the Meandre Workbench Open Meandre Workbench (WB) by navigating to Examine the “demo” flows either in the Community Hub or the Meandre Workbench If in Workbench, add repository for Demo Flows emo-all/repository_flows.nt

Learning Exercises – Demo Flows
Demo Tokens Demo Token Counts Demo Token Counts Filter Top 200 Demo Token Counts Filter Stop Words Demo 2gram Token Counts Demo 2gram Token Counts Filter Stop Words Demo 3gram Token Counts Filter Stop Words

Learning Exercises – Demo Flows
Demo Stemming Demo POS Demo POS Nouns Demo POS Verbs Demo POS Adjectives Counts

Learning Exercise – Build Tag Cloud
Usage of existing components to create a data- driven flow for creating a basic Tag Cloud Viewer flow so they can become familiar with the mechanics of drag-drop, creating connections, setting properties, saving, executing Create a new tab in the WB by clicking on the first tab (with the yellow star)

Learning Exercise – Data Loading
Retrieve text from a url Expand the Components section of the WB (click on the + sign) Find the component named "Push Text" (scroll down or use the search box) and drag it onto the workspace Find the component named "Universal Text Extractor" and add it to the flow, as before Connect the output port "text" of "Push Text" to the input port "location" of "Universal Text Extractor" (click on each port to make a connection) Change the property named "message" of "Push Text" to contain the url you've selected

Learning Exercise – Data Analysis
Count the words Find the components "OpenNLP Tokenizer" and "Token Counter" and add them to the flow, as before Connect the output port "text" of "Universal Text Extractor" to the input port "text" of "OpenNLP Tokenizer" Connect the output port "tokens" of "OpenNLP Tokenizer" to the input port "tokens" of "Token Counter”

Learning Exercise – Data Visualization
Visualize with the Tag Cloud Viewer Find the components "Tag Cloud” and "HTML Viewer" and add them to the flow Connect the output port "token_counts" of "Token Counter" to the input port of "Tag Cloud" Connect the output port "html" of ”Tag Cloud" to the input port of "HTML Viewer”

Learning Exercises – Data Normalization
Improve the Tag Cloud Flow that you created to "clean" it up a bit by converting all words to lower case Find the component "To Lowercase" and add it to the flow, connecting it between "Universal Text Extractor" and "OpenNLP Tokenizer" Click the output port "text" of "Universal Text Extractor" and then click the input port "text" of "To Lowercase" (this will remove the existing connection between "Universal Text Extractor" and "OpenNLP Tokenzier") Connect the output port of "To Lowercase" to the appropriate port of "OpenNLP Tokenizer”

Learning Exercise – Data Cleaning
Remove stop words Add another "Push Text", "Universal Text Extractor" and "OpenNLP Tokenizer" to the flow, and connect them Set the "message" property of this second "Push Text" to read " (no quotes) Find and add the component "Token Filter" between "Token Counter" and "Tag Cloud" Connect the output port "token_counts" of "Token Counter" to the input port "token_counts" of "Token Filter" Connect the output port "token_counts" of "Token Filter" to the input port of "Tag Cloud" Connect the output port "tokens" of the second "OpenNLP Tokenizer" to the input port "tokens_blacklist" of "Token Filter”

Learning Exercise – Data Filtering
Filter to specific number of words Find and add the component "Top N Filter" between "Token Filter" and "Tag Cloud" Connect the output port "token_counts" to the input port of "Top N Filter" Connect the output port of "Top N Filter" to the input port of "Tag Cloud" Set the property "n_top_tokens" of "Top N Filter" to a number representing the number of top tokens to be displayed (ranked by token count)

Discussion Questions What challenges (if any) would scholars have installing the SEASR software? Do you see your institution's IT department running the SEASR environment or would it be your research group? What are three advantages of using a component driven environment for text analytics? What are the possible obstacles for humanities scholars in using an environment like the Meandre Workbench to assemble and create flows for accomplishing their research needs? Are there parts of the workbench that are unclear or that need extra explanation? Do you have any feature requests? Are there any tools that you would like to see componentized such that you can work with these tools in the Meandre Workbench?

Meandre Workbench, Installation and NLP overview

Similar presentations

Presentation on theme: "Meandre Workbench, Installation and NLP overview"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Meandre Workbench, Installation and NLP overview

Similar presentations

Presentation on theme: "Meandre Workbench, Installation and NLP overview"— Presentation transcript:

Similar presentations

About project

Feedback