Presentation is loading. Please wait.

Presentation is loading. Please wait.

Tagging documents made easy, using machine learning

Similar presentations


Presentation on theme: "Tagging documents made easy, using machine learning"— Presentation transcript:

1 Tagging documents made easy, using machine learning
Brendan Clarke

2 PART ONE – APPROACHES FOR BUILDING TAXONOMIES

3 Defines top level containers and work downwards.
TOP DOWN - APPROCH Defines top level containers and work downwards. Usually broad (3-10 wide) and shallow (3-4 deep) Simple, high level classification (functional) A top down approach defines containers for terms, usually starting with some global taxonomies such as locations, departments or products (used throughout the business). Lots of level 1 and 2 term sets that define the function of the document. For example, Departments -> HR Level 3 may begins to define the content itself, for example Departments -> HR -> Policy Documents Works well to classify content into the right areas. This is functional classification.

4 Manually defined or replicated from existing structures
TOP DOWN – TERMS Manually defined or replicated from existing structures Imported from other systems Industry standards / purchased taxonomies Often terms are defined by committees who involve specialist groups to define terms Line of business systems or databases may contain data that can be imported ( SKOS is an interesting for advanced taxonomies ( WAND is off the shelf (

5 People / Committee Driven approach
TOP DOWN – SUMMARY People / Committee Driven approach Some guesswork of what terms should be Simple, high level classification (functional) – Way better than folders! The challenge with deciding terms without looking at your documents is that it will be guesswork to know what would be effective. That said, a simple top down taxonomy is 10x better than a folder structure. No duplication as documents can be tagged within multiple areas.

6 Terms driven by the words and phrases within your content
BOTTOM UP - APPROCH Terms driven by the words and phrases within your content More complex taxonomies Detailed, accurate terms that are subject or facet level Bottom up means looking at the information you have in your content (usually documents and s) and building taxonomies that are based on how you actually describe information. Bottom up results in a taxonomy that can describe the subject or facet of the document.

7 Manual analysis of the documents
BOTTOM UP - TERMS Manual analysis of the documents Statistical analysis of terms and phrases Natural Language processing How long does it take for people to read and process documents: Getting a working team of people to actually read documents is time consuming and expensive, but sometimes if the information is valuable it may be worth it. There are tools that can analyse the frequency of works or phrases in your documents. They can be highly effective but need a lot of consultancy to make sense of the results. NLP is the future of text analysis (more later).

8 BOTTOM UP - SUMMARY Technology driven approach (or a very tough people process) Produces detailed taxonomies that reflect the actual content Extra granulation of tagging A bottom up approach can be used to describe the contents of the documents (not just the area)

9 AND THE WINNER IS… Combining top down and bottom up is the best approach Top down classifies the type of documents Bottom up classifies the subject of the document New technology allows bottom up to be realistic

10 Builds taxonomies (bottom up) using NLP Applies tags
TermSet adds accurate consistent metadata without placing any burden on end users or your IT team. Builds taxonomies (bottom up) using NLP Applies tags Metadata as a service TM TermSet has a different approach.  It manages every step of adding metadata to your SharePoint content.  Projects can be completed in days or weeks instead of months or years. The application uses machine learning that can build over 400 taxonomies that relate to your data. You can also easily train it to apply tags that are important to you. A full list of features is available at

11 WHAT EXACTLY IS NLP ? Natural language processing is at the core of TermSet. We have an engine trained to recognise entities within documents. (First Click) This a BBC news article, when our engine reads the text it identifies entitles such as people, locations and organisations. (Second Click) In fact, we identify a vast array of information inside the documents including concepts, sentiment and relationships.

12 DEMO – CREATING TERMS FROM YOUR DOCUMENTS USING NLP

13 PART TWO – APPLYING YOUR TAGS

14 MANUAL TAGGING Adoption problem Asbestos problem / GIGO
Challenging to do retrospectively (migration tools can help) Every time you add a field that needs to be completed in order to save a document you are impeding adoption of a new DMS If you do mandate fields, many users will pick the first on the list or just randomly pick anything in order to save the document What do you do with the 1 million documents that came from a file share (or any other source without metadata)?

15 MANUAL TAGGING Infer as many terms as possible from: Document types, Location, Function Mandate as few tags as possible Stay shallow or flat with hierarchies Manually tagging new content can work well. Always use default values to answer as many questions before the user is involved (infer the metadata wherever possible). Keeping it simple is a good plan. Single lookup columns may be better than deep hierarchies.

16 MACHINE TAGGING Simple machine tagging can use search to match taxonomy terms to the content of documents More advanced taggers allow rules or weights to be assigned to each tag (tags not context aware) New technologies (NLP) provide a new approach to creating taxonomies There are a number of taggers for SharePoint that will look at your documents and apply tags from a taxonomy that you have defined Some tagggers ask for rules to be defined for each term (can work well, takes forever to get right).

17 TERMSET TAGGING TermSet recommends the right taxonomies for each library (context aware tagging) TermSet automates building the underlying IA in SharePoint Extra cool NLP tags can be added (Summaries, Sentiment and Language) Monitors for new documents and terms arriving into your world

18 DEMO – TAGGING DOCUMENTS

19 WRAP UP TermSet automates a bottom up approach to create and use taxonomies for SharePoint Visit or for a free licence Visit or for a free licence If you need assistance with top down taxonomies or you use a different DMS please me to join the beta program for


Download ppt "Tagging documents made easy, using machine learning"

Similar presentations


Ads by Google