Apache Tika End-to-End An introduction to Apache Tika, and integrating it to your application.

Slides:



Advertisements
Similar presentations
/ 1Online Educa Conference 2008, Berlin Learning Objects and Resources Mega Content Transformation with Open Source Educational Content Project.
Advertisements

Other Web Application Development Technologies. PHP.
DOCUMENT TYPES. Digital Documents Converting documents to an electronic format will preserve those documents, but how would such a process be organized?
METS: An Introduction Structuring Digital Content.
Alternative FILE formats
HTML5 ETDs Edward A. Fox, Sung Hee Park, Nicholas Lynberg, Jesse Racer, Phil McElmurray Digital Library Research Laboratory Virginia Tech ETD 2010, June.
Charmaine NormanCopyright What Is a Web Page Presented by Webpagemaker. Net Left click your mouse to view each frame, Web Page.
Java Script Session1 INTRODUCTION.
EasySearch Technical Overview. Ever seen a website without a full text search? BUT – Search is expensive Financially Computationally – Search is complicated.
DSpace Devika P. Madalli DRTC, ISI Bangalore.
Introducing Mapbuilder Michael Adair Natural Resources Canada.
ARCHIMÈDE Presented by Guy Teasdale Directeur, Services soutien et développement Bibliothèque de l’Université Laval CARL Workshop on Institutional Repositories.
Records Management Network Digital Archiving Workshop 19 March 2015.
Alon Blich A.B.C.  Printer Languages (Escape Codes) ◦ PCL, PostScript, Canon etc.  ActiveX/OLE Automation Server  PDF Utilities ◦ PDFInclude, PDFlib.
©Silberschatz, Korth and Sudarshan10.1Database System Concepts W3C Activities HTML: is the lingua franca for publishing on the Web XHTML: an XML application.
DSA Week 22 MIME types and Meta data. Agenda Google Maps and MyMap Coursework Placement opportunity Tutorial – Multiple Choice questions Lecture – MIME.
Agenda What is BIRT? BIRT Features and Report Gallery Scripting BIRT
WMS: Democratizing Data
Converting word, excel, and powerpoint into html docs Michael Roy Information Technology Services Wesleyan University
Web Basics (Scary Acronyms Demystified). HTML Acronyms And your website… CSS URL HTTP SSL SMTP DNS RSS API ERQ iCAL Yippie! We have an Awesome website…
HYPERTEXT MARKUP LANGUAGE (HTML)
Rethinking language documentation & support for the 21st century David Nathan Endangered Languages Archive SOAS University of London.
Introduction to Apache Tika CSCI 572: Information Retrieval and Search Engines Summer 2010.
By: Shawn Li. OUTLINE XML Definition HTML vs. XML Advantage of XML Facts Utilization SAX Definition DOM Definition History Comparison between SAX and.
AUDIO 101 with Adrian What is an audio file format? An audio file format is a file format for storing digital audio data on a computer system. Universal.
Daniel Pullin Web Developer | www.cadarn.ac.uk YOUR RESOURCES docx xlsx webm txt rar gz wav html js php.
ETD Repositories Using DSpace Software Andrew Penman The Robert Gordon University 27 th September 2004.
PLUG INS flash, quicktime, java applets, etc. Browser Plug-ins Netscape wanted a method to extend features of the browser became an unofficial standard.
Apache POI for Content Management
2005 Adobe Systems Incorporated. All Rights Reserved. 1 Ontolog Forum Gunar Penikis Sr. Product Manager Adobe Systems.
Dspace 1 Introduction to DSpace Mukesh Pund Scientist NISCAIR, New Delhi.
1 XML at a neighborhood university near you Innovation 2005 September 16, 2005 Kwok-Bun Yue University of Houston-Clear Lake.
SDPL 2002Notes 7: Apache Cocoon1 7 XML Web Site Architecture Example: Apache Cocoon, a Web publishing architecture based on XML technology
Application Protocols: HTTP CSNB534 Semester 2, 2007/2008 Asma Shakil.
Metadata Extractors, Content Transformers & Renditions Neil Mc Erlean.
Dynamic Data Exchanges with the Java Flow Processor Presenter: Scott Bowers Date: April 25, 2007.
3/29/2001 O'Reilly Java Java API for XML Processing 1.1 What’s New Edwin Goei Engineer, Sun Microsystems.
File Name Extensions Computer Applications 7th grade.
3.2 Data Portability. Overview Understand the need for data compression and software needed to compress/decompress data. Identify common file types such.
Introduction to Nutch CSCI 572: Information Retrieval and Search Engines Summer 2010.
The S&I Tools & Repository April 12 th, S&I Tools and Repository Agenda: siframework.org S&I Repository repository.siframework.org.
Date : 3/3/2010 Web Technology Solutions Class: Application Syndication: Parse and Publish RSS & XML Data.
© 2006 by «Author»; made available under the EPL v1.0 | Date | Other Information, if necessary Jason Weathersby BIRT Evangelist, Actuate Corp. Leveraging.
introducing the Java Data Processing Framework Paolo Ciccarese, PhD On behalf of the JDPF Team Pavia, December 11, 2007.
Opportunities and constraints for development and translation of digital learning resources How difficult is it to translate or adjust existing digital.
Software for Presenting. Contents Presentation Software Applications, eg. Word processors Authoring software Animation Software Web browsers and HTML.
Apache Jakarta Project. What is Jakarta’s mission Jakarta is a Project of the Apache Software Foundation, charged with the creation and maintenance of.
PLUG INS flash, quicktime, java applets, etc. Browser Plug-ins Netscape wanted a method to extend features of the browser became an unofficial standard.
Server - Client Communication Getting data from server.
File Analysis Dr. John P. Abraham Professor UTPA.
WIRED Detector Description in XML Mark Dönszelmann, Applications for Physics and Infrastructure, IT, CERN XML Detector Description Workshop CERN, 14 April,
XML and SAX (A quick overview) ● What is XML? ● What are SAX and DOM? ● Using SAX.
Alfresco Daeja Integration Yong Qu Chief Solutions Architect
Internet Applications (Cont’d) Basic Internet Applications – World Wide Web (WWW) Browser Architecture Static Documents Dynamic Documents Active Documents.
©Silberschatz, Korth and Sudarshan10.1Database System Concepts W3C - The World Wide Web Consortium W3C - The World Wide Web Consortium.
MULTIMEDIA Multimedia is the field concerned with the computer- controlled integration of text, graphics, drawings, still and moving images (Video), animation,
Apache POI Dima Ionut Daniel.
An Introduction.  Introduction  Logging in from D1  Raison d'être  RSS and Podcasting  DragonDrop is…  What does it do?  Upload  Available Output.
Thinking Long Term - Archive Strategies for Alfresco Nathan McMinn Remote Service Engineer Alfresco Chetan Lalye Senior Software Architect Agilent Technologies.
#SummitNow Metadata Madness Ray Gauss II Digital Asset Management Architect.
But we're already open source
What's new with Apache Tika?
What's with all the 1s and 0s
Office 365 Development July 2014.
Converting word, excel, and powerpoint into html docs
Introduction to DSpace
XML Problems and Solutions
Lesson 5: Multimedia on the Web
DocumentParser: November, 2013.
XML Parsers.
Presentation transcript:

Apache Tika End-to-End An introduction to Apache Tika, and integrating it to your application

Nick Burch Software Engineer Alfresco

Apache Tika Project which started in 2006 Grew out of the Lucene community, now widely used Provides detection of files – eg this binary blob is really a word file, that one is UTF-8 plain text Plain text, HTML and XHTML versions of a wide range of different file formats Consistent Metadata from different files Tika hides the complexity of the different formats and their libraries, instead presents a simple, powerful API Easy to use and extend

What's new? Lots of new parsers – text, office formats, publishing formats, images, audio, CAD, fonts etc Long standing parsers improved – better HTML from word for example Embedded resources and containers Use expanding – used by many SOLR users, Alfresco, lots of people crunching masses of data on Hadoop

Supported Formats Page 1 Audio – WAV, RIFF, MIDI DWG (CAD) Epub RSS and ATOM Feeds True Type Fonts HTML Images – JPEG, GIF, PNG, TIFF, Bitmap (including EXIF where found) iWork (Keynote, Pages etc) RFC822 mbox Mail

Supported Formats Page 2 Microsoft Outlook.msg Microsoft Office (Binary) – Word, PowerPoint, Excel, Visio, Publisher, Works Microsoft Office (OOXML) – Word, PowerPoint, Excel MP3 (id3 v1 and v2) CDF (Scientific Data) Open Document Format (Open Office) Old-style Open Office (.sxw etc) PDF

Supported Formats Page 3 Zip and Tar archives RDF Plain Text FLV Video XML Java class files And I probably forgot one...!

Metadata Tika provides consistent metadata across the range of parsers No need to know if it's “Last Author”, “Last Editor” or “Previous Author” in a file format, they all come back with the same metadata key Keys and values are strings, but strongly typed metadata entries provide converters to dates, ints etc

Text Content Tika generates HTML-like SAX events as it parses Uses Java SAX API Events can be captured or transformed Body Content Handler used for plain text HTML and XHTML available Can customise with your own handler, with XSLT or with E4X from JavaScript eg HTML Table → CSV

Calling Tika

// Get a content detector, and an auto- selecting Parser TikaConfig config = TikaConfig.getDefaultConfig(); ContainerAwareDetector detector = new ContainerAwareDetector( config.getMimeRepository() ); Parser parser = new AutoDetectParser(detector); // We’ll only want the plain text contents ContentHandler handler = new BodyContentHandler(); // Tell the parser what we have Metadata metadata = new Metadata(); metadata.set(Metadata.RESOURCE_NA ME_KEY, filename); // Have it processed parser.parse(input, handler, metadata, new ParseContext());

// Plain text only content handler ContentHandler handler = new BodyContentHandler(); String text = handler.toString(); // XHTML content handler SAXTransformerFactory factory = SAXTransformerFactory.newInstance(); TransformerHandler handler = factory.newTransformerHandler(); handler.getTransformer().setOutputProp erty(OutputKeys.METHOD, "xml"); handler.getTransformer().setOutputProp erty(OutputKeys.INDENT, "yes"); StringWriter sw = new StringWriter(); handler.setResult(new StreamResult(sw)); String text = sw.toString();

Tika Parsers

Parser Interface Two key methods – what mime types are supported, and do the parsing public interface Parser { Set getSupportedTypes(ParseContext context); void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException; }

public class HelloWorldParser implements Parser { public Set getSupportedTypes(ParseContext context) { Set types = new HashSet (); types.add(MediaType.parse("hello/world" )); return types; } public void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws SAXException { XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata); xhtml.startDocument(); xhtml.startElement("h1"); xhtml.characters("Hello, World!"); xhtml.endElement("h1"); xhtml.endDocument(); metadata.set("hello","world"); metadata.set("title","Hello World!"); }

Demo: Tika-App

Demo: Geo-Tagged Images in Alfresco Share via Tika

Any Questions?