Johannes Peter MediaMarktSaturn Retail Group

Slides:

Advertisements

Similar presentations

OOPSLA 2005 Workshop on Library-Centric Software Design The Diary of a Datum: An Approach to Modeling Runtime Complexity in Framework-Based Applications.

Advertisements

Connecting Knowledge Silos using Federated Text Mining Guy Singh Senior Manager, Product & Strategic Alliances ©2014 Linguamatics Ltd.

Depositing e-material to The National Library of Sweden.

Evaluation of distributed open source solutions in CERN database use cases HEPiX, spring 2015 Kacper Surdy IT-DB-DBF M. Grzybek, D. L. Garcia, Z. Baranowski,

Data Warehouse Tools and Technologies - ETL

Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.

Christopher Jeffers August 2012

MSF Requirements Envisioning Phase Planning Phase.

Message Brokers and B2B Application Integration Chap 13 B2B Application Integration Sungchul Hong.

Agenda 03/27/2014 Review first test. Discuss internal data project. Review characteristics of data quality. Types of data. Data quality. Data governance.

SharePoint 2010 Search Architecture The Connector Framework Enhancing the Search User Interface Creating Custom Ranking Models.

Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)

Dan Grady The search for the killer productivity application is over… Copyright 2009, Information Builders. Slide 1.

2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.

IS 325 Notes for Wednesday August 28, Data is the Core of the Enterprise.

Making Watson Fast Daniel Brown HON111. Need for Watson to be fast to play Jeopardy successfully – All computations have to be done in a few seconds –

Metadata By N.Gopinath AP/CSE Metadata and it’s role in the lifecycle. The collection, maintenance, and deployment of metadata Metadata and tool integration.

CISC 849 : Applications in Fintech Namami Shukla Dept of Computer & Information Sciences University of Delaware iCARE : A Framework for Big Data Based.

Information Integration 15 th Meeting Course Name: Business Intelligence Year: 2009.

MapReduce Joins Shalish.V.J. A Refresher on Joins A join is an operation that combines records from two or more data sets based on a field or set of fields,

Julie Strauss Product Manager, SQL Server Reporting and Analysis Services

Apache Solr Dima Ionut Daniel. Contents What is Apache Solr? Architecture Features Core Solr Concepts Configuration Conclusions Bibliography.

Audit & Reporting with Alfresco & NoSQL architecture Lucas Patingre Alfresco consultant and technical lead at Zaizi.

September 2003, 7 th EDG Conference, Heidelberg – Roberta Faggian, CERN/IT CERN – European Organization for Nuclear Research The GRACE Project GRid enabled.

SAP BI – The Solution at a Glance : SAP Business Intelligence is an enterprise-class, complete, open and integrated solution.

MarkLogic The Only Enterprise NoSQL Database Presented by: Aashi Rastogi ( ) Sanket Patel ( )

BIG DATA. Big Data: A definition Big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database.

Abstract MarkLogic Database – Only Enterprise NoSQL DB Aashi Rastogi, Sanket V. Patel Department of Computer Science University of Bridgeport, Bridgeport,

1 Middle East Users Group 2008 Self-Service Engine & Process Rules Engine Presented by: Ryan Flemming Friday 11th at 9am - 9:45 am.

Data Analytics Challenges Some faults cannot be avoided Decrease the availability for running physics Preventive maintenance is not enough Does not take.

Building QXtend / Dell Boomi Based Integration Framework Gary Yang, Roundview Technologies.

Big Data-An Analysis. Big Data: A definition Big data is a collection of data sets so large and complex that it becomes difficult.

The Big Data Network (phase 2) Cloud Hadoop system

WHY VIDEO SURVELLIANCE

Enterprise Service Bus

Integrating Enterprise Applications Into SharePoint® Portal Server

An Open Source Project Commonly Used for Processing Big Data Sets

LOCO Extract – Transform - Load

Overview of MDM Site Hub

of our Partners and Customers

Insights driven Customer Experience

MID-SEM REVIEW.

Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.

Creating New Business Value with Big Data

Ministry of Higher Education

Unleashing the power of JMeter

Big Data - in Performance Engineering

Microsoft SQL Server 2008 Reporting Services

Collaborative Business Solutions

Overview of big data tools

Cloud Data Replication with SQL Data Sync

Data Warehousing in the age of Big Data (1)

Project Goals Collect and permanently store the data flowing around ONAP system into several Big Data storages, each in different category. Also serve.

SSDT and Database Project Basics

MAPREDUCE TYPES, FORMATS AND FEATURES

Guided Research: Intelligent Contextual Task Support for Mails

WHY VIDEO SURVELLIANCE

Prepared by Peter Boško, Luxembourg June 2012

Data Warehousing Concepts

Information Retrieval and Web Design

TN19-TCI: Integration and API management using TIBCO Cloud™ Integration

Mark Quirk Head of Technology Developer & Platform Group

Session Abstract This session will provide an overview of the latest improvements and enhancements made to the Ed-Fi ODS/API in 2016, as well as a preview.

Map Reduce, Types, Formats and Features

Implementing ETL solution for Incremental Data Load in Microsoft SQL Server Ganesh Lohani SR. Data Analyst Lockheed Martin

Presentation transcript:

Meeting complex data load and data preparation challenges for search applications with Apache NiFi Johannes Peter MediaMarktSaturn Retail Group Product Owner / Architect for Search Before – external consultant in different companies for search and NLP projects in the context of ecommerce and enterprise search

Challenges for search applications People E-commerce Enterprise search Scaling Access rights Campaigns Performance Relevancy Data extraction Data transformation Hiring, coordinating, educating, motivating … Several challenges that have to be done before the search engine sees the data Data preprocessing In all search projects where I was involved and which aimed to develop a search solution, the discussions at the beginning actually were not really about search but rather about what has to be done before the data can be pushed into the search engine Data enrichment

Data expected actual In many projects, data is expected to be easily retrievable, well-structured, reliable, valid, complete, … But in the end, a lot of time is required to tame it and a lot of time is spent until the data is ready to be pushed into the search engine

Typical data preprocessing issues Different source systems Enrichments Transformations, standardizations Bottlenecks, scaling ... Different source systems with different characteristics, different security mechanisms Lookups: E. g. Product data shall e enriched by price data Data standardizations: E. g. release data (available / deliverable, ordered in advance Bottlenecks: expensive operation Scaling: very frequent updates, big volumes for search these issues are particularly challenging as tools specialized on such operations frequently only capable processing structured data search solutions usually have to handle semi-structured or unstructured data NiFi is well suited for search Provides solutions for all these issues Principally is able to handle all kinds of data

Essentials of NiFi FlowFile metadata content Principally: Pull data, transform it and push it Data in NiFi is contained in FlowFiles metadata (key-value pairs) and some content (smaller or bigger pieces of data) The metadata and the content can be transformed in processors. Each processor is intended to fulfil a designated task E. g. Extract data from a source system like a relational database, hadoop, or a no-sql store Other processors are designated to transform or to enrich data, e. g. to change the format and the structure of the data, or lookups basing on flowfile content And finally, there are various processors to push data into target application (including Solr or Elasticsearch) Processors can be combined via queues to dataflows; flowfiles are created or changed by a processor and then put into a queue where they are waiting until they are processed by the next processor

Records Processor Reader Writer xml json text … serialize avro write NiFi has various processors to interact with other systems, you frequently have to deal with different formats, e. g. xml, csv, json or simply text. If the data is exchanged between systems, transformations of the data are frequently required Various different processors were needed in order to cover all possible data transformations e. g. for transforming json to xml, csv to avro, json to other json All these formats have in common that they usually contain several individable entrys In order to avoid this, NiFi provides an abstraction for data so-called record-reader and record-writer. The reader reads xml or json data and serializes it, whereas the record writer writes the serialized data to a certain format Functionalities of processors

Schema (xml) FlowFile All the user has to do to achive this is to define an avro schema

Schema (json) FlowFile

Operations on serialized records Transformations Lookups in source systems SQL Validations Merge, partition, split A lot of ootb functionalities Frameworks for customizations

Show-case

Conclusion Dataflows can be quickly created Customizable Scalable Data provenance Well suited for search Dataflows: configured via ui Highly customizable: NiFi includes well-defined APIs to integrate self-developed processors Scalable: Get rid of bottlenecks by allowing NiFi to run certain processors with multiple threads Data provenance: flowfiles are tracked – there are very good opportunities to reconstruct the life of flowfiles Well suited for search: Interaction with Solr or Elasticsearch is supported by various processors

Thank you! peterj@mediamarktsaturn.com