Big Data Quality Panel Norman Paton University of Manchester.

Big Data Quality Panel Norman Paton University of Manchester

Outline Existing approaches to quality and big data. The questions.

Existing Approaches How do they do? Not just one V at a time.

Extract Transform and Load Volume: – Compile down for data scale. – Expensive for numerous sources (e.g. web data extraction). Velocity: – manual, so not so good for rapid source change. Variety: – manual, so not good for huge diversity. Veracity: – Provides a framework and tools, but lots of hand crafting. Talend: www.talend.com

Interactive Cleaning Volume: – thousands of sources – huge sources? Velocity: – rapidly changing sources? Variety: – numerous different representations? Veracity: – good for some types of data cleaning. Trifacta: www.trifacta.com

Constraint Based Approaches Volume : – known complexity challenges. Velocity: – rapidly changing sources? Variety: – numerous different representations? Veracity: – Limited support for domain- specific quality.

What are the lessons / consequences? Lessons: – Existing approaches tend to be quite labour intensive. – For the 4 V’s there may be as many problems with human costs as computational costs. Consequences: – We need to focus effort really well: Perfect solutions are going to be too costly. – We need to automate or pay-as-you-go: Manual solutions will not cope with numerous or rapidly changing sources.

Outline Existing approaches to quality and big data. The questions.

Question 1 and Comments Big Data are usually considered as a guarantee of result accuracy of data science techniques. What are however the risks for the value extraction chain -if the underlying data is wrong, dirty, incomplete, inconsistent, obsolete... ? -if the underlying data analysis algorithms are not reliable, biased in their processing or their assumptions ? A guarantee of result accuracy. Really? What are the risks? The level of the risk can only be known if you know what matters to you and try to measure it. What are the risks for value extraction? This depends on what conclusions are to be drawn.

Question 2 and Comments What are the main quality factors of Big Data? Do these quality factors have the same importance in various data spaces : Corporate transactional data, research data, government data, personal data, social media data....? What are the main quality factors of Big Data? This depends on what you are trying to achieve. Do they have the same importance in different domains? No.

Question 3 and Comments Data quality is an old issue in databases, data- warehousing, statistics and decision making systems. To what extent traditional approaches for diagnosis, prevention and curation are challenged by the Volume Variety and Velocity characteristics of Big Data? Volume (size): some techniques scale better than others. Volume (sources): some techniques need significant manual effort. Variety: see Volume (sources)! Velocity: see Volume (size)! So, it is important to target cleaning and analysis efforts on what really matters.

Question 4 and Comments Quality assessment and improvement may be of high cost which can be balanced with the risk to take decisions on the basis of inaccurate data. How difficult is the evaluation of the threshold under which data quality can be ignored ? How hard is it to tell how good your quality is, so you can set a threshold? This depends on the measure. Human effort may be needed to define metrics, write rules or provide feedback.

Big Data Quality Panel Norman Paton University of Manchester.

Similar presentations

Presentation on theme: "Big Data Quality Panel Norman Paton University of Manchester."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Big Data Quality Panel Norman Paton University of Manchester.

Similar presentations

Presentation on theme: "Big Data Quality Panel Norman Paton University of Manchester."— Presentation transcript:

Similar presentations

About project

Feedback