Dedupe, Merge and Purge Tyler Bell & The Art of Normalization.

Slides:



Advertisements
Similar presentations
Using CLUUZ. © 2008 Sprylogics International Corp. Enter your search term/terms. By default, CLUUZ will extract and display people, companies, phone numbers,
Advertisements

Search Engine Optimisation (SEO) by Graham Sowerby (28 th November 2013)
We have developed CV easy management (CVem) a fast and effective fully automated software solution for effective and rapid management of all personnel.
RCFN.MDB Setup This is an ACCESS 2000 format database Save the RCFN.MDB database to your local or network drive Open the RCFN.MDB Click on the Tools Menu.
Co-Registration.  Industry’s Most Advanced System  Real-Time Data Hygiene Dynamic Offer Targeting/Optimization Secondary Questions & B2B Support XML.
Google My Business Intro and Setup Josh Whitaker essentiaWEBWORKS.com
Recruiting Marketing CRM Integration
Lead Generation and Distribution for Internet / Web Leads
Data Model Examples USER SPECIFICATIONS.
What is it that we do here? What are we paying you for?
This presentation was made to help new users to understand how the is used on the internet. The Power Point is design to advance slowly to allow.
Bloglines.com How to use bloglines By: Jake Szymanski.
 Definition of HTML Definition of HTML  Tags in HTML Tags in HTML  Creation of HTML document Creation of HTML document  Structure of HTML Structure.
CONCRETE SOFTWARE SOLUTIONS PVT. LTD. A leading Digital Marketing Firm In India.
Ranking in the Google 7 Pack presented by Mary Bowling.
Introduction to Computers and the Internet. What is a computer? An "intelligent" machine  You tell a person to do a job and the person follows your “instruction”
Procedures to Develop and Register Data Elements in Support of Data Standardization September 2000.
Review of Last Session Search Engine Optimisation (SEO) Search Engine Optimisation (SEO) You can fine-tune your site so that the search engines notice.
©2012 Microsoft Corporation. All rights reserved..
Understanding the IRS 990n e-Postcard COPYRIGHT 2010 Gold Wing Road Riders Association, Inc. All rights reserved under International and Pan-American Copyright.
Shows the entire path to the file, including the scheme, server name, the complete path, and the file name itself. Same idea of stating your full name,
What’s Your Digital Marketing Strategy?. What is Digital Marketing? Computers Tablets Phones Social networks Traditional (Radio, TV) Ease of use.
©2012 Microsoft Corporation. All rights reserved. Content based on SharePoint 15 Technical Preview and published July 2012.
XP New Perspectives on Browser and Basics Tutorial 1 1 Browser and Basics Tutorial 1.
Evaluating Online Information Sources Ask yourself the following questions…
Agenda9/11/13  Do Now  Display your name tag and log into your computer  Pre-Assessment Test  Info and Interests  Syllabus and Course Expectations.
[Site Name] 2010 Media Kit Prepared by [Your Name], [Your Title] [Insert Your Logo]
Relational databases and third normal form As always click on speaker notes under view when executing to get more information!
Web Marketing Chris Sullivan Director, Digital and Marketing Communications.
Survey on ID code schemes for NID applications and services NID-WG, the 9 th CJK meeting, April 2009 Noboru Koshizuka Tetsuo.
Social Media Marketing Twitter & Facebook Your market is already on Twitter and Facebook, you need to be there too.
NCBI/WHO PubMed/Hinari Course Introduction Session #1, Sept 13, 2005 Session #2, Sept 14, 2005 Internet Concepts and Scientific Literature Resources Ho.
 The World Wide Web is a collection of electronic documents linked together like a spider web.  These documents are stored on computers called servers.
Would you be more likely to take a piece of candy from a trusted friend or a complete stranger? Explain.
Access to NCES Data CCD Build-a-Table Tool Digest of Education Statistics Other sources NCES Summer Data Conference July 2007.
Copyrighted material John Tullis 10/17/2015 page 1 04/15/00 XML Part 3 John Tullis DePaul Instructor
Address Levels Business Use Alignment. Introduction Objective is to provide layers of address granularity tailored to business use Address use levels.
Fall 2008Colorado Department of Education State Library 1 What is the URL? URL (Uniform Resource Locator) is an address that specifies the location of.
Advanced PHP & RSS Utilizing XML, RSS, and PHP. XML (eXtensible Markup Language) XML is the language of all RSS feeds and subscriptions XML is basically.
HTML: Hyptertext Markup Language Doman’s Sections.
ITCS373: Internet Technology Lecture 5: More HTML.
Lecture 6 Title: Web Planning, Designing, Developing for E-Marketing By: Mr Hashem Alaidaros MKT 445.
 After completing this session, you will be able to: 1. Indicate where to find your local legislators and their committee responsibilities. 2. Name an.
1. Go to 2. Complete the requested information.
Highly Confidential – for UCRE Affiliate Use Only 2015 Regional Training Embedding Maps into your listings on your United Country office website.
How Google and Microsoft taught search to “understand” the Web Austin Granger Chris Hesemann.
Search Engine Optimization (SEO)  Some simple HINTS & TIPS for the Beginner.
DIGITAL ADVERTISING Standard 4. THE ROLE OF DIGITAL ADVERTISING IS TO INCREASE SALES OR IMPROVE BRAND AWARENESS.
Creating Web Page Forms COE 201- Computer Proficiency.
Evaluating Web Pages Techniques to apply and questions to ask.
HTML HYPER TEXT MARKUP LANGUAGE. INTRODUCTION Normal text” surrounded by bracketed tags that tell browsers how to display web pages Pages end with “.htm”
Would you be more likely to take a piece of candy from a trusted friend or a complete stranger? Explain.
FI-WARE POI Data Provider. Principles Enables building various location based service (LBS) applications for all kinds of networked devices Distributed.
CSU Extension Webpage Template Session 4 February 2010.
Assignment 6 - Quality and "Credibility of Web Sites Amy Hartwell July 30,2011
Edas edas.info/‎ A description for this result is not available because of this site's robots.txt – learn more.robots.txtlearn more Instructions for Authors.
Using Geo-Spatial Session Tagging for Smart Multicast Session Discovery Piyush Harsh & Richard Newman Computer and Information Science and Engineering,
Technical SEO tips for Web Developers Richa Bhatia Singsys Pte. Ltd.
The Web Web Design. 3.2 The Web Focus on Reading Main Ideas A URL is an address that identifies a specific Web page. Web browsers have varying capabilities.
Read Article on How to Surprise a Horse at Pmsl.com
Lesson 11: Web Services & API's
Basic XHTML Tables XHTML tables—a frequently used feature that organizes data into rows and columns. Tables are defined with the table element. Table.
Website: Contact No: ID:
Free Text Search.
What is the Internet? Global system of interconnected computer networks – a network of networks! Hartland Consolidated Schools network Network at your.
You can access this form from the UGA Foundation->Policies and Forms->Administrative Forms page under the section of Access / IT Forms or by typing the.
WorldCat: Broad Web visibility for our collection
Piyush Harsh & Richard Newman
Lesson 11: Web Services and API's
Presentation transcript:

Dedupe, Merge and Purge Tyler Bell & The Art of Normalization

Two Problems: 1.An over-abundance of data 2.This same over-abundant data is Partial Erroneous Heterogenous Duplicated Untrustworthy Poorly typed

The Big Data Metaphor

Metaphorically: If our source data were a person, it would be a curiously-dressed, absentminded, oracular but at-times-unintelligible sociopathic hermaphrodite who excels at practical jokes.

The Bullhorn

Why This is a Bad Thing

SEM doesn't help Goal of SEO is to (politely of course) ensnare eyeballs SEM is based on broadcast and content multiplicity

“With a single click you can recommend that raincoat, news article or favorite sci-fi movie to friends, contacts and the rest of the world”

“With a single click you can recommend that Webpage to friends, contacts and the rest of the world”

Webpage URLs are Entity URIs Identifiers for people, places, things

The Crucible

Canonical Data

factual_id: the Factual ID name: Business/POI name po_box: PO Box. As they do not represent the physical location of a brick-and-mortar store, PO Boxes are often excluded from mobile use cases. We’ve isolated these for only a limited number of countries, but more will follow address: Street address address_extended: Additional address incl. suite numbers locality: City, town or equivalent region: State, province, territory, or equivalent admin_region: Additional sub-division, usually but not always a country sub-division post_town: Town employed in postal addressing postcode: Postcode or equivalent (zipcode in US) country: The ISO alpha-2 country code tel: Telephone number with local formatting fax: Fax number formatted as above website: Authority page (official website) latitude: Latitude in decimal degrees (WGS84 datum). Value will not exceed 6 decimal places (0.111m) longitude: as above, but sideways category: String name of category tree and category branch status: Boolean representing business as going concern: closed (0) or open (1) We are aware that this will prove confusing to electrical engineers Contact address of organization

It's All About Typing, These Days 15 attributes x 44 countries = 660 attribute types Often domain-specific Required for extraction, verification

Entropy State code: Low entropy Two entites with Same: Tells us very little Two entites with Different: Tells us very much Zip code: as above, but artifact postal code formatting in some countries can convey elements of proximity. Phone number: High entropy but surprisingly uninformative. Things fall apart, the center cannot hold…

15 attributes x 44 countries (so far) = 660 attribute types

The Ultimate Union of Man and Machine

17.5m entities pointing to over… 1.5b references found across… 4.7m domains US Local Dataset

Peter Mika, Jan 2011

enable publishers to give us hints about what things they are describing on their sites… markup [will] amplify the value [webmasters ]receive in return improve how their sites appear in major search engines… powering richer search results and new kinds of applications. improve the search experience… alignment between search and our Web of Objects program

Datawire TL;DR: Search: human disambiguation is expected Few inputs leads to ‘pull’, not ‘push’ Plurality of content is a real bugger Content markup will do more than improve the look of search results Increased recognition of machine-to-machine APIs The socially networked world demands understanding across caissons The Good News:

Tyler