MIS2502: Data Analytics Semi-structured Data Analytics

Slides:



Advertisements
Similar presentations
Alternative FILE formats
Advertisements

XML: text format Dr Andy Evans. Text-based data formats As data space has become cheaper, people have moved away from binary data formats. Text easier.
IS 373—Web Standards Todd Will
Sistemi basati su conoscenza XML Prof. M.T. PAZIENZA a.a
23-Jun-15 HTML. 2 Web pages are HTML HTML stands for HyperText Markup Language Web pages are plain text files, written in HTML Browsers display web pages.
Tutorial 11 Creating XML Document
Introduce of XML Xiaoling Song CS157A. What is XML? XML stands for EXtensible Markup Language XML stands for EXtensible Markup Language XML is a markup.
Pemrograman Berbasis WEB XML part 2 -Aurelio Rahmadian- Sumber: w3cschools.com.
XML introduction to Ahmed I. Deeb Dr. Anwar Mousa  presenter  instructor University Of Palestine-2009.
Chapter 16 The World Wide Web Chapter Goals Compare and contrast the Internet and the World Wide Web Describe general Web processing Describe several.
Chapter 16 The World Wide Web. 2 The Web An infrastructure of information combined and the network software used to access it Web page A document that.
1Computer Sciences Department Princess Nourah bint Abdulrahman University.
ACOT Intro/Copyright Succeeding in Business with Microsoft Excel
XML Extensible Markup Language. Markup Languages u What does this number (100) mean? –Actually, it’s just a string of characters! –A markup language can.
Demystifying the eXtensible Markup Language Nick Roberts & Jim Few
 XML is designed to describe data and to focus on what data is. HTML is designed to display data and to focus on how data looks.  XML is created to structure,
Tutorial 1: XML Creating an XML Document. 2 Introducing XML XML stands for Extensible Markup Language. A markup language specifies the structure and content.
XML eXtensible Markup Language. Topics  What is XML  An XML example  Why is XML important  XML introduction  XML applications  XML support CSEB.
Softsmith Infotech XML. Softsmith Infotech XML EXtensible Markup Language XML is a markup language much like HTML Designed to carry data, not to display.
Copyright © Osmosys O S M O S Y SO S M O S Y S D e p l o y i n g E x p e r i e n c e & E x p e r t i s e™ HTML Training.
WEB APPLICATION DEVELOPMENT For More visit:
Introduction to XML This presentation covers introductory features of XML. What XML is and what it is not? What does it do? Put different related technologies.
1 Introduction to XML XML stands for Extensible Markup Language. Because it is extensible, XML has been used to create a wide variety of different markup.
What it is and how it works
XML Introduction. Markup Language A markup language must specify What markup is allowed What markup is required How markup is to be distinguished from.
XML Basics A brief introduction to XML in general 1XML Basics.
Representing data with XML SE-2030 Dr. Mark L. Hornick 1.
AJAX. Ajax  $.get  $.post  $.getJSON  $.ajax  json and xml  Looping over data results, success and error callbacks.
Dave Salinas. What is XML? XML stands for eXtensible Markup Language Markup language, like HTML HTML was designed to display data, whereas XML was designed.
The idea of adding markup instructions to documents is not new. Before computers, authors would make annotations by hand in their written or typed documents.
VCE IT Theory Slideshows by Mark Kelly study design By Mark Kelly, vceit.com, Begin.
Google maps engine and language presentation Ibrahim Motala.
JSON. JSON as an XML Alternative JSON is a light-weight alternative to XML for data- interchange JSON = JavaScript Object Notation It’s really language.
XML Introduction to XML Extensible Markup Language.
Connecting to External Data. Financial data can be obtained from a number of different data sources.
XML & JSON. Background XML and JSON are to standard, textual data formats for representing arbitrary data – XML stands for “eXtensible Markup Language”
XML Databases Presented By: Pardeep MT15042 Anurag Goel MT15006.
Extensible Markup Language (XML) Pat Morin COMP 2405.
XML intro. What is XML? XML stands for EXtensible Markup Language XML is a markup language much like HTML XML was designed to carry data, not to display.
XML BASICS and more…. What is XML? In common:  XML is a standard, simple, self-describing way of encoding both text and data so that content can be processed.
Web Development. Agenda Web History Network Architecture Types of Server The languages of the web Protocols API 2.
Introduction to Mongo DB(NO SQL data Base)
HTML CS 4640 Programming Languages for Web Applications
Unit 4 Representing Web Data: XML
Unit M Programming Web Pages with
The Client-Server Model
Web Languages What Is a Web Page?
3.00cs HTML Overview 3.00cs Develop webpages.
mysql and mysql workbench
Microsoft Office Illustrated
Intro to PHP & Variables
Tutorial 8 Objectives Continue presenting methods to import data into Access, export data from Access, link applications with data stored in Access, and.
Web Languages What Is a Web Page?
Chapter 7 Representing Web Data: XML
Database Systems – SQL XML XML stands for Extensible Markup Language
Intro to NoSQL Databases
MIS2502: Data Analytics MySQL and SQL Workbench
Intro to NoSQL Databases
What is XML?.
MIS JavaScript and API Workshop (Part 3)
Introduction to AJAX and JSON
Chapter 16 The World Wide Web.
relational thoughts on NoSql
CS 240 – Advanced Programming Concepts
MIS2502: Data Analytics MySQL and MySQL Workbench
Ajax and JSON Jeremy Shafer Department of MIS Fox School of Business
Ajax and JSON Jeremy Shafer Department of MIS Fox School of Business
Intro to NoSQL Databases
Extensible Markup Language (XML)
Presentation transcript:

MIS2502: Data Analytics Semi-structured Data Analytics Zhe (Joe) Deng deng@temple.edu http://community.mis.temple.edu/zdeng

Relational databases are highly structured Tables have the same number of fields for every record Each field has a specified data type Data types have a specified length and precision

Not all data is stored like that

Comma-separated value (CSV) file. Each value is separated by a comma. Other than that it is plain text. No specified field lengths. The first row is often the field/column names. Recall the UCI Machine Learning Repository: http://mlr.cs.umass.edu/ml/machine-learning- databases/adult/adult.data

Role of quotation marks The quotes don’t imply a data type. Notice that the ID is in quotes but the height and mass are not. The quotes just allow commas to be considered part of the value, not a separator. The CSV file format is not fully standardized. The only standardized rule is the basic idea of separating fields with a comma. This is also a valid CSV file…

Semi-Structured Data Structured data Organized according to a formal data model (i.e., relational schema) Semi-structured data No formal data model, but contains symbols to separate and label data elements Unstructured data No data model and no pre-defined organization

Data Examples Structured Relational databases Semi-Structured CSV XML & JSON Unstructured Text documents Images Would you consider an Excel spreadsheet structured, semi-structured, or unstructured?

Why care about semi-structured and unstructured data? Common way to transfer data between software applications Because plain-text is universal, datasets are often posted using semi-structured formats Semi-structured data It’s everywhere Up to 70% to 80% of an organization’s data may be in unstructured forms (Wikipedia) Unstructured data

The CSV format is still quite structured You can’t skip values in a row You have to be careful when using commas as part of your data …but there’s no way to create data hierarchies Can’t make “first” and “last” part of “name” This means the year for Watkins is 3.2… …and she doesn’t have a GPA

Alternatives to CSV for semi-structured data XML Extensible Markup Language JSON JavaScript Object Notation

Extensible Markup Language Plain text file Uses text for values between tags for labels <opening tag>data</closing tag> <height>172</height> Values can be of any length Commas and quotes are valid Fields can be skipped… Remove <mass>75</mass> from C-3PO and skin color is still gold Starts and ends with a tag (often <root> or <document>)

And id, name, height, mass, etc., are all nested under Character Hierarchies in XML <Character> <id>1</id> <name> <first>Luke</first> <last>Skywalker</last> </name> <height>172</height> <mass>77</mass> <hair_color>blond</hair_color> <skin_color>fair</skin_color> <eye_color>blue</eye_color> <birth_year>19</birth_year> <gender>male</gender> <homeworld>Tatooine</homeworld> </Character> We know we can break up name into first and last But we are also nesting it under name So first and last are now attributes of name Easier to find what you’re looking for and organize your data And id, name, height, mass, etc., are all nested under Character

Bottom line for XML XML is better than CSVs for semi-structured data Allow for hierarchies More flexible Easier to read But XML takes up a lot more space with all of those tags Starwars.csv  6,251 bytes Starwars.xml  28,521 bytes

JavaScript Object Notation Plain text file Organized as objects within braces { } Uses key-value pairs key: value or “name”: “C-3PO” keys are field names; strings in quotes values are the data; strings, numbers, Boolean (quotes around strings required) a comma separates the key-value pairs Values can be any length Fields can be skipped Remove “mass”: “75” from C-3PO and skin color is still gold JSON object JSON object

Hierarchies in JSON { "Character": { "id": "1", "name": { "first": "Luke", "last": "Skywalker" }, "height": "172", "mass": "77", "hair_color": "blond", "skin_color": "fair", "eye_color": "blue", "birth_year": "19", "gender": "male", "homeworld": "Tatooine" } We can have first and last nested as attributes of name, just like XML And all the fields (id, name, height) are attributes of Character

Bottom line for JSON Best aspects of XML and CSV More lightweight than XML Starwars.csv  6,251 bytes Starwars.xml  28,521 bytes Starwars.json  21,074 bytes Supports hierarchies like XML JSON becoming the standard for transferring data across the web

Same data, four different ways… XML file Relational database table JSON file first last year GPA Bob Smith Sophomore 3.4 Judy Jones Senior 3.9 Barbara Watkins Junior 3.2 <root> <Person> <first>Bob</first> <last>Smith</last> <year>Sophomore</year> <GPA>3.4</GPA> </Person> <first>Judy</first> <last>Jones</last> <year>Senior</year> <GPA>3.9</GPA> <first>Barbara</first> <last>Watkins</last> <year>Junior</year> <GPA>3.2</GPA> </root> [ { "first": "Bob", "last": "Smith", "year": "Sophomore", "GPA": 3.4 }, "first": "Judy", "last": "Jones", "year": "Senior", "GPA": 3.9 "first": "Barbara", "last": "Watkins", "year": "Junior", "GPA": 3.2 } ] CSV file first,last,year,GPA Bob,Smith,Sophomore,3.4 Judy,Jones,Senior,3.9 Barbara,Watkins,Junior,3.2

JSON and Web APIs Web Application Program Interface (API) Software that exposes functionality through a web interface Use the language of web software to send and receive messages (and exchange data) Some examples

Requesting a web page Google’s Web Server which is really… RESPONSE: which is really… This is HTML and JavaScript and CSS. All you need to know is that the web server is sending back a lot of text that tells your web browser what to do.

JSON and web APIs For us, Web APIs are just a way of getting data JSON is a popular format to package the data MySQL Database Server SELECT actor.first_name, actor.last_name FROM moviedb.actor; PENELOPE GUINESS, NICK WAHLBERG, ED CHASE, JENNIFER DAVIS…. Database Server with Web API https://swapi.co/api/people/1/?format=json {"name":"Luke Skywalker","height":"172","mass":"77", "hair_color":"blond","skin_color":"fair“… This works…try it!

Amazon.com database server Note there is no web browser involved here! Applications use APIs to communicate with each other by exchanging data http://api.paypal.com/payments/... APPROVED! Your web browser on your phone, laptop, desktop computer Amazon.com web server serves the familiar web interface you know Amazon.com application server processes orders, maintains your cart, and makes recommendations Amazon.com database server stores customer, product, and order data

JSON and data analytics JSON is just another data format JSON files can be read by analytics software, including R So can CSV files And XML files And Excel files