Presentation is loading. Please wait.

Presentation is loading. Please wait.

Climate Group 2 Jiajun LI, Serena DONG, Charis DENG.

Similar presentations

Presentation on theme: "Climate Group 2 Jiajun LI, Serena DONG, Charis DENG."— Presentation transcript:

1 Climate Group 2 Jiajun LI, Serena DONG, Charis DENG

2 Outline Data Collection Entity Resolution Data Fusion Demo
Climate Group 2

3 Data Collection Climate Group 2

4 Data Collection – Original Data
Sources Actors City Weather Info Weather-Forecast 1267 24454 Yahoo! 95293 572948 Open Weather Map 147565 979626 Climate Group 2

5 Data Collection – Integrated Schema (Weather Info)
Yahoo! City Name Country Date Temp Min Temp Max Temp Description Weather-Forecast City Name Country Date Temp Min Temp Max Temp Description Humidity Wind Speed Rain Snow Open Weather Map City Name Country Date Temp Min Temp Max Temp Description Humidity Wind Speed Rain Snow Pressure Ground Level Sea Level Wind Degree Climate Group 2

6 Data Collection – Integrated Schema (City)
Yahoo! City Name Country Longitude Latitude Weather-Forecast City Name Country Longitude Latitude Open Weather Map City Name Country Longitude Latitude Climate Group 2

7 Data Collection – Data Cleansing
Sources Attributes Country Date Temperature Wind Speed Snow Weather-Forecast Full Name 21 ºC km/h cm/ 8 hours Yahoo! 21 Apr 2016 ºF Null Open Weather Map Abbreviation Kelvin m/s mm/ 3 hours Final Result mm/ day Climate Group 2

8 Entity Resolution Three tables Main task Error data check? Method
Table cities Find the duplication data by cityname and country Error data check? Method Weather data always have a fluctuate value 1.Mysql (after data cleaning) 2.R language 3.dedupe (python library) Climate Group 2

9 Dedupe Dedupe is a library that uses machine learning to perform de-duplication and entity resolution quickly on structured data. Remove duplicate entries cityname and country between two tables 1.CSV file for small dataset 2.Mysql data structure for big dataset easy to result processing Climate Group 2

10 Supervised Learning : Training, Labeling,
Dedupe Supervised Learning : Training, Labeling, Blocking, Clustering Data Preprocessing Climate Group 2

11 Dedupe Blocking, Clustering Result Processing
low score, same id (itself),same cityname in same country (longitude and latitude) Climate Group 2

12 Not dedupe that, just find them for data fusion
Deduplication Result two table R language Mysql Dedupe a and b 762 820 953 b and c 17767 18096 18502 a and c 74 78 133 cities Not dedupe that, just find them for data fusion Climate Group 2

13 DATA FUSION Data processing flow
Rank the most similarity city of one city . Data fusion main methods Measure the quality Climate Group 2

14 Data Processing Flow two datasets with numerical value numeric
all three datasets with numerical value Six days source datasets for weather clean data two datasets with text value text all three datasets with text value Climate Group 2

15 The Three Source Datasets
calculate distance pairwise get the smallest distance to average the data Climate Group 2

16 Clean Data Climate Group 2

17 All tree datasets with numerical value
Two datasets with numerical value Calculate distance pairwise Get the smallest distance to average the data Attribute value pairwise for three datasets and rank the similarity, assign the smallest value according Climate Group 2

18 Two datasets with text value All three datasets with text value
Calculate the similarity by stringsim pairwise Choose the smallest pair Get the exactly one value by similarity rank Choose the value by similarity value directly Climate Group 2

19 Data Fusion Main Methods
Compare distance pairwise Compute the rank of the weight pairwise compare the distance Jaccard & stringsim used Climate Group 2

20 The Final Table calculate distance pairwise
get the smallest distance to average the data Climate Group 2

21 Rank the Most Similarity City of One City
User defined function:mycity cityrank mycity: computer the specific one city's weather data used to complare with the rest of cities later cityrank: Give the first 20 rank of most similarity cities by the below weight distribution. temp:0.2, mintemp:0.15, maxtemp:0.15, pressure:0.1, humidity:0.1, clouds:0.05,windspeed:0.1,rain:0.1, snow:0.05 Climate Group 2

22 The Rank Result Climate Group 2

23 Regarding to the special geo characteristic for temperature, we can trickly compare the latitude fo the city Meature the Quality Climate Group 2

24 DEMO 2019/2/4 2019/2/4 Climate Group 2

25 Thank You Q & A 2019/2/4 2019/2/4 Climate Group 2

Download ppt "Climate Group 2 Jiajun LI, Serena DONG, Charis DENG."

Similar presentations

Ads by Google