Download presentation
Presentation is loading. Please wait.
1
Data Science for Random Forests Meetup
Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data Science Data Science for Random Forests November 2, 2015
2
Random Forests Defined
Random forests are an ensemble learning method for classification, regression and other tasks, that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. Random forests correct for decision trees' habit of overfitting to their training set.[1]:587–588 The algorithm for inducing a random forest was developed by Leo Breiman[2] and Adele Cutler,[3] and "Random Forests" is their trademark. The method combines Breiman's "bagging" idea and the random selection of features, introduced independently by Ho[4][5] and Amit and Geman[6] in order to construct a collection of decision trees with controlled variance. The selection of a random subset of features is an example of the random subspace method, which, in Ho's formulation, is a way to implement classification proposed by Eugene Kleinberg.[7]
3
Machine learning and data mining
Problems: Classification. Clustering. Regression. Anomaly detection. Association rules. Reinforcement learning. Structured prediction. Feature learning. Online learning. Semi-supervised learning. Unsupervised learning. Learning to rank. Grammar induction. Supervised learning (classification • regression): Decision trees. Ensembles (Bagging, Boosting, Random forest). k-NN Linear regression. Naive Bayes. Neural networks. Logistic regression. Perceptron. Support vector machine (SVM). Relevance vector machine (RVM). Clustering: BIRCH. Hierarchical. k-means. Expectation-maximization (EM). DBSCAN. OPTICS. Mean-shift.
4
Introduction to Random Forests for Beginners – free ebook
Random Forests is of the most powerful and successful machine learning techniques. This free ebook will help beginners to leverage the power of Random Forests. An Introduction to Random Forests for Beginners. Random Forests is one of the top 2 methods used by Kaggle competition winners. An Introduction to Random Forests It is an ensemble learning method for classification and regression that builds many decision trees at training time and combines their output for the final prediction. This ebook will help beginners leverage the power of multiple alternative analyses, randomization strategies, and ensemble learning with Random Forests. The 70-page ebook includes graphs, examples, and illustrations. Chapters include: What is Random Forests? Segment and cluster Suited for wide data Advantages of Random Forests Case Study example
5
Real World Example The Future of Alaska Project: Forecasting Alaska’s Ecosystem in the 22nd Century Analytics On a Grand Scale: Alaska over the next 100 years To assist long term planning related to Alaska’s biological natural resources researchers at the University of Alaska, led by Professor Falk Huettman, have built models predicting the influence of climate change on many of Alaska’s plants and animals An Associate Professor of Wildlife Ecology, Dr. Huettmann runs the EWHALE (Ecological Wildlife Habitat Data Analysis for the Land and Seascape) Lab with the Institute of Arctic Biology, Biology and Wildlife Department at the University of Alaska-Fairbanks (UAF). Real World Example
6
Connecting Alaska Landscapes Into the Future
We employed the Random Forests™ modeling algorithm to identify probable relationships between historic temperature and precipitation data and known distributions for species and biomes across Alaska. These relationships were then used to predict future species and biome distribution based on projected temperature and precipitation. This approach, known as ensemble modeling, takes the average of the outputs of multiple individual models, thus generally providing more robust predictions (Breiman 1998, 2001). MODELING CLIMATE CHANGE ENVELOPES: RANDOM FORESTS™ Connecting Alaska Landscapes Into the Future
7
Commentary Dr. Falk Huettmann is very confident in the RandomForest software and results for Alaska, however, the inventor, Professor Leo Breiman, says his philosophy is: RF is an example of a tool that is useful in doing analyses of scientific data. But the cleverest algorithms are no substitute for human intelligence and knowledge of the data in the problem. Take the output of random forests not as absolute truth, but as smart computer generated guesses that may be helpful in leading to a deeper understanding of the problem. We need to do an audit and see who is closer to the truth.
8
Alaska.csv Wide Data: 317 Rows and 468 Columns
9
Web Player
14
Statistics Essentials For Dummies and Statistics II For Dummies
15
Conclusions and Recommendations
I got a request from a new member of the Federal Big Data Working Group Meetup to "look over my shoulder" when I did my data science to help them enter a Kaggle Competition. He tried the Kaggle Titantic Example with Data Set and R Script Which I Had Done with Spotfire in My Data Science for Statistics.com Tutorial. Then We Found the Introduction to Random Forests for Beginners – free ebook and the Alaska Data Set. So Far We Have Been Unable to Understand and Reproduce Dr. Falk Huettmann’s Random Forest Results in the Connecting Alaska Landscapes Into the Future (2010) Report. Next We Want to Extend the Reach of R to the Enterprise and Another Visualization Bakeoff.
16
Agenda 6:30 p.m. Welcome and Introduction (New Tutorial and Mentoring) Start with Video: Learning Path: Data Science with R then see Kaggle Competition: How Much Did It Rain? II using Spotfire instead of R! See Slides and Slides for Data Science for Random Forests Recent Addition: Data Science for Six World Series-Time Series Analysis and Forecasting Also see new Data Science Data Publication: Homelessness in Metropolitan Washington for Data Science for Homeless Data Bakeoff Part II on November 4th (to be rescheduled) 7:15 p.m. Brief Member Introductions 7:30 p.m. Invited Presentation Ujval Kamath (TIBCO Data Science Team Member for Louis Bajuk-Yorgan, TIBCO Spotfire Senior Director, Project Management) Slides 8:30 p.m. Open Discussion 8:45 p.m. Networking 9:00 p.m. Depart
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.