Presentation by Jennifer Marsman Principal Engineer at Microsoft blog.msdn.microsoft.com/jennifer
Supervised Learning and Unsupervised Learning
Supervised Learning - Requires labeled data
Unsupervised Learning - Does not require labeled data
Supervised Learning: using historic data to make new inferences.
Unsupervised learning: make sense of data without labels, cluster similar things together, look for relationships, form groups, explore the data.
Reinforcement learning: semi-supervised learning. Look at our Markov Decision Process post. Also AlphaGO.
This Kaggle competition, supervised learning will be helpful.
Regression vs Classification
Regression: predicts the cost of a home, sales number, for example zillow dataset, develop a mathematical model.
Classification: YES/NO, categorical, discrete, team A,B,C.
Machine Learning Terminologies
Features: number of bathroom, zipcode,
Training vs inference:
Data split: don't want to use up all data for training. Need to witheld data so that can test out how the model perform for data it has never seen. 15 years ago 70% 30% data split was the gold standard. `
Cross Validation K Fold: divide train data into 10 chunks, use 9 for train 1 for validation, then shuffle and do the same, use another 1 portion for validation.... Use different 9 and test with a different hold out.
Precision, Recall, Accuracy
Azure for ML
- Data Science virtual machine include all the common data tools, pre-installed. Anaconda, R, Python, Tensorflow, common deep learning frameworks and tools.
- Batch AI, train models at scale.
- Azure credits.
Deep neural networks require a lot of data. Transfer Learning can help when that is not available. For example if we only 2000 images.
In the earlier layers the NN is recognizing shapes, edges. Not until much later it is recognizing actual images.
Use known dataset and architecture that already figure out edges and shapes, and only fine tune the final layers.
Learning Resources on Microsoft
Microsoft School of AI
Microsoft tutorials for analytics, Microsoft AI learning via github, Microsoft for AI, Microsoft Program for Data Science, Microsoft Program for Data Analysis. https://azure.github.io/learnAnalytics-public/
Tutorial Presentation Getting Started on Kaggle
Vani Mandava Director Data Science Microsoft Research @vanimt
Kaggle Competition Overview and Tutorial
Oil palm tree produces palm oil in Africa and South Central America through deforestation. High resolution satellite image and computer vision helps us track this environmental issues. Dataset created by the West Big Data Hub and WiDS Datathon Committee. Develop a model to detect whether an oil palm plantation is present. This competition has ended.
Supervised learning problem with image vision. 15K images in training, 4K images in test set, 2K images in holdout.