Traditional Supervised Machine Learning
k-Nearest Neighbors, Linear Regression, Logistic Regression, Support Vector Machines (SVM), Decision Trees and Random Forest. Finally Neural Networks (NN). NN has gotten so popular it is often a category of its. Many innovations have happened in the past 5 years that it is no longer considered a traditional method.
Traditional Unsupervised Learning
Clustering: k-means, Hierarchical Cluster Analysis (HCA), Expectation Maximization, Visualization and dimensionality reduction: Principal Component Analysis (PCA), Kernel PCA, Locally-Linear Embedding (LLE), t-distributed Stochastic Neighbor Embedding (t-SNE), Assocation rule learning: Apriori, Eclat.
We can use unsupervised learning for feature extraction to simplify data and reduce computation cost. Also used for anomaly detection, outlier detection.
Import Data
## Read csv into Pandas DataFrame
import numpy as np
import pandas as pd
# TODO import data from source
%matplotlib inline
file = 'my_file.csv'
data = pd.read_csv(file)
Data.head()
type(data) #->DataFrame
Data Preprocessing - Target Column
## store target column as a separate vector
## remove target column from training data
target = data['Target_Column']
data = data.drop('Target_Column', axis = 1)
data.head()
Normalization
Data always needs to be normalized to make input features comparable with each other. Generally we want to normalize the values so that they are between 0 and 1. Then we use sigmoid activation function get outputs between 0 and 1.
Example:
- Normalize RGB value divide by 255
- Normalize a Yelp review divide by 5 for five stars scale
Activation Functions
We can use activation functions to squash output numbers into certain ranges for example 0 and 1.
Vector Spaces
The decision boundary can span a line, but also a plane, even a hyper plane because data can be 2D aka planes, 3D hyperplanes.
Metric - Calculate Accuracy Scores
def accuracy_score(y, pred):
if len(y) == len(pred):
return "Accuracy score is {:.2f}.".format((y == pred).mean()*100)
else:
return "Exception: Lengths of inputs don't match."
Machine Learning in the Real World
Are you willing to let classifiers make business decisions for you?
ROC Curve
Receiver operating characteristic (ROC) plots the true positive rate (TPR) against the false positive rate (FPR) at various threshold setting (source: wikipedia). It's a measure of how good the decision frontier is at each split.
Read more about ROC Curve Here
Broadcasting
"The term broadcasting describes how numpy treats arrays with different shapes during arithmetic operations. Subject to certain constraints, the smaller array is “broadcast” across the larger array so that they have compatible shapes." (source scipy doc).ROC Curve
Receiver operating characteristic (ROC) plots the true positive rate (TPR) against the false positive rate (FPR) at various threshold setting (source: wikipedia). It's a measure of how good the decision frontier is at each split.
Read more about ROC Curve Here
No comments:
Post a Comment