- Section 1 Machine Learning Foundations
- 1.4 Training Models (Luis Serrano)
- 1.4.1 Intro: 1. evaluate how well the model is doing, 2. how to improve the model based on metrics
- 1.4.2 Outline: a) problem, b) tools aka algorithms, measurement tools aka metrics (focus of this section)
- 1.4.3. Statistics refresher: link to stats concepts mean, median, variance link to free descriptive and inferential statistics by Udacity
- 1.4.4 loading data into pandas `import pandas` `data=pandas.read_csv('file_name.csv')` pandas cheatsheet
- 1.4.5 Numpy Arrays: Pandas read_csv and stores data as a dataframe, use this code to query one column `df['name_of_col']`, query more than one column `df[['col_1','col_2']]`. A common operation is to split dataframe into feature and target, then convert each to numpy arrays for efficient calculation. `numpy.array(df)`
- 1.4.6 Training models in sklearn: this course will cover important classification algorithms including Logistic Regression, Neural Networks, Decision Tree, Support Vector Machines. Modeling data is easy in sklearn `classifier.fit(X,y). Important exercise playing with decision boundaries. Seems like decision tree really fits "boxy" data well, because it can draw vertical and horizontal boundaries. But careful, the data may be "circular", in that case NN and SVM can fit better.
- 1.4.7 Tuning parameters manually classifier = SVC(kernel = None, degree = None, gamma = None, C = None) kernel (string): 'linear', 'poly', 'rbf'. degree (integer): This is the degree of the polynomial kernel, if that's the kernel you picked (goes with poly kernel). gamma (float): The gamma parameter (goes with rbf kernel). C (float): The C parameter. RBF can fit some strange "bacteria" shaped data, while poly can fit some strange abstract art like data.
- 1.4.8 Tuning parameters automatically: useful when data gets big
- 1.5 Testing Models:
- 1.5.1 Regression model returns a numeric number, classification model returns a state. Testing reveals how well the model is doing. It's possible to make a model with a frontier or fitted line that's so curvey, it fits data perfectly, but it doesn't generalize well. Requirement of defining a good testing evaluation function, is to figure out if the model can generalize. Split datainto train_data, test_data. Train model with train_data, test model with test_data. from sklearn.model_selection import train_test_split
- X_train, X_test, y_train, y_test = from sklearn.model_selection import train_test_split(X,y,test_size= 0.25) if you try the code above you will get an error ImportError: No module named model_selection. Previously the train_test_split is in ImportError: No module named model_selection
- 1.5.2 cool visualization of train_test_split, plotting X_train, X_test, y_train, y_test with different tickers.
- Regression vs Classification: quantitative continuous vs discrete classes and categories.
Post a Comment