Friday, December 22, 2017

Prepping the data, data preprocessing in machine learning

Post construction in progress. This is a draft.
Things to consider when prepping data:

Use exploratory data analysis to understand the distribution of your data, helps scout out opportunity of regularization, normalization, feature selection and model selection.

If you have hundreds of features, feature selection can be very effective. Dimensionality reducFion can help simplify the data, generate better results, faster results. It also makes sense to only retain features that actually can predict the labels.

You can cut your dataset into even smaller more manageable subsets by sampling the data.

It’s a standard procedure to further divide the input dataset into train and test splits with shuffling. But some datasets do not do well with shuffling such as time series data. We cannot simply mix past data with present and future. 

Is data linearly separable? SVM can employ different kernels to handle non-linear data. RELU and Sigmoid also generates non-linear output.

Data Transformation

sklearn.preprocessing.Imputer Imputation transformer for completing missing values. Handling missing value, process and replace NaN with mean, median, most_frequent etc.

No comments:

Post a Comment

Machine Learning with No Code

AutoML machine learning deep learning without code by Uber, Ludwig allows users to train and make inference deep learning model without co...