Friday, December 22, 2017

Prepping the data, data preprocessing in machine learning

Post construction in progress. This is a draft.
Things to consider when prepping data:

Use exploratory data analysis to understand the distribution of your data, helps scout out opportunity of regularization, normalization, feature selection and model selection.

If you have hundreds of features, feature selection can be very effective. Dimensionality reducFion can help simplify the data, generate better results, faster results. It also makes sense to only retain features that actually can predict the labels.

You can cut your dataset into even smaller more manageable subsets by sampling the data.

It’s a standard procedure to further divide the input dataset into train and test splits with shuffling. But some datasets do not do well with shuffling such as time series data. We cannot simply mix past data with present and future. 

Is data linearly separable? SVM can employ different kernels to handle non-linear data. RELU and Sigmoid also generates non-linear output.

Data Transformation

sklearn.preprocessing.Imputer Imputation transformer for completing missing values. Handling missing value, process and replace NaN with mean, median, most_frequent etc.


Regularization in Machine Learning, Deep Learning

Regularization can prevent overfitting and potentially make algorithm converge faster and more performant. Useful in deep learning tasks, in...