Ad

Friday, December 22, 2017

Prepping the data, data preprocessing in machine learning

Post construction in progress. This is a draft.
Things to consider when prepping data:

Downloading data, extract unzip data
!wget -O url_to.tar.gz
!tar -zxf url_to.tar.gz -C folder
This is useful for command line. The exclamation mark is a prefix for command line commands to run in Jupyter Notebook

Use exploratory data analysis to understand the distribution of your data, helps scout out opportunity of regularization, normalization, feature selection and model selection.

If you have hundreds of features, feature selection can be very effective. Dimensionality reducFion can help simplify the data, generate better results, faster results. It also makes sense to only retain features that actually can predict the labels.

You can cut your dataset into even smaller more manageable subsets by sampling the data.

It’s a standard procedure to further divide the input dataset into train and test splits with shuffling. But some datasets do not do well with shuffling such as time series data. We cannot simply mix past data with present and future. 

Is data linearly separable? SVM can employ different kernels to handle non-linear data. RELU and Sigmoid also generates non-linear output.

Data Transformation

sklearn.preprocessing.Imputer Imputation transformer for completing missing values. Handling missing value, process and replace NaN with mean, median, most_frequent etc.


2 comments:

Algolia Search API Basics Tutorial

I write full time now for hi@uniqtech.co write me to say hi, request content or be notified of new tutorials like this. Unqitech writes abou...