Prepping the data, data preprocessing in machine learning

Friday, December 22, 2017

Prepping the data, data preprocessing in machine learning

Post construction in progress. This is a draft.

Things to consider when prepping data:

Downloading data, extract unzip data

!wget -O url_to.tar.gz

!tar -zxf url_to.tar.gz -C folder

This is useful for command line. The exclamation mark is a prefix for command line commands to run in Jupyter Notebook

Use exploratory data analysis to understand the distribution of your data, helps scout out opportunity of regularization, normalization, feature selection and model selection.

If you have hundreds of features, feature selection can be very effective. Dimensionality reducFion can help simplify the data, generate better results, faster results. It also makes sense to only retain features that actually can predict the labels.

You can cut your dataset into even smaller more manageable subsets by sampling the data.

It’s a standard procedure to further divide the input dataset into train and test splits with shuffling. But some datasets do not do well with shuffling such as time series data. We cannot simply mix past data with present and future.

Is data linearly separable? SVM can employ different kernels to handle non-linear data. RELU and Sigmoid also generates non-linear output.

Data Transformation

sklearn.preprocessing.Imputer Imputation transformer for completing missing values. Handling missing value, process and replace NaN with mean, median, most_frequent etc.

1 comment:

Amber CollinsJuly 18, 2019 at 12:41 AM
If some one wants expert view concerning running a blog afterward I propose him/her to visit this blog. www.caramembuatwebsiteku.com
ReplyDelete
Replies

Add comment

Silicon Vanity | Tech lifestyle in Silicon Valley

Ad

Friday, December 22, 2017

Prepping the data, data preprocessing in machine learning

1 comment:

React UI, UI UX, Reactstrap React Bootstrap