Ad

Wednesday, April 3, 2019

Getting Started with Automated Data Pipelines, Day 2: Validation and URL...







  • Data validation creating data from URL
  • When do you need data from URL? Maps, getting shapes for maps

Kaggle Challenge (LIVE)





  • Architecture: UNet
  • Use Google Colab to avoid dependent
  • Salt correlated with oil and gas where salt is heavy
  • !pip install imageio
  • for image processing
  • !pip install torch






Kaggle Live-Coding: Code Reviews! | Kaggle







  • Make code robust and reproducible, if column names change later can you still handle it. 
  • Use R functions for column querying starts_with(), ends_with(), contains() makes the query more robust, harder to break downstream. 
  • Avoid using numeric column indexing as order of columns may change
  • Avoid redundancy in code and comments
  • If want to make file a bit shorter, can avoid inline images, use script to generate images instead. 
  • Make sure the logic matches the coding comment and function signature

Machine Learning Workflow

Data cleaning Missing data Outlier Others: duplicates, typos, special characters Strategy for missing data: imputation, mean, median...