Data Science beyond the basics

Exploratory Data Analysis

  • Histogram plotting, input is a list of distributions we want to plot,  specify bins, can also weigh each sample differently, it doesn't have to be count 1. hist function can return values.  How many items in each bin, and the plot. 
  • It is also important to do feature extraction, simply the data, reduce computational cost, dimensionality reduction before feeding data into a machine learning algorithm. Algorithms will run faster, more efficiently, use less memory space, and even perform better, in some cases. 
  • Anomaly detection, outlier detection to handle or remove outliers and abnormality in the data to help the model generalize better and be a more accurate representation. 

Machine Learning

Machine Learning is emerging as a popular field of data science. It has predictive power, employs applied statistics and pattern recognition technologies.

Machine learning is taking data mining to the next level.

Major machine learning tasks include classification, regression and clustering.

Questions that Business Analysts and Decision Makers are Interested In

  • Who are the best customers? aka Who are the customers with the best Customer Life Value
  • Causal relationship: 
    • Results of recent experiments (More prevalent in Startup Culture)
    • Hypothesis if one segmentation is actually different from another
    • Is the result significant or is it random chance
    • Please note that causal relationship determination requires controlled studies to control for extraneous variables. In many industries, such as biotech, statistical significance is a must, a prerequisite for next step analysis or more business investments. 
    • Demo graphics of customers. Summary statistics, customer segmentation and more. 
    • How to measure profitability and other Key Performance Indicators (KPI)

Statistical Hypothesis Testing

Python for Data Science

  • Use conda command similar to pip for installing and launching packages
  • Anaconda comes with a wonderful Python IDE called Spyder

Scientific Computing using Scipy

  • Scipy.integral.quad using the quad method to compute integral function to compute, lower bound, first bound, a tuple, returns an approximation of the result and how much error


