Data Visualization in Machine Learning — Beyond the Basics
This is not a tutorial. These are my notes from various Machine Learning articles and tutorials. My personal cheatsheet for interviews and reviews. Any feedback and corrections are welcome. If you’d like to read more, please let me know as well. These notes are more applicable for python users. Does not include ggplot, great for R.
Prerequisites and Dependencies
This tutorial and overview is python based so we use matplotlib.pyplot. These commands can be run in command line and in Python Notebook with just a bit of modifications. Any reference to plt means the function is from the matplotlib library.
import matplotlib.pyplot as plt # will get object does not have bar, scatter.... function_name error # if not imported
Plot a Bar Chart
Bar chart, bin chart: useful for frequency analysis, distributions and counts.
labels = ['A','B','C','D','E','F','G'] nums = [13,24,5,8,7,10,11] xs = range(len(nums)) #[0, 1, 2, 3, 4, 5, 6] #xs is a convention variable name for x axis plt.bar(xs,nums) plt.ylabel("Customize y label") plt.title("Customize graph label") plt.show() #display the plot
Don’t be deceived by its simple look. Frequency analysis is very powerful in data EDA, stats and machine learning.
Plot a Histogram
Histogram will automatically divide data into bins.
import matplotlib.pyplot as plt import pandas as pd nums = [99, 1, 3, 5, 7,33, 23,684, 13, 3 ,0, 4] pd.Series(nums).hist(bins=30) # <matplotlib.axes._subplots.AxesSubplot object at 0x10d340d90> # returns object in memory plt.show()
Also useful for visualizing distribution and outliers.
How is scatter plot beyond the basics? Scatter plot is extremely intuitive yet powerful. Just plot the vertical coordinate and horizontal coordinate of each data point in the sample to get its scatter plot. If the relationship is non-linear, or there may be the presence of an outlier, these targets will be clearly visible in the scatter plot. In the case of many features i.e. dimensions, a scatterplot matrix can be used.
Below is a screenshot of pandas scatterplot matrix in the official documentation.
Clearly the relationship is not linear. The diagonal is the variable vs itself, so it’s showing a distribution graph instead of scatter plot. Neat, looks like the variable is normally distributed.
Scatterplot is a great first visual. Too many features? Try sampling or generating data subsets before visualizing.
Use pandas.DataFrame.describe() to summarize and describe datasets that are simply too big. This function will generate summary stats.
Scatterplots are useful for pairwise comparison of features.
Scatterplots can go beyond two dimensions. We can use marker size and color to illustrate the 3rd dimension, even 4th dimension as in the famous TED talk of economical inequality. The presenter even used timeline (animation) as the 5th dimension.
Youtube deep learning star Sraj shows a 3D visual of error function while altering y intercept aka bias and slope for linear regression. The global optima i.e. the global minimum in this case is the goal of gradient descent algorithm.
Error functions have shapes and can be visualized. Local optima which prevents your model from improving can potentially be visualized.
Gradient can be visualize as directional arrows that travel in the direction of the global minima along the shape of the 3D plot. It can also be visualized as a field of arrows in a matrix.
Each residual (y_i — y_hat) can be visualize as a vertical line connecting the data point with the fitted line in linear regression.
Data Scientists Love Box Plots
Why? It displays essential stats about distribution in a concise visual form. Aka candle stick plot. Also popular in finance.
Max, 3rd Quartile, Median, 1st Quartile, min.
This is known as the box and whisker graph too. It’s popular among statisticians. Used to visualize range. It can be drawn horizontally.
What’s between Q3 and Q1? The interquartile range, which used in analyzing outliers. Q1–1.5*IQR is too low, Q3+1.5*IQR is too high.
Box whisker plot displays outliers as a dot!
Check out Boston University’s Blood Pressure dataset box whisker plot with outliers.
Did you say heat map? Heat map has been in and out of favor. Web analytics still use heat map to track events and clicks on a webpage to identify key screen real estates. Why should we use heat map for machine learning?
It turns out that generating a heat map of all the feature variables — feature variables as row headers and column headers, and the variable vs itself on the diagonal— is extremely powerful way to visualize relationships between variables in high dimensional space.
For example, a correlation matrix with heat map coloring. A covariance matrix with heat map coloring. Even a massive confusion matrix with coloring.
Think less about the traditional use of heat map, but more like color is another dimension that can visually summarize the underlining data.
Correlation Matrix Heat Maps are frequently seen on Kaggle, for exploratory data analysis (EDA).
More Data Visualization Magic
Did you know that you can visualize decision trees using graphviz. It may output a very large PNG file. Remember the split of decision tree is not always stable — consistent over time. Take it with a grain of salt. The benefit of visualizing a decision tree is to understand where and how machines made decision splits. Decision tree boundaries can be visualized too, see screenshot below from Sklearn documentation.
Visualizing models, decision boundaries and prediction results may give hints whether the model is indeed a good fit or it is a poor fit for the data. For example, it is high bias to ignore the nature of our data if use a straight line to fit a circular scatter of dots.
Researchers even visualized different optimizers to see their descend to minimize loss.
Did you know you can create interactive plots using Plotly right in Jupyter Notebook? Interactive plots allow you to visualize complex data, toggle and change parameters. For example you can slide to change values of your hyperparameters and visualize how the model performance change in gridsearch and other systematic search of the space.