Tuesday, November 22, 2016

A complete guide to election day data prediction mishap - oops we forgot about Johnson and errors


  • A top data visualization team and journal - the New York Times completely failed to predict a Trump win. Predicted 85% Clinton victory. 
  • Nate Silver, author of FiveThirtyEight, also the guy who perfectly predicted Obama 2008 Election results, predicted that Clinton has 2/3 of a chance to win. He's also saying Trump had 30% chance of winning - significantly higher than most people expected. People who had extreme distaste of Trump probably expected close to 0 chance.
    • Nate Silver later posted on his twitter : A—There's a 30% chance of an earthquake B—LOL ur crazy no way it's that high {{earthquake}} B—Idiot! You said a 70% chance of no earthquake
  • We all forgot about Johnson, the minority contender whose presence might have "stolen" the 1-5% margin Clinton so needed but instead narrowly losing to Trump in key Democratic states. Many forgot that these marginal alternative votes to Johnson has cost her many key states like Pennsylvania where her votes were only percentage away from Trump's. Personally I believe that the existence of a 3rd candidate caused Clinton to lose swinging states. In order to exercise their right to vote alternative, many voters have accidentally handed the presidency to Trump, whom they would not have voted for at all. Unintended mathematical result of voting alternative. 
  • Traditional sampling and polling methods claim to have being "blindsided" postmortem. They claim the root data was wrong. Specifically many blamed the early exit polls.
    • Postmortem examination of this method shows clear selection bias - the people who are more vocal would have revealed their votes, as are those who matched the popular expectations.
    • Margin of Error. Columbia University researcher Andrew Gelman found the margin of error of such polling can be as high as 7%, 14-point range +/-. An estimated 50% vote of confidence is actually 43% to 57%.  Crazy, that's the difference between a majority win or loss.
  • HuffingtonPost a Silicon Valley and women friendly media outpost claimed that Clinton will win by more than 90% chance.
  • Election night, people warmed in utter shock as Trump racked up electoral college votes in a landslide
  • The updated NYTimes visualization below shows the stunning "upset" where Trump overtook Clinton in an Election Night victory
  • Just like we cannot predict stock market win/loss, we cannot predict election win/loss.  
  • Data analysis results were wrong but there were some new data visualization charts. Nate Silver came up with a snake board-game-like intestine chart for electoral college votes and the states. Some criticism of the snake chart : no useful info, just a fancy map on electoral votes. And it's a design spin on an vintage game.  
  • Subjective perspective and inherent bias. Many media and data outlets were criticized post election. While no statisticians would ever unwisely predict 100% Clinton win, nearly all liberal outlets were prematurely hailing a Clinton win. Even Nate Silver who gave Trump a 30% chance of winning, did not step out to help the public understand this stats until after the election. 
  • Simulation and randomness. Nearly all respectful data outlets ran simulations with random factors. They build in scenarios of swinging states flipping, margin of error. Yet the models still fell short.
    • Imagine simulating percentage votes versus electoral college votes. Percentage is continuous and can change a fraction of percentage at a time. Electoral Votes are much more discrete, and is allocated by lots of 3, ... , 55 (Alaska...California) margin of error becomes huge on one hand. The "landslide effect" of Clinton's major upset was very apparent on election night when chunks of electoral votes were going to Trump, escalating him quickly to the threshold, swiftly making him the apparent winner.
  • Personally, I think at Election Night, watching the live visualization on Wall Street Journal website, I saw that large cities were voting as expected leaning either democratic or republican but yet there were many lesser known counties were overwhelming voting for Trump from California to New York. Few studies and data analysis were granular on the county level. We were so focused on state level results. 
  • This is an election that swinging states matter a great deal more than usual. Nate Silver used two numbers tipping-point chance and voter power index to highlight these important states that played a crucial role election night. 
  • Who's winning the popular vote? Nate Silver estimated Clinton an average of 48.5% percentage, and Trump 44.9%, really not bad at all. And we forgot Johnson 5%! That is enough margin to make Trump the winner! If Clinton fails to catch all 48.5%, and Johnson fails to capture 5% throughout, that's enough margin going to Trump. Plus the margin of error of estimation... Wow Trump and Hillary win were more like a flip of a coin 50% 50%. (visit Nate silver's blog to see this useful visualizaiton). Again, my personal opinion is that we forgot about Johnson
  • In my personal opinion, Trump's win was not a landslide, instead it appeared to be a landslide because of our electoral college system. The actual votes (popular vote) was a more even split. I personally think we really forgot about errors and Johnson. Landslide victories were unlikely (Obama had a true landslide), so margin of errors and Johnson presence were extremely important. Yet we forgot about them. We still don't think about them when we just claim there was a landslide victory and now we are learning what Trump did right and justify what he did right. Really he did a lot of things right and Clinton was close to do a lot of other things right. One of them won by chance. No one predicted that. 


Sources and Further Reading
  • Fast Company
    • https://www.fastcodesign.com/3065750/why-we-had-no-idea-trump-would-win
  • Nate Silver 
    • http://projects.fivethirtyeight.com/2016-election-forecast/

Monday, October 31, 2016

Udacity Machine Learning Nanodegree Udacity Connect Intensive Syllabus


  • 1. Model evaluation and validation
    • 1.1 STATISTICAL ANALYSIS
    • 1.2 DATA MODEL
    • 1.3 EVALUATION AND VALIDATION
    • 1.4 MANAGING ERROR AND COMPLEXITY
    • 1.5 PROJECT

  • 1.3 EVALUATION AND VALIDATION
    • 1.3.1 TRAINING AND TESTING
      • 1.3.1.1 Benefit of testing
      • 1.3.1.2 Train / Test Split in sklearn
      • Useful concepts : train_test_split function
    • 1.3.2 EVALUATION METRICS
      • 1.3.2.1 Metrics
      • 1.3.2.2 Classification and Regression
        • Useful concepts: Categorical data vs continuous data
      • 1.3.2.3 Classification metrics
        • Useful concepts: discrete predictions
      • 1.3.2.4 Accuracy 
        • Useful concepts: proportion of items classified or labeled correctly, my_model.score(X_test, y_test). Shortcoming of accuracy if data is skewed, or need to err on side of innocence or git. Accuracy: no. of items in a class labeled correct / all items in that class (Erron has a small number of innocent people)
      • Picking the Most Suitable Metric
        • Concept: information asymmetry
      • Confusion Matrix
        • Concept: if care about asymmetric learning, may want to shift the decision front up or down to include certain results
      • Decision Tree: confusion matrix
      • Precision and Recall
      • Equation for Precision
        • Concept: precision = true positives / (true positives and  false positives)
      • Equation for Recall
        • Concept: precision = true positives / (true positives and  false negatives)



Precision vs Recall
F1 Score
Regression metrics

Mean Absolute Error
Mean Squared Error
Regression Scoring Function

Managing Error and Complexity
Cause of Error
Error due to bias
Linear Learner, Quadratic Data (programming learning curve)
Error due to Variance - Precision and Overfitting

Representative Power of a Model
1.1. Curse of Dimensionality
1.2. Curse of Dimensionality Two
Learning Curves and Model Complexity
1.1 Learning Curves
1.2 Learning Curves II
1.3 Ideal Learning Curves
1.4 Model Complexity
1.5 Learning Curves and Model Complexity
1.6 Practical Use of Model Complexity


Sunday, October 30, 2016

Udacity Machine Learning Nanodegree Udacity Connect Intensive Review PROS CONS


  • This blog post is a work in progress
  • POSTIVE
  • Industry ready. Pandas, Numpy, Python are industry standards. The course gets your hands dirty right away in industry-standard competitive software packages and libraries.
  • Online contents are made by folks who actually work professionally in the field, invented things, and are top of their field. Being a good tutor is different from being a good professional. Udacity tends to have professional engineers from top tech firms. 
  • Great motivation, easy to stay on track. Past experience with Nanodegrees is that it was hard to stay on track and easy to get stuck. When a class of people is moving ahead together in person, and the classes are day long, it becomes easier to for me personally to stay on track. There are classmates who move faster as well as slower. It's easy to find help and engage in discussions before falling behind too much. 
  • Amazing in-person instructor. I have Nick Hoh. He is an experienced instructor who has a lot of teaching experience. His material supplement and even exceeds the online videos, making it very helpful study material. During his sessions he also talks about how he would approach a problem and break it down. It's helpful to get a new perspective from the online videos and it's very helpful to be able to chat in person, ask questions and chat on Slack occasionally. The help makes a big difference.
  • Offline instructor as a point of contact and a great mentor. Having that one point of contact is really reassuring. While plenty of work needs to be done through extra research and online forums like StackOverflow, having that one point of key contact makes all the difference for me. 

  • NEGATIVE
  • Course work seems patched together from existing Udacity courses. The content is not always cohesive. Some contents are out of date or inaccurate. Students may be stuck without additional help. For example, the Boston Housing project has a data attribute called PTRATIO. One section calls it ratio of students to teachers, another calls it pupil-student ratio. Pupil is a British word for young students in secondary schools. The actual name is pupil-to-teacher ratio. One section wants us to import a python library from a newer release using a new API call, but the Python installed on Udacity server is of an older release. Beginners will be stuck here forever trying to understand the bug is from the configuration not their code. The courses are constantly improving but the quality of the content needs to be better: more cohesive, consistent and accurate.
  • I find myself Googling a lot for external materials to study. A lot like way above 50%. While it is a common "industry practice" to google and learn additional information, above 50% also means that the course is not doing its job.
  • Lots of implicit prerequisites. While not mandatory, the course actually implicitly requires previous course work in Statistics, Probability, Linear Algebra, data analysis and Python coding. Basic statistics will be used a lot. Linear Algebra is a big part. Python coding is a must. Experience in data analysis and pivot tables. I found doing a massive and comprehensive review beforehand was very helpful. Most of my classmates are engineers. I come from an Economics background from Stanford, which thank goodness, forced me to take linear algebra

Udacity Machine Learning Nanodegree Udacity Connect Intensive Cheatsheet Key Concepts


Udacity Machine Learning Nanodegree Cheatsheet Useful Functions and Libraries

# libraries
import numpy as np
import pandas as pd
# data processing
from sklearn.cross_validation  import ShuffleSplit
from sklearn.cross_validation import train_test_split

# scoring
from sklearn.metrics import r2_score

 # visualizations code visuals.py
import visuals as vs

 # visual display for Jupyter notebooks
%matplotlib inline

 # Load dataset
data = pd.read_csv('xyz.csv')
target = data['col_xyz']
features = data.drop('col_xyz', axis = 1)

#data processing
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state=1)

 # Success
print "XYZ dataset has {} data points with {} variables each.".format(*data.shape)

 # Exploring dataset
my_dataframe.head()
my_dataframe.head(5)
my_dataframe.describe()


# learning curve from sklearn.model_selection import learning_curve

from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import recall_score as recall
from sklearn.metrics import precision_score as precision

model selection

from sklearn.grid_search import GridSearchCV #legacy
from sklearn.model_selection import GridSearchCV #new release

Wednesday, October 19, 2016

Udacity Machine Learning Nanodegree Udacity Connect in-Person Experience

Last Saturday, I started my first Udacity Connect class. For a discounted price, the Udacity Machine Learning Nanodegree and a plethora of in-person help, office hours, and study sessions become available at students' disposal. The classroom is a rather simple and gloom windowless room at Cogswell College (a digital arts, gaming school) in San Jose. The staff is very friendly and our session lead was extremely friendly and energetic. My classmates are on the older side. Many worked for giants of the previous dot-com boom. My first observation is that the course material is new but still "patched" together - they re-used many Udacity courses, video clips and exercises. Unfortunately, recycle and reuse does not always lend the material more clarity. Officially the prerequisites are brief. Like all Nanodegrees, this one just requires some prior knowledge of math, data and coding. The reality is I find myself scrambling to review Linear Algebra, Statistics, Probabilities, Python, data analysis, data science, basic courses on machine learning, AI. The course does not have all the materials one need to succeed even if one does not mind burning time on the course. You will find yourself jumping from branches to branches and googling a lot. While this is a "learning experience" per Udacity, it can be overwhelming for beginners. Just watch out. More information will be published here!

Wednesday, September 21, 2016

Monetizing the Pokemon Go Craze in 20 Examples

Why do marketers and brand ambassadors try so hard in social media to get “fake internet points” or “15 seconds of fame”. The fame and value is fleeting unless we hit the holy grail of vitality, buzz-worthy, blanket worldwide free-keting.
One way to go viral is to ride a viral trend — today the trend is Pokemon Go. Here are some clever ideas around the world used to monetize Pokemon Go. For the first time, the glued-to-phones millennials are going outdoors, and the stay-at-home glued-to-computers gaming crowd and scavenger hunting the city. It is a prime time for small businesses to get noticed.
All the tool you need is a smart phone and a clever idea.
A clever idea can stick, make people laugh and remember your business. Your loyal patrons will now know that you care about their hobbies. If you think these are penniless jobless gamers, you are wrong. While a popular mobile game title can gross about $3 million dollar a month, Pokemon Go players logged about $3 million dollars a day for the franchise during lauch. They will be hungry and thirsty after chasing Pokemons for kilometers. It really takes a lot of effort to hatch a lousy egg.
Final price bill discount based on player levels. Example, level 3 free soda, level 5 free slice. Carefully, level 20 is no longer hard to obtain. Example 2, give an incremental discount of 10 cents per level.
Pokemon Go merchandise. Nothing can erase the image of a magikarp mask on a 6 pack jacked body. Cannot be unseen. Currently, you cannot order the Pokemon Go wearable device, so people are charging a premium for guaranteed pre-orders.
Market your location if you spotted a rare Pokemon. Players will go far and abandon their cars for a monster (as seen in the central park example).
Upgrade your kid’s lemon stand. A smart kid has been selling blue Doritos, red Doritos and yellow Lay’s, as well as lemonades served from colored bottles to play into the Pokemon team rivalry. Remember, these trainers are out there for cute monsters. Your kids can totally be sweet and melt many hearts. That lemons stand will win as long as it is near a Pokestop.
Encourage your employees and your employer to set up lures. Lures especially in a Bermuda triangle or a lagoon of concentrated Pokestops means heavy heavy foot traffic. “Location location location!” This game just turned your street into a hot spot. Set up lures by spending dollars in the app store means 10x return in revenue and tips!
A black board stand outside your bar that reads “Something clever about Pokemon that I don’t know. I am too old. Come in and get drunk”. Yes, we are adults and we play because it is a part of our sentimental emotional childhood. Just like Totoro. We may be 20 years behind the Japanese anime industry, but the US is the first place they chose to launch the game way before Japan. We are the jackpot.
Offer your restuarant / coffee shop as a place for the trainers to “charge up, cool off and catch ’em all”. It’s a hot summer out here. Iced cold drinks are great baits.
User testing bait using lures. The Down to Lunch mobile app team set up a lure on UC Berkeley campus to get users to try their app and give feedback. Startup hustle prime example.
Create a new meme worthy Pokemon Go banner. Schrödinger’s Pokeball: When the app freezes the Pokemon is both caught and not caught.
Youtube videos on top 10’s, cheats, history of Pokemon. Popular Youtube channels have been cranking out top 10 videos: cutest, most useless, unexpected, weirdest pokemon. Cheats, tips, even Forbes came up with an article 10 things I wish I knew before hand. Tips and cheatsheets have surfaced on Imgur and Quora.
Poking fun at Pokemon Go in your comic strip, graphic novel. Redoing the launching screen concept art, making fun of a enormous amount of zubats infesting the cit, poking fun of an useless magikarp can run away. Of course, the server crashes.
Doing Pokemon characters in your illustration. Pokemon Go character infographics, web comic series, Pokemon characters re-drawn as chibi ,characters redone as zombies. You name it.
Artists draw landscape concepts in the style of Pokemon Go — the pale blue pastel with rough edges.
A zoo updated the profile of each zoo animal using Pokemon Go characters card UI and design. They know how to connect with their young animal-loving patrons.
Be nice to your new visitors if your house happens to be close to a Pokestop or a gym. You do not want internet mob to internet shame or reality storm you.
Purchasing Nintendo stocks. It has fluctuated 10% up and down within a week.
Create a new mobile game. The game grosses $3 million dollar a day in the Apple App Store. Apple customers are known to spend more. Clearly mobile game is going strong.
Code a new website and API for Pokemon Go sightings. Flex that programmer muscle and get some Github stars and forks. Example: Pokemon Go Map
Copying the Pokemon Go game mechanics. Without a doubt, players think Pokemon Go has assembled known game mechanics into a whole new game play. Venture capitalists have already been pitched all kinds of Pokemon Go copycat games. Why not? Augmented Reality gaming just started. And apparently it doesn’t cost $3000 dollars for a headset.
Sauce: these gems are from Reddit and Imgur.