Silicon Vanity | Tech lifestyle in Silicon Valley: March 2019

Sunday, March 31, 2019

Django Girls - friendly events that teach women build websites using Django

ng''

Two amazing ladies from Poland teamed up with coaches around the world to teach girls and women how to use Web Development Framework Django

Sunday, March 17, 2019

Machine Learning with No Code

AutoML machine learning deep learning without code by Uber, Ludwig allows users to train and make inference deep learning model without coding (caveat you still have to use command line code). Previously, it is an internal tool at Uber now open sourced to gather contribution. It's a python library.

Sraj Raval gives this tutorial using Ludwig in Google Colab.

Sraj Raval expression quote, "Don't hate. Copy & Paste." To install Ludwig copy and paste installation code from Uber github page.

Wednesday, March 13, 2019

Flatten, Reshape, and Squeeze Explained - Tensors for Deep Learning with...

Matrix is a rank 2 tensor. There are two axis one is an array, one is individual numbers.
Check the dimensions of tensors using .size() or .shape()
Obtain the rank of the tensor by checking the length of its shape
len(tensor.shape) #returns 2 for matrix
number of elements in the tensor, is the product of the component values in the shape torch.tensor(my_tensor.shape).prod()
my_tensor.numel() #number of elements
number of elements is important in reshaping
reshaping does not change underlining data just change the shape

Monday, March 11, 2019

Kaggle Challenge (LIVE)

Taxi Duration Challenge by Sraj Raval.

Kaggle Challenge (LIVE)

The model he will use is unet.

Sunday, March 10, 2019

How to Build a Compelling Data Science Portfolio & Resume | Kaggle Quora

Make every single line of the resume count because recruiters and hiring managers can only spend 20 seconds on it.
Quora engineer's advice for Kaggle Career Con attendees.
Resume basics: one page, one column, clean, simple. Best resumes are probably easy-to-skim. Remove distractions. Have bullet points that reviewer can deep dive. The less busy the better.

There are even LaTex templates.

Need to make it easier to skim.

For tech and data science, hiring managers potentially care more about the skillset than the cover letter or the objectives.

Relevant course works include: Machine Learning, Linear Algebra, Data Analysis, Statistics, Statistical modeling, NLP ... Order by most relevant to the resume.

Relevant Courseworks that William Chen of Quora recommends. He also recommends order them by what is most relevant to the technical job. List Python, R first. SaaS or Excel has different connotations.

Word Embeddings Word2Vec LSTM, Recurrent Neural Network, GRU Review and Notes - Udacity Deep Learning Nanodegree Part 2

Word embedding can use math to represent relations between words such as man and woman, work and worked

Embedding Weights will be learned while training

Embedding lookup finding the corresponding row in the embedding layer

Embedding dimensions is the number of hidden units

Encode each word as an Integer

Embedding matrix is a weight matrix

Embedding layer is a hidden layer

Each row of the learned embedding matrix is a vector representation of the input word

The column of the embedding matrix is the number of stacked hidden units? Usually in the hundreds?

Words in similar context, expected to have similar embeddings, such as I drink water throughout the day, I drink coffee in the morning, I drink tea in the afternoon.

such as water, coffee, and tea

such as morning, throughout the day, afternoon

Vector artihmetic

map a verb A from present to past

map a verb B from present to past

should be the same embedding weights, or vector transformation

Saturday, March 9, 2019

Kaggle Earthquake Prediction Challenge

Objective:

Think like a data scientist

Categorical Gradient Boosting. Cat Boost Algorithm

Support Vector Machine for regression (it is more commonly known for classification)

Syllabus
Earthquake prediction background & helpful resources
Step 1 - installing dependencies
Step 2 - importing dataset
Step 3 - Exploratory data analysis
Step 4 - Feature engineering (statistical features added)
Step 5 - Implement Catboost model
Step 6 - Implement support vector machine + radial basis functional model
Step 7 - Future Directions (Genetic programming, recurrent networks etc.)

Comment: may be we can use advanced RNN for earthquake prediction since it has a time series element

Install important libraries. Installations & Dependencies

!pip install kaggle

!pip install numpy==1.15.0

!pip install catboost

import pandas as pd

import numpy as np

from catboost import CatBoostRegressor, Pool

from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import GridSearchCV

from sklearn.svm import NuSVR, SVR

#kernel ridge model for SVM

from sklearn.kernel_ridge import KernelRidge

"Kernel methods are a way of improving Support Vector Machine Predictions. Make sure we can create a classifier line or regression line in a feature space we can visualize. You know? A lower dimension feature space"

#data visualization

import matplotlib.pyplot as plt

# Google Colab file access feature
# allows Colab to import data directly into colab
from google.colab import files
# retrieve uploaded file
uploaded = files.upload()
# move kaggle.json into thfolder where APIs expects to finds the json file
!mkdir -p ~/.kaggle/ && mv kaggle.json ~/.kaggle/ && chmod 600 ~/kaggle/kaggle.json
#we will upload the kaggle.json file here so that colab knows our kaggle authentication
#Go to my account create new API token, which will be downloaded as a JSON file
now we can access the kaggle competition list
!kaggle competition list

Advanced Udacity Deep Learning Nanodegree

I have noticed that some of my classmates - Facebook Pytorch scholars, they went fast and far beyond what's required of the nanodegree. Here are some of the impressive things they did.

Train model from scratch rather than using a pretrained model. "Try using more convolution layers, increasing the depth, decreasing learning rate, and keep the final fc layer simple, I used only one fc layer after the convolution layers" "It depends on how many epochs to train. I just did 35 epochs. Similar to VGG."
How long did it take for you all to train your cnn from scratch model? GPU use. Some move the training to Google Colab.

Codecademy tutorials and classes

New Codecademy Pro offers some great classes.

Python3
Minimax
Machine Learning Neural Networks
Recursion - Python
Web Design
Test Driven Development
C++ Vectors and Functions
Build Projects Using C++
Additional Codecademy Offerings https://pro.codecademy.com/offerings/

Intermediate Machine Learning Deep Learning Cheat Sheet

Traditional machine learning algorithms are mostly not designed for sequential data. Do B after A, then C. The kind of step wise output can not be comfortably generated by traditional machine learning algorithms.

Deep Learning Deployment - Udacity Deep Learning Nanodegree Part 6

Note on Udacity Deep Learning nanodegree deployment in the machine learning workflow. Tool: Amazon Sagemaker service https://aws.amazon.com/sagemaker/

Problem Introduction: Kaggle Boston Housing competition, trying to predict the median housing data based on features such as no. of rooms. Makes sense the house is more expensive if there are more no. of rooms. However, there are always variances, noises in the data cause the result to fluctuate from the true trend.

Kaggle Intermediate Cheat Sheet

Intermediate Concepts. Source: Kaggle Live Coding

Bloom filter (a data structure) looking at overlapping in data. Checking if there's any overlap or cross over between train and test data. Test if element is an element of a set.
Use in NLP, in n-grams, 8-grams arbitrary, 20-gram typical because sentences are 20ish words. 7-grams, human memory span around seven words. Average spoken language may be 7-grams. Can do both to see the amount of overlaps. Look at all sets of n grams. Pair wise comparison: what number of n-grams already exist in the set. Empty bloom filter is a bit set of m bits, all set to 0 (wikipedia). k hash functions look at the input, each map or hashes some element to m bits. k is much smaller than m.

Kaggle competition with Google Cloud New York Taxi Fare Competition https://www.kaggle.com/c/new-york-city-taxi-fare-prediction
Playground competition in partnership with Google Cloud, Coursera and Kaggle

Using Kaggle on Google Colab

Install Kaggle, and also install catboost

!pip install kaggle

# Google Colab file access feature

# allows Colab to import data directly into colab

from google.colab import files

# retrieve uploaded file

uploaded = files.upload()

# move kaggle.json into thfolder where APIs expects to finds the json file

!mkdir -p ~/.kaggle/ && mv kaggle.json ~/.kaggle/ && chmod 600 ~/kaggle/kaggle.json

#we will upload the kaggle.json file here so that colab knows our kaggle authentication

#Go to my account create new API token, which will be downloaded as a JSON file

now we can access the kaggle competition list

!kaggle competition list

Friday, March 8, 2019

Google Cloud Next 2019 Conference - Extremely Exciting Sessions

Here I highlight a list of amazing sessions mostly in DATA ENGINEERING and unique GOOGLE CLOUD CLIENT SPACE, GOOGLE CLOUD USE CASES.

Google Cloud Next 2019
Interesting Data Engineer Sessions at Google Cloud:
Moving from Cassandra to Auto-Scaling Bigtable at Spotify
Data Management: The New Best Practice for Incident Response
Google Cloud Platform from 1 to 100 Million Users
Google Cloud: Data Protection and Regulatory Compliance
Organizing Your Resources for Cost Management on GCP
TensorFlow 2.0 on Google Cloud Platform
Chatbots Will Empower Students and Teachers
Deploy Your Next Application to Google Kubernetes Engine
Fast and Lean Data Science With TPUs
Creating Interactive Cost and KPI Dashboards Using BigQuery
From Blobs to Tables: Where and How to Store Your Stuff
Data Processing in Google Cloud: Hadoop, Spark, and Dataflow
Enabling Healthcare in the Cloud: Mitigating Risks and Addressing Security and Compliance Requirements with GCP
An Insider's Look: Google's Data Centers
Data Integration at Google Cloud
G Suite Data Controls and Transparency
How AI Computer Vision and IoT Is Transforming Businesses
Smart Pallets for a Smart Warehouse: Building Advanced Computer Vision Systems Using Google Cloud IoT
30 Ways Google Sheets Can Help Your Company Uncover and Share Data Insights
Data for Good: Driving Social & Environmental Impact with Big Data Solutions
Data Warehousing With BigQuery: Best Practices
Extracting Value With a Cloud Clinical Data Warehouse
Future of Google Sites
Best Practices for Storage Classes, Reliability, Performance, and Scalability
Best Practices in Building a Cloud-Based SaaS Application
Building a Global Data Presence
End-to-end Training of a Model and Prediction Generation Using BigQuery ML
How to Secure and Protect Your Data in Cloud Storage
Rethinking Business: Data Analytics With Google Cloud
The Future of Health. Powered by Google
Accelerating Machine Learning App Development with Kubeflow Pipelines
Backup, Disaster Recovery and Archival in the Cloud
Bringing the Cloud to You AMA (Ask-Me-Anything)
Building and Securing a Data Lake on Google Cloud Platform
Deploy and Manage Virtual Workstations on GCP
Google Cloud DevOps: Speed With Reliability and Security
Understanding Google Cloud IoT: Connectivity Options and Examples
Unlocking the Power of Google BigQuery
Data Analytics
Building AI-Powered Customer Service Virtual Agents for Healthcare
Case Study: Using GCP to Measure Package Sizes in 3D Images
Cloud Native Application Development, Delivery and Persistent Storage
ow to Run Millions of Self-Driving Car Simulations on GCP
Cruise Automation is a leading developer of autonomous vehicle technology. In this session, we will dive into the infrastructure which allows us to run hundreds of thousands of autonomous simulations every day and analyze the results quickly and efficiently. Cruise runs the vast majority of our testing on Google Cloud, taking advantage of high scalability of compute and GPU resources for our diverse workloads. Our simulation frameworks allow us to replay data gathered from road testing or generate complex variations
Integrating Smart Devices With the Google Assistant and Google Cloud IoT
Kaggle: Where 2 Million+ Data Scientists Learn, Compete, and Collaborate on AI Projects
Kaggle's the world's largest community of data scientists and AI engineers. You'll learn how 2 million+ users leverage Kaggle to learn AI, sharpen their skills on public competitions, incorporate 10,000's of public datasets into their projects, and analyze data in hosted Jupyter notebooks.
Ben Hamner

CTO,

Kaggle
Migrating Data Analytics Solutions to Google Cloud
Take Care of Data Privacy in a Serverless World with Firebase
Machine Learning with TensorFlow and PyTorch on Apache Hadoop using Cloud Dataproc
Medical Imaging 2.0

Medical imaging is one of the largest sources of healthcare data. Join us in this talk to learn how cloud technologies and artificial intelligence enable new applications in the medical imaging domain, improving patient care and reducing physician burnout.
Python 3 and Me: Upgrading Your Python 2 Application
The Path From Cloud AutoML to Custom Model
Transforming Healthcare With Machine Learning
With the wealth of medical imaging and text data available, there’s a big opportunity for machine learning to optimize healthcare workflows. In this talk, we’ll provide an overview of the Cloud ML products that can help with healthcare scenarios, including AutoML Vision, Natural Language, and BQML. Then we’ll hear from IDEXX, a veterinary diagnostics company using AutoML Vision to classify radiology images.
How to Grow a Spreadsheet into an Application
Integrate Firebase into Your Existing Infrastructure
Customer Case for Anomaly Detection in MMORPG
Genomic Analyses on Google Cloud Platform
Description
Using Google Cloud Platform and other open source tools such as GATK Best Practices and DeepVariant, learn how to perform end-to-end analysis of genomic data. Starting with raw files from a sequencer, progress through variant calling, importing to BigQuery, variant annotation, quality control, BigQuery analysis and visualization with phenotypic data. All the datasets will be publicly available and all the work done will be provided for participants to explore on their own.

Saving Even More Money on Compute Engine

Notable Clients of Google Cloud:
Journey to the Cloud Confidently With Citrix and Google Cloud
Square's Move to Cloud Spanner
Forbes' Road to the Cloud
Why Small and Medium Businesses are Going Google
Clorox Data Cleanup Using Advanced Cloud Dataprep Techniques
How Gordon Food Service Reimagined Collaboration Using G Suite
How Airbnb Secured Access to Their Cloud With Context-Aware Access

ow Twitter Is Migrating 300 PB of Hadoop Data to GCP

Twitter has been migrating their complex Hadoop workload to Google Cloud. In this session, we deep dive into how Twitter's components use Cloud Storage Connector and describe our initial usage, features we implemented, and how Google helped us build those features in open source. We describe how Cloud Storage fits into our ecosystems and the experience and features which have helped us. We'll also talk about unique challenges we discovered in data management at scale.

Optimizing File Storage for Your Use Case

Music Recommendations at Scale with Cloud Bigtable
Spotify serves personalized music recommendations to hundreds of millions of happy customers worldwide, and powers a lot of this infrastructure with Google Cloud Bigtable. In this talk, we'll go into detail about how Cloud Bigtable allows us to deliver recommendations at scale, roll out experiments quickly, and ingest terabytes every day via Cloud Dataflow. We'll discuss a number of challenges we overcame when designing our recommendations infrastructure on top of Cloud Bigtable, including tips about how to design a good schema, how to avoid latency when ingesting new data, and effective caching strategies to scale to tens of millions of data points per second.
Real-Time, Serverless Predictions With Google Cloud Healthcare API
Target's Application Platform (TAP)

Google Cloud for Its Business Partners, Use Case Showcase
Automate Cancer MCA using Cloud Vision API and GCMLE
Learn how Pluto7 built a model to extract the text from Clinical protocols using Cloud Vision API and automatically predicted whether clinical treatments, based on their criteria, were classified, covered by researcher of clinical trial, or by the patient's insurance. We used Cancer Clinical trial protocols by the customer to train word-embeddings and we constructed a dataset of short free-text labeled R or S (Researcher or Sponser).
GitLab's Journey from Azure to GCP and How We Made it Happen
How Booking.com Uses BigQuery ML to Assess Data Quality and Other Features
How News Corp Transformed into a Data-Driven Organisation
Future of Work With Cisco and Google
How Schlumberger is Building Enterprise Solutions for the Future with Google
Kaiser Permanente's Journey Towards an API-First IT Strategy
Everyone Flies Faster When BigQuery Fuels the BI Engines at AirAsia
How Pandora is Migrating It’s On-Premises BI & Analytics to GCP

Composing Pandora's Move to GCP With Composer
A Glimpse Into CBS Interactive’s AI/ML Group
State of the Art: SAP on Google Cloud
What Did the Doctor Say? Mining Clinical Notes With GCP
Marianne Slight

Product Manager, Google Cloud Healthcare & Life Sciences,

Google Cloud

How Equifax Accelerates Time-to-Market with Microservices and APIs
How Macy's Executes DevOps at Scale on GCP

How HSBC Leverages GCP For Regulatory Reporting
Using Google's Data and AI Technologies with Kaggle

HSBC Invents New Technology as They Migrate to BigQuery
Learn How Cardinal Health Migrated Thousands of VMs to GCP
Using AI to Transform Your Fleet Operations

What Did the Doctor Say? Mining Clinical Notes With GCP
Data Analytics
April 11 | 2:35–3:25 PM
Reserve
share
Share

bookmark
Saved
Description
With note bloat now at 80%, it has become harder than ever to trace medical decision-making in the electronic medical record. But the physician's clinical notes provide that context along with nuggets of gold that aren't easily documented in the structured EMR. Join this session to discover how to mine clinical concepts from the physician notes, map them to standard vocabularies, augment the EHR data with them, and use them in your CDW analysis or FHIR applications.

Breakout
Intermediate
Healthcare
Speakers
Marianne Slight
Marianne Slight

Product Manager, Google Cloud Healthcare & Life Sciences,

Google Cloud

Google Cloud Basics Cheatsheet

Project: use a project to organize a collection of your Google Cloud resources and services. For example if you have a WordPress site it should one be project. If you use BigQuery to view its data, put the BigQuery instance under the same project. Project IDs are unique. If your resources or services need to have an unique name, you can just prepend its name with your project ID which is guaranteed unique.

Products
Google Colab: great for machine learning deep learning, GPU availability, without need to worry about dependencies.
Google Datastore: Cloud Datastore is a scalable NoSQL database. Cloud Datastore handles sharding and replication. Supports ACID transactions, SQL-like queries, indexes.

Training
- Free live, instructor led Google Cloud Platform (GCP) onboarding. Google Cloud onboard free classes on Google Cloud Platform Fundamentals: Core Infrastructure. Google Cloud Platform Fundamentals: Big Data and Machine Learning. Kubernetes.

While these free classes have limited availabilities, the online paid versions are always available on Coursera.

Google Cloud Engineer certification
All Google Cloud Courses on Coursera https://www.coursera.org/googlecloud
Machine Learning with TensorFlow on Google Cloud Platform Specialization https://www.coursera.org/specializations/machine-learning-tensorflow-gcp
Google Cloud NEXT free coursera course credit https://www.coursera.org/promo/NEXTExtended
https://www.coursera.org/googlecloud
Data Engineering on Google Cloud Platform Specialization
https://www.coursera.org/specializations/gcp-data-machine-learning
Preparing for the Google Cloud Professional Data Engineer Exam
https://www.coursera.org/learn/preparing-cloud-professional-data-engineer-exam

Associate Cloud Engineer
https://cloud.google.com/certification/guides/cloud-engineer/

Machine Learning with Spark on Google Cloud Dataproc in data lab

We have a disclaimer article that applies to all our articles and the entire site. Please read for usage and disclaimer statement. Generally all information is only for personal information, not meant for commercial nor production usage.

Google cloud pricing

Terminology:

ingress: traffic entering or uploaded into Google Cloud Platform
egress: traffic exiting or downloaded from Google Cloud Platform

Ingress traffic can be free, while egress traffic is charged based on the source and destination of such traffic

https://stackoverflow.com/questions/27627630/what-does-compute-engine-network-internet-egress-mean-to-google-cloud

GAN - Udacity Deep Learning Nanodegree Part 5

GAN when given training dataset can generate new images or outputs that have never been seen before.

StackGAN can take description of an image such as a bird and generate a photo of the said bird. iCAN convert sketches to images. Pix2Pix translation, blue print for building turns into building. #edges2cats turn doodle of cats into real cats. Can be trained in unsupervised ways. CartoonGAN is trained on faces and cartoons but does not need to be trained on face and cartoon pairs. It knows how to convert without being explicitly told. Also can turn photo of day scenes to photo of night scenes. CycleGAN Berkeley especially good at unsuperivsed image-to-image translation. Best example is video of horse turned into a video of zebra. The surrounding even changed from grassland to Savannah. See links to the networks below. Generating simulated training set apple example of turning unreal eyes into realistic eyes and train models to learn where user is looking. Imitation learning, reinforcement learning (data), imitate action that would be taken by experts. GANs can generate adversarial networks: images that look normal to humans but can fool neural networks.

StackGAN - https://arxiv.org/abs/1612.03242
iGAN - https://github.com/junyanz/iGAN

Other generative models
Fully visible belief networks: output is generated one element at a time, for example, one pixel at a time. Aka autoregressive models, known since the 90s.
Breakthrough is to generate in one shot: GANs generate an entire image in parallel. Uses a differentiable function in form of NN.

"Generator Network takes random noise as input, runs that noise through a differentiable function to transform the noise, reshape it so it have recognizable structure. " - Ian Goodfellow

The output of a generator network is a realistic image. The choice of the noise input determines which image will come out of the network. "The goal is to have these (output) image sto be a fair sample of real image data" - Ian Goodfellow

The generator network has to be trained. The training process is very different from a supervised model. The generator network is not supervised. "We just show it a lot of images. And ask it to make more images that come from the same probability distributin."

The second network: the discriminator, a normal neural network classifier, guides the generator network. The discriminator is shown real images half of the time, and fake images the other half of the time. It classifies whether the image is real or not.

The generator network's goal is to make compelling images that the discriminator will assign 100% probability that the image is real.

Overtime, generator has to produce realistic outputs, almost real replicas. Generator Network takes in noise z and generates input x. Whereever generator outputs more of z, the x function becomes denser. Discriminator outputs high numbers (higher probability) whenever real data density is higher than generated data density. Generator then changes its output to catch up.

ROC Curve Basics Cheatsheet

Receiver operating characteristic (ROC) plots the true positive rate (TPR) against the false positive rate (FPR) at various threshold setting (source: wikipedia). It's a measure of how good the decision frontier is at each split.

Aka the sensitivity and specificity curve.

Horizontal axis plots the True Positive Rate. The vertical axis plots the False Positive Rate.

Benchmark is a random guess: the 45 degree line. The best scenario, perfect split, area under the curve is 1.

There are two "extreme" points, if we classify everything as positive, the TPR, FPR = (1,1). If we classify nothing as positive, the TPR, FPR = (0,0) because the true positive rate is true positive / all positive = zero / all positive.

True Positive Rate
True positive rate = true positive / all positives

False Positive Rate
False positive rate = false positive / all negatives

To plot the entire curve, do it for as many possible splits as possible.

Area under the curve is important metric.

Tensorflow Basics cheat sheet, reviews, notes

Also read our latest Tensorflow 2.0 Alpha blog post March 2019.

Tensorflow data validation great for preview summary statistics, data distribution link to documentation

Tensorflow Lite
Tensorflow for mobile, portable devices and embedded devices. Works especially well on Android.

"An embedded device is a highly specialized device meant for one or very few specific purposes and is usually embedded or included within another object or as part of a larger system."
Tensorflow Use Case

Additional Resources

Tensorflow youtube channel and twitter.

Thursday, March 7, 2019

WiDS Datathon Webinar 2019 notes and tutorials

Presentation by Jennifer Marsman Principal Engineer at Microsoft blog.msdn.microsoft.com/jennifer

Supervised Learning and Unsupervised Learning
Supervised Learning - Requires labeled data
Unsupervised Learning - Does not require labeled data
Supervised Learning: using historic data to make new inferences.
Unsupervised learning: make sense of data without labels, cluster similar things together, look for relationships, form groups, explore the data.

Reinforcement learning: semi-supervised learning. Look at our Markov Decision Process post. Also AlphaGO.

This Kaggle competition, supervised learning will be helpful.

Regression vs Classification
Regression: predicts the cost of a home, sales number, for example zillow dataset, develop a mathematical model.
Classification: YES/NO, categorical, discrete, team A,B,C.

Machine Learning Terminologies
Features: number of bathroom, zipcode,
Training vs inference:
Data split: don't want to use up all data for training. Need to witheld data so that can test out how the model perform for data it has never seen. 15 years ago 70% 30% data split was the gold standard. `
Cross Validation K Fold: divide train data into 10 chunks, use 9 for train 1 for validation, then shuffle and do the same, use another 1 portion for validation.... Use different 9 and test with a different hold out.

Precision, Recall, Accuracy

Azure for ML
- Data Science virtual machine include all the common data tools, pre-installed. Anaconda, R, Python, Tensorflow, common deep learning frameworks and tools.
- Batch AI, train models at scale.
- Azure credits.

Transfer Learning:
Deep neural networks require a lot of data. Transfer Learning can help when that is not available. For example if we only 2000 images.

In the earlier layers the NN is recognizing shapes, edges. Not until much later it is recognizing actual images.

Use known dataset and architecture that already figure out edges and shapes, and only fine tune the final layers.

Learning Resources on Microsoft
Microsoft School of AI
Microsoft tutorials for analytics, Microsoft AI learning via github, Microsoft for AI, Microsoft Program for Data Science, Microsoft Program for Data Analysis. https://azure.github.io/learnAnalytics-public/

Tutorial Presentation Getting Started on Kaggle

Vani Mandava Director Data Science Microsoft Research @vanimt

Kaggle Competition Overview and Tutorial

Oil palm tree produces palm oil in Africa and South Central America through deforestation. High resolution satellite image and computer vision helps us track this environmental issues. Dataset created by the West Big Data Hub and WiDS Datathon Committee. Develop a model to detect whether an oil palm plantation is present. This competition has ended.

Supervised learning problem with image vision. 15K images in training, 4K images in test set, 2K images in holdout.

Lessons from 200+ Data Science Interviews

Follow us for beginner friendly tutorials. We will also publish a better quality version of this article on medium.

Notes from Metis webinar of the same title with my own commentary and opinions. Metis is a data science training bootcamp with technical portfolio projects. Metis also offer bootcamp prep courses: python prep for example.

At his previous work Capital One, he started the machine learning powerhouse team that grew from 2 to 80, lots of experience hiring machine learnist and data scientists.

Understand the hiring process and hiring funnel
Employer: Define roles, build a funnel, attract applicants, interview & hire, pro tips

Define Roles
Roles: Data Scientist, Data Engineer, Machine Learnist,
Data Engineer may be a software engineer who codes, and is ready to tackle a large complex dataset, doing ETLs and transformations. Massage data ready for model building.
Data Scientists are ready to handle cleaner data, ready for doing analysis.

Machine learning engineer, a job title that is more senior, a sweet marriage of data engineer and data scientist. JAVA C++, can explain algorithms, KNN, k means.

8:35 In finance, the job descriptions, roles, and skill definitions are well defined, even mandated. Requirements of experiences are set and even strict. Startup role definition tend to be loose, even chaotic.

Set internal expectations. Set expectations for people who conduct interviews. For hiring, it's better to set expectations right to attract the right candidates. Result in smoother experience for candidates. Candidate also feels that the experience is tailored. Google onsite panel is assembled based on candidate strengths and interests.

12:00 data analyst may know SQL but may not know a programming language.
Data analyst, data engineer ETL infrastructure may also need to know data visualization.

Building a funnel and attract talents
13:40
Generally, with funnel, one starts with a large number of people and quickly decreases to a small number of people - monotonically decreasing.

The goal is to build a funnel, attract people to the funnel, and optimizing it. Comment: It is an important startup growth, product management technique.

Application Funnel
Funnel does not really differ by company size.
Entry point: sourcer, referral, cold applications. Sourcer will actively reach out to candidates, but it is a quick process. They won't spend more than 1 minute at your profile.

If you have a person to follow up with, you are already further down the funnel.

Cold application can result in a pool of thousands, or tens of thousands applications.

Referrals are much deeper in the funnel. Even "half way there already". At a meetup. Someone they know may be hiring even if they are not directly hiring.

Non technical Phone screen : pulse check, culture check, is this person generally agreeable, broad skill check with a recruiter, how is their communication skills. Check if this person knows the company language.

17:00
Technical Phone Screen: perhaps everyone's least favorite part, culture check, little CS problem, data science problem algorithms, "coarse filter" for does candidate have enough skills to justify on-site interviews. 4-6 hours of engineering time is valuable so it's best not to waste on candidates that are not ready. Cracking the coding interview book.

On-site : 4-8 hours, a proper day, including technical and non technical interviews. Even a post-on-site sometimes. Discuss candidates feedback, make offer, expectations. Comment: I heard that Microsoft sometimes do this, the day of the interview.

Offer includes salary, compensation, starting date.

First day: be nice if there's a small celebration.

Sometimes there's a take home. For the presenter : "If someone asks me to do 8 hours of work for free. I will just say no. "

"Just make sure you have a good pipeline. Constantly get people through. With well defined steps." - Presenter on building a great application funnel.

Presenter's preferred method of finding best candidate: organic means of finding candidates. Conferences of relevant topics (shows that they are committed and passionate if they are spending time on a Thursday night for hours learning about a topic), meetups, speaker reception, exchange business cards, rolodex? Existing contacts, networks.

Wednesday, March 6, 2019

Tensorflow 2.0 cheatsheets notes and reviews

#TFDevSummit #tensorflow 2.0 alpha is here eager execution by default, keras is now the higher level API (best practice), lots of change to the.contrib, tensorflow.js tensorflow lite for mobile. #MachineLearning #data #keras

Keras is the high level API
Tutorials: deeplearning.ai and Udacity offer free courses on Tensorflow 2.0 alpha

Product Manager Basics Part 1

Important product management skills

Notes from #productcon london 2019 afternoon session

vision, idea, roadmap
customers, stakeholders internal and external, investors, talent future employees

Required skill set for product managers: need to sell to a lot of stakeholders
Founder/ VP / Director product
group product manager / senior PMs
product managers / associate PMs

Even Selling potential employees to join our organization also requires PM to be convincing.

Being good at selling - convince the right audience, can even help finding investors for the product.

Important things to track: task owner, due dates, change in status

Tuesday, March 5, 2019

LSTM, Recurrent Neural Network, GRU Review and Notes - Udacity Deep Learning Nanodegree

RNN averaging noisy samples yield less noisy results.

RNN - music input each row is a new note, like a beat in the song the row number is the number of beats in a song, the columns are mostly zero except one entry each row, it’s the note hard encoddd

timestep is a segment of the song length we are using to train. Ensures same sequence length

https://github.com/udacity/deep-learning-v2-pytorch/blob/master/recurrent-neural-networks/char-rnn/Character_Level_RNN_Solution.ipynb

LSTM overcomes the vanishing gradient problem of RNN. Back propagation through time, can make gradient too small. Avoid loss of information

LSTM allows learning across many different steps. 1000 steps.
The cell is fully differentiable. All its functions have a derivative, and hence a gradient. That can be computed. Including: sigmoid, hyperbolic tangent, multiplication, addition. Easy use of backpropagation or SGD to update the weights.

Sigmoid threshold is the key to manage: what goes into the cell, what retains within the cell, what passes the output.

If RNN set hidden state as None then all the hidden state weights will just be zero.

At first the blue line is just flat, hasn’t learn anything yet. As it learns, it starts to track red line well. Eventually it gets close. But suddenly, in this Udacity lecture the graph looks like it flipped upside down?! This is the same graph but for better visualization, it is flipped, so that the two graph look like their track each other nicely on this new axis. But the lecturer didn’t point this out so it looked surprising.

https://pytorch.org/docs/stable/nn.html#recurrent-layers

If detach hidden variable, but assigning hidden.data to a new variable that means no need to do back propagation on this particular variable that is detached

GRU dimensions (num_layer, batch size, hidden dimensions )

One nuance is that tanh activation function may work better than sigmoid with RNN>

Gated Recurrent Unit

Works well in practice. Only has one working memory, not two (LSTM has long term and short term memory). Has UPDATE GATE (combines learn and forget gate) and runs through COMBINE GATE.

LSTM with peephole connections:

long term memory (LTM) also contributes to decisions made by short term memory (STM) and current event (E). Previously, there's a NN just on those two. Now the NN activates all three with a bias.

Tech Digest - Spring 2019

Machine Learning and Human Bias https://youtu.be/59bMh59JQDo via @YouTube #machinelearning #google #datascience

Udacity Launches Data Engineering Nanodegree for $999 learn how to build data pipeline and data analysis at scale.

I just published Anaconda Miniconda Cheatsheet for Data Scientists https://link.medium.com/6fzPm7ojVU #anaconda #data #datascience #machinelearning #deeplearning

Tensorflow 2.0 or Pytorch on Quora https://www.quora.com/Which-is-better-TensorFlow-2-0-or-PyTorch

Difference between fit() transform() and fit_transform. Answered on StackOverflow

Getting Started with Natural Language Processing NLP for Beginners

Natural Language Processing (NLP) is not supposed to be easy! But let’s try to simplify for beginners. Follow us for more beginner friendly articles like this.

Natural Language Processing or NLP is a subset of the field of Artificial Intelligence. It is a field that analyzes our human language, takes texts as input. The entire text dataset, the input data is called the corpus. For example we calculate how many times a word appears in the corpus. This count is called term frequency.

“Hi there! It’s good to see you. I just wanted to say hi.” # The sentence is the corpus. Term frequency of ‘hi’ is 2, because it appears twice in the corpus, if our analysis case insensitive (‘Hi’ equals to ‘hi’). If it is case sensitive, then the term frequency of ‘Hi’ is one, and TF of ‘hi’ is also one.

We will elaborate on term frequency later.

Practical tip: Sometimes it is important to be case sensitive. For example, Trump may refer to Donald Trump, trump is a verb often used in card games describing one card outranks another. When cases don’t matter, a common preprocessing, data cleaning technique is to change all text of the corpus to lower case. Loweringlower_case_corpus = corpus.lower() The function .lower() is a python string method. For example “Hello there!” will become “hello there!”.

Bag of Words — a common, introductory model for Natural Language Processing NLP

Codecademy.com explains bag-of-words model: “A bag-of-words model totals the frequencies of each word in a document, with each unique word being its own feature and its frequency being the value.”

If you haven’t studied Machine Learning the word feature makes no sense. There are tricks that may help you understand. We can imagine the output of a bag of word model as python dictionary / hashmap of key value pairs or as an Excel sheet. The features are the keys in the dictionary or the column headers in the Excel sheet. Features are meaningful representations of the data. Machine Learning learns features and predicts outcomes called labels.

For example useful features of Person data — information that describes people — may include: height, gender, name, government issued ID number etc.

Pro Tip: what is the feature dimension? What is the size or the number of features? It equals to the size of vocabulary found in the corpus.

corpus = ["You are reading a tutorial by Uniqtech. We are talking about Natural Language Processing aka NLP. Would you like to learn more? Learn more about Machine Learning today!"]
# if use corpus = "..."
# receive error
# ValueError: Iterable over raw text documents expected, string object received.

from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
bow = count_vect.fit_transform(corpus)
bow.shape #(1,22)

count_vect.get_feature_names()
#[u'about', u'aka', u'are', u'by', u'language', u'learn', u'learning', u'like', u'machine', u'more', u'natural', u'nlp', u'processing', u'reading', u'talking', u'to', u'today', u'tutorial', u'uniqtech', u'we', u'would', u'you']

bow.toarray()
#array([[2, 1, 2, 1, 1, 2, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2]])

Pro tip: what does CountVectorizer do per the Sklearn documentation? “Convert a collection of text documents to a matrix of token counts” and returns a sparse matrix scipy.sparse.csr_matrix Just an FYI. Don’t think too hard about it now.

The feature names are returned by count_vect.get_feature_names() and bow.toarray() gives us the frequency of corresponding features. For example, the first word ‘about’ appears twice in the corpus so its frequency is 2. The last word ‘you’ also appears twice.

How is it useful? This common model is surprisingly powerful. There are some criticism of the author of 50 Shades of Grey on the internet: the claim is that she is not a sophisticated author because her books only utilize limited English vocabulary. Apparently people have found that she uses some simple non-descriptive words too often, such as love and gasp. Below is a meme that makes fun of 50 Shades of Grey.

How did people know the author uses gasp a lot? Word count, word frequency of course!

If we read through this Word Frequency Analysis of the 50 Shades of Grey Trilogy, indeed we have to scroll down quite far to see a complex word that is also frequently used such as murmur.

Some argue however precisely because the author uses easy-to-read colloquial style the series has gained wide readership and popularity.

Surprisingly, this simple model is quite insightful and already generates a good discussion.

Sample natural language processing workflow and NLP pipeline:

Data cleaning pipeline for text data

cleaning (regular expressions)
sentence splitting
change to lower case
stopword removal (most frequent words in a language)
stemming — demo porter stemmer
POS tagging (part of speech) — demo
noun chunking
NER (name entity recognition) — demo opencalais
deep parsing — try to “understand” text.

Important Natural Language Processing Concepts

Stop Words Removal

Stop words are words that may not carry valuable information.

In some cases stop words matter. For example researchers found that stop words are useful in identifying negative reviews or recommendations. People use sentences such as “This is not what I want.” “This may not be a good match.” People may use stop words more in negative reviews. Researchers found this out by keeping the stop words and achieving better prediction results.

While it is common practice to remove stop words and only returned clean text, removing stop words do not always give better prediction results. For example, not is considered in some NLP libraries, but not is a very significant word in negative reviews or recommendations in sentiment analysis. For example, if a customer states “I would not buy this product again, and would not accept any refund. Really not a good match at all.”, the word “not” is a strong signal that this review is negative. A positive review may sound, well, positive! “I really like the product! I enjoyed it very much. Not what I expected at all.” In this case, negative reviews use the “not” word 3x more.

Removing punctuation may also yield better results in some situations.

NLP Techniques — Removing punctuations with Regex

Punctuations are not always useful in predicting the meaning of texts. Often they are removed along with stop words. What does removing punctuation mean? It means keeping only the alpha numeric characters. Regex programming lessons can fill books! Just use this nifty function below for short texts. For longer texts that require more processing power, use iterable generators to iterate through each line of text and keep only alpha numeric characters. For big data, use parallel processing to process multiple lines of texts at once.

This process of removing numbers and punctuation is called pruning.

Regex removes punctuation

#import regex
import re

corpus = "You are reading a tutorial by Uniqtech. We are talking about Natural Language Processing aka NLP. Would you like to learn more? Learn more about Machine Learning today!"

corpus = re.sub("[^a-zA-Z0-9]+", "",corpus)
corpus
# 'YouarereadingatutorialbyUniqtechWearetalkingaboutNaturalLanguageProcessingakaNLPWouldyouliketolearnmoreLearnmoreaboutMachineLearningtoday'
#note space is also removed

# ^\s means DO NOT MATCH SPACE
corpus = re.sub("[^a-zA-Z0-9\s]+", "",corpus)
corpus
#returns 'You are reading a tutorial by Uniqtech We are talking about Natural Language Processing aka NLP Would you like to learn more Learn more about Machine Learning today'

Go ahead, just use the above method and avoid reinventing the wheel.

Pro Tip: python also has a build in alpha numeric checker function ialnum() . There is another .isalpha() only returns true for alphabets, a number will not evaluate to true.

There are always hackers coming up with fancy regex code! It keeps getting fancier.

from nltk.tokenize import RegexpTokenizer a regex tokenization
RegexpTokenizer(r'\w+')

#tokenize any word that has length > 1, 
#effectively removing all punctuations

Tokenization

Tokenization: breaking texts into tokens. Example: breaking sentences into words, and more group words based on scenarios. There’s also the n gram model and skip gram model.

Basic tokenization is 1 gram, n gram or multi gram is useful when a phrase yields better result than one word, for example “I do not like Banana.” one gram is I _space_ do _space_ not _space_ like _space_ banana. It may yield better result with 3 gram model: I do not, do not like, not like banana, like banana _space_, banana _space.

ngram : n is the number of words we want in each token. Frequently, n =1

Did you know that Google digitized many books and generated and analyzed literature based on the n gram model? Nice work Google!

Lemmatization

Lemmatization: transforming words into its roots. Example: economics, micro-economics, macro-economists, economists, economist, economy, economical, economic forum can all be transformed back to its root econ, which can mean this text or article is largely about economics, finance or economic issues. Useful in situations such as topic labeling. Common libraries: WordNetLemmatizer, Porter-Stemmer.

Sentence Tagging

Sentence tagging is like the part of speech exercises your grammar teacher made you do in high school. Here’s an illustration of that:

Sections Coming soon…

To be notified, sign up here: subscribe@uniqtech.co

Information Retrieval Basics : Term Frequency Inverse Document Frequency TFIDF

Shameless self plug below, please support us :)

Like what you read so far? Join our $5/month membership to get in-depth Silicon Valley job intelligence, beginner friendly tutorials, training courses for a tech career in Silicon Valley. subscribe@uniqtech.co

Our members only blog includes searchable in-depth analysis of Silicon Valley job postings such as Product Manager, Machine Learning Engineer. Information on tech interviews, technical interviews for bootcamp graduates. Tips and tricks to pass phone interviews. Our tutorials aim to be fast and beginner friendly. Check out our Medium article and Youtube video on Softmax — a function frequently used in Deep Learning, Artificial Intelligence and Machine Learning.

NLP Use Cases

Sentiment analysis of tweets, amazon reviews. Classifying whether a short text is positive or negative.
Writing style analysis analysis: authors’ favorite vocabulary choice, singers’ lyrics style. For example, style analysis has identified JK Rowling as the author of a book even though she used male a pen name after passionate readers analyzed and found parallels and similarity in the text styles.
Entity tagging: find organizations or people’s names in articles
Text summarization: summarize main points of news articles

Getting Started with NLP Now!

You can use the Python nltk library to analyze texts. It’s a popular and a powerful library. It includes lists of stop words in several languages.

from nltk.corpus import stopwords
clean_tokens = [token for token in tokens if token not in stop_words]

#important pattern forremoving stop words iteratively

#source: Towards Data Science  Emma Grimaldi How Machines understand our language

Sklearn conveniently has a build-in text dataset for you to experiment with! These news articles can be classified into different topics. Sklearn provides cleaned training data for this classification task.

Link here

Glossary

SOS start of sentence
EOS end of sentence
padding usually 0
word2index
index2word
word2count

Ad