Ad

Sunday, January 26, 2020

Handling Imbalanced Dataset in Machine Learning


  • Imbalanced dataset can affect machine learning decision boundaries.
  • Distribution of target class matters!
  • Why is imbalanced class bad - source 1
  • Sometimes imbalanced dataset is "good" such as in fraud detection. It is important to find those anomalies. 
  • "imbalancec class puts accuracy out of business". It is important to not choose accuracy as metric because the model can cheat by just guessing the majority class and achieve high accuracy. The high accuracy in this case is an illusion. It is about using a trade-off of precision and recall, confusion matrix.

Related Concepts:
  • Imbalanced datasets
  • Confusion matrix
  • Resampling
  • Random under-sampling
  • Random over-sampling
  • Python imbalanced-learn module
  • Random under-sampling and over-sampling with imbalanced-learn
  • Under-sampling: Tomek links
  • Under-sampling: Cluster Centroids
  • Over-sampling: SMOTE
Source 4


Wednesday, January 22, 2020

Applying to Graduate School Master Degree and PhD

After transitioning to machine learning, I had to learn it the hard way that it is now very competitive and difficult to get into graduate school. Engineering departments are crammed. Here are some difficult lessons I have learned. I am not admissions, I don't know what's the best, but I know what didn't work.

  • Not having strong recommendations letters
    • Without strong recommendation letters, nothing works. It's important to build that network out in the long strong: peers and supervisors in a work or academic setting is especially important
  • Not having a strong personal statement
    • I made the mistake of focusing on my technical skills. There are always people who are better than me. I should have had a narrative, a bigger picture. Higher performance.
    • Even writing papers they require you to write about why the research project matters to me personally.  

Know your AWS SageMaker by Amazon Web Services

How AWS Describes SageMaker:

"Amazon SageMaker provides a fully managed service for data science and machine learning workflows. One of the most important capabilities of Amazon SageMaker is its ability to run fully managed training jobs to train machine learning models." Source 1

The Estimator Object

S3 Storage

AWS SageMaker instance types

Note AWS Sagemaker instances are now separated from EC2 instances, and can differ by region. It is has accelerated computing options more commonly known as GPUs such as ml.p2.xlarge.

See the full list of AWS Sagemaker instances here Source 2

There's a comprehensive table of instance type, vCPU count, GPU, Mem (GiB), GPU Mem (GiB), a simple description of Network Performance.

Optimization Bring your data to AWS
Previously all file has to be stored in S3 now you can use Amazon's distributed systems.
"Training machine learning models requires providing the training datasets to the training job. Until now, when using Amazon S3 as the training datasource in File input mode, all training data had to be downloaded from Amazon S3 to the EBS volumes attached to the training instances at the start of the training job. A distributed file system such as Amazon FSx for Lustre or EFS can speed up machine learning training by eliminating the need for this download step." 
Amazon FSx for Lustre or Amazon Elastic File System (EFS) Source 1


Can train as well as deploy model

Fully managed serverless. You can train as well as deploy a model. Be sure to know that you can choose the instance type for each : train or deploy and each can utilize the same or a difference compute instance type 

Get SageMaker AWS AI Certified

The certificate exam is not easy and requires study as well as actual years of experience. But SageMaker is a relatively new technology so may be the standard isn't too hard yet. Src 3. 

Sources:
2. https://aws.amazon.com/sagemaker/pricing/instance-types/
3. https://aws.amazon.com/certification/certified-machine-learning-specialty/

Regularization in Machine Learning, Deep Learning

Regularization can prevent overfitting and potentially make algorithm converge faster and more performant. Useful in deep learning tasks, in...