Understand the Softmax Function in Minutes
In deep learning, the term logits layer is popularly used for the last neuron layer of neural network for classification task which produces raw prediction values as real numbers ranging from [-infinity, +infinity ]. — Wikipedia
Logits are the raw scores output by the last layer of a neural network. Before activation takes place.
eto some power divide by a sum of some sort.
y_irefers to each element in the logits vector
y. Python and Numpy code will be used in this article to demonstrate math operations. Let’s see it in code:
logits = [2.0, 1.0, 0.1] import numpy as npexps = [np.exp(i) for i in logits]
numpy.exp(power)to take the special number
eto any power we want. We use python list comprehension to iterate through each
iof the logits, and compute
np.exp(i). Logit is another name for a numeric score.The result is stored in a list called
exps. The variable name is short for exponentials.
logitis another verbose way to write out
exps = [np.exp(logit) for logit in logits]. Note the use of plural and singular nouns. It’s intentional.
e. Each transformed logit
jneeds to be normalized by another number in order for all the final outputs, which are probabilities, to sum to one. Again, this normalization gives us nice probabilities that sum to one!
sum_of_exps, which we will use to normalize each of the transformed logits.
sum_of_exps = sum(exps)
jneeds to be normalized by
sum_of_exps, which is the sum of all the logits including itself.
softmax = [j/sum_of_exps for j in exps]
[j for j in exps]divide each
>>> softmax [0.6590011388859679, 0.2424329707047139, 0.09856589040931818] >>> sum(softmax) 1.0
Extra — Understanding List Comprehension
pythonafter the dollar sign
sample_list = [1,2,3,4,5] # console returns None
sample_list # console returns [1,2,3,4,5]
#print the sample list using list comprehension [i for i in sample_list] # console returns [1,2,3,4,5] # note anything before the keyword 'for' will be evaluated # in this case we just display 'i' each item in the list as is # for i in sample_list is a short hand for # Python for loop used in list comprehension
[i+1 for i in sample_list] # returns [2,3,4,5,6] # can you guess what the above code does? # yes, 1) it will iterate through each element of the sample_list # that is the second half of the list comprehension # we are reading the second half first # what do we do with each item in the list? # 2) we add one to it and then display the value # 1 becomes 2, 2 becomes 3
# note the entire expression 1st half & 2nd half are wrapped in  # so the final return type of this expression is also a list # hence the name list comprehension # my tip to understand list comprehension is # read the 2nd half of the expression first # understand what kind of list we are iterating through # what is the individual item aka 'each' # then read the 1st half # what do we do with each item
# can you guess the list comprehension for # squaring each item in the list? [i*i for i in sample_list] #returns [1, 4, 9, 16, 25]
Intuition and Behaviors of Softmax Function
[[1,0,0], #cat [0,1,0], #dog [0,0,1],] #bird
1 star yelp review, 2 stars, 3 stars, 4 stars, 5 starscan be one hot coded but note the five are related. They may be better encoded as
1 2 3 4 5. We can infer that 4 stars is twice as good as 2 stars. Can we say the same about name of dogs?
Ginger, Mochi, Sushi, Bacon, Max, is Macon 2x better than Mochi? There’s no such relationship. In this particular encoding, the first column represent cat, second column dog, the third column bird.
[0.7, 0.2, 0.1]. Can we compare this with the ground truth of cat
[1,0,0]as in one hot encoding? Yes! That’s what is commonly used in cross entropy loss (We have a cool trick to understand cross entropy loss and will write a tutorial about it. Writing here soon). In fact cross entropy loss the “best friend” of Softmax. It is the most commonly used cost function, aka loss function, aka criterion that is used with Softmax in classification problems. More on that in a different article.
Watch this Softmax tutorial on Youtube
Deeper Dive into Softmax
[0.7, 0.2, 0.1] but you predicted during the first try [0.3, 0.3, 0.4], during the second try [0.6, 0.2, 0.2]. You can expect the cross entropy loss of the first try , which is totally inaccurate, almost like a random guess to have higher loss than the second scenario where you aren’t too far off from the expected.
Deep Dive Softmax
[0,1]. The numbers are zero or positive. The entire output vector sums to 1. That is to say when all probabilities are accounted for, that’s 100%.
- May 11 2019 In Progress: a deep dive on Softmax source code Softmax Beyond the Basics (post under construction): implementation of Softmax in Pytorch Tensorflow, Softmax in practice, in production.
- Coming soon: a discussion on graphing Softmax function
- Coming soon: a discussion on cross entropy evaluation of Softmax
- InProgress May 11 2019: Softmax Beyond the Basics, Softmax in textbooks and university lecture slides
- Coming soon: cross entropy loss tutorial
- April 16 2019 added explanation for one hot encoding.
- April 12 2019 added additional wording explaining the outputs of Softmax function: a probability distribution of potential outcomes. In other words, a vector or a list of probabilities associated with each outcome. The higher the probability the more likely the outcome. The highest probability wins — used to classify the final result.
- April 4 2019 updated word choices, advanced use of Softmax Bahtanau attention, assumptions, clarifications, 1800 claps. Logits are useful too.
- Jan 2019 best loss function cost function criterion function to go with Softmax