tf–idf time frequency inverse document frequency Natural Language Processing Python Sklearn

Saturday, March 24, 2018

tf–idf time frequency inverse document frequency Natural Language Processing Python Sklearn

TFIDF models how important keywords are within a document and also in the context of a collection of documents and texts known as a corpus. TFIDF is the key algorithm used in information retrieval. The importance factor is proportional to the frequency of the keyword appearance in the document, can be normalized by the length of the document, and then the inverse part: it is offset by how frequently the word appears in other documents in the corpus. This way naturally more frequently appearing words can be discounted such as economics, economy, etc in the economist magazine collection. And topic specific words, unique appearing words, specialized knowledge will be highlighted.

Note there may be may be frequently appearing words that are stop words like the, a, and, however. Best practice of preprocessing text data may already include removing stop words. It is also common to transform all texts into lower case using .lower().

A bit of a tangent: some times, lower case removes meaning of nuances of words such as Anthropology. Capitalized Anthropology could mean the subject at college, the brand, but lower case anthropology could just mean: human societies, cultures and development. Even if we use stemming, we may never know that the author of the social media post actually is referring to a proper noun and a brand Anthropology, or Fossil.

tf-idf is a popular term weighting scheme. Think Google Search, SEO, ranking of search results, NYTimes article text summarization. One can definitely develop fancier algorithms on top of this elegant and powerful concept.

Term frequency (TF): It's intuitive. The more often a word appears in a document, the more likely that's a part of its main topic. Caveat 1: keyword spamming. Caveat 2: what if document_1 is much longer than document_2? You can normalize the term frequency by document length.

Inverse document frequency (IDF): Stop words like the, and, a appear very frequently in English texts, so regardless of whether they are useful in determining the actual meaning of the document, they will score high in Term Frequency. Remember our Economist Magazine? The word "Economist" may appear in the margin of every page spread. It's not helpful to help us distinguish article_1 and article_2. We may have to discount it.

How to calculate TF-IDF by hand?

See this wikipedia screenshot https://en.wikipedia.org/wiki/Tf%E2%80%93idf

Note the very interesting case where "the" appears in every document so inverse document frequency = log (number of docs in the corpus divide by number of docs containing the word "the") = log(2/2) = log(1) which is 0! So this stop word does not matter at all in our text analysis task.

Natural Language Processing (NLP) in general and with sklearn:

Tokenization: breaking sentences into words, and often take the count of the words. sklearn.CountVectorizer()

Here's a nice tutorial series on how to tokenize, stem, remove stop words using nltk library, a popular python natural language processing library.
https://www2.cs.duke.edu/courses/spring14/compsci290/assignments/lab02.html

It also shows how to marry tokenization and stemming with sklearn tf-idf frequency inverse document frequency TfidfVectorizer