Note there may be may be frequently appearing words that are stop words like the, a, and, however. Best practice of preprocessing text data may already include removing stop words. It is also common to transform all texts into lower case using .lower().
A bit of a tangent: some times, lower case removes meaning of nuances of words such as Anthropology. Capitalized Anthropology could mean the subject at college, the brand, but lower case anthropology could just mean: human societies, cultures and development. Even if we use stemming, we may never know that the author of the social media post actually is referring to a proper noun and a brand Anthropology, or Fossil.
tf-idf is a popular term weighting scheme. Think Google Search, SEO, ranking of search results, NYTimes article text summarization. One can definitely develop fancier algorithms on top of this elegant and powerful concept.
Term frequency (TF): It's intuitive. The more often a word appears in a document, the more likely that's a part of its main topic. Caveat 1: keyword spamming. Caveat 2: what if document_1 is much longer than document_2? You can normalize the term frequency by document length.
Inverse document frequency (IDF): Stop words like the, and, a appear very frequently in English texts, so regardless of whether they are useful in determining the actual meaning of the document, they will score high in Term Frequency. Remember our Economist Magazine? The word "Economist" may appear in the margin of every page spread. It's not helpful to help us distinguish article_1 and article_2. We may have to discount it.
How to calculate TF-IDF by hand?
See this wikipedia screenshot https://en.wikipedia.org/wiki/Tf%E2%80%93idf
Note the very interesting case where "the" appears in every document so inverse document frequency = log (number of docs in the corpus divide by number of docs containing the word "the") = log(2/2) = log(1) which is 0! So this stop word does not matter at all in our text analysis task.
Natural Language Processing (NLP) in general and with sklearn:
Tokenization: breaking sentences into words, and often take the count of the words. sklearn.CountVectorizer()
Here's a nice tutorial series on how to tokenize, stem, remove stop words using nltk library, a popular python natural language processing library.
https://www2.cs.duke.edu/courses/spring14/compsci290/assignments/lab02.html
It also shows how to marry tokenization and stemming with sklearn tf-idf frequency inverse document frequency TfidfVectorizer
I really enjoyed while reading your article and it is good to know the latest updates. Do post more.
ReplyDeleteccna Training institute in Chennai
ccna institute in Chennai
Python Training in Chennai
Python course in Chennai
R Training in Chennai
R Programming Training in Chennai
CCNA Training in Velachery
CCNA Training in Tambaram
This comment has been removed by the author.
ReplyDelete