Sunday, February 23, 2020

SpaCy for Natural Language Processing (NLP)

Documents are tokenized to sentences, then to words. Additional or readily available features can be made from these documents to work a task. One SpaCy task could be identify whether a tweet is positive or negative and among its texts, is there a specific product that is mentioned.


Install SpaCy with Python like any other python package using pip. The easiest way to install and configure is to use Tensorflow Colab. 

pip install and then import SpaCy. It works on Tensorflow Colab too. Perhaps the fastest way to get started.

Step 1. Need to import a language model before proceeding. SpaCy supports many language models.

Step 2. Load the English model:


Supports other models too, include en_core_web_sm

Update the model (optional)

Step 3. Init the model, wrap it in an nlp object
doc = nlp(u"document sentence here")

print out items from the spacy nlp model
- print out tokens
# Iterate over tokens in a Doc
for token in doc:

- print out entities
for ent in doc.ents:
print(ent.text, ent.start_char, ent.end_char, ent.label_)

this will print out the entity as well as the beginning and ending index and its label

Can also query doc using slicing example [1:4]
Can access .text attribute fo the token

Additional functions



Disable pipeline for custom training
nlp.disable_pTraining src 13

source -13 :
Why disable the pipeline? Say if you are using SpaCy for just one task such as NER, you can disable the pipeline to avoid some of the tasks.
Check the list of pipeline labels nlp.pipe_names()

Save the trained model

Bloom embedding, a type of optimized word embedding
1D CNN 1D convolutional neural network

Not a part of Spacy is the entry level tool kit to NLP. It can do basic part-of-speech tagging. But does not have advanced functionality like spacy

Spacy Deep Learning

Training src 13
source -13 :

SpaCy word embedding

To print out the word embedding 
print out vector, access word embedding use .vector method

SpaCy for Biomedical Research

scispaCy, a python package that provides SpaCy models for biomedical, clinical texts and scientific literature. Pre-processing. 

Why Natural Language Processing is hard?

Limitation of pre-trained models

" A model trained on Wikipedia, where sentences in the first person are extremely rare, will likely perform badly on Twitter. Similarly, a model trained on romantic novels will likely perform badly on legal text." - Spacy documentation

1 comment:

  1. Thanks for sharing this article
    Leanpitch provides online training in Scrum master during this lockdown period everyone can use it wisely.
    Scrum Master Interview Questions


Algolia Search API Basics Tutorial

I write full time now for write me to say hi, request content or be notified of new tutorials like this. Unqitech writes abou...