Ad

Sunday, February 23, 2020

SpaCy for Natural Language Processing (NLP)

Documents are tokenized to sentences, then to words. Additional or readily available features can be made from these documents to work a task. One SpaCy task could be identify whether a tweet is positive or negative and among its texts, is there a specific product that is mentioned.

Installation

Install SpaCy with Python like any other python package using pip. The easiest way to install and configure is to use Tensorflow Colab. 

pip install and then import SpaCy. It works on Tensorflow Colab too. Perhaps the fastest way to get started.

Step 1. Need to import a language model before proceeding. SpaCy supports many language models.

Step 2. Load the English model:

spacy.load('en')
spacy.blank('en')

Supports other models too, include en_core_web_sm

Update the model (optional)
nlp.update

Step 3. Init the model, wrap it in an nlp object
doc = nlp(u"document sentence here")

print out items from the spacy nlp model
- print out tokens
# Iterate over tokens in a Doc
for token in doc:
    print(token.text)

- print out entities
for ent in doc.ents:
print(ent.text, ent.start_char, ent.end_char, ent.label_)

this will print out the entity as well as the beginning and ending index and its label

Can also query doc using slicing example [1:4]
Can access .text attribute fo the token


Additional functions

Pipeline

print(nlp.pipeline)

Disable pipeline for custom training
nlp.disable_pTraining src 13

source -13 : https://spacy.io/usage/trainingipes
Why disable the pipeline? Say if you are using SpaCy for just one task such as NER, you can disable the pipeline to avoid some of the tasks.
Check the list of pipeline labels nlp.pipe_names()


Save the trained model
nlp.to_disk


NER
Bloom embedding, a type of optimized word embedding
1D CNN 1D convolutional neural network


NLTK
Not a part of Spacy is the entry level tool kit to NLP. It can do basic part-of-speech tagging. But does not have advanced functionality like spacy

Spacy Deep Learning

Training src 13
source -13 : https://spacy.io/usage/training

SpaCy word embedding

To print out the word embedding 
print out vector, access word embedding use .vector method

SpaCy for Biomedical Research

scispaCy, a python package that provides SpaCy models for biomedical, clinical texts and scientific literature. Pre-processing. 
Source: https://allenai.github.io/scispacy/

Why Natural Language Processing is hard?

Limitation of pre-trained models

" A model trained on Wikipedia, where sentences in the first person are extremely rare, will likely perform badly on Twitter. Similarly, a model trained on romantic novels will likely perform badly on legal text." - Spacy documentation

No comments:

Post a Comment

React + React Native Basics in 2020

I am writing new blog posts for technologies every year because they change, they evolve. JavaScript today is nothing like the JavaScript 10...