SpaCy for Natural Language Processing (NLP)

Sunday, February 23, 2020

SpaCy for Natural Language Processing (NLP)

Documents are tokenized to sentences, then to words. Additional or readily available features can be made from these documents to work a task. One SpaCy task could be identify whether a tweet is positive or negative and among its texts, is there a specific product that is mentioned.

Installation

Install SpaCy with Python like any other python package using pip. The easiest way to install and configure is to use Tensorflow Colab.

pip install and then import SpaCy. It works on Tensorflow Colab too. Perhaps the fastest way to get started.

Step 1. Need to import a language model before proceeding. SpaCy supports many language models.

Step 2. Load the English model:

spacy.load('en')

spacy.blank('en')

Supports other models too, include en_core_web_sm

Update the model (optional)

nlp.update

Step 3. Init the model, wrap it in an nlp object

doc = nlp(u"document sentence here")

print out items from the spacy nlp model

- print out tokens
# Iterate over tokens in a Doc
for token in doc:
print(token.text)

- print out entities

for ent in doc.ents:

print(ent.text, ent.start_char, ent.end_char, ent.label_)

this will print out the entity as well as the beginning and ending index and its label

Can also query doc using slicing example [1:4]
Can access .text attribute fo the token

Additional functions

Pipeline

print(nlp.pipeline)

Disable pipeline for custom training

nlp.disable_pTraining src 13

source -13 : https://spacy.io/usage/trainingipes
Why disable the pipeline? Say if you are using SpaCy for just one task such as NER, you can disable the pipeline to avoid some of the tasks.
Check the list of pipeline labels nlp.pipe_names()

Save the trained model

nlp.to_disk

NER
Bloom embedding, a type of optimized word embedding
1D CNN 1D convolutional neural network

NLTK
Not a part of Spacy is the entry level tool kit to NLP. It can do basic part-of-speech tagging. But does not have advanced functionality like spacy

Spacy Deep Learning

Training src 13

source -13 : https://spacy.io/usage/training

SpaCy word embedding

To print out the word embedding

print out vector, access word embedding use .vector method

SpaCy for Biomedical Research

scispaCy, a python package that provides SpaCy models for biomedical, clinical texts and scientific literature. Pre-processing.

Source: https://allenai.github.io/scispacy/

Why Natural Language Processing is hard?

Limitation of pre-trained models

" A model trained on Wikipedia, where sentences in the first person are extremely rare, will likely perform badly on Twitter. Similarly, a model trained on romantic novels will likely perform badly on legal text." - Spacy documentation

1 comment:

sriJuly 21, 2020 at 12:52 AM
Thanks for sharing this article
Leanpitch provides online training in Scrum master during this lockdown period everyone can use it wisely.
Scrum Master Interview Questions
ReplyDelete
Replies

Add comment

Silicon Vanity | Tech lifestyle in Silicon Valley

Ad

Sunday, February 23, 2020