Enable Javascript

Please enable Javascript to view website properly

Toll Free 1800 889 7020

Looking for an Expert Development Team? Take two weeks Trial! Try Now

Problem statement dependent processing for text in NLP

In this blog, we cover some basic text processing when handling text data in natural language processing.

With the help of machine learning, if we are trying to solve some text comprehension, or NLU (Natural Language Understanding) problem then it is very difficult to blindly apply any algorithm, so avoid that blind problem. First we have to understand our data and apply some tits and bits to get insights.

Also, sometimes we are not available with enormous data like 10k data points to apply some deep learning model.

As per my experience, processing depends much on your problem statement and the kind of data available.

So, let’s deep dive into some nitty-gritty processing.

  • Understanding the pos tagging from different library
  • Contracting or expanding text for better and clear understanding
  • Spelling checker

I will explain all this with the help of python development and its libraries.

If you are not installed with Python on your machine, then just use:

Requirements to be installed:

  1. Install NLTK -: pip install nltk
  2. Install Spacy -: pip install spacy

For some other things with nltk just use nltk.download(‘package_name’)

problem-statement-dependent

Understanding the pos tagging from different library

For sure there are many libraries for the same task in any programming language. Similarly, for text processing/NLP there are several libraries. Suppose we want to perform NER (Name Entity Recognition) then mostly used library is NLTK. Apart from that, we have some other commonly used libraries like spacy, AllenNLP, etc. For NER, we can use NLTK simple pos tagging and then chunking to extract the entities, apart from that we have the Stanford NER model recently given Stanford University which both works in a very different fashion and the same is the case with spacy inbuilt NER model or pos_tagging.

So, it depends upon the subproblem of your bigger problem statement to use which one. It is according to your suitability.

  • import nltk
  • nltk.pos_tag(nltk.word_tokenize('hundred thousand'))

## Output: This is the output from NLTK library.

[('hundreds', 'NNS'), ('thousand', 'VBP')]

  • import spacy
  • nlp = spacy.load('en_core_web_sm')
  • doc = nlp('hundred thousand')
  • for ent in doc:
  • print(ent.text, ent.pos_)

##Output: This is the output from Spacy library.

Hundreds NOUN

thousand NUM

So, you can see from the above outputs that both POS taggers are working differently.

Sometimes it is useful to use pos tag and then apply the regex/chunking to extract the information and sometimes use the NER model and on top that apply some regex or rule base to get results according to your requirement.

  • grammar = "NP: { } # Chunk for abstracting person from opening sentence"
  • cp = nltk.RegexpParser(grammar)
  • cp.parse(nltk.pos_tag(nltk.word_tokenize('myself benoy')))

It is the chunking after pos tagging to extract numbers.

##Output:

pos-tagging
1. doc = nlp('Hi, Your admission fees is for a month 20000') 2. import re 3. 4. for ent in doc.ents: 5. if ent.label_ == 'DATE' or ent.label_ == 'CARDINAL': 6. for x in ent.text.split(): 7. ifre.findall("[a-z]+", x ) != []: 8. print(re.findall("[a-z]+", x ))

##Output -:

['a']

['month']

The above code helps in detecting the duration present with some cardinal numbers like you are paying rent 20k per month etc.

It can be applicable in many use cases. The above code uses NER and then regex and some filtering.

So, you can see it depends upon the problem how to approach it.

Contraction or expanding text for better and clear understanding

In most cases, we need to replace the different writing forms with the standard ones. Such that our model is to interpret it every time in the same way.

Like, some will type I’m doing well, some will I am doing well. From these “I am” is the standard one in front of “I’m”. There can be many cases like this such as wouldn’t, don’t, etc.

You have to make your list for this kind of item. These are examples of expansion.

Sometimes you are handling abbreviations or chat data such as Twitter/Facebook comments then you have to replace some short forms like LOL by Laughing out loud etc.

There are some ways to handle this type of thing is either you can make your list if they are repeating and the chance of coming new short forms is less or you can refer to some websites which are giving full forms of generally occurring short forms according to the domain.

Few you can refer are:

https://www.webopedia.com/quick_ref/textmessageabbreviations.asp

https://www.webopedia.com/quick_ref/Twitter_Dictionary_Guide.asp

More links are available like this

Spell Checker

One thing you can use is spell-checking. This matter is also based on the problem statement. Spell Checker will replace the spelling mistakes with correct form.

This spell checker is useful in tasks such as question and answer, finding similarities between two statements.

Install pyspellcheker from Pypi using -:

Pip install pyspellchecker from spellchecker import SpellChecker spell = SpellChecker() # find those words that may be misspelled misspelled = spell.unknown(['hapenning', 'rann', 'hello']) for word in misspelled: # Get the one `most likely` answer print("Correct Spelling for {0} word is {1} ".format(word , spell.correction(word))) # Get a list of `likely` options print("Possible corrections:", spell.candidates(word)) print(“\n”)

## Output

Correct Spelling for rann word is ran Possible corrections: {'renn', 'crann', 'gann', 'ranh', 'hann', 'rand', 'wann', 'dann', 'mann', 'vann', 'ran', 'ann', 'cann', 'rant', 'rank', 'rain', 'rang', 'bann', 'sann', 'rana'}

Correct Spelling for hapenning word is happening Possible corrections: {'happening', 'penning', 'henning'}

Conclusion

In this blog, I hope you can understand what type of processing is necessary for your problem apart from basic processing like punctuation removal, lemmatization, stemming, etc.

Recent Blogs

Categories

NSS Note

Some of our clients

team