Problem statement dependent processing for text in NLP

In this blog we will cover some basic text processing to be done while handling text data in Natural language processing.

With the help of machine learning if we are trying to solve some text understanding or NLU (Natural language Understanding) problem then it is quite difficult to apply blindly any algorithm so, just avoiding that blind issue. First we have to understand our data and apply some tits and bits to get insights.

Also, sometimes we are not available with enormous data like 10k data points to apply some deep learning model.

As per my experience, processing depends very much on your problem statement and the kind of data available.

So, let’s deep dive into some nitty gritty processing.

  • Understanding the pos tagging from different library
  • Contracting or expanding text for better and clear understanding
  • Spelling checker

I will explain all this with the help of python development and its libraries.

If you are not installed with Python on your machine, then just use -:

Requirements to be installed -:

  1. Install NLTK -: pip install nltk
  2. Install Spacy -: pip install spacy

For some other things with nltk just use‘package_name’)

  • Understanding the pos tagging from different library

    For sure there are many libraries for the same task in any programming language. Similarly for text processing/nlp there are several libraries. Suppose we want to perform NER (Name Entity Recognition) then mostly used libraries is NLTK. Apart from that we have some other commonly used libraries like spacy, AllenNLP etc. For NER, we can use NLTK simple pos tagging and then chunking to extract the entities, apart from that we have Stanford NER model recently given Stanford University which both works in a very different fashion and same is the case with spacy inbuilt NER model or pos_tagging.

    So, it depends upon the subproblem of your bigger problem statement to use which one. It is according to your suitability.

    • import nltk
    • nltk.pos_tag(nltk.word_tokenize('hundred thousand'))

    ## Output: This is the output from NLTK library.

    [('hundreds', 'NNS'), ('thousand', 'VBP')]

    • import spacy
    • nlp = spacy.load('en_core_web_sm')
    • doc = nlp('hundred thousand')
    • for ent in doc:
    • print(ent.text, ent.pos_)

    ##Output: This is the output from Spacy library.

    Hundreds NOUN

    thousand NUM

    So, you can see from the above outputs that both POS tagger are working in different way.

    Sometimes it is useful to use pos tag and then applying the regex/chunking to extract the information and sometimes use NER model and on top that apply some regex or rule base to get results according to your requirement.

    • grammar = "NP: {} # Chunk for abstracting person from opening sentence"
    • cp = nltk.RegexpParser(grammar)
    • cp.parse(nltk.pos_tag(nltk.word_tokenize('myself benoy')))

    This is the chunking after pos tagging to extract numbers.


    1. doc = nlp('Hi, Your admission fees is for a month 20000') 2. import re 3. 4. for ent in doc.ents: 5. if ent.label_ == 'DATE' or ent.label_ == 'CARDINAL': 6. for x in ent.text.split(): 7. ifre.findall("[a-z]+", x ) != []: 8. print(re.findall("[a-z]+", x ))

    ##Output -:



    Above code helps in detecting the duration present with some cardinal numbers like you are paying rent 20k per month etc.

    It can be applicable in many use cases. This above code uses NER and then regex and some filtering.

    So, you can see it depends upon the problem how to approach it.

  • Contraction or expanding text for better and clear understanding

    In most of the cases we need to replace the different writing forms into the standard one. Such that our model is to interpret it every time in the same way.

    Like, some will type I’m doing well, some will I am doing well. From these “I am” is the standard one in front of “I’m”. There can be many cases like this such as wouldn’t, don’t, etc.

    You have to make your own list for these kind of items. These are the examples of expansion basically.

    Sometimes you are handling abbreviations or chat data such as twitter/facebook comments then you have to replace some short forms like LOL by Laughing out loud etc.

    There are some ways to handle this type of thing is either you can make your own list if they are repeating and chance of coming new short forms is less or you can refer some websites which are giving full forms of generally occurring short forms according to domain.

    Few you can refer are -:

    More links are available like this

  • Spell Checker

    One thing you can use is spell checking. This thing is also problem statement dependent. Spell Checker will replace the spelling mistakes by correct form.

    This spell checker is useful in tasks like question and answering, finding the similarity between two statements.

    Install pyspellcheker from Pypi using -:

    Pip install pyspellchecker from spellchecker import SpellChecker spell = SpellChecker() # find those words that may be misspelled misspelled = spell.unknown(['hapenning', 'rann', 'hello']) for word in misspelled: # Get the one `most likely` answer print("Correct Spelling for {0} word is {1} ".format(word , spell.correction(word))) # Get a list of `likely` options print("Possible corrections:", spell.candidates(word)) print(“\n”)

    ## Output

    Correct Spelling for rann word is ran Possible corrections: {'renn', 'crann', 'gann', 'ranh', 'hann', 'rand', 'wann', 'dann', 'mann', 'vann', 'ran', 'ann', 'cann', 'rant', 'rank', 'rain', 'rang', 'bann', 'sann', 'rana'}

    Correct Spelling for hapenning word is happening Possible corrections: {'happening', 'penning', 'henning'}

  • Conclusion

    In this blog, I hope you are able to understand what type of processing is necessary for your problem apart from basic processing like punctuation removal, lemmatization, stemming etc.


Some of our clients