JavaScript Required

We're sorry, but we doesn't work properly without JavaScript enabled.

Looking for an Expert Software Development Team? Take two weeks Trial! Try Now

Overview of Named Entity Recognition Using CRF

Introduction

Named Entity Recognition Using CRF

In this blog, we are going to discuss one of the major tasks of Natural language processing i.e., Named Entity Recognition.

As the name suggests, it helps in recognizing entity type from text i.e., detect if an organization presents and what is the name of an organization, etc.

Generally, we deal with some basic 5-7 entities such as organization, person, time, date, number, money, etc.

NER (Named Entity recognition)

In order to build NER for basic or custom entities, definitely will require a ton of labeled dataset.

There could be different labeling methods like Stanford NER uses IOB encoding, spacy uses the start index and end index format.

We are having so many prebuilt NER models which are easily available like of Stanford Core NLP, Spacy, and Allennlp, etc.

Today we will see how to train our own custom model in order to get some idea of how these prebuilt NER models are built.

To train custom model i.e., for some new entities you need to annotate the data.

Implementation in Python development

The data set used is medical data. You can find the dataset here:-

https://www.dropbox.com/s/ef5g11fdq7igi74/hackathon_disease_extraction.zip?dl=0

Requirements of libraries -:

You can choose your own architecture i.e., you can add more layers. You can also use hyperparameter tuning.

## Importing all required packages import pandas as pd import numpy as np from keras.preprocessing.sequence import pad_sequences from keras.utils import to_categorical from keras_contrib.layers import CRF from keras.layers import TimeDistributed, Dropout from keras.models import Model, Input from keras.layers import LSTM, Embedding, Dense import unicodedata ## Loading data train_data = pd.read_csv("./data/input/train.csv") test_data = pd.read_csv("./data/input/test.csv") ##-------------------------------------- Data Analysis ----------------------------------- ## print("Training data summarization\n:",train_data.nunique()) ## Getting the list of words words = list(set(train_data["Word"].append(test_data["Word"]).values)) ## Creating the vocabulary ## Converting into ascii form words = [unicodedata.normalize('NFKD', str(w)).encode('ascii','ignore') for w in words] n_words = len(words) ## Creating the list of tags tags = list(set(train_data["tag"].values)) n_tags = len(tags) ## Converting into index and map with words and tag in order to refer for future. word_idx = {word:index for index, word in enumerate(words)} tag_idx = {tag:index for index, tag in enumerate(tags)} ##------------------------------------ Preparing the dataset --------------------------------------------## word_tag_func = lambda s: [(word,tag) for word, tag in zip(s["Word"].values, s["tag"].values)] grouped_word_tag = train_data.groupby("Sent_ID").apply(word_tag_func) sentences = [s for s in grouped_word_tag] word_tag_func = lambda s: [word for word in s["Word"].values] grouped_word = test_data.groupby("Sent_ID").apply(word_tag_func ) test_sentences = [s for s in grouped_word] ##------------------------------- Preparing data for modelling ----------------------------## X_train = [[word_idx[unicodedata.normalize('NFKD', str(w[0])). encode('ascii','ignore')] for w in s]for s in sentences] ## Preparing input training data X_train = pad_sequences(sequences= X_train, maxlen=180, padding='post') print(len(X_train)) X_test = [[word_idx[unicodedata.normalize('NFKD', str(w)). encode('ascii','ignore')] for w in s] for s in test_sentences] ## Preparing input test data X_test = pad_sequences(sequences= X_test, maxlen= 180, padding='post') ## Preparing output training data y_train = [[tag_idx[w[1]] for w in s] for s in sentences] y_train = pad_sequences(sequences=y_train, maxlen=180, padding= 'post', value= tag_idx["O"]) y_train = [to_categorical(i, num_classes=n_tags) for i in y] Y_train = np.array(y_train) ##------------------------------------------------ Model Creation -----------------------------------------------## input = Input(shape=(180,)) model = Embedding(input_dim= n_words, output_dim=180, input_length=180)(input) model = Dropout(0.1)(model) model = LSTM(units=150, return_sequences=True, recurrent_dropout=0.1)(model) model = TimeDistributed(Dense(n_tags, activation="relu"))(model) crf_model = CRF(n_tags+1) output = crf_model(model) # output model = Model(input, output) model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy']) print(model.summary) fitted_model = model.fit(X_train, y_train, batch_size= 48, epochs= 5, verbose=1) print(fitted_model)

Conclusion

We have seen how to create our own NER model for some basic entities. You can add on more entities for the custom NER model.

In this above program some are general things, rest you can change the architecture of the model according to your needs.

Read more

 
Ast Note

Some of our clients

team