A Complete Tutorial for Named Entity Recognition Using CRF

Introduction

In this blog, we are going to discuss one of the major tasks of Natural language processing i.e., Named Entity Recognition.

As the name suggests, it helps in recognizing entity type from text i.e., detecting if an organization presents and what is the name of an organization, etc.

Generally, we deal with some basic 5-7 entities such as organization, person, time, date, number, money, etc.

NER (Named Entity recognition)

To create a NER for a basic or custom entity, you will definitely need a ton of labeled datasets.

There could be different labeling methods like Stanford NER uses IOB encoding, spacy uses the start index and end index format.

We have a number of pre-built NER models, readily available such as Stanford Core NLP, Spacy, Allennlp, etc.

Today we will see how to train our own custom model in order to get some idea of how these prebuilt NER models are built.

To train a custom model i.e., for some new entities you need to annotate the data.

Implementation in Python development

The data set used is medical data. You can find the dataset here:

https://www.dropbox.com/s/ef5g11fdq7igi74/hackathon_disease_extraction.zip?dl=0

Requirements of libraries:

1. Pandas
2. Numpy
3. Keras
4. Tensorflow
5. Unicodedata
6. Keras-contrib for CRF

You can choose your architecture, that is, you can add more layers. You can also use hyperparameter tuning.

ALSO READ

Socket communication Using MQTT in Python

## Importing all required packages import pandas as pd import numpy as np from keras.preprocessing.sequence import pad_sequences from keras.utils import to_categorical from keras_contrib.layers import CRF from keras.layers import TimeDistributed, Dropout from keras.models import Model, Input from keras.layers import LSTM, Embedding, Dense import unicodedata ## Loading data train_data = pd.read_csv("./data/input/train.csv") test_data = pd.read_csv("./data/input/test.csv") ##-------------------------------------- Data Analysis ----------------------------------- ## print("Training data summarization\n:",train_data.nunique()) ## Getting the list of words words = list(set(train_data["Word"].append(test_data["Word"]).values)) ## Creating the vocabulary ## Converting into ascii form words = [unicodedata.normalize('NFKD', str(w)).encode('ascii','ignore') for w in words] n_words = len(words) ## Creating the list of tags tags = list(set(train_data["tag"].values)) n_tags = len(tags) ## Converting into index and map with words and tag in order to refer for future. word_idx = {word:index for index, word in enumerate(words)} tag_idx = {tag:index for index, tag in enumerate(tags)} ##------------------------------------ Preparing the dataset --------------------------------------------## word_tag_func = lambda s: [(word,tag) for word, tag in zip(s["Word"].values, s["tag"].values)] grouped_word_tag = train_data.groupby("Sent_ID").apply(word_tag_func) sentences = [s for s in grouped_word_tag] word_tag_func = lambda s: [word for word in s["Word"].values] grouped_word = test_data.groupby("Sent_ID").apply(word_tag_func ) test_sentences = [s for s in grouped_word] ##------------------------------- Preparing data for modelling ----------------------------## X_train = [[word_idx[unicodedata.normalize('NFKD', str(w[0])). encode('ascii','ignore')] for w in s]for s in sentences] ## Preparing input training data X_train = pad_sequences(sequences= X_train, maxlen=180, padding='post') print(len(X_train)) X_test = [[word_idx[unicodedata.normalize('NFKD', str(w)). encode('ascii','ignore')] for w in s] for s in test_sentences] ## Preparing input test data X_test = pad_sequences(sequences= X_test, maxlen= 180, padding='post') ## Preparing output training data y_train = [[tag_idx[w[1]] for w in s] for s in sentences] y_train = pad_sequences(sequences=y_train, maxlen=180, padding= 'post', value= tag_idx["O"]) y_train = [to_categorical(i, num_classes=n_tags) for i in y] Y_train = np.array(y_train) ##------------------------------------------------ Model Creation -----------------------------------------------## input = Input(shape=(180,)) model = Embedding(input_dim= n_words, output_dim=180, input_length=180)(input) model = Dropout(0.1)(model) model = LSTM(units=150, return_sequences=True, recurrent_dropout=0.1)(model) model = TimeDistributed(Dense(n_tags, activation="relu"))(model) crf_model = CRF(n_tags+1) output = crf_model(model) # output model = Model(input, output) model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy']) print(model.summary) fitted_model = model.fit(X_train, y_train, batch_size= 48, epochs= 5, verbose=1) print(fitted_model)

Conclusion

We have seen a Python Development Company India describe how to create our NER model for some basic entities. You can add on more entities for the custom NER model.

There are some commonalities in the above program, the rest you can modify the architecture of the model according to your needs.

🙌

Thank you!
We will contact soon.

Oops! Something went wrong.

Overview of Named Entity Recognition Using CRF

Introduction

NER (Named Entity recognition)

Implementation in Python development

Conclusion

Recent Blogs

Categories