Introduction
In this blog, we are going to discuss one of the major tasks of Natural language processing i.e., Named Entity Recognition.
As the name suggests, it helps in recognizing entity type from text i.e., detect if an organization presents and what is the name of an organization, etc.
Generally, we deal with some basic 5-7 entities such as organization, person, time, date, number, money, etc.
NER (Named Entity recognition)
In order to build NER for basic or custom entities, definitely will require a ton of labeled dataset.
There could be different labeling methods like Stanford NER uses IOB encoding, spacy uses the start index and end index format.
We are having so many prebuilt NER models which are easily available like of Stanford Core NLP, Spacy, and Allennlp, etc.
Today we will see how to train our own custom model in order to get some idea of how these prebuilt NER models are built.
To train custom model i.e., for some new entities you need to annotate the data.
The data set used is medical data. You can find the dataset here:-
https://www.dropbox.com/s/ef5g11fdq7igi74/hackathon_disease_extraction.zip?dl=0
Requirements of libraries -:
- 1. Pandas
- 2. Numpy
- 3. Keras
- 4. Tensorflow
- 5. Unicodedata
- 6. Keras-contrib for CRF
You can choose your own architecture i.e., you can add more layers. You can also use hyperparameter tuning.
## Importing all required packages
import pandas as pd
import numpy as np
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
from keras_contrib.layers import CRF
from keras.layers import TimeDistributed, Dropout
from keras.models import Model, Input
from keras.layers import LSTM, Embedding, Dense
import unicodedata
## Loading data
train_data = pd.read_csv("./data/input/train.csv")
test_data = pd.read_csv("./data/input/test.csv")
##-------------------------------------- Data Analysis ----------------------------------- ##
print("Training data summarization\n:",train_data.nunique())
## Getting the list of words
words = list(set(train_data["Word"].append(test_data["Word"]).values))
## Creating the vocabulary
## Converting into ascii form
words = [unicodedata.normalize('NFKD', str(w)).encode('ascii','ignore') for w in words]
n_words = len(words)
## Creating the list of tags
tags = list(set(train_data["tag"].values))
n_tags = len(tags)
## Converting into index and map with words and tag in order to refer for future.
word_idx = {word:index for index, word in enumerate(words)}
tag_idx = {tag:index for index, tag in enumerate(tags)}
##------------------------------------ Preparing the dataset --------------------------------------------##
word_tag_func = lambda s: [(word,tag) for word, tag in zip(s["Word"].values, s["tag"].values)]
grouped_word_tag = train_data.groupby("Sent_ID").apply(word_tag_func)
sentences = [s for s in grouped_word_tag]
word_tag_func = lambda s: [word for word in s["Word"].values]
grouped_word = test_data.groupby("Sent_ID").apply(word_tag_func )
test_sentences = [s for s in grouped_word]
##------------------------------- Preparing data for modelling ----------------------------##
X_train = [[word_idx[unicodedata.normalize('NFKD', str(w[0])).
encode('ascii','ignore')] for w in s]for s in sentences]
## Preparing input training data
X_train = pad_sequences(sequences= X_train, maxlen=180, padding='post')
print(len(X_train))
X_test = [[word_idx[unicodedata.normalize('NFKD', str(w)).
encode('ascii','ignore')] for w in s] for s in test_sentences]
## Preparing input test data
X_test = pad_sequences(sequences= X_test, maxlen= 180, padding='post')
## Preparing output training data
y_train = [[tag_idx[w[1]] for w in s] for s in sentences]
y_train = pad_sequences(sequences=y_train, maxlen=180, padding= 'post', value= tag_idx["O"])
y_train = [to_categorical(i, num_classes=n_tags) for i in y]
Y_train = np.array(y_train)
##------------------------------------------------ Model Creation -----------------------------------------------##
input = Input(shape=(180,))
model = Embedding(input_dim= n_words, output_dim=180, input_length=180)(input)
model = Dropout(0.1)(model)
model = LSTM(units=150, return_sequences=True, recurrent_dropout=0.1)(model)
model = TimeDistributed(Dense(n_tags, activation="relu"))(model)
crf_model = CRF(n_tags+1)
output = crf_model(model) # output
model = Model(input, output)
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
print(model.summary)
fitted_model = model.fit(X_train, y_train, batch_size= 48, epochs= 5, verbose=1)
print(fitted_model)
Conclusion
We have seen how to create our own NER model for some basic entities. You can add on more entities for the custom NER model.
In this above program some are general things, rest you can change the architecture of the model according to your needs.
Read more