Bidirectional RNN과 Bidirectional LSTM (실습편)

17 Jan 2019

이번 포스트에서는 Bidirectional LSTM Network를 이용하여 Part-of-Speech Tagging (PoS Tagging)을 실습해본다. 본 실습 예제는 PyTorch 튜토리얼을 참고하여 작성하였다.

Preliminaries

본 포스트의 실습은 이전 포스트와 달리 numpy가 아닌 PyTorch를 이용함!
본 포스트에서는 Natural Language Toolkit 라이브러리를 사용함!

Natural Language Toolkit 설치 및 데이터 셋 받기

Natural Language Toolkit (NLTK)가 깔려있지 않다면

conda install nltk

와 같은 방식으로 nltk를 깔아주자.

nltk 라이브러리를 설치했더라도, 데이터 셋을 다운받으려면 다음과 같은 추가적인 과정이 필요하다.

import nltk
nltk.download('universal_tagset()')

본 예제에서는 brown corpus의 universal_tagset 를 사용할 것이므로 위와 같이 받아주자.

데이터 살펴보기

다음과 같이 brown corpus의 universal_tagset 를 받아와서 변수로 만들어보자.

from nltk.corpus import brown
brown_news_tagged = brown.tagged_sents(tagset='universal')

import numpy as np

print('The number of sentences in this corpus: %d \n' % len(brown_news_tagged))
print('Sample sentance: ')
print(np.array(brown_news_tagged[0]).T)

The number of sentences in this corpus: 57340

Sample sentance:
[['The' 'Fulton' 'County' 'Grand' 'Jury' 'said' 'Friday' 'an'
  'investigation' 'of' "Atlanta's" 'recent' 'primary' 'election'
  'produced' '``' 'no' 'evidence' "''" 'that' 'any' 'irregularities'
  'took' 'place' '.']
 ['DET' 'NOUN' 'NOUN' 'ADJ' 'NOUN' 'VERB' 'NOUN' 'DET' 'NOUN' 'ADP'
  'NOUN' 'ADJ' 'NOUN' 'NOUN' 'VERB' '.' 'DET' 'NOUN' '.' 'ADP' 'DET'
  'NOUN' 'VERB' 'NOUN' '.']]

단어 사전 만들기!

인풋 데이터 집합을 만들어주기 위한 사전작업으로, 단어 사전을 만들어보자. Vanilla RNN 실습때와 유사한 방법을 사용할 것이다.

def extract_sentence (pair):
    return [s for (s, t) in pair]

sentences = [ extract_sentence(pair) for pair in brown_news_tagged ]
flatten = [word for sentence in sentences for word in sentence ]
unique_words = list(set(flatten))
word_to_idx = { word : i for i,word in enumerate(unique_words) }
idx_to_word = { i:word  for i,word in enumerate(unique_words) }
vocab_size = len(unique_words)

print('There are %d unique words in the data set.' % vocab_size)

There are 56057 unique words in the data set.

import torch

def word2id (sentence):
    return torch.tensor([word_to_idx[word] for word  in sentence ], dtype=torch.long)

word2id(['Hello', 'my','name','is', 'john', 'hello', 'Hello'])

tensor([48582, 36790, 35040, 35000, 12817, 41889, 48582])

Tag 사전 만들기!

정답 데이터 집합을 만들어주기 위한 사전작업으로, Tag 사전을 만들어보자.

def extract_tag (pair):
    return [t for (s, t) in pair]

sentence_tags = [ extract_tag (pair) for pair in brown_news_tagged ]
flatten = [tag for sentence_tag in sentence_tags for tag in sentence_tag ]
unique_tags = list(set(flatten))
tag_to_idx = { word : i for i,word in enumerate(unique_tags) }
tag_size = len(unique_tags)

print('There are %d unique tags in the data set.' % tag_size)
print(tag_to_idx)

There are 12 unique tags in the data set.
{'ADP': 0, 'X': 1, 'VERB': 2, '.': 3, 'ADJ': 4, 'ADV': 5, 'PRT': 6, 'NOUN': 7, 'DET': 8, 'PRON': 9, 'NUM': 10, 'CONJ': 11}

def tag2id (tags):
    return torch.tensor([tag_to_idx[tag] for tag  in tags ], dtype=torch.long)

tag2id(['ADJ', 'ADP','ADV','CONJ', 'CONJ', 'PRT', 'X'])

tensor([ 4,  0,  5, 11, 11,  6,  1])

Training Data 집합과 Test Data 집합 만들기

전체 데이터의 75%는 트레이닝 데이터로, 25%는 테스트 데이터로 만들어주자!

import random

shuffled = list(brown_news_tagged)
random.seed(1)
random.shuffle(shuffled)

split_idx = (int)(0.75 * len(brown_news_tagged) )

training_data = shuffled[:split_idx]
test_data = shuffled[split_idx:]

PyTorch word embedding 사용해보기

import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

torch.manual_seed(1)

embeds = nn.Embedding(vocab_size, 5)  # (vocab_size, embedding_size)
lookup_tensor = torch.tensor([word_to_idx["words"]], dtype=torch.long)
words_embed = embeds(lookup_tensor)
print(words_embed)

tensor([[ 0.1171,  0.0513,  0.8599, -0.8551, -0.3351]],
       grad_fn=<EmbeddingBackward>)

lstm = nn.LSTM(embedding_dim, 2)
first_hidden = (torch.zeros(1, 1, 2) ,
                torch.zeros(1, 1, 2))

sent = prepared_training_data[0]
length = len(sent)
embeded_sent = embeds(sentence)

lstm( embeded_sent.view(length, 1, -1), first_hidden )

---------------------------------------------------------------------------

NameError                                 Traceback (most recent call last)

<ipython-input-8-857c8f313cb1> in <module>
----> 1 lstm = nn.LSTM(embedding_dim, 2)
      2 first_hidden = (torch.zeros(1, 1, 2) ,
      3                 torch.zeros(1, 1, 2))
      4
      5 sent = prepared_training_data[0]


NameError: name 'embedding_dim' is not defined

sentence = sentences[0][:5]
print('The original sentence: %s \n' % sentence)
word_ids = word2id(sentence)
print('Ids: %s: \n' % word_ids)
sentence_embed = embeds(word_ids)
print('sentence_embed\'s shape:\n%s:' % str(sentence_embed))

The original sentence: ['The', 'Fulton', 'County', 'Grand', 'Jury']

Ids: tensor([23217, 54967, 31112, 23732, 24743]):

sentence_embed's shape:
tensor([[-0.6171, -0.1046, -1.2225,  2.1271,  1.8479],
        [-0.5408, -0.0800, -0.3077, -0.6716,  0.1900],
        [-0.1758, -0.3613, -0.2091,  0.0833, -0.6361],
        [-1.3710,  1.6590, -0.2266,  0.4978,  0.1015],
        [-1.0120, -0.0862, -2.0142, -2.0842,  0.6107]],
       grad_fn=<EmbeddingBackward>):

모델 만들기

Pos Tagging을 위해 다음과 같은 모델을 만들것이다.

Imgur

from torch.nn.parameter import Parameter
from torch.nn.modules   import Module
from torch.nn import init
import math

class BD_LSTM_Tagger(Module):

    def __init__(self, embedding_dim, hidden_dim, vocab_size, tagset_size):
        super(BD_LSTM_Tagger, self).__init__()

        self.hidden_dim = hidden_dim
        self.word_embeddings = nn.Embedding(vocab_size, embedding_dim)

        # The LSTM takes word embeddings as inputs, and outputs hidden states
        # with dimensionality hidden_dim.
        self.lstm_LtoR = nn.LSTM(embedding_dim, hidden_dim)
        self.lstm_RtoL = nn.LSTM(embedding_dim, hidden_dim)

        # The linear layer that maps from hidden state space to tag space
        self.hidden2y = nn.Linear(hidden_dim*2, tag_size)
        self.hidden_LtoR = self.init_hidden()
        self.hidden_RtoL = self.init_hidden()

    def init_hiddens(self):
            self.hidden_LtoR = self.init_hidden()
            self.hidden_RtoL = self.init_hidden()


    def init_hidden(self):
        # Before we've done anything, we dont have any hidden state.
        # Refer to the Pytorch documentation to see exactly
        # why they have this dimensionality.
        # The axes semantics are (num_layers, minibatch_size, hidden_dim)
        return (torch.zeros(1, 1, self.hidden_dim) ,
                torch.zeros(1, 1, self.hidden_dim))

    def forward(self, sentence):

        length = len(sentence)

        embeded_sent = self.word_embeddings(sentence)
        lstm_out_LtoR, self.hidden_LtoR = self.lstm_LtoR( embeded_sent.view(length, 1, -1), self.hidden_LtoR)

        reversed_embeds = embeded_sent # reversed(self.embeded_sent)
        lstm_out_RtoL, self.hidden_RtoL = self.lstm_RtoL( reversed_embeds.view(length, 1, -1), self.hidden_RtoL)

        hidden_full = torch.cat((lstm_out_LtoR, lstm_out_RtoL), 2)

        y = self.hidden2y(hidden_full).view(length,-1)
        return F.log_softmax(y, dim=1)

훈련시키기

def prepare_xy (tagged_sentence):
    sentence = extract_sentence(tagged_sentence)
    tags     = extract_tag(tagged_sentence)
    return word2id(sentence), tag2id(tags)    


embedding_dim = 100
hidden_dim = 200
model = BD_LSTM_Tagger(embedding_dim, hidden_dim, vocab_size, tag_size)

loss_function = nn.NLLLoss()
optimizer = optim.SGD(model.parameters(), lr=0.1)

# sentence, tags = prepare_xy(training_data[0])

def get_accuracy(prepared_data):
    correct_case = 0
    total_case = 0

    with torch.no_grad():

        for sentence, tags in prepared_data:

            tag_scores = model(sentence)
            selected_result = torch.max(tag_scores, dim=1)[1]
            diff = torch.abs(selected_result - tags)

            size = len(tags)
            correct_case += size
            total_case += size

            if (sum(diff) != 0):
                wrong_case = torch.sum((diff != 0).int())
                correct_case -= wrong_case

    return float(correct_case) / float(total_case)

prepared_test_data = [prepare_xy(test_sentence) for test_sentence in  test_data]

import signal
import matplotlib.pyplot as plt
from IPython import display


# See what the scores are before training
# Note that element i,j of the output is the score for tag j for word i.
# Here we don't need to train, so the code is wrapped in torch.no_grad()
# with torch.no_grad():
#     inputs, outputs = prepare_xy(training_data[0])
#     tag_scores = model(inputs)
#     print(tag_scores)


cnt = 0
loss_trace = []
accuracy_trace = []
smooth_loss = 0

total_epoch = 3
display_step = int(total_epoch * len(training_data) / 1000)
display_loss = 0.

prepared_training_data = [prepare_xy(tagged_sentence) for tagged_sentence in  training_data]


for epoch in range(total_epoch):
    try:
        for (sentence, tags) in prepared_training_data:

            # Step 1. Remember that Pytorch accumulates gradients.
            # We need to clear them out before each instance
            model.zero_grad()

            # detaching it from its history on the last instance.
            model.hidden_LtoR = model.init_hidden()
            model.hidden_RtoL = model.init_hidden()

            # Step 3. Run our forward pass.
            tag_scores = model(sentence)

            # Step 4. Compute the loss, gradients, and update the parameters by
            #  calling optimizer.step()
            loss = loss_function(tag_scores, tags)
            loss.backward()
            optimizer.step()

            display_loss = display_loss + float(loss)
            cnt += 1


            if (cnt % display_step == 0):

                ## display loss
                if (smooth_loss ==0):
                    smooth_loss = display_loss

                else:
                    smooth_loss = 0.99 * smooth_loss + 0.01 * display_loss


                loss_trace.append(smooth_loss)

                display.clear_output(wait=True)

                ## display accuracy 4 test data
                accuracy = get_accuracy(prepared_test_data)
                accuracy_trace.append(accuracy)

                plt.figure()
                plt.plot(accuracy_trace)
                plt.show()
                print('accuracy (test data):\t%f' % accuracy )

                plt.plot(loss_trace)
                plt.show()
                print('loss: %f' % (float)(display_loss/display_step))
                display_loss = 0            

    except KeyboardInterrupt:
        break

Imgur

accuracy (test data):	0.934136

Imgur

loss: 0.135792

테스트 데이터로 실험하기

prepared_test_data = [prepare_xy(test_sentence) for test_sentence in  test_data]

def get_accuracy(prepared_data):
    correct_case = 0
    total_case = 0

    with torch.no_grad():

        for sentence, tags in prepared_data:

            tag_scores = model(sentence)
            selected_result = torch.max(tag_scores, dim=1)[1]
            diff = torch.abs(selected_result - tags)

            size = len(tags)
            correct_case += size
            total_case += size

            if (sum(diff) != 0):
                wrong_case = torch.sum((diff != 0).int())
                correct_case -= wrong_case

    return float(correct_case) / float(total_case)

print('accuracy (training data):\t%f' %  get_accuracy(prepared_training_data))
print('accuracy (test data):\t%f' %  get_accuracy(prepared_test_data))

모델 저장하기

torch.save(model.state_dict(), "bidirectional_learned.model")

Word Embedding

def get_embedding (word):
    return model.word_embeddings(word2id([word]))

king = get_embedding ('king')
queen = get_embedding ('queen')
man =  get_embedding ('man')
woman =  get_embedding ('woman')

torch.dist(king-man+woman, queen)

word_idx = list(range(len(unique_words)))
term = 'computer'
term_idx = word_to_idx[term]
word_idx.remove(term_idx)

long_tensor = torch.Tensor(word_idx).long()
term_idx = torch.Tensor([term_idx]).long()

embedded_words = model.word_embeddings(long_tensor)
embedded_term = model.word_embeddings(term_idx)
diffs = embedded_words - embedded_term
square_diffs = torch.mul(diffs, diffs)
square_sum = torch.sum(square_diffs, 1)
dist = torch.rsqrt(square_sum)

nearest_neighbor = (int)(torch.argmax(dist))

idx_to_word[nearest_neighbor]
#print(dist.max(dist))

Reference

https://pytorch.org/tutorials/beginner/nlp/sequence_models_tutorial.html

Woosung Choi ws-choi

Bidirectional RNN과 Bidirectional LSTM (실습편)

Preliminaries

Natural Language Toolkit 설치 및 데이터 셋 받기

데이터 살펴보기

단어 사전 만들기!

Tag 사전 만들기!

Training Data 집합과 Test Data 집합 만들기

PyTorch word embedding 사용해보기

모델 만들기

훈련시키기

테스트 데이터로 실험하기

모델 저장하기

Word Embedding

Reference

Related posts

Secondhand bicycle for sale (sold) 25 Apr 2022

PyTorch에서의 Bidirectional RNN에 대한 정확한 이해 28 Nov 2020

(미완) Deep Learning 기반 음원분리 튜토리얼 (Deep Learning-basessd Source Separation Tutorial) 24 Oct 2020