word processing in nlp with tensorflow

through tokenier and Serialization achieve word processing for train neuro network,i use some sample with tensorflow to introduce.

Preprocessing

Tokenizer

source code:https://github.com/keras-team/keras-preprocessing/blob/master/keras_preprocessing/text.py#L490-L519

some important functions and variables

  • init
  • def fit_on_texts(self, texts) #texts can be a string or a list of strings or a list of list of strings
  • self.word_index # the type of variance is dictonary, which contain a specific word subject to a unique index
  • self.index_word #r eserve the key and value of the word_index

sample

import tensorflow as tf
  from tensorflow import keras
  # the package which can tokenizer
  from tensorflow.keras.preprocessing.text import Tokenizer
  '''
    transform the word into number
  '''
  sentences= ['i love my dog', 'i love my cat','you love my dog!']
  tokenizer = Tokenizer(num_words = 100)
  tokenizer.fit_on_texts(sentences)
  word_index = tokenizer.word_index
  print(word_index)
  # get the result {'love': 1, 'my': 2, 'i': 3, 'dog': 4, 'cat': 5, 'you': 6}

Serialization

sample

sentences= ['i love my dog', 'i love my cat','you love my dog!','do you think my dog is amazing']
sequences = tokenizer.texts_to_sequences(sentences)
print(sequences)
 '''
   result is [[3, 1, 2, 4], [3, 1, 2, 5], [6, 1, 2, 4], [6, 2, 4]]
   which is not encoding for amazing, because it's not appear in fit texts
 '''

To solve this problem,we can set a oovin tokenizer to encode a word which not appear before.

tokenizer = Tokenizer(num_words = 100, oov_token = "")
'''
    restart the code,we can get the result
    [[4, 2, 3, 5], [4, 2, 3, 6], [7, 2, 3, 5], [1, 7, 1, 3, 5, 1, 1]]
'''

but each sequences has the different length of the series, it’s difficult for train a neuro network,so we need make the sequnces has the same length.

from tensorflow.keras.preprocessing.sequence import pad_sequences
padded_sequences = pad_sequences(sequences,
                                 padding = 'post',   # right padding
                                 maxlen = 5,         # max len of senquence
                                 truncating = 'post') # right cut
padded_sequences
'''
then we can get the result
array([[5, 3, 2, 4, 0],
       [5, 3, 2, 7, 0],
       [6, 3, 2, 4, 0],
       [8, 6, 9, 2, 4]])
'''

Original: https://www.cnblogs.com/linkcxt/p/14986517.html
Author: linkcxt
Title: word processing in nlp with tensorflow

原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/567165/

转载文章受原作者版权保护。转载请注明原作者出处!

(0)

大家都在看

亲爱的 Coder【最近整理,可免费获取】👉 最新必读书单  | 👏 面试题下载  | 🌎 免费的AI知识星球