Gaussic

## 语言模型

Language Model，即语言模型，其主要思想是，在知道前一部分的词的情况下，推断出下一个最有可能出现的词。例如，知道了 `The fat cat sat on the`，我们认为下一个词为`mat`的可能性比`hat`要大，因为猫更有可能坐在毯子上，而不是帽子上。

## 数据准备

TensorFlow的官方文档使用的是Mikolov准备好的PTB数据集。我们可以将其下载并解压出来：

``````\$ wget http://www.fit.vutbr.cz/~imikolov/rnnlm/simple-examples.tgz
\$ tar xvf simple-examples.tgz
``````

``````we 're talking about years ago before anyone heard of asbestos having any questionable properties
there is no asbestos in our products now
neither <unk> nor the researchers who studied the workers were aware of any research on smokers of the kent cigarettes
we have no useful information on whether users are at risk said james a. <unk> of boston 's <unk> cancer institute
the total of N deaths from malignant <unk> lung cancer and <unk> was far higher than expected the researchers said
``````

``````def _read_words(filename):
with open(filename, 'r', encoding='utf-8') as f:
``````
``````f = _read_words('simple-examples/data/ptb.train.txt')
print(f[:20])
``````

``````['aer', 'banknote', 'berlitz', 'calloway', 'centrust', 'cluett', 'fromstein', 'gitano', 'guterman', 'hydro-quebec', 'ipo', 'kia', 'memotec', 'mlx', 'nahb', 'punts', 'rake', 'regatta', 'rubens', 'sim']
``````

``````def _build_vocab(filename):

counter = Counter(data)
count_pairs = sorted(counter.items(), key=lambda x: -x[1])

words, _ = list(zip(*count_pairs))
word_to_id = dict(zip(words, range(len(words))))

return words, word_to_id
``````
``````words, words_to_id = _build_vocab('simple-examples/data/ptb.train.txt')
print(words[:10])
print(list(map(lambda x: words_to_id[x], words[:10])))
``````

``````('the', '<unk>', '<eos>', 'N', 'of', 'to', 'a', 'in', 'and', "'s")
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
``````

``````def _file_to_word_ids(filename, word_to_id):
return [word_to_id[x] for x in data if x in word_to_id]
``````
``````words_in_file = _file_to_word_ids('simple-examples/data/ptb.train.txt', words_to_id)
print(words_in_file[:20])
``````

``````[9980, 9988, 9981, 9989, 9970, 9998, 9971, 9979, 9992, 9997, 9982, 9972, 9993, 9991, 9978, 9983, 9974, 9986, 9999, 9990]
``````

``````def to_words(sentence, words):
return list(map(lambda x: words[x], sentence))
``````

``````def ptb_raw_data(data_path=None):
train_path = os.path.join(data_path, 'ptb.train.txt')
valid_path = os.path.join(data_path, 'ptb.valid.txt')
test_path = os.path.join(data_path, 'ptb.test.txt')

words, word_to_id = _build_vocab(train_path)
train_data = _file_to_word_ids(train_path, word_to_id)
valid_data = _file_to_word_ids(valid_path, word_to_id)
test_data = _file_to_word_ids(test_path, word_to_id)

return train_data, valid_data, test_data, words, word_to_id
``````

``````def ptb_producer(raw_data, batch_size=64, num_steps=20, stride=1):
data_len = len(raw_data)

sentences = []
next_words = []
for i in range(0, data_len - num_steps, stride):
sentences.append(raw_data[i:(i + num_steps)])
next_words.append(raw_data[i + num_steps])

sentences = np.array(sentences)
next_words = np.array(next_words)

batch_len = len(sentences) // batch_size
x = np.reshape(sentences[:(batch_len * batch_size)], \
[batch_len, batch_size, -1])

y = np.reshape(next_words[:(batch_len * batch_size)], \
[batch_len, batch_size])

return x, y
``````

• raw_data: 即`ptb_raw_data()`函数产生的数据
• batch_size: 神经网络使用随机梯度下降，数据按多个批次输出，此为每个批次的数据量
• num_steps: 每个句子的长度，相当于之前描述的n的大小，这在循环神经网络中又称为时序的长度。
• stride: 取数据的步长，决定数据量的大小。

``````train_data, valid_data, test_data, words, word_to_id = ptb_raw_data('simple-examples/data')
x_train, y_train = ptb_producer(train_data)
print(x_train.shape)
print(y_train.shape)
``````

``````(14524, 64, 20)
(14524, 64)
``````

``````print(' '.join(to_words(x_train[100, 3], words)))
``````

``````despite steady sales growth <eos> magna recently cut its quarterly dividend in half and the company 's class a shares
``````
``````print(words[np.argmax(y_train[100, 3])])
``````

``````the
``````

## 构建模型

### 配置项

``````class LMConfig(object):
"""language model 配置项"""
batch_size = 64       # 每一批数据的大小
num_steps = 20        # 每一个句子的长度
stride = 3            # 取数据时的步长

embedding_dim = 64    # 词向量维度
hidden_dim = 128      # RNN隐藏层维度
num_layers = 2        # RNN层数

learning_rate = 0.05  # 学习率
dropout = 0.2         # 每一层后的丢弃概率
``````

### 读取输入

``````class PTBInput(object):
"""按批次读取数据"""
def __init__(self, config, data):
self.batch_size = config.batch_size
self.num_steps = config.num_steps
self.vocab_size = config.vocab_size # 词汇表大小

self.input_data, self.targets = ptb_producer(data,
self.batch_size, self.num_steps)

self.batch_len = self.input_data.shape[0] # 总批次
self.cur_batch = 0  # 当前批次

def next_batch(self):
"""读取下一批次"""
x = self.input_data[self.cur_batch]
y = self.targets[self.cur_batch]

# 转换为one-hot编码
y_ = np.zeros((y.shape[0], self.vocab_size), dtype=np.bool)
for i in range(y.shape[0]):
y_[i][y[i]] = 1

# 如果到最后一个批次，则回到最开头
self.cur_batch = (self.cur_batch +1) % self.batch_len

return x, y_
``````

### 模型

``````class PTBModel(object):
def __init__(self, config, is_training=True):

self.num_steps = config.num_steps
self.vocab_size = config.vocab_size

self.embedding_dim = config.embedding_dim
self.hidden_dim = config.hidden_dim
self.num_layers = config.num_layers
self.rnn_model = config.rnn_model

self.learning_rate = config.learning_rate
self.dropout = config.dropout

self.placeholders()  # 输入占位符
self.rnn()           # rnn 模型构建
self.cost()          # 代价函数
self.optimize()      # 优化器
self.error()         # 错误率

def placeholders(self):
"""输入数据的占位符"""
self._inputs = tf.placeholder(tf.int32, [None, self.num_steps])
self._targets = tf.placeholder(tf.int32, [None, self.vocab_size])

def input_embedding(self):
"""将输入转换为词向量表示"""
with tf.device("/cpu:0"):
embedding = tf.get_variable(
"embedding", [self.vocab_size,
self.embedding_dim], dtype=tf.float32)
_inputs = tf.nn.embedding_lookup(embedding, self._inputs)

return _inputs

def rnn(self):
"""rnn模型构建"""
def lstm_cell():  # 基本的lstm cell
return tf.contrib.rnn.BasicLSTMCell(self.hidden_dim,
state_is_tuple=True)

def gru_cell():   # gru cell，速度更快
return tf.contrib.rnn.GRUCell(self.hidden_dim)

def dropout_cell():    # 在每个cell后添加dropout
if (self.rnn_model == 'lstm'):
cell = lstm_cell()
else:
cell = gru_cell()
return tf.contrib.rnn.DropoutWrapper(cell,
output_keep_prob=self.dropout)

cells = [dropout_cell() for _ in range(self.num_layers)]
cell = tf.contrib.rnn.MultiRNNCell(cells, state_is_tuple=True)  # 多层rnn

_inputs = self.input_embedding()
_outputs, _ = tf.nn.dynamic_rnn(cell=cell,
inputs=_inputs, dtype=tf.float32)

# _outputs的shape为 [batch_size, num_steps, hidden_dim]
last = _outputs[:, -1, :]  # 只需要最后一个输出

# dense 和 softmax 用于分类，以找出各词的概率
logits = tf.layers.dense(inputs=last, units=self.vocab_size)
prediction = tf.nn.softmax(logits)

self._logits = logits
self._pred = prediction

def cost(self):
"""计算交叉熵代价函数"""
cross_entropy = tf.nn.softmax_cross_entropy_with_logits(
logits=self._logits, labels=self._targets)
cost = tf.reduce_mean(cross_entropy)
self.cost = cost

def optimize(self):
self.optim = optimizer.minimize(self.cost)

def error(self):
"""计算错误率"""
mistakes = tf.not_equal(
tf.argmax(self._targets, 1), tf.argmax(self._pred, 1))
self.errors = tf.reduce_mean(tf.cast(mistakes, tf.float32))
``````

### 训练

``````def run_epoch(num_epochs=10):
config = LMConfig()   # 载入配置项

# 载入源数据，这里只需要训练集
train_data, _, _, words, word_to_id = \
ptb_raw_data('simple-examples/data')
config.vocab_size = len(words)

# 数据分批
input_train = PTBInput(config, train_data)
batch_len = input_train.batch_len

# 构建模型
model = PTBModel(config)

# 创建session，初始化变量
sess = tf.Session()
sess.run(tf.global_variables_initializer())

print('Start training...')
for epoch in range(num_epochs):  # 迭代轮次
for i in range(batch_len):   # 经过多少个batch
x_batch, y_batch = input_train.next_batch()

# 取一个批次的数据，运行优化
feed_dict = {model._inputs: x_batch, model._targets: y_batch}
sess.run(model.optim, feed_dict=feed_dict)

# 每500个batch，输出一次中间结果
if i % 500 == 0:
cost = sess.run(model.cost, feed_dict=feed_dict)

msg = "Epoch: {0:>3}, batch: {1:>6}, Loss: {2:>6.3}"
print(msg.format(epoch + 1, i + 1, cost))

# 输出部分预测结果
pred = sess.run(model._pred, feed_dict=feed_dict)
word_ids = sess.run(tf.argmax(pred, 1))
print('Predicted:', ' '.join(words[w] for w in word_ids))
true_ids = np.argmax(y_batch, 1)
print('True:', ' '.join(words[w] for w in true_ids))

print('Finish training...')
sess.close()
``````

### Gaussic

07/04
0
0

05/03
0
0

选自Medium 　　作者：Josh Dillon、Mike Shwe、Dustin Tran 　　机器之心编译 　　参与：白妤昕、李泽南 　　 　　在 2018 年 TensorFlow 开发者峰会上，谷歌发布了 TensorFlow Probabi...

04/22
0
0

07/12
0
0

2017/02/05
26.9K
3

Spring源码学习笔记-1-Resource

zypy333

10
0
RestClientUtil和ConfigRestClientUtil区别说明

RestClientUtil directly executes the DSL defined in the code. ConfigRestClientUtil gets the DSL defined in the configuration file by the DSL name and executes it. RestClientUtil......

bboss

17
0

2
0
Linux系统设置全局的默认网络代理

10
0
java框架学习日志-6（bean作用域和自动装配）

10
0