avatar

目录
word2vec

自然语言模型的发展与引出

https://www.cnblogs.com/guoyaohua/p/9240336.html
基于频率或者预测模型:https://www.analyticsvidhya.com/blog/2017/06/word-embeddings-count-word2veec/
语言模型会给一个正常的语句很高的概率
unigram:P(w1,w2...wn)=i=1nP(wi)P(w_{1},w_{2}...w_{n})=\prod_{i=1}^{n}P(w_{i})
然而下一个词很大程度会取决于前面序列的词,这样独立概率会让一些愚蠢的语句也可以得到很大的值,所以引出了
Bigram model:P(w1,w2...wn)=i=2nP(wiwi1)P(w_{1},w_{2}...w_{n})=\prod_{i=2}^{n}P(w_{i}|w_{i-1})

模型解析

概念

语料(corpus)是指文本所有内容,包括重复的词,词典DD是从语料中抽取出来的不包括重复词语
one-hot这种表示方式使得每一个词映射到高维空间中都是互相正交的,也就是说one-hot向量空间中词与词之间没有任何关联关系,这显然与实际情况不符合,因为实际中词与词之间有近义、反义等多种关系。Word2vec虽然学习不到反义这种高层次语义信息,但它巧妙的运用了一种思想:“具有相同上下文的词语包含相似的语义”,使得语义相近的词在映射到欧式空间后中具有较高的余弦相似度

其实总体思想还是降维,one-hot表达维度太大了,svd矩阵分解两个低纬度矩阵。

权重矩阵

https://blog.csdn.net/itplus/article/details/37969979
以skipgram为例主要有两个权重矩阵,第一个是中心词的向量表达矩阵VV,第二个是上下文单词的向量表达矩阵UU,他们都是D×VD\times V维的
下面总结一下计算过程:
1.首先输入中心词ωt\omega_{t}的one-hot编码(V×1V\times 1)
2.接着与矩阵VV运算得到中心词的representation vc=ωtVv_{c}=\omega_{t}\cdot V (D×1D\times1)
3.下一步就是中心词向量vcv_{c}与矩阵UU相乘(u0Tvcu_{0}^{T}v_{c},u0u_{0}就是矩阵UU的某一行,其实这个就是即某个上下文词的one-hot的表达乘以UU得到representation,u0Tvcu_{0}^{T}v_{c}这最后得到的就是V×1V\times 1向量的一维)
可以听cs24n的视频讲解
4. 下面用softmax把相似性大小转变为概率

详细的一个例子:https://cloud.tencent.com/developer/article/1591734

输出层

对应一颗二叉树,词典中的词作为叶子节点,根据单词出现次数作为权值,构造huffman树,叶子节点一共D|D|个,每一次分支都是二分类,分到左面是负类,右面是正类,详细过程,https://www.cnblogs.com/neopenx/p/4571996.html

代码

源代码
#构造一个神经网络,输入词语,输出词向量

python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
import numpy as np
import torch
from torch import nn, optim
import random
from collections import Counter
import matplotlib.pyplot as plt

#训练数据
#text = "I like dog i like cat i like animal dog cat animal apple cat dog like dog fish milk like dog \
#cat eyes like i like apple apple i hate apple i movie book music like cat dog hate cat dog like"
with open('/content/text8') as f: #colab上的路径
text = f.read()

#参数设置
EMBEDDING_DIM = 2 #词向量维度
PRINT_EVERY = 1000 #可视化频率
EPOCHS = 3 #训练的轮数
BATCH_SIZE = 5 #每一批训练数据大小
N_SAMPLES = 3 #负样本大小
WINDOW_SIZE = 5 #周边词窗口大小
FREQ = 0 #词汇出现频率
DELETE_WORDS = False #是否删除部分高频词

#文本预处理
def preprocess(text, FREQ):
text = text.lower()
words = text.split()
#去除低频词
word_counts = Counter(words)
trimmed_words = [word for word in words if word_counts[word] > FREQ]
return trimmed_words
words = preprocess(text, FREQ)

#构建词典
vocab = set(words)
vocab2int = {w: c for c, w in enumerate(vocab)}
int2vocab = {c: w for c, w in enumerate(vocab)}

#将文本转化为数值
int_words = [vocab2int[w] for w in words]

#计算单词频次
int_word_counts = Counter(int_words)
total_count = len(int_words)
word_freqs = {w: c/total_count for w, c in int_word_counts.items()}#items()方法把字典中每对key和value组成一个元组,并把这些元组放在列表中返回。

#去除出现频次高的词汇
if DELETE_WORDS:
t = 1e-5
prob_drop = {w: 1-np.sqrt(t/word_freqs[w]) for w in int_word_counts}
train_words = [w for w in int_words if random.random()<(1-prob_drop[w])]
else:
train_words = int_words

#单词分布
word_freqs = np.array(list(word_freqs.values()))
unigram_dist = word_freqs / word_freqs.sum()
noise_dist = torch.from_numpy(unigram_dist ** (0.75) / np.sum(unigram_dist ** (0.75)))

#获取目标词汇
def get_target(words, idx, WINDOW_SIZE):
target_window = np.random.randint(1, WINDOW_SIZE+1)
start_point = idx-target_window if (idx-target_window)>0 else 0
end_point = idx+target_window
targets = set(words[start_point:idx]+words[idx+1:end_point+1])
return list(targets)

#批次化数据
def get_batch(words, BATCH_SIZE, WINDOW_SIZE):
n_batches = len(words)//BATCH_SIZE
words = words[:n_batches*BATCH_SIZE]
for idx in range(0, len(words), BATCH_SIZE):
batch_x, batch_y = [],[]
batch = words[idx:idx+BATCH_SIZE]
for i in range(len(batch)):
x = batch[i]
y = get_target(batch, i, WINDOW_SIZE)
batch_x.extend([x]*len(y))
batch_y.extend(y)
yield batch_x, batch_y

#定义模型
class SkipGramNeg(nn.Module):
def __init__(self, n_vocab, n_embed, noise_dist):
super().__init__()
self.n_vocab = n_vocab
self.n_embed = n_embed
self.noise_dist = noise_dist
#定义词向量层
self.in_embed = nn.Embedding(n_vocab, n_embed)
self.out_embed = nn.Embedding(n_vocab, n_embed)
#词向量层参数初始化
self.in_embed.weight.data.uniform_(-1, 1)
self.out_embed.weight.data.uniform_(-1, 1)
#输入词的前向过程
def forward_input(self, input_words):
input_vectors = self.in_embed(input_words)
return input_vectors
#目标词的前向过程
def forward_output(self, output_words):
output_vectors = self.out_embed(output_words)
return output_vectors
#负样本词的前向过程
def forward_noise(self, size, N_SAMPLES):
noise_dist = self.noise_dist
#从词汇分布中采样负样本
noise_words = torch.multinomial(noise_dist,
size * N_SAMPLES,
replacement=True)
noise_vectors = self.out_embed(noise_words).view(size, N_SAMPLES, self.n_embed)
return noise_vectors

#定义损失函数
class NegativeSamplingLoss(nn.Module):
def __init__(self):
super().__init__()

def forward(self, input_vectors, output_vectors, noise_vectors):
BATCH_SIZE, embed_size = input_vectors.shape
#将输入词向量与目标词向量作维度转化处理
input_vectors = input_vectors.view(BATCH_SIZE, embed_size, 1)
output_vectors = output_vectors.view(BATCH_SIZE, 1, embed_size)
#目标词损失
test = torch.bmm(output_vectors, input_vectors)
out_loss = torch.bmm(output_vectors, input_vectors).sigmoid().log()
out_loss = out_loss.squeeze()
#负样本损失
noise_loss = torch.bmm(noise_vectors.neg(), input_vectors).sigmoid().log()
noise_loss = noise_loss.squeeze().sum(1)
#综合计算两类损失
return -(out_loss + noise_loss).mean()

#模型、损失函数及优化器初始化
model = SkipGramNeg(len(vocab2int), EMBEDDING_DIM, noise_dist=noise_dist)
criterion = NegativeSamplingLoss()
optimizer = optim.Adam(model.parameters(), lr=0.003)

#训练
steps = 0
for e in range(EPOCHS):
#获取输入词以及目标词
for input_words, target_words in get_batch(train_words, BATCH_SIZE, WINDOW_SIZE):
steps += 1
inputs, targets = torch.LongTensor(input_words), torch.LongTensor(target_words)
#输入、输出以及负样本向量
input_vectors = model.forward_input(inputs)
output_vectors = model.forward_output(targets)
size, _ = input_vectors.shape
noise_vectors = model.forward_noise(size, N_SAMPLES)
#计算损失
loss = criterion(input_vectors, output_vectors, noise_vectors)
#打印损失
if steps%PRINT_EVERY == 0:
print("loss:",loss)
#梯度回传
optimizer.zero_grad()
loss.backward()
optimizer.step()

#可视化词向量
for i, w in int2vocab.items() :
vectors = model.state_dict()["in_embed.weight"]
x,y = float(vectors[i][0]),float(vectors[i][1])
plt.scatter(x,y)
plt.annotate(w, xy=(x, y), xytext=(5, 2), textcoords='offset points', ha='right', va='bottom')
plt.show()

评估词向量

  1. 可视化
  2. 相似度计算(余弦)
  3. analogy类比

词向量局限性

  1. 一词多义
文章作者: Sunxin
文章链接: https://sunxin18.github.io/2020/03/06/skipgram/
版权声明: 本博客所有文章除特别声明外,均采用 CC BY-NC-SA 4.0 许可协议。转载请注明来自 lalala
打赏
  • 微信
    微信
  • 支付宝
    支付宝

评论