自然语言处理Machine Learning & Recommendation & NLP & DL

自然语言处理---文本表示

2016-12-05  本文已影响1333人  kakasyw

1. 引言

所谓文本表示既是通过某种形式将文本字符串表示成计算机所能处理的数值向量。那么为什么要进行文本表示,根本原因是计算机不能直接对文本字符串进行处理,因此需要进行数值化或者向量化。不仅传统的机器学习算法需要这个过程,深度学习也需要这个过程,只不过这个过程可能直接包含在了深度学习网络中;同时,良好的文本表示形式也可以极大的提升算法效果。

1.1 表示方法分类

文本表示一直以来都是自然语言处理研究领域中的一个热点问题,总体来讲主要分为二大类,

2. 分布表示算法

2.1 基于矩阵模型

该模型的思路主要是,根据文本内容构建一个词-上下文矩阵,每一行代表一个词,每一列代表一个文本或者上下文,那么每行就可以作为一个term的表示。

doc1 : "NBA2K16 视频 设置 存储 位置 _NBA 视频 设置 存储 位置 解析 攻略 玩游戏"
doc2 : "NBA2K16 ncaa 豪门 大学 选择 推荐 NBA ncaa 大学 选择 游戏网 攻略"
doc3 : "NBA2K16 学好 NBA2K16 大学 名校 选择 攻略 攻略 心得 单机"

1)构造 term-document matrix,矩阵的元素为该词在不同doc里出现的次数

term-DocMatrix ^T= [[1, 1, 0, 2, 0, 0, 0, 2, 0, 0, 0, 1, 0, 1, 2, 1, 2, 0, 0],
                    [1, 1, 2, 0, 0, 0, 2, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 2],
                    [0, 2, 0, 0, 1, 1, 1, 0, 1, 1, 0, 2, 0, 0, 0, 0, 0, 0, 1]] 
term^T = [ nba , nba2k16 , ncaa , 位置 , 单机 ,名校 , 大学 , 存储 , 
学好 , 心得 , 推荐 , 攻略 , 游戏网 , 玩游戏 , 视频 , 解析 , 设置 , 豪门 , 选择]

如上图所示,term-DocMatrix 是词-文档矩阵,每一列是一个doc,每一行代表每个词在不同doc中的词频。(本示例中采用的是分好词的文本,token之间用space隔开)
2) 采用TF-IDF 模型填充term-docMatrix中每个元素值。

term-docValueMatrix^T = [
       [ 0.17322273,  0.1345229 ,  0.        ,  0.45553413,  0.        ,
         0.        ,  0.        ,  0.45553413,  0.        ,  0.        ,
         0.        ,  0.1345229 ,  0.        ,  0.22776707,  0.45553413,
         0.22776707,  0.45553413,  0.        ,  0.        ],
       [ 0.21172122,  0.16442041,  0.55677592,  0.        ,  0.        ,
         0.        ,  0.42344244,  0.        ,  0.        ,  0.        ,
         0.27838796,  0.16442041,  0.27838796,  0.        ,  0.        ,
         0.        ,  0.        ,  0.27838796,  0.42344244],
       [ 0.        ,  0.41900794,  0.        ,  0.        ,  0.35472106,
         0.35472106,  0.26977451,  0.        ,  0.35472106,  0.35472106,
         0.        ,  0.41900794,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.26977451]]

3)采用SVD 降维
term-docMatrix = USigmaV

U^T = [[-0.31592434,  0.94651327, -0.06560826],
       [-0.66440088, -0.27006617, -0.6968757 ],
       [-0.67732067, -0.17656981,  0.71418472]]

Sigma^T = [ 1.18821321,  0.97769309,  0.79515131]

V^T = [[-0.16444274, -0.36655279, -0.31132663, -0.12111826, -0.20220269,
        -0.20220269, -0.39055228, -0.12111826, -0.20220269, -0.20220269,
        -0.15566331, -0.36655279, -0.15566331, -0.06055913, -0.12111826,
        -0.06055913, -0.12111826, -0.15566331, -0.39055228],
       [ 0.10921512,  0.00914312, -0.15379708,  0.4410066 , -0.06406206,
        -0.06406206, -0.16568749,  0.4410066 , -0.06406206, -0.06406206,
        -0.07689854,  0.00914312, -0.07689854,  0.2205033 ,  0.4410066 ,
         0.2205033 ,  0.4410066 , -0.07689854, -0.16568749],
       [-0.19984651,  0.22114366, -0.48796198, -0.03758631,  0.31860145,
         0.31860145, -0.12880305, -0.03758631,  0.31860145,  0.31860145,
        -0.24398099,  0.22114366, -0.24398099, -0.01879315, -0.03758631,
        -0.01879315, -0.03758631, -0.24398099, -0.12880305],
       [-0.44741306,  0.03857492,  0.08025905,  0.81630905, -0.00975885,
        -0.00975885,  0.03422628, -0.18369095, -0.00975885, -0.00975885,
         0.02738115, -0.04960114,  0.02738115, -0.09184548, -0.18369095,
        -0.09184548, -0.18369095,  0.02738115,  0.03422628],
       [ 0.01801602, -0.25523251,  0.25570126,  0.02316318,  0.89589932,
        -0.10410068, -0.09155289,  0.02316318, -0.10410068, -0.10410068,
        -0.00814018, -0.12093452, -0.00814018,  0.01158159,  0.02316318,
         0.01158159,  0.02316318, -0.00814018, -0.09155289],
       [ 0.01801602, -0.25523251,  0.25570126,  0.02316318, -0.10410068,
         0.89589932, -0.09155289,  0.02316318, -0.10410068, -0.10410068,
        -0.00814018, -0.12093452, -0.00814018,  0.01158159,  0.02316318,
         0.01158159,  0.02316318, -0.00814018, -0.09155289],
       [-0.02484861, -0.41515297, -0.22222725,  0.03237328,  0.00507016,
         0.00507016,  0.84492088,  0.03237328,  0.00507016,  0.00507016,
        -0.10449028, -0.04616452, -0.10449028,  0.01618664,  0.03237328,
         0.01618664,  0.03237328, -0.10449028, -0.15507912],
       [-0.44741306,  0.03857492,  0.08025905, -0.18369095, -0.00975885,
        -0.00975885,  0.03422628,  0.81630905, -0.00975885, -0.00975885,
         0.02738115, -0.04960114,  0.02738115, -0.09184548, -0.18369095,
        -0.09184548, -0.18369095,  0.02738115,  0.03422628],
       [ 0.01801602, -0.25523251,  0.25570126,  0.02316318, -0.10410068,
        -0.10410068, -0.09155289,  0.02316318,  0.89589932, -0.10410068,
        -0.00814018, -0.12093452, -0.00814018,  0.01158159,  0.02316318,
         0.01158159,  0.02316318, -0.00814018, -0.09155289],
       [ 0.01801602, -0.25523251,  0.25570126,  0.02316318, -0.10410068,
        -0.10410068, -0.09155289,  0.02316318, -0.10410068,  0.89589932,
        -0.00814018, -0.12093452, -0.00814018,  0.01158159,  0.02316318,
         0.01158159,  0.02316318, -0.00814018, -0.09155289],
       [-0.02534448, -0.14532188, -0.2739517 ,  0.0097019 ,  0.05538366,
         0.05538366, -0.05617876,  0.0097019 ,  0.05538366,  0.05538366,
         0.93537401,  0.03011687, -0.06462599,  0.00485095,  0.0097019 ,
         0.00485095,  0.0097019 , -0.06462599, -0.05617876],
       [-0.12581243, -0.37592682,  0.16394342, -0.02115422, -0.09313846,
        -0.09313846, -0.131218  , -0.02115422, -0.09313846, -0.09313846,
        -0.03969872,  0.86028813, -0.03969872, -0.01057711, -0.02115422,
        -0.01057711, -0.02115422, -0.03969872, -0.131218  ],
       [-0.02534448, -0.14532188, -0.2739517 ,  0.0097019 ,  0.05538366,
         0.05538366, -0.05617876,  0.0097019 ,  0.05538366,  0.05538366,
        -0.06462599,  0.03011687,  0.93537401,  0.00485095,  0.0097019 ,
         0.00485095,  0.0097019 , -0.06462599, -0.05617876],
       [-0.22370653,  0.01928746,  0.04012952, -0.09184548, -0.00487943,
        -0.00487943,  0.01711314, -0.09184548, -0.00487943, -0.00487943,
         0.01369058, -0.02480057,  0.01369058,  0.95407726, -0.09184548,
        -0.04592274, -0.09184548,  0.01369058,  0.01711314],
       [-0.44741306,  0.03857492,  0.08025905, -0.18369095, -0.00975885,
        -0.00975885,  0.03422628, -0.18369095, -0.00975885, -0.00975885,
         0.02738115, -0.04960114,  0.02738115, -0.09184548,  0.81630905,
        -0.09184548, -0.18369095,  0.02738115,  0.03422628],
       [-0.22370653,  0.01928746,  0.04012952, -0.09184548, -0.00487943,
        -0.00487943,  0.01711314, -0.09184548, -0.00487943, -0.00487943,
         0.01369058, -0.02480057,  0.01369058, -0.04592274, -0.09184548,
         0.95407726, -0.09184548,  0.01369058,  0.01711314],
       [-0.44741306,  0.03857492,  0.08025905, -0.18369095, -0.00975885,
        -0.00975885,  0.03422628, -0.18369095, -0.00975885, -0.00975885,
         0.02738115, -0.04960114,  0.02738115, -0.09184548, -0.18369095,
        -0.09184548,  0.81630905,  0.02738115,  0.03422628],
       [-0.02534448, -0.14532188, -0.2739517 ,  0.0097019 ,  0.05538366,
         0.05538366, -0.05617876,  0.0097019 ,  0.05538366,  0.05538366,
        -0.06462599,  0.03011687, -0.06462599,  0.00485095,  0.0097019 ,
         0.00485095,  0.0097019 ,  0.93537401, -0.05617876],
       [-0.02484861, -0.41515297, -0.22222725,  0.03237328,  0.00507016,
         0.00507016, -0.15507912,  0.03237328,  0.00507016,  0.00507016,
        -0.10449028, -0.04616452, -0.10449028,  0.01618664,  0.03237328,
         0.01618664,  0.03237328, -0.10449028,  0.84492088]]

4)根据SVD的分解结果,因为只有3个奇异值,我们可以根据分解之后的结果重构原来的term-docMatrix,此时,我们只需要U的前三列U[:,:3],V的前三行V[:3,:],重构结果:

term-DocMatrix^T = [[-0.11278275,  0.33070808, -0.30923603,  0.01264451,  0.35411661,
          0.35411661,  0.03413284,  0.01264451,  0.35411661,  0.35411661,
         -0.15461801,  0.33070808, -0.15461801,  0.00632225,  0.01264451,
          0.00632225,  0.01264451, -0.15461801,  0.03413284],
        [ 0.29332642,  0.1084434 ,  0.64500943,  0.12636854, -0.10103917,
         -0.10103917,  0.4137034 ,  0.12636854, -0.10103917, -0.10103917,
          0.32250471,  0.1084434 ,  0.32250471,  0.06318427,  0.12636854,
          0.06318427,  0.12636854,  0.32250471,  0.4137034 ],
        [ 0.05335364,  0.38241006,  0.05768725,  0.08262001,  0.28866144,
          0.28866144,  0.26340711,  0.08262001,  0.28866144,  0.28866144,
          0.02884362,  0.38241006,  0.02884362,  0.04131001,  0.08262001,
          0.04131001,  0.08262001,  0.02884362,  0.26340711]]

经过svd计算之后重新生成的term-docMatrix中的每个词的表示已经包含了部分语义了,因此可以直接用于后续NLP任务中。

2.1.1 基于矩阵模型的算法

  • HAL
  • GloVe
  • Jones & Mewhort

基于 聚类方式:

  • Brown Clustering

基于神经网络:

  • Skip-gram
  • CBOW
  • Order
  • LBL
  • NNLM
  • C&W

2. 每个词向量算法原理探究

3. 不同算法对比

4. 总结

上一篇 下一篇

猜你喜欢

热点阅读