gensim — 文档集与向量空间

gensim学习笔记，文档集与向量空间，点击链接可访问官网原文。

先设置日志：

1
2
3

In [1]: import logging

In [2]: logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

从字符串到向量

这次，我们从字符串组成的文档开始：

In [8]:  from gensim import corpora 

In [7]:  documents = ['Human machine interface for lab abc computer applications',
   ...:  'A survey of user opinion of computer system response time',
   ...:  'The EPS user interface management system',
   ...:  'System and human system engineering testing of EPS',
   ...:  'Relation of user perceived response time to error measurement',
   ...:  'The generation of random binary unordered trees',
   ...:  'The intersection graph of paths in trees',
   ...:  'Graph minors IV Widths of trees and well quasi ordering',
   ...:  'Graph minors A survey']

这个文档集包含9个文档，每个文档只有一个句子组成。

首先，我们标记化文档集，剔除常用词（使用停用词表）和只在句子中出现1词次的词：

In [9]: stoplist = set('for a of the and to in'.split())

In [10]: texts = [[word for word in document.lower().split() if word not in stoplist] for document in documents]

In [11]: from collections import defaultdict

In [12]: frequency = defaultdict(int)

In [13]: for text in texts:
   ....:     for token in text:
   ....:         frequency[token] += 1
   ....:         
In [14]: texts = [[token for token in text if frequency[token] > 1] for text in texts]

In [15]: from pprint import pprint 

In [16]: pprint(texts)
[['human', 'interface', 'computer'],
 ['survey', 'user', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'system'],
 ['system', 'human', 'system', 'eps'],
 ['user', 'response', 'time'],
 ['trees'],
 ['graph', 'trees'],
 ['graph', 'minors', 'trees'],
 ['graph', 'minors', 'survey']]

您处理文件的方式可能会有所不同；这里，我只利用空格进行分词，然后把每个单词字母最小化。事实上，我使用这个特殊的（简单低效）设置来模拟Deerwester等人做的原始实验。

处理文档的方式各不相同，并且与应用程序和语言有关，我决定不通过任何接口来限制它们。相反，文档由从其中提取的特征表示，而不是通过其“表面”字符串形式表示：如何获得特征取决于您。下面我描述一个常见的，通用的方法（称为词袋），但是请记住，不同的应用程序域需要不同的特征，并且，当然，输入的是垃圾，输出的也是垃圾 …

要将文档转换为向量，我们将使用一个用bag-of-words来表达的文档。在该表达中，每个文档由一个向量表示，其中每个向量元素表示问题 - 答案对，其风格为：

“文字出现在文档中多少次？一次。”

仅通过其ID（整数）来表达问题是很有用处的。这种问题和ID之间的映射称为词典：

In [17]: dictionary = corpora.Dictionary(texts)
2016-11-28 19:11:28,521 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2016-11-28 19:11:28,522 : INFO : built Dictionary(12 unique tokens: ['minors', 'human', 'time', 'system', 'computer']...) from 9 documents (total 29 corpus positions)

In [18]: dictionary.save('/tmp/deerwester.dict') # 保存词典以备捡来使用
2016-11-28 19:12:21,738 : INFO : saving Dictionary object under /tmp/deerwester.dict, separately None
2016-11-28 19:12:21,739 : INFO : saved /tmp/deerwester.dict

In [19]: print(dictionary)
Dictionary(12 unique tokens: ['minors', 'human', 'time', 'system', 'computer']...)

这里我们使用gensim.corpora.dictionary.Dictionary 类对文档集中出现的每一个词赋予一个唯一的整数ID。这将扫描文本，收集字数和相关统计数据。最后，我们看到在处理的文档库中有12个不同的词，这意味着每个文档将由12个数字（即通过12维向量）表示。要查看单词及其ID之间的映射：

1 2	In [20]: print(dictionary.token2id) {'minors': 11, 'human': 0, 'time': 7, 'system': 6, 'computer': 1, 'interface': 2, 'survey': 5, 'response': 4, 'graph': 10, 'user': 3, 'trees': 9, 'eps': 8}

要将一个文档实际转换为向量：

In [21]: new_doc = "Human computer interaction"

In [22]: new_vec = dictionary.doc2bow(new_doc.lower().split())

In [23]: print(new_vec)
[(0, 1), (1, 1)]

函数doc2bow()知识简单地统计每个唯一的词出现的次数，把单词转换为对应的整数ID，返回稀疏向量作为结果。稀疏向量[(0,1),(1,1)]因此可读为：在文档“Human computer interaction”中，computer（id 0）和human（id 1）各出现一次；其他十个字典中的词出现（隐式）零次。

In [25]: corpus = [dictionary.doc2bow(text) for text in texts]

In [26]: corpora.MmCorpus.serialize('/tmp/deerwester.mm',corpus) # 存储以备将来 使用
2016-11-28 19:29:06,023 : INFO : storing corpus in Matrix Market format to /tmp/deerwester.mm
2016-11-28 19:29:06,023 : INFO : saving sparse matrix to /tmp/deerwester.mm
2016-11-28 19:29:06,023 : INFO : PROGRESS: saving document #0
2016-11-28 19:29:06,024 : INFO : saved 9x12 matrix, density=25.926% (28/108)
2016-11-28 19:29:06,024 : INFO : saving MmCorpus index to /tmp/deerwester.mm.index

In [27]: print(corpus)
[[(0, 1), (1, 1), (2, 1)], [(1, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)], [(2, 1), (3, 1), (6, 1), (8, 1)], [(0, 1), (6, 2), (8, 1)], [(3, 1), (4, 1), (7, 1)], [(9, 1)], [(9, 1), (10, 1)], [(9, 1), (10, 1), (11, 1)], [(5, 1), (10, 1), (11, 1)]]

现在这个问题很清楚了：id=10的向量特征值表示“单词graph在这个文档中出现了几次？”答案是：在前六个文档中出现了0次，在剩余三个文档中各出现1次。

文档集流—一次一个文档

注意，上面的文档库完全在内存中，作为一个普通的Python列表。在这个简单的例子中没关系，但为了使事情清楚，让我们假设文档库中有数百万个文档。不可能将它们全部存储在RAM中。相反，让我们假设文档存储在磁盘上的文件中，每行一个文档。 Gensim只要求文档库一次只返回一个文档向量：

In [28]: class MyCorpus(object):
   ....:     def __iter__(self):
   ....:         for line in open('mycorpus.txt'):
   ....:             # 假设每行一个文档，每个单词之间有空格
   ....:             yield dictionary.doc2bow(line.lower().split())
   ....:

在这儿下载示例文件mycorpus.txt 。文件中每个文档占一行的假设并不重要；你可以修改iter功能以适合自己输入的需要，不管它是什么。遍历词典，解析XML格式文件，访问网络……只要能在每个文档中解析出清晰的词语列表，然后把这些词语经词典转换为对应ID，在iter中产出稀疏向量。

In [29]: corpus_memory_friendly = MyCorpus() # 没有把文档集加载到内存里来！

In [30]: print(corpus_memory_friendly)
<__main__.MyCorpus object at 0xb69e07ac>

文档集现在是一个对象。我们没有定义任何打印它的方法，所以print仅打印出该对象在内存中的地址。用处不大。要看组成向量，让我们遍历文档库并打印每一个向量（一次一个）：

In [31]: for vector in corpus_memory_friendly:
   ....:     print(vector)
   ....:     
[(1, 1), (5, 1), (11, 1)]
[(2, 1), (3, 1), (4, 1), (5, 1), (7, 1), (9, 1)]
[(3, 1), (8, 1), (9, 1), (11, 1)]
[(1, 1), (3, 2), (8, 1)]
[(2, 1), (7, 1), (9, 1)]
[(10, 1)]
[(6, 1), (10, 1)]
[(0, 1), (6, 1), (10, 1)]
[(0, 1), (4, 1), (6, 1)]

虽然输出与python的列表输出并没有什么不同，但此处的文档库是内存友好的，因为多数情况下内存中只有一个向量。现在你的文档库可以想要多大就要多大。

类似地，可以不需要把所有文本载入到内存来构建词典：

In [31]: from six import iteritems

In [32]: dictionary = corpora.Dictionary(line.lower().split() for line in open('mycorpus.txt'))
2016-11-28 19:51:27,132 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2016-11-28 19:51:27,133 : INFO : built Dictionary(42 unique tokens: ['engineering', 'and', 'perceived', 'time', 'intersection']...) from 9 documents (total 69 corpus positions)

In [33]: stop_ids = [dictionary.token2id[stopword] for stopword in stoplist if stopword in dictionary.token2id]

In [34]: ones_ids = [tokenid for tokenid, docfreq in iteritems(dictionary.dfs) if docfreq == 1]

In [37]: dictionary.filter_tokens(stop_ids + ones_ids)

In [38]: dictionary.compactify() # 剔除移去停用词和出现一次的词之后留下的空隙

In [39]: print(dictionary)
Dictionary(12 unique tokens: ['minors', 'human', 'time', 'computer', 'system']...)

这就是所有的事情！至少就词袋表示而言。当然，我们用这样的文档库存在另一个问题；它不是完全清楚如何计数可能是有用的不同的单词的频率。事实证明，这个问题的确存在，在我们可以使用它来计算任何有意义的文档或比较文档的相似之处前，我们将需要对这个简单的表达先应用一个变换。转换在另一个教程里介绍，在此之前，让我们简要地将注意力转移到文档库持久性。

文档库格式

将向量空间文档库（〜向量序列）序列化到磁盘存在若干文件格式。 Gensim通过前面提到的流式语料库接口实现它们：以被动方式从文件（或者存储到）盘中一次读取一个文档，而不是将整个文档库一次读入内存。

其中一个更值得注意的文件格式是市场矩阵格式。要以市场矩阵格式保存文档库：

In [41]: corpus = [[(1,0.5)],[]]

In [42]: corpora.MmCorpus.serialize('/tmp/corpus.mm',corpus)
2016-11-28 20:15:43,361 : INFO : storing corpus in Matrix Market format to /tmp/corpus.mm
2016-11-28 20:15:43,361 : INFO : saving sparse matrix to /tmp/corpus.mm
2016-11-28 20:15:43,361 : INFO : PROGRESS: saving document #0
2016-11-28 20:15:43,361 : INFO : saved 2x2 matrix, density=25.000% (1/4)
2016-11-28 20:15:43,362 : INFO : saving MmCorpus index to /tmp/corpus.mm.index

其他格式还包括：Joachim’s SVMlight format ，Blei’s LDA-C format ，GibbsLDA++ format 。

In [43]: corpora.SvmLightCorpus.serialize('/tmp/corpus.svmlight',corpus)
2016-11-28 20:19:02,491 : INFO : converting corpus to SVMlight format: /tmp/corpus.svmlight
2016-11-28 20:19:02,492 : INFO : saving SvmLightCorpus index to /tmp/corpus.svmlight.index

In [44]: corpora.BleiCorpus.serialize('/tmp/corpus.lda-c',corpus)
2016-11-28 20:19:23,526 : INFO : no word id mapping provided; initializing from corpus
2016-11-28 20:19:23,526 : INFO : storing corpus in Blei's LDA-C format into /tmp/corpus.lda-c
2016-11-28 20:19:23,528 : INFO : saving vocabulary of 2 words to /tmp/corpus.lda-c.vocab
2016-11-28 20:19:23,528 : INFO : saving BleiCorpus index to /tmp/corpus.lda-c.index

In [45]: corpora.LowCorpus.serialize('/tmp/corpus.low',corpus)
2016-11-28 20:19:42,338 : INFO : no word id mapping provided; initializing from corpus
2016-11-28 20:19:42,339 : INFO : storing corpus in List-Of-Words format into /tmp/corpus.low
2016-11-28 20:19:42,339 : WARNING : List-of-words format can only save vectors with integer elements; 1 float entries were truncated to integer value
2016-11-28 20:19:42,339 : INFO : saving LowCorpus index to /tmp/corpus.low.index

相反，从市场矩阵格式文件中读取一个迭代向量：

In [46]: corpus = corpora.MmCorpus('/tmp/corpus.mm')
2016-11-28 20:22:02,531 : INFO : loaded corpus index from /tmp/corpus.mm.index
2016-11-28 20:22:02,532 : INFO : initializing corpus reader from /tmp/corpus.mm
2016-11-28 20:22:02,532 : INFO : accepted corpus with 2 documents, 2 features, 1 non-zero entries

文档集对象是流文件，所以一般你不能将它们直接打印出来：

1 2	In [47]: print(corpus) MmCorpus(2 documents, 2 features, 1 non-zero entries)

可以用以下方法：

1 2	In [48]: print(list(corpus)) [[(1, 0.5)], []]

第二种方法明显对内存使用更友好，当然在开发和测试的时候，使用list(corpus)更简单直接。

要把矩阵市场文档流存为Blei’s LDA-C格式：

In [49]: corpora.BleiCorpus.serialize('/tmp/corpus.lda-c',corpus)
2016-11-28 20:27:53,736 : INFO : no word id mapping provided; initializing from corpus
2016-11-28 20:27:53,737 : INFO : storing corpus in Blei's LDA-C format into /tmp/corpus.lda-c
2016-11-28 20:27:53,738 : INFO : saving vocabulary of 2 words to /tmp/corpus.lda-c.vocab
2016-11-28 20:27:53,738 : INFO : saving BleiCorpus index to /tmp/corpus.lda-c.index

这样，gensim也可以用作内存高效的I / O格式转换工具：只需使用一种格式加载文档流，并立即将其另存为另一种格式。添加新格式是很容易，可参阅此处 SVMlight文档库的代码。