<Natural Language Processing

2017-03-05  本文已影响203人  _Randolph_

Chapter 1 & 2: Language Processing and Python & Accessing Text Corpora and Lexical Resources


1.Key:

What's NTLK?

NTLK是一个自然语言工具包,最初创建于2001年,最初是宾州大学计算机与信息科学系计算语言学课程的一部分,大部分NLP研究者入门的首选tool。

另外,这本书是关于用Python进行自然语言处理的一本入门书,基本上可以看做是NLTK这个库的HandBook,使用的方法均是nltk库中的方法。如果希望查阅API文档或者是下载安装NLTK,可以前往官方网站下载,官网上提供和的API文档涵盖了工具包中的每一个模块、类和函数,详细说明了各种参数,以及用法示例,在此不再赘述。

语言处理任务 NLTK模块 功能描述
获取语料库 nltk.corpus 语料库和字典的标准化接口
字符串处理 nltk.tokenize, nltk.stem 分词、句子分解、提取主干
搭配探究 nltk.collocations t-检验、卡方、点互信息
词性标识符 nltk.tag n-gram、backoff、Brill、HMM、TnT
分类 nltk.classify,nltk.cluster 决策树、最大熵、朴素贝叶斯、EM、k-means
分块 nltk.chunk 正则表达式、n-gram、命名实体
解析 nltk.parse 图表、基于特征、一致性、概率性、依赖项
语义解释 nltk.sem,nltk.inference ℷ演算、一阶逻辑、模型检验
指标评测 nltk.metrice 精度、召回率、协议系数
概率与估计 nltk.probability 频率分布、平滑概率分布
应用 nltk.app,nltk.chat 图形化的关键词排序、分析器、WordNet查看器、聊天机器人
语言学领域的工作 nltk.toolbox 处理SIL数据格式的工具箱

concordance function.

>>> text1.concordance("monstrous")
Building index...
Displaying 11 of 11 matches:
ong the former , one was of a most monstrous size . ... This came towards us ,
ON OF THE PSALMS . " Touching that monstrous bulk of the whale or ork we have r
ll over with a heathenish array of monstrous clubs and spears . Some were thick
d as you gazed , and wondered what monstrous cannibal and savage could ever hav
that has survived the flood ; most monstrous and most mountainous ! That Himmal
they might scout at Moby Dick as a monstrous fable , or still worse and more de
th of Radney .'" CHAPTER 55 Of the monstrous Pictures of Whales . I shall ere l
ing Scenes . In connexion with the monstrous pictures of whales , I am strongly
ere to enter upon those still more monstrous stories of them which are to be fo

这个函数的具体实现如下:

def concordance(self, word, width=79, lines=25):
    """
    Print a concordance for ``word`` with the specified context window.
    Word matching is not case-sensitive.
    :seealso: ``ConcordanceIndex``
    """
    if '_concordance_index' not in self.__dict__:
        print("Building index...")
        self._concordance_index = ConcordanceIndex(self.tokens,
                                                   key=lambda s:s.lower())

    self._concordance_index.print_concordance(word, width, lines)

Word Sense Disambiguation & Pronoun Resolution

a. serve: help with food or drink; hold an office; put ball into play

b. dish: plate; course of a meal; communications device


Text Corpus Structure

以下是几种常见的语料库结构:



WordNet

WordNet synsets correspond to abstract concepts, and they don’t always have corre- sponding words in English. These concepts are linked together in a hierarchy. Some concepts are very general, such as Entity, State, Event; these are called unique begin- ners or root synsets. Others, such as gas guzzler and hatchback, are much more specific.

WordNet概念的层次片段:每个节点对应一个同义词集;边表示上位词/下位词关系,即上级概念与从属概念的关系。



2.Correct errors in printing:

P19.

在[Your Trun]的那块内容中:

使用text2尝试前面频率分布的例子。...如果得到的是错误信息:NameError: name 'FreqDist'is not defined,则需要在一开始输入 **nltk.book import ***。

需更正为:

使用text2尝试前面频率分布的例子。...如果得到的是错误信息:NameError: name 'FreqDist'is not defined,则需要在一开始输入 **nltk.import ***。

原因:nltk.book中并不存在FreqDist( )这一function.


P48.

在[Inaugural Address Corpus]的那块代码部分中:

 >>> cfd = nltk.ConditionalFreqDist(
     ...           (target, file[:4])
     ...           for fileid in inaugural.fileids()

需更正为:

 >>> cfd = nltk.ConditionalFreqDist(
     ...           (target, fileid[:4])
     ...           for fileid in inaugural.fileids()

3.Practice:

6.​○ In the discussion of comparative wordlists, we created an object called trans late, which you could look up using words in both German and Italian in order to get corresponding words in English. What problem might arise with this approach? Can you suggest a way to avoid this problem?

8.◑ Define a conditional frequency distribution over the Names Corpus that allows you to see which initial letters are more frequent for males versus females (see Figure 2-7).

cfd = nltk.ConditionalFreqDist((fileid, name[1]
                                 for fileid in names.fileids()
                                 for name in names.words(fileid))

14.◑ Define a function supergloss(s) that takes a synset s as its argument and returns a string consisting of the concatenation of the definition of s, and the definitions of all the hypernyms and hyponyms of s.

def supergloss(s):
    s = wn.synset('s')
    hyponyms_of_s = s.hyponyms()
    hypernyms_of_s = s.hypernyms()
    return str(s) + str(hyponyms_of_s) + str(hypernyms_of_s)

17.◑ Write a function that finds the 50 most frequently occurring words of a text that are not stopwords.

def most_fifty_words(text):
    stopwords = nltk.corpus.stopwords.words('english')
    content = [w for w in text if w.lower() not in stopwords]
    fdist = FreqDist(content)
    vocabulary = list(fdist.keys())
    return vocabulary[:50]

4.Still have Question:

上一篇 下一篇

猜你喜欢

热点阅读