检测科学摘要中特定的词或词组 (自学43天)

2020-03-24  本文已影响0人  天明豆豆


可以使用上一篇文章所用到的检测科学摘要中的词或词组。一般地,本例还可以适用于进行非常简单的文本挖掘,可类比于 Microsoft Word 的"查找"工具。

import urllib2 
import re 
# word to be searched 

keyword = re.compile('schistosoma')

# list of PMIDs where we want to search the word 

pmids = ['18235848','22607149','22405002','21630672'] 
for pmid in pmids: 
  url = 'http://www.ncbi.nlm.nih.gov/pubmed?term=%s' +%pmid 
  html = handler.read() 
  title_regexp = re.compile('<h1>.{5.400}<!h1>') 
  abstract_regexp = re.compile('<h3>Abstract</h3><p>.{20.3000}</p></div>') 
  abstract = abstract_regexp.search(html) 
  abstract = abstract.group() 
  word = keyword.search(abstract,re.IGNORECASE) 

if word: 
# display title and where the keyword was found 
  print (title) 
  print (word.group(),word.start(),word.end())

如果想找出文本单词的所有匹配结果,可以使用 finditer()方法:

import urllib2
import re 
# word to be searched 

word_regexp = re.compile('schistosαna')
# list of PMIDs where we want to search the word 

pmids = ['18235648','22607149','22405002','21630672'] 
for pmid in pmids: 
  url = 'http://www.ncbi.nlm.nih.gov/pubmed?term=%s' +%pmid 
  handler = urllib2.urlopen(url) 
  html = handler.read () 
  title_regexp = re.compile('<h1>.{5,400}</h1>') 
  title = title_regexp.search(html) 
  title = title.group() 
  abstract_regexp = re.compile('<h3>Abstract</h3><P>.{20, 3000}</p></div>') 
  abstract = abstract_regexp.search(html) 
  abstract = abstract.group() 
  words = keyword.finditer(abstract) 
  if words: 
# diaplay title and where the keyword was found 

    print (title)
    for word in words: 
      print (word.group(),word.start(),word.end())
上一篇 下一篇

