project 5 movies question

2021-01-07 本文已影响0人 jiarf

1.数据的下载

教程：Biopython学习笔记（四）访问NCBI Entrez数据库 - 简书 (jianshu.com)
Bio.Entrez package — Biopython 1.75 documentation
Python：利用Entrez库筛选下载PubMed文献摘要 - v林三岁 - 博客园 (cnblogs.com)--主要是这篇教程，教程中下面还有两个教程也是很好的
另外就是Python抓取简单的web内容,爬取,网页内容 (pythonf.cn)
这篇的获取需要正则表达python中的正则表达式（re模块） - tina.py - 博客园 (cnblogs.com)
，这个还没大看懂，就先放弃了。
entrez包
esearch：对进行Entrez检索。
efilter：对esearch的检索结果进行过滤。
efetch：对上面两个操作的结果进行抓取（下载）。
以ncbi的ageing的2019年1/1到2020/12/1号之间所有的题目中有aging的pubmed号和标题和摘要

image.png
Entrez.esearch的作用就是用来检索的，里边的参数db指向你要检索的数据库，代码中的注释也写了，Entrez作为一个接口检索，除了能够检索PubMed中的文献，也能去到别的数据库检索文献；term是写你的筛选语句，注意你写的检索语句不能带有引号，单引号也不行，否则会检索不到，如果不知道检索语句怎么写，或者是不知道字段是否被定义，可以在官网的检索那里https://pubmed.ncbi.nlm.nih.gov/advanced/选择字段输入内容自动生成query，但是生成的语句是不太智能的，会有很多括号是你不需要的，自己写代码的时候要适当去掉；ptyp我这里用的是Review，usehistory是y，意思是后边我的检索要记住这个语句，根据历史查询来检索；retmax如果不进行设置的话，默认给你的最大数据量好像是只有1000，我要的检索内容是超过这个值的，因此我需要自定义检索的数量。

handle_0 = Entrez.esearch(db="pubmed", term="ageing[Title] AND humans[MeSH Terms] AND (2020/01/01[Date - Publication] : 2021/12/31[Date - Publication])")

from Bio import Entrez
from collections import Counter
import numpy as np
# 参数设置
Entrez.email = "1873350971@qq.com"

# 搜索
import http.client
 http.client.HTTPConnection._http_vsn = 10
http.client.HTTPConnection._http_vsn_str = 'HTTP/1.0'
hd_esearch = Entrez.esearch(db="pubmed", term="ageing[Title]", reldate=365, ptyp="Review", usehistory="y")
read_esearch = Entrez.read(hd_esearch)
total = int(read_esearch["Count"])
webenv = read_esearch["WebEnv"]
query_key = read_esearch["QueryKey"]
# 这里演示设定total为 10
total = 50
step = 6
print("Result items: ", total)
with open("./data/ageing_pubmed.txt", "w") as file:
    for start in range(0, total, step):
        print("Download record %i to %i" % (start + 1, int(start+step)))
        hd_efetch = Entrez.efetch(db="pubmed", retstart=start, retmax=step, webenv=webenv, query_key=query_key, rettype="medline", retmode="text")
        file.write(hd_efetch.read())

跑的过程中遇到这样的问题

image.png

是因为输入的太多了数据，结果输入的那个文件已经满了

image.png
所以换一个代码

project 5 movies question

1.数据的下载

猜你喜欢

热点阅读