生信分析流程我自己的生信百宝箱

超简单python脚本实现Selenium+Xpath框架下批量

2021-01-01  本文已影响0人  瓶瓶瓶平平

在这个不平凡的2020的最后几天,林小姐的Group project分工分到了对155篇文章进行是否是RCT(Randomized Controlled Trial)研究的判断。


淦!好多.png

林小姐说要自己一篇篇看然后仔细筛选,
“在下佩服,棒!不愧是您”
然后她躺下睡着了。
好吧!看着林小姐的不那么盛世的美颜,我决定捡起我那几千个小时没用过的Selenium(本来想用scrapy,很可惜发现自己忘得差不多了)


首先当然是找规律!
去Pubmed搜了几篇发现人家的RCT都有标识了。那干就完了


RCT标识 of Pubmed
from selenium import webdriver
from collections import OrderedDict
import time
goin = "C:/Users/LIFANGPING/Desktop/inclusion1.txt"#Your paper name list
goout = "C:/Users/LIFANGPING/Desktop/inclusionout.csv"#Your outfile name list

file = open(goin,"r")
lines = list(file.readlines())
file.close()
outfile = open(goout,"w")

chromedriver = r"C:\Users\LIFANGPING\AppData\Local\Google\Chrome\Application\chromedriver"#Start the browers
driver = webdriver.Chrome(chromedriver)
url = "https://pubmed.ncbi.nlm.nih.gov/29747957/"# Search page of a paper from Pubmed


driver.get(url)
time.sleep(2) #keep enough time  of  waiting for the response between your browser and the server; all time.sleep is for this purpose

for i in lines:
    driver.refresh()
    time.sleep(1)
    print(i.strip(),end = ",",file = outfile)
   
    time.sleep(2)
    need = i.strip()
    scan = driver.find_element_by_xpath('/html/body/form/div/div[1]/div/span/input')#Search box location
    scan.send_keys(need)#Enter your search content
    
    time.sleep(2)
    scanclick = driver.find_element_by_xpath('/html/body/form/div/div[1]/div/button')#Search the search botton
    scanclick.click()#click the search botton
    time.sleep(2)
    try: #If there are multiple paper search results, choose the best
        bestmeet = driver.find_element_by_xpath('/html/body/main/div[9]/div[2]/section[1]/div[1]/article/div/a')
        bestmeet.click()
        time.sleep(2)
        doi = driver.find_element_by_xpath('.//*[@class="citation-doi"]').text #get the doi
        time.sleep(1)
        print(doi,file = outfile)
         
    except:
        try:
            time.sleep(4)
            doi = driver.find_element_by_xpath('.//*[@class="citation-doi"]').text
            time.sleep(2)
            print(doi,file = outfile)
         
        except:
            driver.get(url)
            time.sleep(2)
            print("",file = outfile)
            continue
            
outfile.close()

结果相当感人又有少年感!


部分结果(几乎是全部RCT文章了!)1.png

当然有些文章Pubmed没分类或者没收录的,醒来的林小姐睁着大大的小眼睛一脸无辜的说一定要仔细过一遍。(好吧!后来的事实证明这些剩下的没几篇是)

好吧,脚本还能做的就大概是根据doi下载了,下载肯定要有下载源,科研女神Alexandra Elbakyan的成果就此登场(好像新的网站版本里不挥手改下雪了)


手工调整一下Doi list.png

来吧!淦!
这里声明一下我用的是win下的Ubuntu子系统执行脚本,所以能直接用Wget

from selenium import webdriver
import os 

download_dir = "C:/Users/LIFANGPING/Desktop/allpdf/" # for linux/*nix, download_dir="/usr/Public"
options = webdriver.ChromeOptions()

profile = {"plugins.plugins_list": [{"enabled": False, "name": "Chrome PDF Viewer"}], # Disable Chrome's PDF Viewer
               "download.default_directory": download_dir , "download.extensions_to_open": "applications/pdf"}
options.add_experimental_option("prefs", profile)
driver = webdriver.Chrome("C:/Users/LIFANGPING/AppData/Local/Google/Chrome/Application/chromedriver", options=options)
# Optional argument, if not specified will search path.
file = open("C:/Users/LIFANGPING/Desktop/doi-part2.txt","r")
lines = list(file.readlines())
file.close()

srclist = []
for i in lines:
    doi = i.strip()
    print(doi)
    try:
        driver.get("https://sci-hub.se/"+doi)
        src = driver.find_element_by_xpath("//*[@id='pdf']").get_attribute("src")
        srclist.append(src)
    except:
        continue

for i in src:
    command = "wget " + i
    os.system(command)
图片.png

这里要注意hub页面直接出来的PDF其实并不在页面上,而是通过一个iframe引入到另一个页面了。需要通过XPath先定位到那个页面,然后通过Wget直接下载。
结果相当优秀


下下来了可我并不想看.png

有没有什么能替我读PDF?(以下内容不靠谱)
我们用 PyPDF2吧,搜关键词,我的关键词是

 word_list=['randomly assigned','randomlyassigned','random assi','randomass','randomizedcontrolledtrial','randomized controlled trial',"randomlyallo",'randomallo','random allo','randomallo','Randomizedcontrolledtrial','Randomizedclinicaltrial',"randomizedclinicaltrial"]

全代码:

import PyPDF2
import os


path = r"C:\Users\LIFANGPING\Desktop\newpdf"
pdflist = os.listdir(path)

for pdfgo in pdflist:
    
    pdf_File=open(path+"/"+pdfgo,'rb')
    print(pdfgo,end = ",")
    #path = r'C:\Users\LIFANGPING\Desktop\file'
    try:
        pdf_Obj=PyPDF2.PdfFileReader(pdf_File)
        pages=pdf_Obj.getNumPages()

        word_list=['randomly assigned','randomlyassigned','random assi','randomass','randomizedcontrolledtrial','randomized controlled trial',"randomlyallo",'randomallo','random allo','randomallo','Randomizedcontrolledtrial','Randomizedclinicaltrial',"randomizedclinicaltrial"]


        for w in word_list:
            page_list=[]
            for p in range(0,pages):
                text=pdf_Obj.getPage(p).extractText().strip()

                if text.find(w) != -1:
                    page_list.append(p+1)

            print(w,page_list,end = ",")

        print()
    except:
        continue

其实还是满可靠的,不是的就是没有,是的就都有好几个!

不是就是没有,是有好几个.png

就是不敢相信,还是一篇篇看了(为啥不相信机器呢?)
乏了乏了。

林小姐的group project leader: “不属于我们小组的李同学真是一个宝!”
喵喵喵?

大家新年快乐!

上一篇下一篇

猜你喜欢

热点阅读