超简单python脚本实现Selenium+Xpath框架下批量

2021-01-01 本文已影响0人瓶瓶瓶平平

在这个不平凡的2020的最后几天，林小姐的Group project分工分到了对155篇文章进行是否是RCT（Randomized Controlled Trial）研究的判断。

淦！好多.png

林小姐说要自己一篇篇看然后仔细筛选，
“在下佩服，棒！不愧是您”
然后她躺下睡着了。
好吧！看着林小姐的不那么盛世的美颜，我决定捡起我那几千个小时没用过的Selenium(本来想用scrapy,很可惜发现自己忘得差不多了)

首先当然是找规律！
去Pubmed搜了几篇发现人家的RCT都有标识了。那干就完了

RCT标识 of Pubmed

from selenium import webdriver
from collections import OrderedDict
import time
goin = "C:/Users/LIFANGPING/Desktop/inclusion1.txt"#Your paper name list
goout = "C:/Users/LIFANGPING/Desktop/inclusionout.csv"#Your outfile name list

file = open(goin,"r")
lines = list(file.readlines())
file.close()
outfile = open(goout,"w")

chromedriver = r"C:\Users\LIFANGPING\AppData\Local\Google\Chrome\Application\chromedriver"#Start the browers
driver = webdriver.Chrome(chromedriver)
url = "https://pubmed.ncbi.nlm.nih.gov/29747957/"# Search page of a paper from Pubmed


driver.get(url)
time.sleep(2) #keep enough time  of  waiting for the response between your browser and the server; all time.sleep is for this purpose

for i in lines:
    driver.refresh()
    time.sleep(1)
    print(i.strip(),end = ",",file = outfile)
   
    time.sleep(2)
    need = i.strip()
    scan = driver.find_element_by_xpath('/html/body/form/div/div[1]/div/span/input')#Search box location
    scan.send_keys(need)#Enter your search content
    
    time.sleep(2)
    scanclick = driver.find_element_by_xpath('/html/body/form/div/div[1]/div/button')#Search the search botton
    scanclick.click()#click the search botton
    time.sleep(2)
    try: #If there are multiple paper search results, choose the best
        bestmeet = driver.find_element_by_xpath('/html/body/main/div[9]/div[2]/section[1]/div[1]/article/div/a')
        bestmeet.click()
        time.sleep(2)
        doi = driver.find_element_by_xpath('.//*[@class="citation-doi"]').text #get the doi
        time.sleep(1)
        print(doi,file = outfile)
         
    except:
        try:
            time.sleep(4)
            doi = driver.find_element_by_xpath('.//*[@class="citation-doi"]').text
            time.sleep(2)
            print(doi,file = outfile)
         
        except:
            driver.get(url)
            time.sleep(2)
            print("",file = outfile)
            continue
            
outfile.close()

结果相当感人又有少年感！

部分结果(几乎是全部RCT文章了！)1.png

当然有些文章Pubmed没分类或者没收录的，醒来的林小姐睁着大大的小眼睛一脸无辜的说一定要仔细过一遍。（好吧！后来的事实证明这些剩下的没几篇是）

好吧，脚本还能做的就大概是根据doi下载了，下载肯定要有下载源，科研女神Alexandra Elbakyan的成果就此登场(好像新的网站版本里不挥手改下雪了)

手工调整一下Doi list.png

来吧！淦!
这里声明一下我用的是win下的Ubuntu子系统执行脚本，所以能直接用Wget

from selenium import webdriver
import os 

download_dir = "C:/Users/LIFANGPING/Desktop/allpdf/" # for linux/*nix, download_dir="/usr/Public"
options = webdriver.ChromeOptions()

profile = {"plugins.plugins_list": [{"enabled": False, "name": "Chrome PDF Viewer"}], # Disable Chrome's PDF Viewer
               "download.default_directory": download_dir , "download.extensions_to_open": "applications/pdf"}
options.add_experimental_option("prefs", profile)
driver = webdriver.Chrome("C:/Users/LIFANGPING/AppData/Local/Google/Chrome/Application/chromedriver", options=options)
# Optional argument, if not specified will search path.
file = open("C:/Users/LIFANGPING/Desktop/doi-part2.txt","r")
lines = list(file.readlines())
file.close()

srclist = []
for i in lines:
    doi = i.strip()
    print(doi)
    try:
        driver.get("https://sci-hub.se/"+doi)
        src = driver.find_element_by_xpath("//*[@id='pdf']").get_attribute("src")
        srclist.append(src)
    except:
        continue

for i in src:
    command = "wget " + i
    os.system(command)

图片.png

这里要注意hub页面直接出来的PDF其实并不在页面上，而是通过一个iframe引入到另一个页面了。需要通过XPath先定位到那个页面，然后通过Wget直接下载。
结果相当优秀

下下来了可我并不想看.png

有没有什么能替我读PDF?（以下内容不靠谱）
我们用 PyPDF2吧，搜关键词，我的关键词是

 word_list=['randomly assigned','randomlyassigned','random assi','randomass','randomizedcontrolledtrial','randomized controlled trial',"randomlyallo",'randomallo','random allo','randomallo','Randomizedcontrolledtrial','Randomizedclinicaltrial',"randomizedclinicaltrial"]

全代码：

import PyPDF2
import os


path = r"C:\Users\LIFANGPING\Desktop\newpdf"
pdflist = os.listdir(path)

for pdfgo in pdflist:
    
    pdf_File=open(path+"/"+pdfgo,'rb')
    print(pdfgo,end = ",")
    #path = r'C:\Users\LIFANGPING\Desktop\file'
    try:
        pdf_Obj=PyPDF2.PdfFileReader(pdf_File)
        pages=pdf_Obj.getNumPages()

        word_list=['randomly assigned','randomlyassigned','random assi','randomass','randomizedcontrolledtrial','randomized controlled trial',"randomlyallo",'randomallo','random allo','randomallo','Randomizedcontrolledtrial','Randomizedclinicaltrial',"randomizedclinicaltrial"]


        for w in word_list:
            page_list=[]
            for p in range(0,pages):
                text=pdf_Obj.getPage(p).extractText().strip()

                if text.find(w) != -1:
                    page_list.append(p+1)

            print(w,page_list,end = ",")

        print()
    except:
        continue

其实还是满可靠的，不是的就是没有，是的就都有好几个！

不是就是没有，是有好几个.png

就是不敢相信，还是一篇篇看了（为啥不相信机器呢？）
乏了乏了。

林小姐的group project leader: “不属于我们小组的李同学真是一个宝！”
喵喵喵？

超简单python脚本实现Selenium+Xpath框架下批量

大家新年快乐！

猜你喜欢

热点阅读