Python 爬虫：从 Word 文件读取文章列表并查询 Web of Science 引用情况

本文将介绍如何使用 Python 编写爬虫代码，从 Word 文件中读取文章列表，并通过 Web of Science 查询对应文章的引用情况，包括自引、他引以及施引文献清单，并将结果保存到新的 Word 文件。

代码实现

from selenium import webdriver
import docx

# 读取文章列表
doc = docx.Document(r'e:\Users\Administrator\Desktop\文章列表.docx')
article_list = []
for para in doc.paragraphs:
    article_list.append(para.text)

# 打开 Web of Science
driver = webdriver.Chrome(executable_path=r'chromedriver.exe')
driver.get('https://www.webofknowledge.com')

# 登录账户
input('请手动登录账户后按回车键继续...')

# 开始循环查询每篇文章
for article in article_list:
    # 在搜索框中输入文章标题
    search_box = driver.find_element_by_id('value(input1)')
    search_box.clear()
    search_box.send_keys(article)
    search_button = driver.find_element_by_class_name('searchButton')
    search_button.click()
    
    # 获取引用情况
    try:
        cited_num = driver.find_element_by_css_selector('.snowplow-citedref-times-cited-count-link').text
        self_cited_num = driver.find_element_by_css_selector('.snowplow-citedref-self-cite-count-link').text
        other_cited_num = driver.find_element_by_css_selector('.snowplow-citedref-related-records-count-link').text
        citing_article = driver.find_element_by_css_selector('.snowplow-citedref-times-cited-full-record-link').get_attribute('href')
    except:
        cited_num = '无相关数据'
        self_cited_num = '无相关数据'
        other_cited_num = '无相关数据'
        citing_article = '无'
        
    # 将引用情况写入新的 Word 文件
    new_doc.add_paragraph('文章标题：' + article)
    new_doc.add_paragraph('总引用数：' + cited_num)
    new_doc.add_paragraph('自引数：' + self_cited_num)
    new_doc.add_paragraph('他引数：' + other_cited_num)
    new_doc.add_paragraph('施引文献清单：')
    if citing_article == '无':
        new_doc.add_paragraph('无相关数据')
    else:
        driver.get(citing_article)
        citing_doc = docx.Document()
        for para in driver.find_elements_by_css_selector('.l-column-content div p'):
            citing_doc.add_paragraph(para.text)
        citing_doc.save(article + '.docx')
        new_doc.add_paragraph('已保存到本地：' + article + '.docx')
        
new_doc.save('引用情况.docx')
print('引用情况已保存到本地：引用情况.docx')

代码说明

代码使用 Selenium 模块来模拟浏览器操作，并使用 docx 模块来处理 Word 文件。
代码首先从 Word 文件中读取文章列表，并使用循环遍历每个文章标题。
对于每个文章标题，代码会使用 Selenium 模块在 Web of Science 中搜索对应文章，并提取相关引用信息。
代码将提取到的引用信息以及施引文献清单写入新的 Word 文件中。

注意

代码中使用了 input() 函数来引导用户手动登录 Web of Science 账户，并进行相关搜索操作。
代码中使用了 try...except 语句来处理 Web of Science 中可能出现的错误情况。
代码中将每个文章的施引文献清单保存到本地 Word 文件中，并将其链接添加到新的 Word 文件中。

总结

本文介绍了如何使用 Python 编写爬虫代码，从 Word 文件中读取文章列表，并通过 Web of Science 查询对应文章的引用情况，并将其保存到新的 Word 文件中。该代码可以使用户快速查询大量文章的引用情况，并方便地进行数据整理。

Python 爬虫：从 Word 文件读取文章列表并查询 Web of Science 引用情况