Python實踐之合併WOS文獻資料，並對關鍵詞進行詞頻分析

　　想要對所在學科(Public Administration)近十年都在做什麼有一個基本的認識，其中一個最直觀的辦法是看主要期刊都在發什麼領域的文章。任務拆解為三步：1)下載文獻;2)整理文獻資料，以一篇期刊文章為一個觀測點，以關鍵詞為主要變數，構建以供詞頻分析的原始資料集;3)對關鍵詞進行詞頻分析。(注1：這三步應該可以用一篇程式碼打通解決，但出於學習的目的，我將任務分解了，每個任務對應一部分程式碼。注2: 我做這項工作的目的並非是用於文獻分析，而是讓自己對所在學科有一個直觀感覺，做決策參考用。如果要做專門的文獻分析以及生成視覺化結果，CiteSpace等專業軟體可以幫上忙。)

　　下載文獻

　　透過WOS，我們首先在2021年3月4日下載了所在學科近十年主要期刊上的7500條左右文獻資料。這一步驟記錄如下。

　　WOS 高階搜尋—>時間選擇2010-2021，語言選擇English，型別選擇article—>搜尋框內輸入：

　　SO=(Public Administration Review) OR

　　SO=(Journal of Public Administration Research and Theory) OR

　　SO=(Public Management Review) OR

　　SO=(International Journal of Public Administration) OR

　　SO=(International Public Management Journal) OR

　　SO=(Public Administration) OR

　　SO=(Governance-An International Journal of Policy Administration and Institution) OR

　　SO=(International Review of Administrative Science) OR

　　SO=(Journal of Human Resources) OR

　　SO=(Administrative Science Quarterly) OR

　　SO=(American Review of Public Administration) OR

　　SO=(Review of Public Personnel Administration) OR

　　SO=(Local Government Studies) OR

　　SO=(Social Policy & Administration) OR

　　SO=(Public Policy and Administration) OR

　　SO=(Nonprofit Management & Leadership) OR

　　SO=(Administration & Society)

　　構建原始資料集

　　這一步的任務是合併統一制式的15個excel表。主要原因是，WOS每次只能儲存500條文獻，因此我們得到了15個excel表格。要想構建包含7500條資料的原始資料集，就需要合併15個表格。程式碼如下：

　　import os

　　import pandas as pd

　　#將檔案讀取出來放一個列表裡

　　pwd = "./0304"

　　#新建列表存放每個檔案資料(依次讀取多個相同結構的excel檔案並建立dataframe

　　dfs=[]

　　for root, dirs, files in os.walk(pwd):

　　for file in files:

　　file_path = os.path.join(root, file)

　　#print(file_path) #因為報錯了“xlrd.biffh.XLRDError: Unsupported format, or corrupt file”，所以這一步print看看，結果發現有一個.DS_Store的隱藏檔案，所以xlrd讀不出來會報錯，於是在下一步加一個判斷

　　if "xls" in file_path: #加一個判斷解決xlrd報錯的問題

　　df = pd.read_excel(file_path)

　　dfs.append(df)

　　df = pd.concat(dfs)

　　df.to_excel("./0304/raw_data.xls", index=False)

　　關鍵詞詞頻分析

　　觀察WOS生成的文獻資料發現，有的期刊並不會要求作者提供關鍵詞，因此存在缺失值的問題。但同時，發現在“author_keywords"這個變數旁邊還有一個“keywords_plus”變數。這個變數是基於每一篇文獻所引用文獻的標題生成的關鍵詞。因此任務分為兩步：1)對缺少“author_kewords”觀測值的文獻，將其“keywords_plus”下的觀測值補過來。2)對“author_kewords”進行詞頻分析。基於步驟二，準備好一個僅有“author_keywords"和“keywords_plus”的excel表

　　發現以下一些坑：

　　1)如果“author_keywords"值為空，則將“keywords_plus”的值補給“author_keywords";

　　2)大小寫大連做人流哪家好

　　3)關鍵詞之間是用semicolon 和space隔開的。

　　import xlrd

　　from collections import Counter

　　from operator import itemgetter

　　# step 0. 讀excel檔案

　　file_path = "./Book1.xlsx"

　　table = xlrd.open_workbook(file_path).sheets()[0] # 讀取表

　　nrow, ncol = table.nrows, table.ncols # 記錄行數和列數，第一行是表頭

　　results = Counter()

　　# 對每行處理

　　for index in range(1, nrow): #第0行是表頭

　　# step 1. 獲取value

　　value = table.cell_value(index, 0) #index是行的引數(第12行)，0是指表格的第一列

　　if value == "":

　　value = table.cell_value(index, 1) # 如果為空，則取“keywords_plus”這一列的值

　　# step 2. 分割並統計

　　keywords = []

　　for word in value.split(";"):

　　if word.strip() != "":

　　keywords.append(word.strip().lower())

　　# keywords = [word.strip().lower() for word in value.split(";") if word.strip() != ""] #這一行可以代替第19-22行，是一樣的效果，這一行程式碼更簡潔且更高階

　　results.update(keywords)

　　# step 3. 對結果排序

　　ordered_results = sorted([(key, value) for key, value in results.items()], key=itemgetter(1), reverse=True) #(key 和reverse是sorted函式的引數，如果key=itemgetter(0), 那麼就是按照關鍵詞的首字母進行排序)

　　# step 4. 輸出結果

　　save_file_path = "./frequency.txt"

　　with open(save_file_path, "w") as f:

　　for key, value in ordered_results:

　　f.write(f"{key} : {value}\n")

　　最終呈現結果

Python實踐之合併WOS文獻資料，並對關鍵詞進行詞頻分析

相關文章