selenium + xpath爬取csdn關於python的博文博主資訊
from selenium.webdriver import Chrome
from lxml import etree
import time
import requests
import json
class CSDN_Spider():
def __init__(self):
self.url = "https://www.csdn.net/nav/python"
self.headers = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36"
}
self.browser = Chrome(executable_path="D:/chromedriver_win32/chromedriver.exe")
def get_all_articles_url(self):
items = []
self.browser.get(self.url)
for i in range(100):
text = self.browser.page_source
html = etree.HTML(text)
tag_a_url = html.xpath("//div[@class='title']//a/@href")
items += tag_a_url
js = "var q=document.documentElement.scrollTop=100000"
self.browser.execute_script(js)
time.sleep(1)
self.browser.close()
return list(set(items))
def get_detail_page_text(self, url):
response = requests.get(url, headers=self.headers)
if response.status_code == 200:
return response.text
else:
return "request not successfully"
def parse_detail_page(self, text):
html = etree.HTML(text)
item = {}
try:
item["name"] = html.xpath("//a[@class='follow-nickName ']/text()")[0]
item["code_age"] = html.xpath("//span[@class='personal-home-page personal-home-years']/text()")[0]
item["authentication"] = html.xpath("//a[@class='personal-home-certification']/@title")[0]
digital_data = html.xpath("//dl[@class='text-center']//span[@class='count']/text()")
item["original"] = digital_data[0]
item["week_rank"] = digital_data[1]
item["all_rank"] = digital_data[2]
item["vivstor_num"] = digital_data[3]
item["integral"] = digital_data[4]
item["fans_num"] = digital_data[5]
item["be_praised_num"] = digital_data[6]
item["review_num"] = digital_data[7]
item["collection"] = digital_data[8]
except:
print("this item is erroneous")
item["name"] = "invalidity"
return item
def save_data(self, item):
with open("./data/csdn_authors.json", "a", encoding="utf-8") as fp:
json.dump(item, fp, ensure_ascii=False)
fp.write("\n")
def main(self):
articles_urls = self.get_all_articles_url()
for url in articles_urls:
text = self.get_detail_page_text(url)
if text != "request not successfully":
item = self.parse_detail_page(text)
self.save_data(item)
print(item["name"] + "save into local document of json successfully")
if __name__ == '__main__':
cs = CSDN_Spider()
cs.main()
爬取結果如下:
相關文章
- 「玩轉Python」打造十萬博文爬蟲篇Python爬蟲
- CSDN開博,致力於非常邪門的開發
- 利用爬蟲獲取當前博文數量與字數爬蟲
- selenium 知網爬蟲之根據【關鍵詞】獲取文獻資訊爬蟲
- 自動化:用selenium發一篇博文
- [Python3]selenium爬取淘寶商品資訊Python
- 博主簡介
- 博文標題
- CSDN部落格頻道“移動開發之我見”主題徵文活動【博文彙總】移動開發
- 博主的腦抽日常
- 此博不再更新,新博地址:http://blog.csdn.net/tonyzhou_cnHTTP
- 推薦一篇關於java集合的博文,寫的很niceJava
- 9.1作業博文
- 2018-05-10 爬蟲筆記(二)一個簡單的實踐 —簡單獲取生物資訊達人博主主頁的資訊...爬蟲筆記
- [Python爬蟲] Selenium+Phantomjs動態獲取CSDN下載資源資訊和評論Python爬蟲JS
- [Python學習] 簡單爬取CSDN下載資源資訊Python
- [python爬蟲] BeautifulSoup和Selenium簡單爬取知網資訊測試Python爬蟲
- ORACLE 優秀博文收藏Oracle
- 1.博文標題
- [python爬蟲] Selenium爬取新浪微博內容及使用者資訊Python爬蟲
- Python爬取CSDN部落格資料Python
- 關於python爬取網頁Python網頁
- 博主的學習小Tips
- Python爬蟲——XPathPython爬蟲
- Python爬蟲-xpathPython爬蟲
- 從SpringBoot構建十萬博文聊聊快取穿透Spring Boot快取穿透
- 我的Android博文整理彙總Android
- 爬蟲---xpath解析(爬取美女圖片)爬蟲
- 【Python爬蟲實戰】使用Selenium爬取QQ音樂歌曲及評論資訊Python爬蟲
- 爬蟲Selenium+PhantomJS爬取動態網站圖片資訊(Python)爬蟲JS網站Python
- [python爬蟲] BeautifulSoup和Selenium對比爬取豆瓣Top250電影資訊Python爬蟲
- Python web自動化爬蟲-selenium/處理驗證碼/XpathPythonWeb爬蟲
- BLOG - 個人博文系統開發總結 二:使用Lucene完成博文檢索功能
- python爬蟲——爬取大學排名資訊Python爬蟲
- 爬蟲實戰(二):Selenium 模擬登入並爬取資訊爬蟲
- Python 爬取外文期刊論文資訊(機械 儀表工業)Python
- python_selenium元素定位_xpath(2)Python
- python爬取北京租房資訊Python