selenium + xpath爬取csdn關於python的博文博主資訊
from selenium.webdriver import Chrome
from lxml import etree
import time
import requests
import json
class CSDN_Spider():
def __init__(self):
self.url = "https://www.csdn.net/nav/python"
self.headers = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36"
}
self.browser = Chrome(executable_path="D:/chromedriver_win32/chromedriver.exe")
def get_all_articles_url(self):
items = []
self.browser.get(self.url)
for i in range(100):
text = self.browser.page_source
html = etree.HTML(text)
tag_a_url = html.xpath("//div[@class='title']//a/@href")
items += tag_a_url
js = "var q=document.documentElement.scrollTop=100000"
self.browser.execute_script(js)
time.sleep(1)
self.browser.close()
return list(set(items))
def get_detail_page_text(self, url):
response = requests.get(url, headers=self.headers)
if response.status_code == 200:
return response.text
else:
return "request not successfully"
def parse_detail_page(self, text):
html = etree.HTML(text)
item = {}
try:
item["name"] = html.xpath("//a[@class='follow-nickName ']/text()")[0]
item["code_age"] = html.xpath("//span[@class='personal-home-page personal-home-years']/text()")[0]
item["authentication"] = html.xpath("//a[@class='personal-home-certification']/@title")[0]
digital_data = html.xpath("//dl[@class='text-center']//span[@class='count']/text()")
item["original"] = digital_data[0]
item["week_rank"] = digital_data[1]
item["all_rank"] = digital_data[2]
item["vivstor_num"] = digital_data[3]
item["integral"] = digital_data[4]
item["fans_num"] = digital_data[5]
item["be_praised_num"] = digital_data[6]
item["review_num"] = digital_data[7]
item["collection"] = digital_data[8]
except:
print("this item is erroneous")
item["name"] = "invalidity"
return item
def save_data(self, item):
with open("./data/csdn_authors.json", "a", encoding="utf-8") as fp:
json.dump(item, fp, ensure_ascii=False)
fp.write("\n")
def main(self):
articles_urls = self.get_all_articles_url()
for url in articles_urls:
text = self.get_detail_page_text(url)
if text != "request not successfully":
item = self.parse_detail_page(text)
self.save_data(item)
print(item["name"] + "save into local document of json successfully")
if __name__ == '__main__':
cs = CSDN_Spider()
cs.main()
爬取結果如下:
相關文章
- Python實現微博爬蟲,爬取新浪微博Python爬蟲
- python實現微博個人主頁的資訊爬取Python
- 微博爬取長津湖博文及評論
- 爬蟲實戰(一):爬取微博使用者資訊爬蟲
- 「玩轉Python」打造十萬博文爬蟲篇Python爬蟲
- JB的Python之旅-爬蟲篇-新浪微博內容爬取Python爬蟲
- Python 超簡單爬取微博熱搜榜資料Python
- [Python3]selenium爬取淘寶商品資訊Python
- Python爬取CSDN部落格資料Python
- 一個批次爬取微博資料的神器
- selenium 知網爬蟲之根據【關鍵詞】獲取文獻資訊爬蟲
- Python爬取動態載入的視訊(梨視訊,xpath)Python
- Python 超簡單爬取新浪微博資料 (高階版)Python
- 微博-指定話題當日資料爬取
- 利用 Python 爬取“工商祕密”微博,看看大家都在關注些什麼?Python
- 爬取微博圖片資料存到Mysql中遇到的各種坑mysql儲存圖片爬取微博圖片MySql
- 爬蟲Selenium+PhantomJS爬取動態網站圖片資訊(Python)爬蟲JS網站Python
- 利用爬蟲獲取當前博文數量與字數爬蟲
- 2018-05-10 爬蟲筆記(二)一個簡單的實踐 —簡單獲取生物資訊達人博主主頁的資訊...爬蟲筆記
- Python網路爬蟲2 - 爬取新浪微博使用者圖片Python爬蟲
- 自動化:用selenium發一篇博文
- 關於python爬取網頁Python網頁
- Scrapy框架的使用之Scrapy爬取新浪微博框架
- 【Python爬蟲實戰】使用Selenium爬取QQ音樂歌曲及評論資訊Python爬蟲
- 用xpath、bs4、re爬取B站python資料Python
- Python爬蟲-xpathPython爬蟲
- Python爬蟲——XPathPython爬蟲
- 博主簡介
- 優秀博主
- 技術博主
- Python web自動化爬蟲-selenium/處理驗證碼/XpathPythonWeb爬蟲
- 爬蟲實戰(三):微博使用者資訊分析爬蟲
- Python爬蟲之資料解析(XPath)Python爬蟲
- 博主的腦抽日常
- python爬取北京租房資訊Python
- Python一鍵爬取你所關心的書籍資訊Python
- python_selenium元素定位_xpath(2)Python
- 博文標題