91job就業知識競賽題庫的爬取
最近呢一個大學生就業知識競賽(這個需要有需要學生的賬號才可以登入)的一個比賽,從這個網站上來看的話,如果你要是自己去看題庫的話,是很不容易記住的,而我用的方法是將他的題庫用爬蟲爬取下來,然後再做的時候就可以直接檢視了,今天呢閒下來了就把我之前爬取這個題庫的過程寫一下吧,僅供學習使用
-
在網站中找到請求返回資料那個包,如圖可以看出是那個question那個包返回的資料,並且裡面有我們想要的資料
-
當我們找到請求返回的響應資料包的位置的時候去檢視請求時響應對應的地址和請求時headers攜帶的資料(由於是要用學校的學生賬號才可以登入,所以攜帶的資料中的cookie可定包含了賬號的資訊所以必須要在程式碼的時候攜帶cookie),如圖所示
-
找到請求的地址之後,我們就可以去查詢我們所需要的資料(需要用到寫xpath的知識來定位)在那一部分了我們需要的東西有四個,分別是 1. 題目 2. 題目的選項 3. 正確的答案 4. 下一題的連結;如下圖正是我們想要找的東西;
在我們觀察過幾個題目後,可以看出來他的下一題的地址是有規律的,所以就不需要找到他的xpath了
題目的xpath為:"//div[@class='title']//text()")
選項的xpath為: "//div[@class='answer']//text()"
正確答案的xpath為:"//div[@class='right']//text()"
下一題的地址: http://ccit.91job.org.cn/contest/question?page=i 註釋: page=i 即page等於一個數字,數字是幾就是第幾題,一共530道題目
- 編寫程式碼
# coding=utf-8
import requests
from lxml import etree
import json
import time
import random
class Job():
"""創新創業試題"""
def __init__(self, url):
self.url = url
self.comment_url = "http://ccit.91job.org.cn/contest/question?page={}" #下一頁的構造地址
self.USER_AGENT_LIST = [
"Mozilla/5.0 (Linux; U; Android 2.3.6; en-us; Nexus S Build/GRK39F) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/532.5 (KHTML, like Gecko) Chrome/4.0.249.0 Safari/532.5",
"Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US) AppleWebKit/532.9 (KHTML, like Gecko) Chrome/5.0.310.0 Safari/532.9",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.514.0 Safari/534.7",
"Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/534.14 (KHTML, like Gecko) Chrome/9.0.601.0 Safari/534.14",
"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.14 (KHTML, like Gecko) Chrome/10.0.601.0 Safari/534.14",
"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.20 (KHTML, like Gecko) Chrome/11.0.672.2 Safari/534.20",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.27 (KHTML, like Gecko) Chrome/12.0.712.0 Safari/534.27",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.24 Safari/535.1",
"Mozilla/5.0 (Windows NT 6.0) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.120 Safari/535.2",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.36 Safari/535.7",
"Mozilla/5.0 (Windows; U; Windows NT 6.0 x64; en-US; rv:1.9pre) Gecko/2008072421 Minefield/3.0.2pre",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.10) Gecko/2009042316 Firefox/3.0.10",
"Mozilla/5.0 (Windows; U; Windows NT 6.0; en-GB; rv:1.9.0.11) Gecko/2009060215 Firefox/3.0.11 (.NET CLR 3.5.30729)",
"Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6 GTB5",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; tr; rv:1.9.2.8) Gecko/20100722 Firefox/3.6.8 ( .NET CLR 3.5.30729; .NET4.0E)",
"Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
"Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
"Mozilla/5.0 (Windows NT 5.1; rv:5.0) Gecko/20100101 Firefox/5.0",
"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0a2) Gecko/20110622 Firefox/6.0a2",
"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:7.0.1) Gecko/20100101 Firefox/7.0.1",
"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:2.0b4pre) Gecko/20100815 Minefield/4.0b4pre",
"Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0 )",
"Mozilla/4.0 (compatible; MSIE 5.5; Windows 98; Win 9x 4.90)",
"Mozilla/5.0 (Windows; U; Windows XP) Gecko MultiZilla/1.6.1.0a",
"Mozilla/2.02E (Win95; U)",
"Mozilla/3.01Gold (Win95; I)",
"Mozilla/4.8 [en] (Windows NT 5.1; U)",
"Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.4) Gecko Netscape/7.1 (ax)",
"HTC_Dream Mozilla/5.0 (Linux; U; Android 1.5; en-ca; Build/CUPCAKE) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
"Mozilla/5.0 (hp-tablet; Linux; hpwOS/3.0.2; U; de-DE) AppleWebKit/534.6 (KHTML, like Gecko) wOSBrowser/234.40.1 Safari/534.6 TouchPad/1.0",
"Mozilla/5.0 (Linux; U; Android 1.5; en-us; sdk Build/CUPCAKE) AppleWebkit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
"Mozilla/5.0 (Linux; U; Android 2.1; en-us; Nexus One Build/ERD62) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
"Mozilla/5.0 (Linux; U; Android 2.2; en-us; Nexus One Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
"Mozilla/5.0 (Linux; U; Android 1.5; en-us; htc_bahamas Build/CRB17) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
"Mozilla/5.0 (Linux; U; Android 2.1-update1; de-de; HTC Desire 1.19.161.5 Build/ERE27) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
"Mozilla/5.0 (Linux; U; Android 2.2; en-us; Sprint APA9292KT Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
"Mozilla/5.0 (Linux; U; Android 1.5; de-ch; HTC Hero Build/CUPCAKE) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
"Mozilla/5.0 (Linux; U; Android 2.2; en-us; ADR6300 Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
"Mozilla/5.0 (Linux; U; Android 2.1; en-us; HTC Legend Build/cupcake) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
"Mozilla/5.0 (Linux; U; Android 1.5; de-de; HTC Magic Build/PLAT-RC33) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1 FirePHP/0.3",
"Mozilla/5.0 (Linux; U; Android 1.6; en-us; HTC_TATTOO_A3288 Build/DRC79) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
"Mozilla/5.0 (Linux; U; Android 1.0; en-us; dream) AppleWebKit/525.10 (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2",
"Mozilla/5.0 (Linux; U; Android 1.5; en-us; T-Mobile G1 Build/CRB43) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari 525.20.1",
"Mozilla/5.0 (Linux; U; Android 1.5; en-gb; T-Mobile_G2_Touch Build/CUPCAKE) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
"Mozilla/5.0 (Linux; U; Android 2.0; en-us; Droid Build/ESD20) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
"Mozilla/5.0 (Linux; U; Android 2.2; en-us; Droid Build/FRG22D) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
"Mozilla/5.0 (Linux; U; Android 2.0; en-us; Milestone Build/ SHOLS_U2_01.03.1) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
"Mozilla/5.0 (Linux; U; Android 2.0.1; de-de; Milestone Build/SHOLS_U2_01.14.0) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
"Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/525.10 (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2",
"Mozilla/5.0 (Linux; U; Android 0.5; en-us) AppleWebKit/522 (KHTML, like Gecko) Safari/419.3",
"Mozilla/5.0 (Linux; U; Android 1.1; en-gb; dream) AppleWebKit/525.10 (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2",
"Mozilla/5.0 (Linux; U; Android 2.0; en-us; Droid Build/ESD20) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
"Mozilla/5.0 (Linux; U; Android 2.1; en-us; Nexus One Build/ERD62) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
"Mozilla/5.0 (Linux; U; Android 2.2; en-us; Sprint APA9292KT Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
"Mozilla/5.0 (Linux; U; Android 2.2; en-us; ADR6300 Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
"Mozilla/5.0 (Linux; U; Android 2.2; en-ca; GT-P1000M Build/FROYO) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
"Mozilla/5.0 (Linux; U; Android 3.0.1; fr-fr; A500 Build/HRI66) AppleWebKit/534.13 (KHTML, like Gecko) Version/4.0 Safari/534.13",
"Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/525.10 (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2",
"Mozilla/5.0 (Linux; U; Android 1.6; es-es; SonyEricssonX10i Build/R1FA016) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
"Mozilla/5.0 (Linux; U; Android 1.6; en-us; SonyEricssonX10i Build/R1AA056) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
] # 這個是讓程式偽裝成不同的裝置去訪問這個網址
self.headers = {
"User-Agent": random.choice(self.USER_AGENT_LIST),
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"Accept-Encoding": "gzip, deflate",
"Accept-Language": "zh-CN,zh;q=0.9",
"Cookie": "__jsluid_h=87264cb6bb94cd265aa3cd970d21a4c7; __51cke__=; __tins__20821311=%7B%22sid%22%3A%201601988819421%2C%20%22vd%22%3A%203%2C%20%22expires%22%3A%201601991057133%7D; __51laig__=3; PHPSESSID2=8qqu9tn32ubakrul25nfco1rn3; universityadmin=ce11d199533a25ee8176c75a452fa45fa64e9699a%3A4%3A%7Bi%3A0%3Bs%3A7%3A%225155521%22%3Bi%3A1%3Bs%3A11%3A%2218060230305%22%3Bi%3A2%3Bi%3A2592000%3Bi%3A3%3Ba%3A12%3A%7Bs%3A6%3A%22status%22%3Bi%3A1%3Bs%3A8%3A%22realname%22%3Bs%3A9%3A%22%E6%9F%B4%E4%BC%9A%E8%BE%BE%22%3Bs%3A8%3A%22username%22%3Bs%3A11%3A%2218060230305%22%3Bs%3A5%3A%22email%22%3Bs%3A17%3A%221362951170%40qq.com%22%3Bs%3A5%3A%22grade%22%3Bs%3A2%3A%2221%22%3Bs%3A6%3A%22idcard%22%3Bs%3A0%3A%22%22%3Bs%3A12%3A%22universityid%22%3Bs%3A3%3A%22772%22%3Bs%3A5%3A%22major%22%3Bs%3A9%3A%22110008012%22%3Bs%3A8%3A%22lasttime%22%3BN%3Bs%3A6%3A%22lastip%22%3BN%3Bs%3A10%3A%22thisschool%22%3Bi%3A1%3Bs%3A4%3A%22type%22%3Bs%3A1%3A%221%22%3B%7D%7D",
} # Accept、Accept-Encoding 、Accept-Language 這幾個可以不新增,headers中只用User-Agent和Cookie就可以了
def parse_url(self, url):
"""傳送請求獲取資料"""
html = requests.get(url, headers=self.headers)
return html.content.decode()
def get_content_list(self, html):
"""獲取資料"""
html_xml = etree.HTML(html)
print(html_xml)
item = dict()
item["test"] = [i.replace("\n","") for i in html_xml.xpath("//div[@class='title']//text()")]
item["answer"] = [i.replace("\r\n","") for i in html_xml.xpath("//div[@class='answer']//text()")]
item["right"] = html_xml.xpath("//div[@class='right']//text()")
return item
def save_content(self, content):
"""儲存資料"""
file_path = "job.json"
with open(file_path, "a", encoding="utf-8") as f:
f.write(json.dumps(content, ensure_ascii=False, indent=2))
f.write('\n')
print("儲存成功")
def run(self):
"""主流程控制"""
# 1.獲取url
next_url = self.url
for i in range(1,531):
# 2. 傳送請求獲取資料
html = self.parse_url(next_url)
# 3. 資料提取
content = self.get_content_list(html)
# content= self.get_content_list(html)
# 4. 儲存資料
self.save_content(content)
next_url = self.comment_url.format(i)
print(i)
time.sleep(5)
if __name__ == '__main__':
url = "http://ccit.91job.org.cn/contest/question"
job = Job(url)
job.run()
- 當我用了上面的程式碼後發現爬取的速度非常的慢,找了原因後,發現是網站的響應非常的慢,緊接著我就將其更改為了多執行緒的程式碼,發現後來的速率提升了很多,程式碼如下
# coding=utf-8
import requests
from lxml import etree
import json
import time
import random
from queue import Queue
import threading
class Job():
"""創新創業試題"""
def __init__(self, url):
"""
建構函式,例項化物件
:param url:
"""
self.url = url #這個沒有用到
self.session = requests.session()
self.comment_url = "http://ccit.91job.org.cn/contest/question?page={}"
self.USER_AGENT_LIST = [
"Mozilla/5.0 (Linux; U; Android 2.3.6; en-us; Nexus S Build/GRK39F) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/532.5 (KHTML, like Gecko) Chrome/4.0.249.0 Safari/532.5",
"Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US) AppleWebKit/532.9 (KHTML, like Gecko) Chrome/5.0.310.0 Safari/532.9",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.514.0 Safari/534.7",
"Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/534.14 (KHTML, like Gecko) Chrome/9.0.601.0 Safari/534.14",
"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.14 (KHTML, like Gecko) Chrome/10.0.601.0 Safari/534.14",
"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.20 (KHTML, like Gecko) Chrome/11.0.672.2 Safari/534.20",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.27 (KHTML, like Gecko) Chrome/12.0.712.0 Safari/534.27",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.24 Safari/535.1",
"Mozilla/5.0 (Windows NT 6.0) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.120 Safari/535.2",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.36 Safari/535.7",
"Mozilla/5.0 (Windows; U; Windows NT 6.0 x64; en-US; rv:1.9pre) Gecko/2008072421 Minefield/3.0.2pre",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.10) Gecko/2009042316 Firefox/3.0.10",
"Mozilla/5.0 (Windows; U; Windows NT 6.0; en-GB; rv:1.9.0.11) Gecko/2009060215 Firefox/3.0.11 (.NET CLR 3.5.30729)",
"Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6 GTB5",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; tr; rv:1.9.2.8) Gecko/20100722 Firefox/3.6.8 ( .NET CLR 3.5.30729; .NET4.0E)",
"Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
"Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
"Mozilla/5.0 (Windows NT 5.1; rv:5.0) Gecko/20100101 Firefox/5.0",
"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0a2) Gecko/20110622 Firefox/6.0a2",
"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:7.0.1) Gecko/20100101 Firefox/7.0.1",
"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:2.0b4pre) Gecko/20100815 Minefield/4.0b4pre",
"Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0 )",
"Mozilla/4.0 (compatible; MSIE 5.5; Windows 98; Win 9x 4.90)",
"Mozilla/5.0 (Windows; U; Windows XP) Gecko MultiZilla/1.6.1.0a",
"Mozilla/2.02E (Win95; U)",
"Mozilla/3.01Gold (Win95; I)",
"Mozilla/4.8 [en] (Windows NT 5.1; U)",
"Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.4) Gecko Netscape/7.1 (ax)",
"HTC_Dream Mozilla/5.0 (Linux; U; Android 1.5; en-ca; Build/CUPCAKE) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
"Mozilla/5.0 (hp-tablet; Linux; hpwOS/3.0.2; U; de-DE) AppleWebKit/534.6 (KHTML, like Gecko) wOSBrowser/234.40.1 Safari/534.6 TouchPad/1.0",
"Mozilla/5.0 (Linux; U; Android 1.5; en-us; sdk Build/CUPCAKE) AppleWebkit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
"Mozilla/5.0 (Linux; U; Android 2.1; en-us; Nexus One Build/ERD62) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
"Mozilla/5.0 (Linux; U; Android 2.2; en-us; Nexus One Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
"Mozilla/5.0 (Linux; U; Android 1.5; en-us; htc_bahamas Build/CRB17) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
"Mozilla/5.0 (Linux; U; Android 2.1-update1; de-de; HTC Desire 1.19.161.5 Build/ERE27) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
"Mozilla/5.0 (Linux; U; Android 2.2; en-us; Sprint APA9292KT Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
"Mozilla/5.0 (Linux; U; Android 1.5; de-ch; HTC Hero Build/CUPCAKE) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
"Mozilla/5.0 (Linux; U; Android 2.2; en-us; ADR6300 Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
"Mozilla/5.0 (Linux; U; Android 2.1; en-us; HTC Legend Build/cupcake) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
"Mozilla/5.0 (Linux; U; Android 1.5; de-de; HTC Magic Build/PLAT-RC33) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1 FirePHP/0.3",
"Mozilla/5.0 (Linux; U; Android 1.6; en-us; HTC_TATTOO_A3288 Build/DRC79) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
"Mozilla/5.0 (Linux; U; Android 1.0; en-us; dream) AppleWebKit/525.10 (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2",
"Mozilla/5.0 (Linux; U; Android 1.5; en-us; T-Mobile G1 Build/CRB43) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari 525.20.1",
"Mozilla/5.0 (Linux; U; Android 1.5; en-gb; T-Mobile_G2_Touch Build/CUPCAKE) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
"Mozilla/5.0 (Linux; U; Android 2.0; en-us; Droid Build/ESD20) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
"Mozilla/5.0 (Linux; U; Android 2.2; en-us; Droid Build/FRG22D) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
"Mozilla/5.0 (Linux; U; Android 2.0; en-us; Milestone Build/ SHOLS_U2_01.03.1) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
"Mozilla/5.0 (Linux; U; Android 2.0.1; de-de; Milestone Build/SHOLS_U2_01.14.0) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
"Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/525.10 (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2",
"Mozilla/5.0 (Linux; U; Android 0.5; en-us) AppleWebKit/522 (KHTML, like Gecko) Safari/419.3",
"Mozilla/5.0 (Linux; U; Android 1.1; en-gb; dream) AppleWebKit/525.10 (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2",
"Mozilla/5.0 (Linux; U; Android 2.0; en-us; Droid Build/ESD20) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
"Mozilla/5.0 (Linux; U; Android 2.1; en-us; Nexus One Build/ERD62) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
"Mozilla/5.0 (Linux; U; Android 2.2; en-us; Sprint APA9292KT Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
"Mozilla/5.0 (Linux; U; Android 2.2; en-us; ADR6300 Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
"Mozilla/5.0 (Linux; U; Android 2.2; en-ca; GT-P1000M Build/FROYO) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
"Mozilla/5.0 (Linux; U; Android 3.0.1; fr-fr; A500 Build/HRI66) AppleWebKit/534.13 (KHTML, like Gecko) Version/4.0 Safari/534.13",
"Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/525.10 (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2",
"Mozilla/5.0 (Linux; U; Android 1.6; es-es; SonyEricssonX10i Build/R1FA016) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
"Mozilla/5.0 (Linux; U; Android 1.6; en-us; SonyEricssonX10i Build/R1AA056) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
]
self.headers = {
"User-Agent": random.choice(self.USER_AGENT_LIST),
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"Accept-Encoding": "gzip, deflate",
"Accept-Language": "zh-CN,zh;q=0.9",
"Cookie": "__jsluid_h=87264cb6bb94cd265aa3cd970d21a4c7; __51cke__=; __tins__20821311=%7B%22sid%22%3A%201601988819421%2C%20%22vd%22%3A%203%2C%20%22expires%22%3A%201601991057133%7D; __51laig__=3; PHPSESSID2=8qqu9tn32ubakrul25nfco1rn3; universityadmin=ce11d199533a25ee8176c75a452fa45fa64e9699a%3A4%3A%7Bi%3A0%3Bs%3A7%3A%225155521%22%3Bi%3A1%3Bs%3A11%3A%2218060230305%22%3Bi%3A2%3Bi%3A2592000%3Bi%3A3%3Ba%3A12%3A%7Bs%3A6%3A%22status%22%3Bi%3A1%3Bs%3A8%3A%22realname%22%3Bs%3A9%3A%22%E6%9F%B4%E4%BC%9A%E8%BE%BE%22%3Bs%3A8%3A%22username%22%3Bs%3A11%3A%2218060230305%22%3Bs%3A5%3A%22email%22%3Bs%3A17%3A%221362951170%40qq.com%22%3Bs%3A5%3A%22grade%22%3Bs%3A2%3A%2221%22%3Bs%3A6%3A%22idcard%22%3Bs%3A0%3A%22%22%3Bs%3A12%3A%22universityid%22%3Bs%3A3%3A%22772%22%3Bs%3A5%3A%22major%22%3Bs%3A9%3A%22110008012%22%3Bs%3A8%3A%22lasttime%22%3BN%3Bs%3A6%3A%22lastip%22%3BN%3Bs%3A10%3A%22thisschool%22%3Bi%3A1%3Bs%3A4%3A%22type%22%3Bs%3A1%3A%221%22%3B%7D%7D",
}
self.url_queue = Queue()
self.html_queue = Queue()
self.content_queue = Queue()
def get_url_list(self):
"""
構造url地址,最後將其放入佇列
:return: 無返回
"""
for i in range(1,531):
self.url_queue.put(self.comment_url.format(i))
def get_url(self):
"""
請求url獲取響應
:return:無返回
"""
while True:
url = self.url_queue.get()
print(url)
response = self.session.get(url,headers=self.headers)
# return response.content.decode()
self.html_queue.put(response.content.decode())
self.url_queue.task_done() # 佇列數減一
def get_content_list(self):
"""
獲取資料,並將資料格式化
:return:無返回
"""
while True:
html_xml = etree.HTML(self.html_queue.get())
# print(html_xml)
item = dict()
item["test"] = [i.replace("\n","") for i in html_xml.xpath("//div[@class='title']//text()")]
item["answer"] = [i.replace("\r\n","") for i in html_xml.xpath("//div[@class='answer']//text()")]
item["right"] = html_xml.xpath("//div[@class='right']//text()")
# print(item)
self.content_queue.put(item)
self.html_queue.task_done() # 佇列數減一
def save_content(self):
"""
儲存資料
:return:無返回
"""
file_path = "job1.json"
while True:
with open(file_path, "a", encoding="utf-8") as f:
f.write(json.dumps(self.content_queue.get(), ensure_ascii=False, indent=2))
f.write('\n')
self.content_queue.task_done()
print("儲存成功")
def run(self):
"""
主流程控制
:return:無返回
"""
thread_list = list()
# 構造url
url = threading.Thread(target=self.get_url_list)
thread_list.append(url)
for i in range(5):
# 傳送請求
get_url = threading.Thread(target=self.get_url)
thread_list.append(get_url)
for i in range(4):
# 提取資料
content = threading.Thread(target=self.get_content_list)
thread_list.append(content)
# 儲存
save = threading.Thread(target=self.save_content)
thread_list.append(save)
for t in thread_list:
t.setDaemon(True) # 將子執行緒設定為守護執行緒,該執行緒不重要主執行緒結束,子執行緒結束
t.start()
for q in [self.url_queue, self.html_queue, self.content_queue]:
q.join() # 讓主執行緒等待子執行緒結束,即佇列中數為空
print("主執行緒結束")
if __name__ == '__main__':
url = "http://ccit.91job.org.cn/contest/question" # 最後這個沒有用到,第一題的page=1
job = Job(url)
job.run()
7.爬取的結果
如果不夠詳細可以檢視如下筆記
爬蟲筆記
相關文章
- 爬知識星球,製作自己的知識倉庫
- 用python爬取知識星球Python
- 企業安全知識庫
- 知識庫終極指南:為什麼您的企業需要知識庫?
- 爬蟲之前需要先了解哪些專業知識?爬蟲
- 爬蟲基礎知識爬蟲
- 金融行業客戶運營知識庫:構建數字化知識庫行業
- 資料競賽:第四屆工業大資料競賽-虛擬測量大資料
- 逆向爬蟲知識學習爬蟲
- Python爬蟲知識梳理Python爬蟲
- 競賽選手問題的解答演算法演算法
- http快取知識HTTP快取
- 中國大學生數學競賽(非數學專業類)競賽大綱
- 知識庫(3)-從Active Directory獲取物件的GUID (轉)物件GUI
- 學習雲端計算好就業嗎?需要學習哪些知識?就業
- 搭建企業知識庫有哪些好處?
- 【知識積累】使用Httpclient實現網頁的爬取並儲存至本地HTTPclient網頁
- 知識雜庫
- PgSql 知識庫SQL
- Python爬蟲知識點二Python爬蟲
- Python爬蟲知識點一Python爬蟲
- 華中農業大學第十三屆程式設計競賽 題解程式設計
- 初賽De各種各樣的知識點
- 一圖讀懂 | 首信通聯“關愛信用記錄,積累信用財富”知識競賽!
- 爬蟲必須得會的預備知識爬蟲
- Python—Requests庫的爬取效能分析Python
- 資料探勘比賽預備知識
- Python分散式爬蟲(三) - 爬蟲基礎知識Python分散式爬蟲
- Python爬蟲之路-爬蟲基礎知識(理論)Python爬蟲
- python 爬蟲基礎知識一Python爬蟲
- 知網知識庫呼叫
- 搭建知識庫xwiki
- 打造“個人知識庫”
- Linux運維就業前景如何?linux基礎知識學習Linux運維就業
- 專業領域的顯性知識與隱性知識
- 一看就懂的交換機基礎知識
- 非同步爬取畢業照非同步
- 隨身雲資料探勘競賽解題思路