爬蟲永不停息
最近改進上一篇的爬蟲,不爬豆瓣了,改爬一爬京東評論,先放幾張圖研究看看先。
研究了一下,發現商品的id就是連結.html前面的數字。我們把它複製貼上下拉
1,對上一篇的代表進行修改和新增
class Spider():
def __init__(self):
# score:1為差評;2為中評;3為好評,0為所有評論
# 這連結是京東jsonp回撥的資料,我們要給這連結加上商品id和評論頁碼。
self.start_url = 'https://sclub.jd.com/comment/productPageComments.action?callback=fetchJSON_comment98vv30672&productId={}&score=0&sortType=5&page={}&pageSize=10&isShadowSku=0&rid=0&fold=1'
# 新增一個新的佇列
self.pageQurl = Queue()
# 上一篇是list,現在是dict
self.data = dict()
複製程式碼
2,修改上一篇整個 parse_first函式
def parse_first(self):
if not self.qurl.empty():
goodsid = self.qurl.get()
url = self.start_url.format(goodsid,1)
print('parse_first',url)
try:
r = requests.get(url, headers={'User-Agent': random.choice(self.user_agent)},proxies=self.notproxy,verify=False)
# 編碼格式是GBK,不是UTF-8
r.encoding = 'GBK'
if r.status_code == 200:
# 對回撥回來的資料進行處理
res = r.text.replace('fetchJSON_comment98vv30672(', '').replace(');', '').replace('false', '0').replace('true','1')
res = json.loads(res)
lastPage = int(res['maxPage'])
# 爬1-5頁評論
for i in range(lastPage)[1:5]:
temp = str(goodsid)+ ',' + str(i)
self.pageQurl.put(temp)
arr = []
for j in res['hotCommentTagStatistics']:
arr.append({'name':j['name'],'count':j['count']})
self.data[str(goodsid)] = {
'hotCommentTagStatistics':arr,
'poorCountStr':res['productCommentSummary']['poorCountStr'],
'generalCountStr': res['productCommentSummary']['generalCountStr'],
'goodCountStr': res['productCommentSummary']['goodCountStr'],
'goodRate': res['productCommentSummary']['goodRate'],
'comments': []
}
self.parse_first()
else:
self.first_running = False
print('ip被遮蔽')
except:
self.first_running = False
print('代理ip代理失敗')
else:
self.first_running = False
複製程式碼
3,修改上一篇整個 parse_second函式
def parse_second(self):
while self.first_running or not self.pageQurl.empty():
if not self.pageQurl.empty():
arr = self.pageQurl.get().split(',')
url = self.start_url.format(arr[0],arr[1])
print(url)
try:
r = requests.get(url,headers={'User-Agent': random.choice(self.user_agent)},proxies=self.notproxy,verify=False)
r.encoding = 'GBK'
if r.status_code == 200:
res = r.text.replace('fetchJSON_comment98vv30672(', '').replace(');', '').replace('false','0').replace('true', '1')
try:
res = json.loads(res)
for i in res['comments']:
images = []
videos = []
# 記錄使用者的評論圖片與視訊
if i.get('images'):
for j in i['images']:
images.append({'imgUrl': j['imgUrl']})
if i.get('videos'):
for k in i['videos']:
videos.append({'mainUrl': k['mainUrl'], 'remark': k['remark']})
# 記錄使用者的詳細資料
mydict = {
'referenceName': i['referenceName'],
'content': i['content'],
'creationTime': i['creationTime'],
'score': i['score'],
'userImage': i['userImage'],
'nickname': i['nickname'],
'userLevelName': i['userLevelName'],
'productColor': i['productColor'],
'productSize': i['productSize'],
'userClientShow': i['userClientShow'],
'images': images,
'videos': videos
}
self.data[arr[0]]['comments'].append(mydict)
# 執行緒隨機休眠
time.sleep(random.random() * 5)
except:
print('無法編譯成物件',res)
except Exception as e:
print('獲取失敗',str(e))
複製程式碼
4,修改一部分run函式,
@run_time
def run(self):
# 爬京東商品的ID,用陣列對它們進行存放
goodslist = ['6784500','31426982482','7694047']
for i in goodslist:
self.qurl.put(i)
ths = []
th1 = Thread(target=self.parse_first, args=())
th1.start()
ths.append(th1)
for _ in range(self.thread_num):
th = Thread(target=self.parse_second)
th.start()
ths.append(th)
for th in ths:
# 等待執行緒終止
th.join()
s = json.dumps(self.data, ensure_ascii=False, indent=4)
with open('jdComment.json', 'w', encoding='utf-8') as f:
f.write(s)
print('Data crawling is finished.')
複製程式碼
5,最後爬出來的資料是,這只是部分程式碼,對上一篇程式碼進行替換即可執行。
京東原版資料:
爬出來的資料格式:
京東原版評論:
爬出來的資料格式:
下面是個人寫的小程式,資料也是爬蟲得來的,希望大家看看,給點意見。