爬取js渲染過的頁面（爬取一個婚慶網站為例）

wanghandou發表於2017-11-22

這個網站是js渲染過的，所以我們可以使用PhantomJS瀏覽器或者在network中找出需要post的qurrystring中的引數，發請求就可以了，得到的是json

# !/usr/bin/python # -*- encoding: UTF-8 -*- from lxml import etree import urllib import urllib2 import jsonpath import json from lxml import etree class we(): def __init__(self): self.page=3 self.headers={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36",} def meiyiye(self): item=[] headers=self.headers url="http://search.jiayuan.com/v2/search_v2.php" fromdata={ "sex":"f", "key":"", "stc":"1%3A41%2C2%3A19.27%2C3%3A155.170%2C23%3A1", "sn":"default", "sv":"1", "p":self.page,#表示爬取的第幾頁 "f":"", "listStyle":"bigPhoto", "pri_uid":"170703614", "jsversion":"v5"} data = urllib.urlencode(fromdata) request=urllib2.Request(url,data=data,headers=headers)#post請求的話需要data值，而get請求不需要data有值 response = urllib2.urlopen(request) #得到的是json格式的字串（字典行的） html = response.read() html1=html.replace("##jiayser##","").replace("//","") #把json轉換成python格式的unicode字串（列表形式的） content = json.loads(html1) id_list=jsonpath.jsonpath(content,"$..uid")#在content中匹配出需要的個人id，然後通過這個id拼接出個人主頁的連結 for i in id_list: item.append(i) self.page+=1 self.meiyigeren(item) #處理這一頁每一個人的頁面資訊 def meiyigeren(self,item): for id in item: print "*******************************************" print u"使用者id:"+str(id) url="http://www.jiayuan.com/"+str(id)+"?fxly=search_v2_index"#拼接連線，然後傳送請求，找到個人主頁中需要的有用的內容 print u"主頁連結:"+url headers=self.headers request=urllib2.Request(url,headers=headers) response = urllib2.urlopen(request) html = response.read() content=etree.HTML(html)#解析HTML文件為HTML DOM模型，然後下面就可以使用xpath匹配出想要的內容 username=content.xpath('//div[@class="main_1000 bg_white mt15"]//h4/text()') if len(username)==1: print username[0] else: print u"沒有名字" a=content.xpath('//div[@class="main_1000 bg_white mt15"]//h6[@class="member_name"]/text()') we=" ".join(a) ni=we.replace("，"," ").replace(',',' ') ha=ni.split(" ") print u"年齡:"+ha[0] header_url=content.xpath('//div[@class="big_pic fn-clear"]//li[2]//tr//img[@class="img_absolute"]//@_src') if len(header_url)==1: header_urll=header_url[0] else: header_urll=u"沒有頭像連結:" print u"頭像連結:"+header_urll image_url=content.xpath('//div[@class="small_pic_box fn-clear"]//div[@class="small_pic fn-clear"]//li//img//@src') print u"相簿連結:", print image_url content1=content.xpath('//div[@class="main_1000 mt15 fn-clear"]//div[@class="bg_white"]//div[@class="js_text"]//text()') content2="" for i in content1: content2+=i content3=content2 print u"內心獨白:"+content3.strip() place=content.xpath('//div[@class="main_1000 bg_white mt15"]//h6[@class="member_name"]/a[2]/text()') if len(place)==1: where=place[0] else: where=u"河南" print u"來自:"+where+u"省" xueli=content.xpath('//div[@class="main_1000 bg_white mt15"]//ul[@class="member_info_list fn-clear"]/li[1]//div[@class="fl pr"]/em/text()') if len(xueli)==1: print u"學歷:"+xueli[0] else: print u"學歷:本科" print "***********************************************" if self.page<=5: self.meiyiye() if __name__=="__main__": ni=we() ni.meiyiye()

puppeteer 頁面爬取例項（元素遍歷）
2018-12-07
Node JS爬蟲：爬取瀑布流網頁高清圖
2018-05-17
JS爬蟲網頁
爬取網頁文章
2021-09-29
網頁
讓 scrapy 重複爬取同一個頁面
2019-09-25
爬蟲——網頁爬取方法和網頁解析方法
2020-12-07
爬蟲網頁
node：爬蟲爬取網頁圖片
2019-02-16
爬蟲網頁
動態網站的爬取
2018-08-29
網站
seo-mask -- 為單頁應用建立一個適合蜘蛛爬取的seo網站
2019-01-19
網站
爬取網站新聞
2020-09-24
網站
[Python3網路爬蟲開發實戰] 7-動態渲染頁面爬取-4-使用Selenium爬取淘寶商品
2018-03-30
Python爬蟲
python爬蟲---網頁爬蟲，圖片爬蟲，文章爬蟲，Python爬蟲爬取新聞網站新聞
2019-01-04
Python爬蟲網頁網站
Python爬蟲—爬取某網站圖片
2020-11-19
Python爬蟲網站
一起學爬蟲——使用Beautiful Soup爬取網頁
2018-11-26
爬蟲網頁
爬蟲搭建代理池、爬取某網站影片案例、爬取新聞案例
2023-03-16
爬蟲網站
爬取子頁
2018-08-24
ferret 爬取動態網頁
2019-12-15
網頁
關於python爬取網頁
2021-03-10
Python網頁
Puppeteer爬取網頁資料
2019-03-22
網頁
用Jupyter—Notebook爬取網頁資料例項14
2020-12-01
網頁
用Jupyter—Notebook爬取網頁資料例項12
2020-12-01
網頁
不會Python爬蟲？教你一個通用爬蟲思路輕鬆爬取網頁資料
2019-01-08
Python爬蟲網頁
Python網路爬蟲之爬取淘寶網頁頁面 MOOC可以執行的程式碼
2018-11-24
Python爬蟲網頁
「譯」如何用 Node.Js 和 Puppeteer 爬取網頁
2019-03-03
Node.js網頁
記錄一次使用jsoup爬取頁面
2020-12-23
JS
JB的Python之旅-爬取phizhub網站
2019-02-21
Python網站
Python爬蟲教程-13-爬蟲使用cookie爬取登入後的頁面(人人網)（下）
2018-09-06
Python爬蟲Cookie
Python爬蟲教程-12-爬蟲使用cookie爬取登入後的頁面(人人網)（上）
2018-09-06
Python爬蟲Cookie
[Python3網路爬蟲開發實戰] 7-動態渲染頁面爬取-1-Selenium的使用
2019-02-28
Python爬蟲
Python網路爬蟲第三彈《爬取get請求的頁面資料》
2018-09-14
Python爬蟲
python爬取網頁詳細教程
2021-09-11
Python網頁
python 非同步佇列爬取多個網站
2020-11-21
Python非同步佇列網站
使用 Python 爬取網站資料
2024-07-27
Python網站
某網站加密返回資料加密_爬取過程
2024-06-08
網站加密
一個很垃圾的整站爬取--Java爬蟲
2019-01-07
Java爬蟲
python 爬蟲如何爬取動態生成的網頁內容
2024-10-31
Python爬蟲網頁
python爬取換頁_爬蟲爬不進下一頁了，怎麼辦
2020-11-24
Python爬蟲
python例項，python網路爬蟲爬取大學排名!
2018-11-20
Python爬蟲
Python爬取網頁的所有內外鏈
2021-04-09
Python網頁
爬取某網站寫的python程式碼
2019-11-29
網站Python

爬取js渲染過的頁面（爬取一個婚慶網站為例）

相關文章