001-urllib讀取網頁

weixin_34087301發表於2018-12-16

原文網址 : https://blog.csdn.net/weixin_34087301/article/details/86955866

正規表示式re

加(),代表我們需要括號裡面的東西
不加(),表示全部內容我們都需要

import re

mystr = """<span class \"search_yx_t j\">
  共<em>5830</em>個職位滿足條件
  <span>"""

restr = "<em>(\\d+)</em>"#d+表示和數字有關；():只要裡面的物件
regex = re.compile(restr, re, IGNORECASE)
mylist = regex.findall(mystr)

讀取網頁的三種方式

#py2
#enconding:utf-8
import urllib2

url = "http://www.baidu.com"#urlopen只能處理http，不可以處理https

def download1(url):
  return urllib2.urlopen(url).read()#讀取全部網頁

def download2(url):
  return urllib2.urlopen(url).readlines()#讀取每一行的網頁資料，然後壓入列表

def download3(url):
  response = urllib2.orlopen(url)#網頁抽象為檔案
  while True:
    line = response.readline()#讀取一行
    if not line:
      break

python2 和python3的區別

編碼

一般抓取英文資料用python2，抓取帶中文的資料，就需要用python3

用python2列印中文是，需要在第一行encoding:utf-8

urllib2

獲取一個網頁資料，urllib在python2和python3中有不同的表示

python2

urllib2.urlopen(url).read()
# urllib2只可以處理http，不可以處理https

python3

urllib2.request(url).read()

python被網站遮蔽：referer

有時候我們請求伺服器的時候，伺服器可以知道通過請求頭中的referer引數，知道是誰在請求它
伺服器如果發現不是瀏覽器在請求他，二是python在請求他。直接502.可以直接拒絕我們的請求。

selenium

我們需要這個東西來模擬瀏覽器。selenium可以操作我們的瀏覽器

抓取智聯招聘

基於selenium庫和selenium.webdriver

import selenium #測試框架
import selennium.webdriver #模擬瀏覽器
import re

mystr = """<span class \"search_yx_t j\">
  共<em>5830</em>個職位滿足條件
  <span>"""

restr = "<em>(\\d+)</em>"#d+表示和數字有關；():只要裡面的物件
regex = re.compile(restr, re. IGNORECASE)
mylist = regex.findall(pagesource)
def getnumberbyname(searchname):
  url = "https://sou.zhaopin.com/?jl=613&kw=" + searchname + "&kt=3"
  driver = selenium.webdriver.Firefox() #呼叫火狐瀏覽器
  driver.get(url) #訪問連結
  pagesource = driver.page_source #抓取網頁原始碼
  driver.close()#關閉
  return mylist[0]

# print getnumberbyname("python") 這是測試函式

pythonlist = ["python", "python 運維", "python 測試", "python 資料", "python web"]
for oystr in pythonlist:
  print pystr, getnumberbyname(pystr)

抓取51job

import selenium #測試框架
import selennium.webdriver #模擬瀏覽器
import re

mystr = """<div class = "rt">
  共67條職位
  <\div>"""


def getnumberbyname(searchname): #可能這裡有一些混亂，手頭沒有python環境就沒測試，大致就先這樣吧
 url="https://search.51job.com/list/240200,000000,0000,00,9,99,"+searchname +",2,1.htmllang=c&stype=&postchannel=0000&workyear=99&cotype=99&degefrom=99&jobterm=99&companysize=99&providesalary=99&lonlat=0%2C0&adius=-1&ord_field=0&confirmdate=9&fromType=&dibiaoid=0&address=&lin=&specialarea=00&from=&welfare=
 driver = selenium.webdriver.Firefox() #呼叫火狐瀏覽器
 driver.get(url) #訪問連結
 pagesource = driver.page_source #抓取網頁原始碼
 restr = """(\\d+)""" #先抓大，再抓小；尤其是空白字 符出現的時候
 regex = re.compile(restr, re.IGNORECASE)
 mylist = regex.findall(pagesource)
 newstr = mylist[0].strip()
 driver.close()#關閉
 return mylist[0]

pythonlist = ["python", "python 運維", "python 測試", "python 資料", "python web"]
for oystr in pythonlist:
  print pystr, getnumberbyname(pystr)

爬取網頁文章
2021-09-29
網頁
爬蟲——網頁爬取方法和網頁解析方法
2020-12-07
爬蟲網頁
閱讀網頁-待清空
2024-08-29
網頁
QWebView獲取網頁原始碼
2018-11-01
WebView網頁原始碼
ferret 爬取動態網頁
2019-12-15
網頁
Puppeteer爬取網頁資料
2019-03-22
網頁
關於python爬取網頁
2021-03-10
Python網頁
Privatus for Mac網頁快取清理
2021-02-22
Mac網頁快取
Vue專案全域性配置頁面快取，實現按需讀取快取
2018-07-30
Vue快取
js/jq 獲取網頁寬高
2018-12-12
JS網頁
python爬取網頁詳細教程
2021-09-11
Python網頁
[AHK]讀取演示PPT當前頁的備註
2020-10-27
Python Flask+Pandas讀取excel顯示到html網頁: 沒有excle檔案提示
2024-04-16
PythonFlaskExcelHTML網頁
node：爬蟲爬取網頁圖片
2019-02-16
爬蟲網頁
如何使用python進行網頁爬取?
2020-08-06
Python網頁
python opencv讀取網路圖片
2019-03-04
PythonOpenCV
教你用 Python 來朗讀網頁
2018-10-31
Python網頁
laravel利用Redis來實現網站快取讀取
2019-02-16
LaravelRedis網站快取
Postman模擬瀏覽器網頁請求並獲取網頁資料
2024-04-03
Postman瀏覽器網頁
WKWebView 獲取網頁高度，圖片點選檢視，網頁連結點選
2021-03-03
WebView網頁
網頁用python爬取後如何解析
2021-09-11
網頁Python
網頁快取清理工具：Privatus 6 for Mac
2023-03-29
網頁快取Mac
Chrome 獲取網頁顏色（文字、圖片）
2022-04-06
Chrome網頁
Python爬取網頁的所有內外鏈
2021-04-09
Python網頁
手機版python爬取網頁書籍
2020-12-19
Python網頁
想獲取JS載入網頁的源網頁的原始碼，不想獲取JS載入後的資料
2024-04-10
JS網頁原始碼
C#爬取動態網頁上的資訊：B站主頁
2024-09-27
C#網頁
python四種方式解析網頁獲取頁面中的連結
2020-12-31
Python網頁
postgresql不考慮可見性、讀取髒頁、物理頁、被刪除/更新的元組
2020-10-27
SQL
php獲取網頁內容的三種方法
2018-10-17
PHP網頁
Puppeteer 實戰-爬取動態生成的網頁
2018-11-10
網頁
JavaScript 獲取網頁尾本程式碼內容
2020-02-20
JavaScript網頁
結合LangChain實現網頁資料爬取
2024-07-18
LangChain網頁
python3中編碼如何獲取網頁?
2021-09-11
Python網頁
Python應用開發——爬取網頁圖片
2022-09-21
Python網頁
Python 爬取網頁資料的兩種方法
2023-02-15
Python網頁
Privatus--想快速清理網頁快取，就用它！
2020-11-30
網頁快取
.NET微信網頁開發之網頁授權獲取使用者基本資訊
2023-12-12
網頁