Python 從入門到爬蟲極簡教程

Yujiaao發表於2019-02-16

原文網址 : https://flycode.co/archives/78955

Python 爬蟲與資料分析

你學的太多，練習太少。 — 古典

抓取資料但不用 Python

不編碼是第一選擇

八爪魚採集器 Octoparse

特點: 內嵌瀏覽器, 視覺化定位, 可提取 JavaScript 生成內容, 提取資料用 xpath, 常用網站模板, 支援雲採集, 支援多種資料格式輸出和資料庫匯出
http://www.bazhuayu.com/
5分鐘演示 https://v.youku.com/v_show/id…
支援部分驗證碼自動識別 http://www.bazhuayu.com/faq/c…
免費版同時2個執行緒, 最多10個任務

火車採集器

特點: 對接資料庫, 可直接匯入 cms
http://www.locoy.com/

很多 cms 自帶文章採集工具

如 jeecms, phpCMS, dedeCMS, 帝國 cms
(略)

為什麼要學 Python

資料分析需要多個階段, 抓取資料僅是一個環節, 資料需要不斷採集, 更新, 清洗, 分析, 可視會展示等多個階段, 這些過程中 Python 都能應對自如. 屬於性階適中的工具.

vs C

對比 C 語言, 效率弱一些, 但僅是執行效率, 開發效率高很多, 多數專案恰是開發佔比高, 一直開發, 偶爾執行成為常態

vs Java

無需編譯, 省去很多麻煩, 更適合一次性應用, 或小團隊使用, 更靈活.

Life Is Short, Use Python

AI與機器學習

Python 語言基礎

版本的問題

區別

Python 2.x 和 3.x 有很大區別

2to3

使用 2to3 可以自動升級大部分程式碼

3.x 新特性

https://www.asmeurer.com/pyth…

版本隔離 virtualenv

$ pip3 install virtualenv
$ virtualenv --no-site-packages venv
$ source venv/bin/activate
(venv)$ 
(venv)$ deactivate
$

常用資料結構

{} 大字典   Dictionary  鍵值對, 鍵唯一, 按鍵可以很快隨機查詢
[] 方列表  List  元素儲存緊湊, 順序固定, 可排序
(1,) 圓元組 tuple
set() 設集合 set 集合中,元素唯一,無相同元素

輸入輸出, 文字處理, 陣列處理

input 終端輸入

讀檔案

open(), read() seek()

寫檔案

寫檔案和讀檔案是一樣的，唯一區別是呼叫open()函式時，傳入識別符號`w`或者`wb`表示寫文字檔案或寫二進位制檔案：

>>> f = open(`/Users/michael/test.txt`, `w`)
>>> f.write(`Hello, world!`)
>>> f.close()

陣列

物件導向基本概念與使用

如何輕鬆愉快地學 Python

遊戲學程式設計,熟悉語法, 流程結構, 函式等 https://codecombat.com/
ide: pycharm, vs code, 斷點除錯

Python教程

練習題

猜隨機數
成三角形概率
求質數的幾種境界
質數概率
png 格式簡析

圖形格式介紹

png, gif, jpg, svg, webp

特色與難點

裝飾器

decorator @

生成器

generator

yeild

lambda 表示式

一些常用函式

zip()

map()

filter()

網路協議與檔案格式

URL

協議頭://域名:埠/路徑/檔案?引數1=引數值1&引數2=引數值2#頁面錨點

HTTP 協議

https://www.tutorialspoint.co…

無連線: 請求之間不需要保持連線
媒介無關: MIME 型別確定資料內容
無狀態: 用 cookie 或引數跟蹤狀態

請求頭

通過觀察瀏覽器 -> 開發者工具學習

重點掌握

Cookie
Referer
User-Agent
Content-Type

請求方法

GET

最常見, 一般通過 url 傳遞引數, 冪等性

POST

提交操作, 大量資料時, 上傳檔案時用

響應狀態碼

200：請求成功處理方式：獲得響應的內容，進行處理

301：請求到的資源都會分配一個永久的URL，這樣就可以在將來通過該URL來訪問此資源檢視頭裡的 Location
302：請求到的資源在一個不同的URL處臨時儲存檢視頭裡的 Location

400：非法請求
401：未授權
403：禁止

404：沒有找到

500：伺服器內部錯誤
502：錯誤閘道器作為閘道器或者代理工作的伺服器嘗試執行請求時，從上游伺服器接收到無效的響應。

測試工具

curl

結合瀏覽器的使用, -o 引數,

wget

斷點續傳之 -c 引數, 批量下載時的萬用字元使用

chromium, telnet, netcat

HTML 格式

學習工具

w3cschool.com

json

格式
工具

JavaScript & CSS

適當瞭解

python常用抓取工具/類庫介紹

urllib

import urllib2
 
response = urllib2.urlopen("http://www.baidu.com")
print response.read()

2to3 urllib.py

import urllib.request, urllib.error, urllib.parse
 
response = urllib.request.urlopen("http://example.com")
print(response.read())

練習指導:

Python3 啟動, 退出 Ctrl+D
2to3 –help 找出 -w 回寫引數
兩種執行方式, 命令列, 互動式

參考: https://cuiqingcai.com/947.html

Requests 庫

Scrapy

$ pip install Scrapy lxml

PySpider

非常方便並且功能強大的爬蟲框架，支援多執行緒爬取、JS動態解析，提供了可操作介面、出錯重試、定時爬取等等的功能，使用非常人性化。

官網

安裝

$ pip install pyspider

使用

$ pyspider all

然後瀏覽器訪問 http://localhost:5000

Selenium & PhantomJS

$pip install selenium

用瀏覽器進行載入頁面

    from selenium import webdriver     
    browser = webdriver.Chrome()
    browser.get(`http://www.baidu.com/`)

驅動瀏覽器進行搜尋

import unittest
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
 
class PythonOrgSearch(unittest.TestCase):
 
    def setUp(self):
        self.driver = webdriver.Chrome()
 
    def test_search_in_python_org(self):
        driver = self.driver
        driver.get("http://www.python.org")
        self.assertIn("Python", driver.title)
        elem = driver.find_element_by_name("q")
        elem.send_keys("pycon")
        elem.send_keys(Keys.RETURN)
        assert "No results found." not in driver.page_source
 
    def tearDown(self):
        self.driver.close()
 
if __name__ == "__main__":
    unittest.main()

用 PhantomJS 儲存頁面為圖片

PhantomJS 相當於無介面瀏覽器, 可執行指令碼和 CSS 記憶體渲染

phantomjs helloworld.js

var page = require(`webpage`).create();
page.open(`http://cuiqingcai.com`, function (status) {
    console.log("Status: " + status);
    if (status === "success") {
        page.render(`example.png`);
    }
    phantom.exit();
});

資料提取工具

html, xml, xpath, selector, json

正規表示式

掌握起來, 有一定難度, 多數編輯器支援, 使用場景廣, 但不適合結構化資料(xml, json, html)

Python Re模組提供

#返回pattern物件
re.compile(string[,flag])  
#以下為匹配所用函式
re.match(pattern, string[, flags])
re.search(pattern, string[, flags])
re.split(pattern, string[, maxsplit])
re.findall(pattern, string[, flags])
re.finditer(pattern, string[, flags])
re.sub(pattern, repl, string[, count])
re.subn(pattern, repl, string[, count])

參見: https://cuiqingcai.com/912.html

其於 Dom 模型的 jQuery selector

在 Selenium 中或瀏覽器中直接使用

基於查詢語言的 XPath 標準

XPath語言是基於一個樹形結構表示的XML 文件，提供的導航能力，通過多種屬性選擇節點的一個標準。
XPath 是提取 XML 的工具, 所以需要對 HTML正行校正

校正工具:

使用 lxml 完成解析 HTML

>>> from lxml import etree
>>> doc = `<foo><bar></bar></foo>`
>>> tree = etree.HTML(doc)

>>> r = tree.xpath(`/foo/bar`)
>>> len(r)
1
>>> r[0].tag
`bar`

>>> r = tree.xpath(`bar`)
>>> r[0].tag
`bar`

最穩定的結果是使用 lxml.html 的 soupparser。你需要安裝 python-lxml 和 python-beautifulsoup，然後你可以執行以下操作：

from lxml.html.soupparser import fromstring
tree = fromstring(`<mal form="ed"><html/>here!`)
matches = tree.xpath("./mal[@form=ed]")

XPath 文件

維基 https://en.wikipedia.org/wiki…
W3C https://www.w3.org/TR/xpath-30/

入門教程

https://www.w3schools.com/xml…

XPath 線上測試工具

https://codebeautify.org/Xpat…

特點: 可以直接載入 url

<root xmlns:foo="http://www.foo.org/" xmlns:bar="http://www.bar.org">
 <employees>
  <employee id="1">Johnny Dapp</employee>
  <employee id="2">Al Pacino</employee>
  <employee id="3">Robert De Niro</employee>
  <employee id="4">Kevin Spacey</employee>
  <employee id="5">Denzel Washington</employee>
  
 </employees>
 <foo:companies>
  <foo:company id="6">Tata Consultancy Services</foo:company>
  <foo:company id="7">Wipro</foo:company>
  <foo:company id="8">Infosys</foo:company>
  <foo:company id="9">Microsoft</foo:company>
  <foo:company id="10">IBM</foo:company>
  <foo:company id="11">Apple</foo:company>
  <foo:company id="12">Oracle</foo:company>
 </foo:companies>
</root>

示例:
1.選擇文件節點
/
2.選擇“root”元素
/root
3.選擇所有`employee`元素，它們是`employees`元素的直接子元素。
/root/employees/employee
4.選擇所有“公司”元素，無論它們在文件中的位置如何。
//foo:company
5.選擇“公司”元素的“id”屬性，無論它們在文件中的位置如何。
//foo:company/@id
6.選擇第一個“employee”元素的文字值。
//employee[1]/text()
7.選擇最後一個`employee`元素。
//employee[last()]
8.使用其位置選擇第一個和第二個“employee”元素。
//employee[position() < 3]
9.選擇具有“id”屬性的所有“employee”元素。
//employee[@id]
10.選擇`id`屬性值為`3`的`employee`元素。
//employee[@id=`3`]
11.選擇“id”屬性值小於或等於“3”的所有“employee”節點。
//employee[@id<=3]
12.選擇“companies”節點的所有子項。
/root/foo:companies/*
13.選擇文件中的所有元素。
// *
14.選擇所有“員工”元素和“公司”元素。
//employee|//foo:company
15.選擇文件中第一個元素的名稱。
name(//*[1])
16.選擇第一個“employee”元素的“id”屬性的數值。
number(//employee[1]/@id)
17.選擇第一個“employee”元素的“id”屬性的字串表示形式值。
string(//employee[1]/@id)
18.選擇第一個“employee”元素的文字值的長度。
string-length(//employee[1]/text())
19.選擇第一個“company”元素的本地名稱，即沒有名稱空間。
string-length(//employee[1]/text())
20.選擇“公司”元素的數量。
count(//foo:company)
21.選擇`company`元素的`id`屬性的總和。
sum(//foo:company/@id)

http://www.xpathtester.com/xpath

使用示例: 用xpath怎麼提取重複元素中的一個元素

<div class="container">
  <div class="col-12 col-sm-3">
    <p class="title">序號</p>
    <p>001</p>
  </div>
  <div class="col-12 col-sm-3">
    <p class="title">編號</p>
    <p>999</p>
  </div>
  <div class="col-12 col-sm-3">
    <p class="title">列號</p>
    <p>321</p>
  </div>
</div>

//p[text()=”編號”]/following-sibling::p[1]
例如：Python+Selenium獲取文字：
driver.driver.find_element_by_xpath(//p[text()=”編號”]/following-sibling::p[1]).text
注: Selenium 支援 XPath 和類 jQuery Selector 等多種選擇方式.

Firefox 和 XPath

2017之前的 firefox 版本 + Firebug
2017後 Firefox Developer Edition + Chropath addon
https://addons.mozilla.org/en…

Chromium 和 XPath

在Chrome/ Firefox瀏覽器中開啟網站

按Ctrl + Shift + I（將開啟開發人員工具）Alt+CMD+I
選擇儀器視窗頂部的“元素”
選擇儀器視窗底部的放大鏡
在瀏覽器中選擇所需的元素
右鍵單擊DOM樹中的選定行，然後選擇“複製XPath”

Chrome Extension XPath Helper (需要科學上網)

資料儲存

csv 及 excel 格式

注意引號轉義, 可用現成庫

MySQL 資料庫

安裝MySQL驅動
由於MySQL伺服器以獨立的程式執行，並通過網路對外服務，所以，需要支援Python的MySQL驅動來連線到MySQL伺服器。MySQL官方提供了mysql-connector-python驅動，但是安裝的時候需要給pip命令加上引數–allow-external：

$ pip install mysql-connector-python --allow-external mysql-connector-python

如果上面的命令安裝失敗，可以試試另一個驅動：

$ pip install mysql-connector

我們演示如何連線到MySQL伺服器的test資料庫：

# 匯入MySQL驅動:
>>> import mysql.connector
# 注意把password設為你的root口令:
>>> conn = mysql.connector.connect(user=`root`, password=`password`, database=`test`)
>>> cursor = conn.cursor()
# 建立user表:
>>> cursor.execute(`create table user (id varchar(20) primary key, name varchar(20))`)
# 插入一行記錄，注意MySQL的佔位符是%s:
>>> cursor.execute(`insert into user (id, name) values (%s, %s)`, [`1`, `Michael`])
>>> cursor.rowcount
1
# 提交事務:
>>> conn.commit()
>>> cursor.close()
# 執行查詢:
>>> cursor = conn.cursor()
>>> cursor.execute(`select * from user where id = %s`, (`1`,))
>>> values = cursor.fetchall()
>>> values
[(`1`, `Michael`)]
# 關閉Cursor和Connection:
>>> cursor.close()
True
>>> conn.close()

爬蟲常見問題

常見反爬技術

User-Agent

新華網

　Referer

頻率

36kr.com
taobao.com

使用者點選才展示內容

csdn.net 部落格

登入後可用內容

taobao.com

各種人機驗證 Captcha

封IP, 封ID

編碼問題 GB2312, GB18030, GKB, UTF-8, ISO8859-1

GB18030 > GBK > GB2312 但相互相容
UTF-8與以上編碼不相容

用代理隱藏 ip

import requests
from lxml import etree
headers = {
        `User-Agent`: `Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36`
    }
url = `https://ip.cn/`

## 下面的網站是用來獲取代理ip的API
ip_url = `http://proxy.w2n1ck.com:9090/random`
ip = {`http`  : `http://`+requests.get(ip_url).text}
print(ip)
response = requests.get(url, headers=headers, proxies=ip, timeout=10).text
html = etree.HTML(response)
## 提取頁面顯示的ip
res = html.xpath(`//*[@id="result"]/div/p[1]/code/text()`)
print(res)

模擬登入

圖形驗證碼處量

百度OCR　

https://aip.baidubce.com/rest…

Tesseract + openCV

ML-OCR

效果最好

人工OCR

手工錄入

資料視覺化

matplot

echarts

Tableau

高階話題

手機　APP　介面資料抓取

Python3.x+Fiddler抓取APP資料
思路是電腦共享 wifi, 手機連這個 wifi, 電腦wifi 的 IP做為代理，　手機上設定代理．
手機信任電腦的代理證照．　中間人攻擊完成了．　
截獲到網路請求再通過引數變換完成抓取
https://segmentfault.com/a/11…

分散式爬蟲

資料庫或快取為協調工具

中文分詞

結巴分詞

自然言語分析

hanlp
tlp-cloud

人臉識別

阿里的介面

圖形識別

有問題到哪裡去問？

Coursera

stackoverflow.com

思否

【爬蟲】python爬蟲從入門到放棄
2018-12-20
爬蟲Python
Python 爬蟲從入門到進階之路（十）
2019-07-03
Python爬蟲
Python 爬蟲從入門到進階之路（十五）
2019-07-10
Python爬蟲
Python 爬蟲從入門到進階之路（九）
2019-07-02
Python爬蟲
Python 爬蟲從入門到進階之路（十二）
2019-07-05
Python爬蟲
Python 爬蟲從入門到進階之路（十七）
2019-07-12
Python爬蟲
Python 爬蟲從入門到進階之路（二）
2019-06-20
Python爬蟲
Python 爬蟲從入門到進階之路（十一）
2019-07-04
Python爬蟲
Python 爬蟲從入門到進階之路（六）
2019-06-27
Python爬蟲
Python 爬蟲從入門到進階之路（八）
2019-07-01
Python爬蟲
Python 爬蟲從入門到進階之路（七）
2019-06-28
Python爬蟲
Python 爬蟲從入門到進階之路（十八）
2019-07-15
Python爬蟲
Python 爬蟲從入門到進階之路（十六）
2019-07-11
Python爬蟲
Python 爬蟲從入門到進階之路（三）
2019-06-21
Python爬蟲
Python爬蟲入門教程導航帖
2019-01-08
Python爬蟲
Python爬蟲入門
2020-11-30
Python爬蟲
Python爬蟲入門教程 50-100 Python3爬蟲爬取VIP視訊-Python爬蟲6操作
2019-02-14
Python爬蟲
不踩坑的Python爬蟲：Python爬蟲開發與專案實戰，從爬蟲入門 Python
2021-12-17
Python爬蟲
python-爬蟲入門
2024-09-22
Python爬蟲
Python爬蟲從入門到精通系列──第1課基礎知識
2019-01-17
Python爬蟲
Python超簡單超基礎的免費小說爬蟲！爬蟲入門從這開始！
2020-10-23
Python爬蟲
什麼是Python爬蟲？python爬蟲入門難嗎？
2021-12-27
Python爬蟲
爬蟲入門基礎-Python
2020-05-09
爬蟲Python
python3 爬蟲入門
2021-09-09
Python爬蟲
PYTHON系列-從零開始的爬蟲入門指南
2018-09-16
Python爬蟲
Python爬蟲入門教程 2-100 妹子圖網站爬取
2018-12-13
Python爬蟲網站
Python爬蟲入門學習線路圖2019最新版（附Python爬蟲視訊教程）
2019-01-09
Python爬蟲
Python爬蟲入門教程 55-100 python爬蟲高階技術之驗證碼篇
2019-04-02
Python爬蟲
為什麼學習python及爬蟲，Python爬蟲[入門篇]？
2018-11-21
Python爬蟲
Python爬蟲入門，8個常用爬蟲技巧盤點
2018-12-12
Python爬蟲
python爬蟲之 BeautifulSoup庫入門
2019-12-09
Python爬蟲
Python3爬蟲入門(一)
2020-12-05
Python爬蟲
scrapy入門教程()部署爬蟲專案
2018-09-27
爬蟲
爬蟲工程師的unidbg入門教程
2019-12-27
爬蟲工程師
Python爬蟲教程-21-xpath 簡介
2018-09-06
Python爬蟲
Python爬蟲教程-20-xml 簡介
2018-09-06
Python爬蟲XML
Python爬蟲教程-04-response簡介
2018-09-06
Python爬蟲
Python爬蟲入門教程 61-100 寫個爬蟲碰到反爬了，動手破壞它！
2019-04-22
Python爬蟲