Python爬蟲入門教程 55-100 python爬蟲高階技術之驗證碼篇

夢想橡皮擦發表於2019-04-02

原文網址 : https://juejin.im/post/5ca2a76df265da30dc7ac64d

Python爬蟲

驗證碼探究

如果你是一個資料探勘愛好者，那麼驗證碼是你避免不過去的一個天坑，和各種驗證碼鬥爭，必然是你成長的一條道路，接下來的幾篇文章，我會盡量的找到各種驗證碼，並且去嘗試解決掉它，中間有些技術甚至我都沒有見過，來吧，一起Coding吧

數字+字母的驗證碼

我隨便在百度圖片搜尋了一個驗證碼，如下

今天要做的是驗證碼識別中最簡單的一種辦法，採用pytesseract解決，它屬於Python當中比較簡單的OCR識別庫

庫的安裝

使用pytesseract之前，你需要通過pip 安裝一下對應的模組，需要兩個

pytesseract庫還有影象處理的pillow庫了

pip install pytesseract
pip install pillow
複製程式碼

如果你安裝了這兩個庫之後，編寫一個識別程式碼，一般情況下會報下面這個錯誤

pytesseract.pytesseract.TesseractNotFoundError: tesseract is not installed or it's not in your path
複製程式碼

這是由於你還缺少一部分內容

安裝一個Tesseract-OCR軟體。這個軟體是由Google維護的開源的OCR軟體。

下載地址 > github.com/tesseract-o…

中文包的下載地址 > github.com/tesseract-o…

選擇你需要的版本進行下載即可

pillow庫的基本操作

命令	釋義
open()	開啟一個圖片 from PIL import Image im = Image.open("1.png") im.show()
save()	儲存檔案
convert()	convert() 是影象例項物件的一個方法，接受一個 mode 引數，用以指定一種色彩模式，mode 的取值可以是如下幾種： · 1 (1-bit pixels, black and white, stored with one pixel per byte) · L (8-bit pixels, black and white) · P (8-bit pixels, mapped to any other mode using a colour palette) · RGB (3x8-bit pixels, true colour) · RGBA (4x8-bit pixels, true colour with transparency mask) · CMYK (4x8-bit pixels, colour separation) · YCbCr (3x8-bit pixels, colour video format) · I (32-bit signed integer pixels) · F (32-bit floating point pixels)

Filter

from PIL import Image, ImageFilter 
im = Image.open(‘1.png’) 
# 高斯模糊 
im.filter(ImageFilter.GaussianBlur) 
# 普通模糊 
im.filter(ImageFilter.BLUR) 
# 邊緣增強 
im.filter(ImageFilter.EDGE_ENHANCE) 
# 找到邊緣 
im.filter(ImageFilter.FIND_EDGES) 
# 浮雕 
im.filter(ImageFilter.EMBOSS) 
# 輪廓 
im.filter(ImageFilter.CONTOUR) 
# 銳化 
im.filter(ImageFilter.SHARPEN) 
# 平滑 
im.filter(ImageFilter.SMOOTH) 
# 細節 
im.filter(ImageFilter.DETAIL)

複製程式碼

Format

format屬性定義了影象的格式，如果影象不是從檔案開啟的，那麼該屬性值為None； size屬性是一個tuple，表示影象的寬和高（單位為畫素）； mode屬性為表示影象的模式，常用的模式為：L為灰度圖，RGB為真彩色，CMYK為pre-press影象。如果檔案不能開啟，則丟擲IOError異常。

這個地方可以參照一篇部落格，寫的不錯 > www.cnblogs.com/mapu/p/8341…

驗證碼識別

注意安裝完畢，如果還是報錯，請找到模組 pytesseract.py 這個檔案，對這個檔案進行編輯

一般這個檔案在 C:\Program Files\Python36\Lib\site-packages\pytesseract\pytesseract.py 位置

檔案中 tesseract_cmd = 'tesseract' 改為自己的地址
例如： tesseract_cmd = 'C:\Program Files (x86)\Tesseract-OCR\tesseract.exe' 
複製程式碼

如果報下面的BUG，請注意

Error opening data file \Program Files (x86)\Tesseract-OCR\tessdata/chi_sim.traineddata Please make sure the TESSDATA_PREFIX environment variable

解決辦法也比較容易，按照它的提示，表示缺失了 TESSDATA_PREFIX 這個環境變數。你只需要在系統環境變數中新增一條即可

將 TESSDATA_PREFIX=C:\Program Files (x86)\Tesseract-OCR 新增環境變數

重啟IDE或者重新CMD，然後繼續執行程式碼，這個地方注意需要用管理員執行你的py指令碼

步驟分為

開啟圖片 Image.open()
pytesseract識別圖片

import pytesseract
from PIL import Image

def main():
    image = Image.open("1.jpg")
 
    text = pytesseract.image_to_string(image,lang="chi_sim")
    print(text)

if __name__ == '__main__':
    main()
複製程式碼

測試英文，數字什麼的基本沒有問題，中文簡直慘不忍睹。空白比較大的可以識別出來。唉~不好用當然剛才那個7364 十分輕鬆的就識別出來了。

帶干擾的驗證碼識別

接下來識別如下的驗證碼，我們首先依舊先嚐試一下。執行程式碼發現沒有任何顯示。接下來需要對這個圖片進行處理

基本原理都是完全一樣的

彩色轉灰度
灰度轉二值
二值影象識別

彩色轉灰度

im = im.convert('L')  
複製程式碼

灰度轉二值，解決方案比較成套路，採用閾值分割法，threshold為分割點

def initTable(threshold=140):
    table = []
    for i in range(256):
        if i < threshold:
            table.append(0)
        else:
            table.append(1)
    return table
複製程式碼

呼叫

binaryImage = im.point(initTable(), '1')
binaryImage.show()
複製程式碼

調整之後

我們還需要對干擾線進行處理。在往下研究去，是圖片深入處理的任務，對付小網站的簡單驗證碼，這個辦法足夠了，本篇博文OVER,下一篇我們繼續研究驗證碼。

參考連結

tesserocr GitHub：github.com/sirfz/tesse…
tesserocr PyPI：pypi.python.org/pypi/tesser…
pytesserocr GitHub：github.com/madmaze/pyt…
pytesserocr PyPI：pypi.org/project/pyt…
tesseract下載地址：digi.bib.uni-mannheim.de/tesseract
tesseract GitHub：github.com/tesseract-o…
tesseract 語言包：github.com/tesseract-o…
tesseract文件：github.com/tesseract-o…

掃碼關注微信公眾賬號，領取2T學習資源

Python爬蟲入門教程 57-100 python爬蟲高階技術之驗證碼篇3-滑動驗證碼識別技術
2019-04-11
Python爬蟲
Python爬蟲入門教程 58-100 python爬蟲高階技術之驗證碼篇4-極驗證識別技術之一
2019-04-12
Python爬蟲
Python爬蟲入門教程 56-100 python爬蟲高階技術之驗證碼篇2-開放平臺OCR技術
2019-04-09
Python爬蟲
python入門與進階篇(七)之原生爬蟲
2018-10-07
Python爬蟲
python爬蟲之處理驗證碼
2019-03-01
Python爬蟲
為什麼學習python及爬蟲，Python爬蟲[入門篇]？
2018-11-21
Python爬蟲
Python爬蟲進階之JS逆向入門
2019-05-29
Python爬蟲JS
Python爬蟲入門教程 50-100 Python3爬蟲爬取VIP視訊-Python爬蟲6操作
2019-02-14
Python爬蟲
Python爬蟲入門
2020-11-30
Python爬蟲
什麼是Python爬蟲？python爬蟲入門難嗎？
2021-12-27
Python爬蟲
python爬蟲之 BeautifulSoup庫入門
2019-12-09
Python爬蟲
【爬蟲】python爬蟲從入門到放棄
2018-12-20
爬蟲Python
如何高效的學習Python爬蟲技術？Python入門
2021-05-18
Python爬蟲
python-爬蟲入門
2024-09-22
Python爬蟲
Python爬蟲入門教程導航帖
2019-01-08
Python爬蟲
Python爬蟲抓取技術的門道
2019-09-21
Python爬蟲
爬蟲進階教程：極驗(GEETEST)驗證碼破解教程
2018-12-24
爬蟲
【爬蟲系列】1. 無事，Python驗證碼識別入門
2021-08-07
爬蟲Python
python入門之爬蟲工具有哪些？
2021-09-11
Python爬蟲
【Python學習】爬蟲爬蟲爬蟲爬蟲~
2018-05-03
Python爬蟲
Python爬蟲怎麼入門-初級篇
2018-12-10
Python爬蟲
Python 從入門到爬蟲極簡教程
2019-02-16
Python爬蟲
爬蟲入門基礎-Python
2020-05-09
爬蟲Python
python3 爬蟲入門
2021-09-09
Python爬蟲
不踩坑的Python爬蟲：Python爬蟲開發與專案實戰，從爬蟲入門 Python
2021-12-17
Python爬蟲
「docker實戰篇」python的docker爬蟲技術-python
2021-09-09
DockerPython爬蟲
python爬蟲庫技術分享
2022-01-19
Python爬蟲
Python爬蟲入門，8個常用爬蟲技巧盤點
2018-12-12
Python爬蟲
Python爬蟲教程-01-爬蟲介紹
2018-09-06
Python爬蟲
【Python篇】scrapy爬蟲
2020-11-29
Python爬蟲
Python 爬蟲從入門到進階之路（十）
2019-07-03
Python爬蟲
Python 爬蟲從入門到進階之路（十五）
2019-07-10
Python爬蟲
Python 爬蟲從入門到進階之路（九）
2019-07-02
Python爬蟲
Python 爬蟲從入門到進階之路（十二）
2019-07-05
Python爬蟲
Python 爬蟲從入門到進階之路（十七）
2019-07-12
Python爬蟲
Python 爬蟲從入門到進階之路（二）
2019-06-20
Python爬蟲
Python 爬蟲從入門到進階之路（十一）
2019-07-04
Python爬蟲
Python 爬蟲從入門到進階之路（六）
2019-06-27
Python爬蟲