爬蟲效能：NodeJs VS Python

發表於2016-09-03

早就聽說Nodejs的非同步策略是多麼的好，I/O是多麼的牛逼……反正就是各種好。今天我就準備給nodejs和python來做個比較。能體現非同步策略和I/O優勢的專案，我覺得莫過於爬蟲了。那麼就以一個爬蟲專案來一較高下吧。

爬蟲專案

眾籌網-眾籌中專案 http://www.zhongchou.com/brow…，我們就以這個網站為例，我們爬取它所有目前正在眾籌中的專案，獲得每一個專案詳情頁的URL，存入txt檔案中。

實戰比較

python原始版

# -*- coding:utf-8 -*-
'''
Created on 20160827
@author: qiukang
'''
import requests,time
from BeautifulSoup import BeautifulSoup    # HTML

#請求頭
headers = {
   'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
   'Accept-Encoding':'gzip, deflate, sdch',
   'Accept-Language':'zh-CN,zh;q=0.8',
   'Connection':'keep-alive',
   'Host':'www.zhongchou.com',
   'Upgrade-Insecure-Requests':1,
   'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2593.0 Safari/537.36'
}

# 獲得專案url列表
def getItems(allpage):
    no = 0
    items = open('pystandard.txt','a')
    for page in range(allpage):
        if page==0:
            url = 'http://www.zhongchou.com/browse/di'
        else:
            url = 'http://www.zhongchou.com/browse/di-p'+str(page+1)
        # print url #①
        r1 = requests.get(url,headers=headers)
        html = r1.text.encode('utf8')
        soup = BeautifulSoup(html);
        lists = soup.findAll(attrs={"class":"ssCardItem"})
        for i in range(len(lists)):
            href = lists[i].a['href']
            items.write(href+"n")
            no +=1
    items.close()
    return no
    
if __name__ == '__main__':
    start = time.clock()
    allpage = 30
    no = getItems(allpage)
    end = time.clock()
    print('it takes %s Seconds to get %s items '%(end-start,no))

# -*- coding:utf-8 -*-

'''

Created on 20160827

@author: qiukang

'''

import requests,time

from BeautifulSoup import BeautifulSoup # HTML

#請求頭

headers = {

'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',

'Accept-Encoding':'gzip, deflate, sdch',

'Accept-Language':'zh-CN,zh;q=0.8',

'Connection':'keep-alive',

'Host':'www.zhongchou.com',

'Upgrade-Insecure-Requests':1,

'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2593.0 Safari/537.36'

}

# 獲得專案url列表

def getItems(allpage):

no = 0

items = open('pystandard.txt','a')

for page in range(allpage):

if page==0:

url = 'http://www.zhongchou.com/browse/di'

else:

url = 'http://www.zhongchou.com/browse/di-p'+str(page+1)

# print url #①

r1 = requests.get(url,headers=headers)

html = r1.text.encode('utf8')

soup = BeautifulSoup(html);

lists = soup.findAll(attrs={"class":"ssCardItem"})

for i in range(len(lists)):

href = lists[i].a['href']

items.write(href+"n")

no +=1

items.close()

return no

if __name__ == '__main__':

start = time.clock()

allpage = 30

no = getItems(allpage)

end = time.clock()

print('it takes %s Seconds to get %s items '%(end-start,no))

實驗5次的結果：

 it takes 48.1727159614 Seconds to get 720 items 
 it takes 45.3397999415 Seconds to get 720 items  
 it takes 44.4811429862 Seconds to get 720 items 
 it takes 44.4619293082 Seconds to get 720 items
 it takes 46.669706593 Seconds to get 720 items

it takes 48.1727159614 Seconds to get 720 items

it takes 45.3397999415 Seconds to get 720 items

it takes 44.4811429862 Seconds to get 720 items

it takes 44.4619293082 Seconds to get 720 items

it takes 46.669706593 Seconds to get 720 items

python多執行緒版

# -*- coding:utf-8 -*-
'''
Created on 20160827
@author: qiukang
'''
import requests,time,threading
from BeautifulSoup import BeautifulSoup    # HTML

#請求頭
headers = {
   'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
   'Accept-Encoding':'gzip, deflate, sdch',
   'Accept-Language':'zh-CN,zh;q=0.8',
   'Connection':'keep-alive',
   'Host':'www.zhongchou.com',
   'Upgrade-Insecure-Requests':1,
   'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2593.0 Safari/537.36'
}

items = open('pymulti.txt','a')
no = 0
lock = threading.Lock()

# 獲得專案url列表
def getItems(urllist):
    # print urllist  #①
    global items,no,lock
    for url in urllist:
        r1 = requests.get(url,headers=headers)
        html = r1.text.encode('utf8')
        soup = BeautifulSoup(html);
        lists = soup.findAll(attrs={"class":"ssCardItem"})
        for i in range(len(lists)):
            href = lists[i].a['href']
            lock.acquire()
            items.write(href+"\n")
            no +=1
            # print no
            lock.release()
    
if __name__ == '__main__':
    start = time.clock()
    allpage = 30
    allthread = 30
    per = (int)(allpage/allthread)
    urllist = []
    ths = []
    for page in range(allpage):
        if page==0:
            url = 'http://www.zhongchou.com/browse/di'
        else:
            url = 'http://www.zhongchou.com/browse/di-p'+str(page+1)
        urllist.append(url)
    for i in range(allthread):
        # print urllist[i*(per):(i+1)*(per)]
        th = threading.Thread(target = getItems,args= (urllist[i*(per):(i+1)*(per)],))
        th.start()
        th.join()
    items.close()
    end = time.clock()
    print('it takes %s Seconds to get %s items '%(end-start,no))

# -*- coding:utf-8 -*-

'''

Created on 20160827

@author: qiukang

'''

import requests,time,threading

from BeautifulSoup import BeautifulSoup # HTML

#請求頭

headers = {

'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',

'Accept-Encoding':'gzip, deflate, sdch',

'Accept-Language':'zh-CN,zh;q=0.8',

'Connection':'keep-alive',

'Host':'www.zhongchou.com',

'Upgrade-Insecure-Requests':1,

'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2593.0 Safari/537.36'

}

items = open('pymulti.txt','a')

no = 0

lock = threading.Lock()

# 獲得專案url列表

def getItems(urllist):

# print urllist #①

global items,no,lock

for url in urllist:

r1 = requests.get(url,headers=headers)

html = r1.text.encode('utf8')

soup = BeautifulSoup(html);

lists = soup.findAll(attrs={"class":"ssCardItem"})

for i in range(len(lists)):

href = lists[i].a['href']

lock.acquire()

items.write(href+"\n")

no +=1

# print no

lock.release()

if __name__ == '__main__':

start = time.clock()

allpage = 30

allthread = 30

per = (int)(allpage/allthread)

urllist = []

ths = []

for page in range(allpage):

if page==0:

url = 'http://www.zhongchou.com/browse/di'

else:

url = 'http://www.zhongchou.com/browse/di-p'+str(page+1)

urllist.append(url)

for i in range(allthread):

# print urllist[i*(per):(i+1)*(per)]

th = threading.Thread(target = getItems,args= (urllist[i*(per):(i+1)*(per)],))

th.start()

th.join()

items.close()

end = time.clock()

print('it takes %s Seconds to get %s items '%(end-start,no))

實驗五次的結果：

it takes 45.5222291114 Seconds to get 720 items 
it takes 46.7097831417 Seconds to get 720 items
it takes 45.5334646156 Seconds to get 720 items 
it takes 48.0242797553 Seconds to get 720 items
it takes 44.804855018 Seconds to get 720 items

it takes 45.5222291114 Seconds to get 720 items

it takes 46.7097831417 Seconds to get 720 items

it takes 45.5334646156 Seconds to get 720 items

it takes 48.0242797553 Seconds to get 720 items

it takes 44.804855018 Seconds to get 720 items

這個多執行緒並沒有優勢，經過 #① 的註釋與否發現，這個所謂的多執行緒也是按照單執行緒執行的。

python改進

單執行緒

首先我們把解析html的步驟改進一下，分析發現

lists = soup.findAll('a',attrs={"class":"siteCardICH3"})

1	lists = soup.findAll('a',attrs={"class":"siteCardICH3"})

比

lists = soup.findAll(attrs={"class":"ssCardItem"})

1	lists = soup.findAll(attrs={"class":"ssCardItem"})

更好，因為它是直接找 a ，而不是先找 div 再找 div 下的 a
改進後實驗5次結果如下，可見有進步：

it takes 41.0018861912 Seconds to get 720 items 
it takes 42.0260390497 Seconds to get 720 items
it takes 42.249635988 Seconds to get 720 items 
it takes 41.295524133 Seconds to get 720 items 
it takes 42.9022894154 Seconds to get 720 items

it takes 41.0018861912 Seconds to get 720 items

it takes 42.0260390497 Seconds to get 720 items

it takes 42.249635988 Seconds to get 720 items

it takes 41.295524133 Seconds to get 720 items

it takes 42.9022894154 Seconds to get 720 items

多執行緒

修改 getItems(urllist) 為 getItems(urllist，thno)
函式起止加入 print thno," begin at",time.clock() 和 print thno," end at",time.clock()。結果：

0  begin at 0.00100631078628
0  end at 1.28625832936
1  begin at 1.28703230691
1  end at 2.61739476075
2  begin at 2.61801291642
2  end at 3.92514717937
3  begin at 3.9255829208
3  end at 5.38870235361
4  begin at 5.38921134066
4  end at 6.670658786
5  begin at 6.67125734731
5  end at 8.01520989534
6  begin at 8.01566383155
6  end at 9.42006780585
7  begin at 9.42053340537
7  end at 11.0386755513
8  begin at 11.0391565464
8  end at 12.421359168
9  begin at 12.4218294329
9  end at 13.9932716671
10  begin at 13.9939957256
10  end at 15.3535799145
11  begin at 15.3540870354
11  end at 16.6968289314
12  begin at 16.6972665389
12  end at 17.9798803157
13  begin at 17.9804714125
13  end at 19.326706238
14  begin at 19.3271438455
14  end at 20.8744308886
15  begin at 20.8751017624
15  end at 22.5306500245
16  begin at 22.5311450156
16  end at 23.7781693541
17  begin at 23.7787245279
17  end at 25.1775114499
18  begin at 25.178350742
18  end at 26.5497330734
19  begin at 26.5501776789
19  end at 27.970799259
20  begin at 27.9712727895
20  end at 29.4595075375
21  begin at 29.4599959972
21  end at 30.9507299602
22  begin at 30.9513989679
22  end at 32.2762763982
23  begin at 32.2767182045
23  end at 33.6476256057
24  begin at 33.648137392
24  end at 35.1100517711
25  begin at 35.1104907783
25  end at 36.462657099
26  begin at 36.4632234696
26  end at 37.7908515759
27  begin at 37.7912845182
27  end at 39.4359928956
28  begin at 39.436448698
28  end at 40.9955021593
29  begin at 40.9960871912
29  end at 42.6425665264
it takes 42.6435882327 Seconds to get 720 items

0 begin at 0.00100631078628

0 end at 1.28625832936

1 begin at 1.28703230691

1 end at 2.61739476075

2 begin at 2.61801291642

2 end at 3.92514717937

3 begin at 3.9255829208

3 end at 5.38870235361

4 begin at 5.38921134066

4 end at 6.670658786

5 begin at 6.67125734731

5 end at 8.01520989534

6 begin at 8.01566383155

6 end at 9.42006780585

7 begin at 9.42053340537

7 end at 11.0386755513

8 begin at 11.0391565464

8 end at 12.421359168

9 begin at 12.4218294329

9 end at 13.9932716671

10 begin at 13.9939957256

10 end at 15.3535799145

11 begin at 15.3540870354

11 end at 16.6968289314

12 begin at 16.6972665389

12 end at 17.9798803157

13 begin at 17.9804714125

13 end at 19.326706238

14 begin at 19.3271438455

14 end at 20.8744308886

15 begin at 20.8751017624

15 end at 22.5306500245

16 begin at 22.5311450156

16 end at 23.7781693541

17 begin at 23.7787245279

17 end at 25.1775114499

18 begin at 25.178350742

18 end at 26.5497330734

19 begin at 26.5501776789

19 end at 27.970799259

20 begin at 27.9712727895

20 end at 29.4595075375

21 begin at 29.4599959972

21 end at 30.9507299602

22 begin at 30.9513989679

22 end at 32.2762763982

23 begin at 32.2767182045

23 end at 33.6476256057

24 begin at 33.648137392

24 end at 35.1100517711

25 begin at 35.1104907783

25 end at 36.462657099

26 begin at 36.4632234696

26 end at 37.7908515759

27 begin at 37.7912845182

27 end at 39.4359928956

28 begin at 39.436448698

28 end at 40.9955021593

29 begin at 40.9960871912

29 end at 42.6425665264

it takes 42.6435882327 Seconds to get 720 items

可見這些執行緒是真的沒有併發執行，而是順序執行的，並沒有達到多執行緒的目的。問題在哪裡呢？原來
我的迴圈中

th.start()
th.join()

1 2	th.start() th.join()

兩行程式碼是緊接著的，所以新的執行緒會等待上一個執行緒執行完畢才會start，修改為

for i in range(allthread):
    # print urllist[i*(per):(i+1)*(per)]
    th = threading.Thread(target = getItems,args= (urllist[i*(per):(i+1)*(per)],i))
    ths.append(th)
for th in ths:
    th.start()
for th in ths:
    th.join()

for i in range(allthread):

# print urllist[i*(per):(i+1)*(per)]

th = threading.Thread(target = getItems,args= (urllist[i*(per):(i+1)*(per)],i))

ths.append(th)

for th in ths:

th.start()

for th in ths:

th.join()

結果：

0  begin at 0.0010814225325
1  begin at 0.00135201143191
2  begin at 0.00191744892518
3  begin at 0.0021311208492
4  begin at 0.00247495536449
5  begin at 0.0027334144167
6  begin at 0.00320601192551
7  begin at 0.00379011072218
8  begin at 0.00425431064445
9  begin at 0.00511692939449
10  begin at 0.0132038052264
11  begin at 0.0165926979253
12  begin at 0.0170886220634
13  begin at 0.0174665134574
14  begin at 0.018348726576
15  begin at 0.0189780790334
16  begin at 0.0201896641572
17  begin at 0.0220576606283
18  begin at 0.0231484138125
19  begin at 0.0238804034387
20  begin at 0.0273901280772
21  begin at 0.0300363009005
22  begin at 0.0362878375422
23  begin at 0.0395512329756
24  begin at 0.0431556637289
25  begin at 0.0459581249682
26  begin at 0.0482254733323
27  begin at 0.0535430117384
28  begin at 0.0584971212607
29  begin at 0.0598136762161
16  end at 65.2657542222
24  end at 66.2951247811
21  end at 66.3849747583
4  end at 66.6230160119
5  end at 67.5501632164
29  end at 67.7516992283
23  end at 68.6985322418
7  end at 69.1060433231
22  end at 69.2743398214
2  end at 69.5523713152
14  end at 69.6454986837
15  end at 69.8333400981
12  end at 69.9508018062
10  end at 70.2860348602
26  end at 70.3670659719
13  end at 70.3847232972
27  end at 70.3941635841
11  end at 70.5132838156
1  end at 70.7272351926
0  end at 70.9115253609
6  end at 71.0876563409
8  end at 71.112480539825
  end at 71.1145248855
3  end at 71.4606034226
19  end at 71.6103622486
18  end at 71.6674453096
20  end at 71.725601862
17  end at 71.7778992318
9  end at 71.7847479301
28  end at 71.7921004837
it takes 71.7931912368 Seconds to get 720 items

0 begin at 0.0010814225325

1 begin at 0.00135201143191

2 begin at 0.00191744892518

3 begin at 0.0021311208492

4 begin at 0.00247495536449

5 begin at 0.0027334144167

6 begin at 0.00320601192551

7 begin at 0.00379011072218

8 begin at 0.00425431064445

9 begin at 0.00511692939449

10 begin at 0.0132038052264

11 begin at 0.0165926979253

12 begin at 0.0170886220634

13 begin at 0.0174665134574

14 begin at 0.018348726576

15 begin at 0.0189780790334

16 begin at 0.0201896641572

17 begin at 0.0220576606283

18 begin at 0.0231484138125

19 begin at 0.0238804034387

20 begin at 0.0273901280772

21 begin at 0.0300363009005

22 begin at 0.0362878375422

23 begin at 0.0395512329756

24 begin at 0.0431556637289

25 begin at 0.0459581249682

26 begin at 0.0482254733323

27 begin at 0.0535430117384

28 begin at 0.0584971212607

29 begin at 0.0598136762161

16 end at 65.2657542222

24 end at 66.2951247811

21 end at 66.3849747583

4 end at 66.6230160119

5 end at 67.5501632164

29 end at 67.7516992283

23 end at 68.6985322418

7 end at 69.1060433231

22 end at 69.2743398214

2 end at 69.5523713152

14 end at 69.6454986837

15 end at 69.8333400981

12 end at 69.9508018062

10 end at 70.2860348602

26 end at 70.3670659719

13 end at 70.3847232972

27 end at 70.3941635841

11 end at 70.5132838156

1 end at 70.7272351926

0 end at 70.9115253609

6 end at 71.0876563409

8 end at 71.112480539825

end at 71.1145248855

3 end at 71.4606034226

19 end at 71.6103622486

18 end at 71.6674453096

20 end at 71.725601862

17 end at 71.7778992318

9 end at 71.7847479301

28 end at 71.7921004837

it takes 71.7931912368 Seconds to get 720 items

反思

上面的的多線是併發了，可是比單執行緒執行時間長了太多……我還沒找出來原因，猜想是不是beautifulsoup不支援多執行緒？請各位多多指教。為了驗證這個想法，我準備不用beautifulsoup,直接使用字串查詢。首先還是從單執行緒的修改：

# -*- coding:utf-8 -*-
'''
Created on 20160827
@author: qiukang
'''
import requests,time
from BeautifulSoup import BeautifulSoup    # HTML

#請求頭
headers = {
   'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
   'Accept-Encoding':'gzip, deflate, sdch',
   'Accept-Language':'zh-CN,zh;q=0.8',
   'Connection':'keep-alive',
   'Host':'www.zhongchou.com',
   'Upgrade-Insecure-Requests':'1',
   'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2593.0 Safari/537.36'
}

# 獲得專案url列表
def getItems(allpage):
    no = 0
    data = set()
    for page in range(allpage):
        if page==0:
            url = 'http://www.zhongchou.com/browse/di'
        else:
            url = 'http://www.zhongchou.com/browse/di-p'+str(page+1)
        # print url #①
        r1 = requests.get(url,headers=headers)
        html = r1.text.encode('utf8')
        start = 5000    
        while  True:     
            index = html.find("deal-show", start)   
            if index == -1:     
                break    
            # print "http://www.zhongchou.com/deal-show/"+html[index+10:index+19]+"\n"
            # time.sleep(100)
            data.add("http://www.zhongchou.com/deal-show/"+html[index+10:index+19]+"\n")    
            start = index  + 1000 
    items = open('pystandard.txt','a')
    items.write("".join(data))
    items.close()
    return len(data)
    
if __name__ == '__main__':
    start = time.clock()
    allpage = 30
    no = getItems(allpage)
    end = time.clock()
    print('it takes %s Seconds to get %s items '%(end-start,no))

# -*- coding:utf-8 -*-

'''

Created on 20160827

@author: qiukang

'''

import requests,time

from BeautifulSoup import BeautifulSoup # HTML

#請求頭

headers = {

'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',

'Accept-Encoding':'gzip, deflate, sdch',

'Accept-Language':'zh-CN,zh;q=0.8',

'Connection':'keep-alive',

'Host':'www.zhongchou.com',

'Upgrade-Insecure-Requests':'1',

'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2593.0 Safari/537.36'

}

# 獲得專案url列表

def getItems(allpage):

no = 0

data = set()

for page in range(allpage):

if page==0:

url = 'http://www.zhongchou.com/browse/di'

else:

url = 'http://www.zhongchou.com/browse/di-p'+str(page+1)

# print url #①

r1 = requests.get(url,headers=headers)

html = r1.text.encode('utf8')

start = 5000

while True:

index = html.find("deal-show", start)

if index == -1:

break

# print "http://www.zhongchou.com/deal-show/"+html[index+10:index+19]+"\n"

# time.sleep(100)

data.add("http://www.zhongchou.com/deal-show/"+html[index+10:index+19]+"\n")

start = index + 1000

items = open('pystandard.txt','a')

items.write("".join(data))

items.close()

return len(data)

if __name__ == '__main__':

start = time.clock()

allpage = 30

no = getItems(allpage)

end = time.clock()

print('it takes %s Seconds to get %s items '%(end-start,no))

實驗3次，結果：

it takes 11.6800132309 Seconds to get 720 items
it takes 11.3621804427 Seconds to get 720 items
it takes 11.6811991567 Seconds to get 720 items

it takes 11.6800132309 Seconds to get 720 items

it takes 11.3621804427 Seconds to get 720 items

it takes 11.6811991567 Seconds to get 720 items

然後對多執行緒進行修改：

# -*- coding:utf-8 -*-
'''
Created on 20160827
@author: qiukang
'''
import requests,time
from BeautifulSoup import BeautifulSoup    # HTML

#請求頭
headers = {
   'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
   'Accept-Encoding':'gzip, deflate, sdch',
   'Accept-Language':'zh-CN,zh;q=0.8',
   'Connection':'keep-alive',
   'Host':'www.zhongchou.com',
   'Upgrade-Insecure-Requests':'1',
   'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2593.0 Safari/537.36'
}

# 獲得專案url列表
def getItems(allpage):
    no = 0
    data = set()
    for page in range(allpage):
        if page==0:
            url = 'http://www.zhongchou.com/browse/di'
        else:
            url = 'http://www.zhongchou.com/browse/di-p'+str(page+1)
        # print url #①
        r1 = requests.get(url,headers=headers)
        html = r1.text.encode('utf8')
        start = 5000    
        while  True:     
            index = html.find("deal-show", start)   
            if index == -1:     
                break    
            # print "http://www.zhongchou.com/deal-show/"+html[index+10:index+19]+"\n"
            # time.sleep(100)
            data.add("http://www.zhongchou.com/deal-show/"+html[index+10:index+19]+"\n")    
            start = index  + 1000 
    items = open('pystandard.txt','a')
    items.write("".join(data))
    items.close()
    return len(data)
    
if __name__ == '__main__':
    start = time.clock()
    allpage = 30
    no = getItems(allpage)
    end = time.clock()
    print('it takes %s Seconds to get %s items '%(end-start,no))

# -*- coding:utf-8 -*-

'''

Created on 20160827

@author: qiukang

'''

import requests,time

from BeautifulSoup import BeautifulSoup # HTML

#請求頭

headers = {

'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',

'Accept-Encoding':'gzip, deflate, sdch',

'Accept-Language':'zh-CN,zh;q=0.8',

'Connection':'keep-alive',

'Host':'www.zhongchou.com',

'Upgrade-Insecure-Requests':'1',

'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2593.0 Safari/537.36'

}

# 獲得專案url列表

def getItems(allpage):

no = 0

data = set()

for page in range(allpage):

if page==0:

url = 'http://www.zhongchou.com/browse/di'

else:

url = 'http://www.zhongchou.com/browse/di-p'+str(page+1)

# print url #①

r1 = requests.get(url,headers=headers)

html = r1.text.encode('utf8')

start = 5000

while True:

index = html.find("deal-show", start)

if index == -1:

break

# print "http://www.zhongchou.com/deal-show/"+html[index+10:index+19]+"\n"

# time.sleep(100)

data.add("http://www.zhongchou.com/deal-show/"+html[index+10:index+19]+"\n")

start = index + 1000

items = open('pystandard.txt','a')

items.write("".join(data))

items.close()

return len(data)

if __name__ == '__main__':

start = time.clock()

allpage = 30

no = getItems(allpage)

end = time.clock()

print('it takes %s Seconds to get %s items '%(end-start,no))

實驗三次的結果：

it takes 1.4781525123 Seconds to get 720 items 
it takes 1.44905954029 Seconds to get 720 items
it takes 1.49297891786 Seconds to get 720 items

it takes 1.4781525123 Seconds to get 720 items

it takes 1.44905954029 Seconds to get 720 items

it takes 1.49297891786 Seconds to get 720 items

可見多執行緒確實比單執行緒快好多倍。對於簡單的爬取任務而言，用字串的內建方法比用beautifulsoup解析html快很多。

NodeJs

# -*- coding:utf-8 -*-
'''
Created on 20160827
@author: qiukang
'''
import requests,time
from BeautifulSoup import BeautifulSoup    # HTML

#請求頭
headers = {
   'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
   'Accept-Encoding':'gzip, deflate, sdch',
   'Accept-Language':'zh-CN,zh;q=0.8',
   'Connection':'keep-alive',
   'Host':'www.zhongchou.com',
   'Upgrade-Insecure-Requests':'1',
   'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2593.0 Safari/537.36'
}

# 獲得專案url列表
def getItems(allpage):
    no = 0
    data = set()
    for page in range(allpage):
        if page==0:
            url = 'http://www.zhongchou.com/browse/di'
        else:
            url = 'http://www.zhongchou.com/browse/di-p'+str(page+1)
        # print url #①
        r1 = requests.get(url,headers=headers)
        html = r1.text.encode('utf8')
        start = 5000    
        while  True:     
            index = html.find("deal-show", start)   
            if index == -1:     
                break    
            # print "http://www.zhongchou.com/deal-show/"+html[index+10:index+19]+"\n"
            # time.sleep(100)
            data.add("http://www.zhongchou.com/deal-show/"+html[index+10:index+19]+"\n")    
            start = index  + 1000 
    items = open('pystandard.txt','a')
    items.write("".join(data))
    items.close()
    return len(data)
    
if __name__ == '__main__':
    start = time.clock()
    allpage = 30
    no = getItems(allpage)
    end = time.clock()
    print('it takes %s Seconds to get %s items '%(end-start,no))

# -*- coding:utf-8 -*-

'''

Created on 20160827

@author: qiukang

'''

import requests,time

from BeautifulSoup import BeautifulSoup # HTML

#請求頭

headers = {

'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',

'Accept-Encoding':'gzip, deflate, sdch',

'Accept-Language':'zh-CN,zh;q=0.8',

'Connection':'keep-alive',

'Host':'www.zhongchou.com',

'Upgrade-Insecure-Requests':'1',

'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2593.0 Safari/537.36'

}

# 獲得專案url列表

def getItems(allpage):

no = 0

data = set()

for page in range(allpage):

if page==0:

url = 'http://www.zhongchou.com/browse/di'

else:

url = 'http://www.zhongchou.com/browse/di-p'+str(page+1)

# print url #①

r1 = requests.get(url,headers=headers)

html = r1.text.encode('utf8')

start = 5000

while True:

index = html.find("deal-show", start)

if index == -1:

break

# print "http://www.zhongchou.com/deal-show/"+html[index+10:index+19]+"\n"

# time.sleep(100)

data.add("http://www.zhongchou.com/deal-show/"+html[index+10:index+19]+"\n")

start = index + 1000

items = open('pystandard.txt','a')

items.write("".join(data))

items.close()

return len(data)

if __name__ == '__main__':

start = time.clock()

allpage = 30

no = getItems(allpage)

end = time.clock()

print('it takes %s Seconds to get %s items '%(end-start,no))

實驗五次的結果：

it takes 3.949 Seconds to get 720 items
it takes 3.642 Seconds to get 720 items
it takes 3.641 Seconds to get 720 items
it takes 3.938 Seconds to get 720 items
it takes 3.783 Seconds to get 720 items

it takes 3.949 Seconds to get 720 items

it takes 3.642 Seconds to get 720 items

it takes 3.641 Seconds to get 720 items

it takes 3.938 Seconds to get 720 items

it takes 3.783 Seconds to get 720 items

可見同樣是用解析html的方法，nodejs速度完虐python。字串查詢呢？

var request = require("request");
var cheerio = require('cheerio');
var fs = require('fs');

var t1 = new Date().getTime();
var allpage = 30;
var urllist = new Array()  ;
var urldata = new Array();
var mark = 0;
var no = 0;
for (var i=0; i<allpage; i++) {
    if (i==0) 
        urllist[i] = 'http://www.zhongchou.com/browse/di'
    else
        urllist[i] = 'http://www.zhongchou.com/browse/di-p'+(i+1).toString();
    // console.log(urllist[i]);
    request(urllist[i],function(error,resp,body){
        if (!error && resp.statusCode==200) {
            getUrl(body);
        }
    });
} 

function getUrl(data) {
    mark += 1;
    var start = 5000
    while (true) {
        var index1 = data.indexOf("deal-show", start);
        if (index1 == -1)     
            break;
        var url = "http://www.zhongchou.com/deal-show/"+data.substring(index1+10,index1+19)+"\n";
        // console.log(url);
        if (urldata.indexOf(url)==-1) {
            urldata.push(url);
        }
        start = index1 + 1000;
    }
    if (mark==allpage) {//所有頁面執行完畢
        // console.log(urldata);
        no = urldata.length;
        fs.writeFile('./nodestandard.txt',urldata.join(""),function(err){
                    if(err) throw err;
        });
        var t2 = new Date().getTime();
        console.log("it takes " + ((t2-t1)/1000).toString() + " Seconds to get " + no.toString() + " items");
    }  
}

var request = require("request");

var cheerio = require('cheerio');

var fs = require('fs');

var t1 = new Date().getTime();

var allpage = 30;

var urllist = new Array() ;

var urldata = new Array();

var mark = 0;

var no = 0;

for (var i=0; i<allpage; i++) {

if (i==0)

urllist[i] = 'http://www.zhongchou.com/browse/di'

else

urllist[i] = 'http://www.zhongchou.com/browse/di-p'+(i+1).toString();

// console.log(urllist[i]);

request(urllist[i],function(error,resp,body){

if (!error && resp.statusCode==200) {

getUrl(body);

}

});

}

function getUrl(data) {

mark += 1;

var start = 5000

while (true) {

var index1 = data.indexOf("deal-show", start);

if (index1 == -1)

break;

var url = "http://www.zhongchou.com/deal-show/"+data.substring(index1+10,index1+19)+"\n";

// console.log(url);

if (urldata.indexOf(url)==-1) {

urldata.push(url);

}

start = index1 + 1000;

}

if (mark==allpage) {//所有頁面執行完畢

// console.log(urldata);

no = urldata.length;

fs.writeFile('./nodestandard.txt',urldata.join(""),function(err){

if(err) throw err;

});

var t2 = new Date().getTime();

console.log("it takes " + ((t2-t1)/1000).toString() + " Seconds to get " + no.toString() + " items");

}

實驗五次的結果：

it takes 3.695 Seconds to get 720 items
it takes 3.781 Seconds to get 720 items
it takes 3.94 Seconds to get 720 items
it takes 3.705 Seconds to get 720 items
it takes 3.601 Seconds to get 720 items

it takes 3.695 Seconds to get 720 items

it takes 3.781 Seconds to get 720 items

it takes 3.94 Seconds to get 720 items

it takes 3.705 Seconds to get 720 items

it takes 3.601 Seconds to get 720 items

可見和解析起來的時間是差不多的。

綜上

由我自己瞭解的知識和本實驗而言，我的結論是：python用上多執行緒下載速度能夠比過nodejs，但是解析網頁這種事python沒有nodejs快，畢竟js原生就是為了寫網頁，而且複雜的爬蟲總不能都用字串去找吧。

nodejs 爬蟲
2019-02-16
NodeJS爬蟲
【Python學習】爬蟲爬蟲爬蟲爬蟲~
2018-05-03
Python爬蟲
【python爬蟲】python爬蟲demo
2018-02-21
Python爬蟲
python爬蟲---網頁爬蟲，圖片爬蟲，文章爬蟲，Python爬蟲爬取新聞網站新聞
2019-01-04
Python爬蟲網頁網站
Nodejs 爬蟲框架，支援佇列
2017-12-04
NodeJS爬蟲框架佇列
Reactjs前端、Python爬蟲、Nodejs後臺開發招聘
2018-11-09
React前端Python爬蟲NodeJS
python就是爬蟲嗎-python就是爬蟲嗎
2020-10-29
Python爬蟲
python爬蟲
2024-06-13
Python爬蟲
python 爬蟲
2024-04-20
Python爬蟲
nodejs + koa2 實現爬蟲
2019-02-16
NodeJS爬蟲
基於nodejs編寫小爬蟲
2019-02-16
NodeJS爬蟲
Python爬蟲教程-01-爬蟲介紹
2018-09-06
Python爬蟲
Java爬蟲與Python爬蟲的區別？
2023-10-25
Java爬蟲Python
python爬蟲初探--第一個python爬蟲專案
2018-05-18
Python爬蟲
nodejs爬蟲記憶體洩露排查
2019-04-02
NodeJS爬蟲記憶體洩露
nodeJS做一個簡單的爬蟲
2018-03-30
NodeJS爬蟲
Python asyncio 爬蟲
2020-04-28
Python爬蟲
python爬蟲2
2019-01-07
Python爬蟲
Python爬蟲——XPath
2018-07-28
Python爬蟲
Python 爬蟲系列
2021-01-01
Python爬蟲
Python爬蟲-xpath
2018-06-08
Python爬蟲
Python爬蟲--2
2024-03-24
Python爬蟲
VSCrawler爬蟲專案介紹
2017-06-28
爬蟲
python網路爬蟲_Python爬蟲：30個小時搞定Python網路爬蟲視訊教程
2020-10-21
Python爬蟲
什麼是Python爬蟲？python爬蟲入門難嗎？
2021-12-27
Python爬蟲
python爬蟲是什麼?學習python爬蟲難嗎
2021-03-31
Python爬蟲
什麼是Python爬蟲？Python爬蟲常用框架有哪些？
2020-12-24
Python爬蟲框架
【Python爬蟲9】Python網路爬蟲例項實戰
2017-02-17
Python爬蟲
Python爬蟲入門教程 50-100 Python3爬蟲爬取VIP視訊-Python爬蟲6操作
2019-02-14
Python爬蟲
【python--爬蟲】彼岸圖網高清桌布爬蟲
2019-07-21
Python爬蟲
什麼是爬蟲？Python爬蟲框架有哪些？
2022-04-18
爬蟲Python框架
【爬蟲】python爬蟲從入門到放棄
2018-12-20
爬蟲Python
Python爬蟲之路-chrome在爬蟲中的使用
2021-01-04
Python爬蟲Chrome
Python爬蟲與Java爬蟲有何區別？
2022-06-01
Python爬蟲Java
python爬蟲實戰，爬蟲之路，永無止境
2022-01-27
Python爬蟲
Python爬蟲（1.爬蟲的基本概念）
2018-04-20
Python爬蟲
圖靈樣書爬蟲 - Python 爬蟲實戰
2017-06-08
圖靈爬蟲Python
python與nodejs寫爬蟲你更傾向於哪個一個？
2023-01-06
PythonNodeJS爬蟲

爬蟲效能：NodeJs VS Python

爬蟲專案

實戰比較

python原始版

python多執行緒版

python改進

單執行緒

多執行緒

反思

NodeJs

綜上

相關文章