這不算爬蟲吧？！

小LiAn發表於2017-04-15

原文網址 : https://www.cnblogs.com/cnlian/p/6711495.html

------------------------------------------------------------------------------------------------------------------

因程式需要，需要拿到一個粵語詞典（需要找到任一個漢字的粵語拼音），但是在網上找來找去都沒有找到現有的詞典。

走投無路下，只能對現有粵語詞典網站進行知識“掠奪”：），拿到一個對應表。

於是，碼了以下程式碼：

 1 using System;
 2 using System.Text;
 3 using System.Net;
 4 using System.IO;
 5 using System.Threading;
 6 
 7 namespace Yueyu_Dic_Crawler
 8 {
 9     class Program
10     {
11         static void Main(string[] args)
12         {
13             string[] array = { "0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "A", "B", "C", "D", "E", "F" };
14             {
15                 //建立檔案，準備輸入
16                 FileStream fs = new FileStream(@"C:\Users\Lian\Desktop\Dic\Dictionary.txt", FileMode.Create);
17                 StreamWriter sw = new StreamWriter(fs);
18 
19                 //由於此網站URL的特殊性，不用實現真正意義上的爬蟲就可以獲取資訊
20                 //只需要更改URL中間的4位就可以遍歷60000+漢字的資訊
21                 for (int apple=0;apple<16;apple++)
22                     for (int pear = 0; pear < 16; pear++)
23                         for (int orange = 0; orange < 16; orange++)
24                             for (int peach = 0; peach < 16; peach++)
25                             {
26                                 //沒有這個sleep，就要被網站伺服器的防護機制給弄炸了:（
27                                 Thread.Sleep(100);
28 
29                                 //從0000到FFFF：）
30                                 string url = "http://www.yueyv.cn/?keyword=%" + array[apple] + array[pear] + "%" + array[orange] + array[peach] + "&submit=%B2%E9+%D1%AF";
31 
32                                 //Request AND Response
33                                 HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create(url);
34                                 request.Method = "GET";
35                                 HttpWebResponse response = (HttpWebResponse)request.GetResponse();
36 
37                                 //使用StreamReader讀取html原始碼
38                                 StreamReader reader = new StreamReader(response.GetResponseStream(), Encoding.GetEncoding("gb2312"));
39 
40                                 //經觀察，第370行儲存漢字，第394行儲存粵語拼音
41                                 string hanzi_line;
42                                 for (int z = 0; z < 369; z++, reader.ReadLine()) ;
43                                 hanzi_line = reader.ReadLine();
44                                 string yuepin_line;
45                                 for (int z = 0; z < 23; z++, reader.ReadLine()) ;
46                                 yuepin_line = reader.ReadLine();
47 
48                                 //寫入檔案
49                                 sw.Write(hanzi_line + "\t" + yuepin_line + "\r\n");
50                                 Console.WriteLine(hanzi_line + "\t" + yuepin_line);
51 
52                                 /*@@@@@@!!!!!!@@@@@@!!!!!!@@@@@@!!!!!!@@@@@@*/
53 
54                                 //如果不關閉HttpWebResponse，在請求兩次後，就收不到迴音了= =
55                                 //應該算是C#的特點吧，很關鍵，花費了很長很長時間。
56                                 response.Close();
57                             }
58                 //清空緩衝區
59                 sw.Flush();
60                 //關閉流
61                 sw.Close();
62                 fs.Close();
63             }
64         }
65     }
66 }

其實，中間還有一些小細節，比如：

1、實際上只有一部分組合儲存著資訊，如8000-8FFF的組合中，其實只有8140-8FFE有資訊（感謝partner）；

2、大約將0000-FFFF分成了10塊，分了10次才爬下來，因為即使sleep，伺服器的防護機制有時間也能把你攔住；

3、沒有使用正規表示式，就是用excel簡單處理了一下結果：）以後肯定要使用正規表示式：）

4、多音字，只收錄了它的第一次讀音：）

從昨天中午有這個想法，到今天晚上實現，感觸最深的有兩點：

一是，這個時代學習東西太方便了，知識的互動太便捷了！

二是，網際網路上儲存著多少知識和財富啊！！！！！！！！

過幾天把這個粵語詞典放網上：）應該不犯法吧。。。

小Lian

2017/4/15凌晨

【Python學習】爬蟲爬蟲爬蟲爬蟲~
2018-05-03
Python爬蟲
爬蟲基本功就這？早知道幹爬蟲了
2020-11-21
爬蟲
爬蟲：多程式爬蟲
2021-05-19
爬蟲
python爬蟲---網頁爬蟲，圖片爬蟲，文章爬蟲，Python爬蟲爬取新聞網站新聞
2019-01-04
Python爬蟲網頁網站
爬蟲速度太慢？來試試用非同步協程提速吧！
2018-07-09
爬蟲非同步
通用爬蟲與聚焦爬蟲
2023-04-18
爬蟲
爬蟲--Scrapy簡易爬蟲
2020-10-07
爬蟲
爬蟲進階：反反爬蟲技巧
2018-06-28
爬蟲
反爬蟲之字型反爬蟲
2019-06-27
爬蟲
爬蟲
2024-11-16
爬蟲
這 6 個爬蟲開源專案 yyds
2021-10-22
爬蟲
【爬蟲】爬蟲專案推薦 / 思路
2020-04-21
爬蟲
網路爬蟲——爬蟲實戰（一）
2022-01-29
爬蟲
2個月精通Python爬蟲——3大爬蟲框架+6場實戰+反爬蟲技巧+分散式爬蟲
2018-06-28
Python爬蟲框架分散式
想成為Python高手，必須看這篇爬蟲原理介紹！（附29個爬蟲專案）
2021-03-14
Python爬蟲
Python爬蟲教程-01-爬蟲介紹
2018-09-06
Python爬蟲
Java爬蟲與Python爬蟲的區別？
2023-10-25
Java爬蟲Python
python就是爬蟲嗎-python就是爬蟲嗎
2020-10-29
Python爬蟲
爬蟲與反爬蟲技術簡介
2022-09-20
爬蟲
爬蟲，其實本就是這麼簡單
2019-08-19
爬蟲
request爬蟲
2019-02-16
爬蟲
nodejs 爬蟲
2019-02-16
NodeJS爬蟲
科普：爬蟲
2018-06-29
爬蟲
python 爬蟲
2024-04-20
Python爬蟲
app爬蟲
2024-05-04
APP爬蟲
爬蟲案例
2024-03-31
爬蟲
爬蟲概述
2024-05-02
爬蟲
爬蟲包
2019-12-10
python爬蟲
2024-06-13
Python爬蟲
Python超簡單超基礎的免費小說爬蟲！爬蟲入門從這開始！
2020-10-23
Python爬蟲
【爬蟲】python爬蟲從入門到放棄
2018-12-20
爬蟲Python
分散式爬蟲原理之分散式爬蟲原理
2018-05-25
分散式爬蟲
C#爬蟲與反爬蟲--字型加密篇
2019-06-26
C#爬蟲加密
【python--爬蟲】彼岸圖網高清桌布爬蟲
2019-07-21
Python爬蟲
Python爬蟲（1.爬蟲的基本概念）
2018-04-20
Python爬蟲
python爬蟲實戰，爬蟲之路，永無止境
2022-01-27
Python爬蟲
什麼是爬蟲？Python爬蟲框架有哪些？
2022-04-18
爬蟲Python框架
Python爬蟲與Java爬蟲有何區別？
2022-06-01
Python爬蟲Java
Python爬蟲之路-chrome在爬蟲中的使用
2021-01-04
Python爬蟲Chrome

這不算爬蟲吧？！

相關文章