在前面分享的兩篇隨筆中分別介紹了HttpClient和Jsoup以及簡單的程式碼案例:
今天就來實戰下,用他們來抓取酷狗音樂網上的 Top500排行榜音樂。接下來的程式碼中除了會用到HttpClient和Jsoup之外,還會用到log4j和ehcache,分別用來記錄日誌和實現快取,如果看官對這兩個不是很熟悉的話,請自行百度,現在網上的入門例項有很多,我就不專門記筆記了。
那為什麼會想到爬取酷狗音樂網呢?其實也不是我想到的,而是不久前看過某位大神的部落格就是爬取酷狗的(具體哪位大神不記得了,見諒哈~~~),我也想用自己的程式碼試試,並且我看的部落格裡面好像沒有用到快取,也沒有用到代理ip這種反反爬蟲的工具,我會在我的爬蟲程式裡面補上,親測能自動處理全部23頁的歌曲(但是付費歌曲由於必須登入購買才能訪問,因此未能下載到,只有其他的400+首非付費歌曲可以正常下載),所以酷狗網的工作人員不要擔心哦~~~
話有又說回來了,在那篇部落格出來後,也沒見酷狗音樂去專門處理下,還能給我留下寫這段程式碼的機會,說明人家酷狗不在乎,畢竟付費歌曲是不能爬取的,而且網站已經有了一定的反爬蟲機制。
***************************************************************************
宣告:
本爬蟲程式和程式爬取到的內容僅限個人學習交流使用,
請勿用於商業用途,否則後果自負
***************************************************************************
好,廢話不多說,該上乾貨了~~
================很華麗的分割線=================
一、設計思路
首先說下思路,我看過的那篇部落格沒有把過程寫詳細,我就把它補充下吧:
1.點進去Top500排行榜,它的位址列裡面是:https://www.kugou.com/yy/rank/home/1-8888.html?from=homepage,而這個1其實就是頁碼,訪問第N頁就把1改成N就行,這個是我爬取的基礎
2.點具體某首歌曲,比如《你的酒館對我打了烊》,新開啟頁面:https://www.kugou.com/song/#hash=BE1E1D3C2A46B4CBD259ACA7FF050CD3&album_id=14913769,
3.我們F12分析下網路請求(啥?開啟F12沒東西?大哥呀你不會再重新整理下嗎),
你會發現有個耗時很長的請求,而且型別是media,它很可能就是真正獲取mp3的請求
仔細看,果然是的,mp3的真實地址是:http://fs.w.kugou.com/201905272134/9d4d81230e6f5c759df51618b03961a7/G126/M00/05/09/HocBAFxLAoeAT3BzAD1nWyW7V5M814.mp3
關掉頁面,重新進入該頁面,MP3的真是地址是:http://fs.w.kugou.com/201905272139/2897cc9816b82f4cda304d927187b282/G126/M00/05/09/HocBAFxLAoeAT3BzAD1nWyW7V5M814.mp3
根據這個看不出來啥
繼續分析,那它是怎麼找到這個真實地址的呢?應該是前面的某個請求裡面獲取到了真實地址,找前面的請求:
這個請求的response裡面含有MP3的真實地址,
請求的request為:
https://wwwapi.kugou.com/yy/index.php?r=play/getdata&callback=jQuery19106506492572547629_1558964792005&hash=BE1E1D3C2A46B4CBD259ACA7FF050CD3&album_id=14913769&dfid=3LWatj1PQwvn09grkH3FbFAF&mid=31adc5218ff6a510b05aacad71bc7090&platid=4&_=1558964792007
退出重新獲取一次,然後再退出換首歌再獲取一下這個request,你會發現一些規律:
粉紅色是歌曲播放頁面位址列裡面的內容,加粗部分是日期的long值,其他的都可以不變(“jQuery19106506492572547629_1558964792005”雖然每次有變化,但是經過嘗試,其實沒有影響),
所以我們就可以通過請求這個連結來獲取帶有MP3真實地址的json,然後請求真實地址,從而獲取音樂檔案。
4.那粉紅色部分的值怎麼獲取呢?檢視top500的列表頁的原始碼會發現有段內容,這個裡面記錄的第N頁所有歌曲的hash值、歌曲名、id等基本資訊
// 列表資料 global.features = [{"Hash":"BE1E1D3C2A46B4CBD259ACA7FF050CD3","FileName":"\u9648\u96ea\u51dd - \u4f60\u7684\u9152\u9986\u5bf9\u6211\u6253\u4e86\u70ca","timeLen":251.048,"privilege":10,"size":4024155,"album_id":14913769,"encrypt_id":"tlk6517"},{"Hash":"9198B18815EE8CE42AE368AE29276F78","FileName":"\u9648\u96ea\u51dd - \u7eff\u8272","timeLen":269.064,"privilege":10,"size":4314636,"album_id":15270740,"encrypt_id":"txskm8f"},{"Hash":"458E9B9F362277AC37E9EEF1CB80B535","FileName":"\u738b\u742a - \u4e07\u7231\u5343\u6069","timeLen":322.011,"privilege":10,"size":5152644,"album_id":18712576,"encrypt_id":"vsdz726"},{"Hash":"7E91FDE7E8D33E8ED11C6DB4620917E2","FileName":"\u5b64\u72ec\u8bd7\u4eba - \u6e21\u6211\u4e0d\u6e21\u5979","timeLen":182.23,"privilege":10,"size":2916145,"album_id":14624971,"encrypt_id":"th6cka5"},{"Hash":"9681F4CCD830B8436DB5F8218C7DF0C7","FileName":"\u864e\u4e8c - \u4f60\u4e00\u5b9a\u8981\u5e78\u798f","timeLen":259.066,"privilege":10,"size":4155201,"album_id":12249679,"encrypt_id":"rniv71f"},{"Hash":"44ABEAA9CCE29AFB5C947D4FBD2C567F","FileName":"\u5927\u58ee - \u4f2a\u88c5","timeLen":301.004,"privilege":10,"size":4817151,"album_id":15999493,"encrypt_id":"u6n6i28"},{"Hash":"5FCE4CBCB96D6025033BCE2025FC3943","FileName":"\u5468\u6770\u4f26 - \u544a\u767d\u6c14\u7403","timeLen":215,"privilege":10,"size":3443771,"album_id":1645030,"encrypt_id":"d5c5m23"},{"Hash":"0A62227CAAB66F54D43EC084B4BDD81F","FileName":"\u5468\u6770\u4f26 - \u7a3b\u9999","timeLen":223.582,"privilege":10,"size":3577344,"album_id":960399,"encrypt_id":"74itc7"},{"Hash":"A11F7A8BD2EA5BBDB32F58A9081F27B4","FileName":"\u82b1\u59d0 - \u72c2\u6d6a","timeLen":181.037,"privilege":10,"size":2902317,"album_id":13476703,"encrypt_id":"sfzob9f"},{"Hash":"33EB8FE0DC9F70D9F7FE4CB77305D5A8","FileName":"\u6d77\u6765\u963f\u6728\u3001\u963f\u5477\u62c9\u53e4\u3001\u66f2\u6bd4\u963f\u4e14 - \u522b\u77e5\u5df1","timeLen":280.111,"privilege":10,"size":4482365,"album_id":16324799,"encrypt_id":"uajki71"},{"Hash":"76D04F195C1F081CC0CD027A310A7D9A","FileName":"\u738b\u742a - \u7ad9\u7740\u7b49\u4f60\u4e09\u5343\u5e74","timeLen":381.083,"privilege":10,"size":6109771,"album_id":13886090,"encrypt_id":"sunkg88"},{"Hash":"9C00A468D2658487DB2DE4ED16A12B5A","FileName":"\u738b\u8d30\u6d6a - \u50cf\u9c7c","timeLen":285.031,"privilege":10,"size":4565459,"album_id":13621986,"encrypt_id":"smhia84"},{"Hash":"4F76587A5B0B93EEF15883E54DD3E2DB","FileName":"\u6bdb\u4e0d\u6613 - \u6d88\u6101 (Live)","timeLen":179,"privilege":10,"size":2870658,"album_id":2900867,"encrypt_id":"gf96d56"},{"Hash":"8B7DF540F77042FB76DA1EE3A79EAE0A","FileName":"NCF-\u827e\u529b - \u9ece\u660e\u524d\u7684\u9ed1\u6697 (\u5973\u58f0\u7248)","timeLen":145.058,"privilege":10,"size":2329748,"album_id":17997426,"encrypt_id":"twhgf05"},{"Hash":"7A3269C36D07E88A24FB35D246856FA4","FileName":"Yusee\u897f - \u5fc3\u5982\u6b62\u6c34","timeLen":182.883,"privilege":10,"size":2926594,"album_id":19692772,"encrypt_id":"wd07h77"},{"Hash":"7995A2173ED0914868BB860F93C3D642","FileName":"\u9b4f\u65b0\u96e8 - \u4f59\u60c5\u672a\u4e86","timeLen":216.189,"privilege":10,"size":3459539,"album_id":20709823,"encrypt_id":"wnru4c8"},{"Hash":"D8E40DA7F51C0486224E008A3B6ABD45","FileName":"\u5154\u5b50\u7259 - \u5c0f\u767d\u5154\u9047\u4e0a\u5361\u5e03\u5947\u8bfa","timeLen":163.087,"privilege":10,"size":2622454,"album_id":12492325,"encrypt_id":"rrrbccf"},{"Hash":"D2462B148305FF7D990F3B6EB3F90D66","FileName":"\u5f20\u656c\u8f69 - \u53ea\u662f\u592a\u7231\u4f60","timeLen":254.302,"privilege":10,"size":4080941,"album_id":558311,"encrypt_id":"3f65bd"},{"Hash":"03FE01457005CEEF8627BE5E5313D230","FileName":"\u84dd\u4e03\u4e03 - \u9ece\u660e\u524d\u7684\u9ed1\u6697 (\u5973\u58f0\u7248)","timeLen":111.986,"privilege":10,"size":1792253,"album_id":19842582,"encrypt_id":"w8lwi96"},{"Hash":"96E064A41AB84EBE4C03C6AAE3CB9334","FileName":"\u5f20\u7d2b\u8c6a - \u53ef\u4e0d\u53ef\u4ee5","timeLen":240.093,"privilege":10,"size":3855453,"album_id":9618875,"encrypt_id":"mkt6v7f"},{"Hash":"5D6CCE061BD65404BF5669FDD26C40B1","FileName":"\u4e01\u8299\u59ae - \u53ea\u662f\u592a\u7231\u4f60","timeLen":247.797,"privilege":10,"size":3965342,"album_id":18231730,"encrypt_id":"vhrxi30"},{"Hash":"95B48A0894FC2198B6E2B93C034AAC72","FileName":"\u5468\u6770\u4f26 - \u9752\u82b1\u74f7","timeLen":239.046,"privilege":10,"size":3825206,"album_id":979856,"encrypt_id":"7a6sd6"}];
把這些資訊獲取後放到ehcache快取,hash為key,album_id為value,迴圈單個歌曲的時候播放頁也能獲取到hash,然後根據hash到快取裡面取值即可
5.根據以上獲取的資訊就可以正常爬取檔案了,但是在爬取了一段時間後會發現無法正常下載了,在log中看到請求不到MP3的真實地址, 返回的json報文裡面error_code不為0,這個就是爬蟲程式被網站識別了,這就要用到代理ip了,當被識別出後就換個代理ip,如此迴圈下去直到歌曲輪詢完或代理ip被用完為止。
二、核心程式碼展示
有了思路之後,就可以寫程式碼了,由於篇幅原因,這裡只貼出部分核心程式碼,完整程式碼請在下面的gitee上獲取
程式碼結構:
- 需要的依賴
<!-- httpclient 抓取html --> <dependency> <groupId>org.apache.httpcomponents</groupId> <artifactId>httpclient</artifactId> <version>4.5.8</version> </dependency> <!-- Jsoup 解析html--> <dependency> <groupId>org.jsoup</groupId> <artifactId>jsoup</artifactId> <version>1.11.3</version> </dependency> <!-- 用來下載歌曲,就不用自己寫流操作了 --> <dependency> <groupId>commons-io</groupId> <artifactId>commons-io</artifactId> <version>2.6</version> </dependency> <!-- fastjson用來處理json --> <dependency> <groupId>com.alibaba</groupId> <artifactId>fastjson</artifactId> <version>1.2.58</version> </dependency> <!-- ehcache用作快取 --> <dependency> <groupId>net.sf.ehcache</groupId> <artifactId>ehcache</artifactId> <version>2.10.6</version> </dependency> <!-- 引入slf4j-nop 純粹是防止ehcache執行報錯 --> <dependency> <groupId>org.slf4j</groupId> <artifactId>slf4j-nop</artifactId> <version>1.7.2</version> </dependency> <!-- log4j作為日誌系統 --> <dependency> <groupId>log4j</groupId> <artifactId>log4j</artifactId> <version>1.2.17</version> </dependency>
- 主類
package com.sam.kugou.main; import java.util.List; import org.apache.log4j.Logger; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.select.Elements; import com.alibaba.fastjson.JSONObject; import com.sam.kugou.utils.DownLoadMusic; import com.sam.kugou.utils.EhcacheUtil; import com.sam.kugou.utils.HttpClientUtil; public class KugouSpiderMain { static final Logger logger = Logger.getLogger(KugouSpiderMain.class); static String URL_TEMP = "https://www.kugou.com/yy/rank/home/PAGE_NUM-8888.html?from=homepage"; public static final int SLEEP_TIME_WHEN_DENY = 1000*60*60;//被網站識別後睡眠時間 public static final int SPIDER_DURING = 1;//隔多久爬取下一首,單位:ms public static final String DIR_NAME = "E:\\personal\\音樂\\酷狗\\";//音樂下載地址 public static void main(String[] args) { //酷狗TOP500頁面 try { for (int i = 1; i <= 23; i++) { String url = URL_TEMP; url = url.replace("PAGE_NUM", i + ""); /** * 1.請求歌曲列表 */ logger.info(url); String html = HttpClientUtil.getHtml(url); logger.debug(html); /** * 2.獲取該頁的hash和id 放到快取 */ int beginIdx = html.indexOf("global.features = "); int endIdx = html.indexOf("];", beginIdx); String features = html.substring(beginIdx, endIdx + 1).replace("global.features = ", ""); logger.info("containingOwnText >>>>>> " + features); List<JSONObject> list = JSONObject.parseArray(features, JSONObject.class); for (JSONObject jsonObject : list) { String hash = (String) jsonObject.get("Hash"); Integer albumId = (Integer) jsonObject.get("album_id"); EhcacheUtil.setCache(hash, albumId); } /** * 3.解析列表內容 */ Document doc = Jsoup.parse(html); Elements songList = doc.select(".pc_temp_songlist ul li a"); for (Element element : songList) { String title = element.attr("title"); String href = element.attr("href"); if(href.contains("https")) { try { Thread.sleep(SPIDER_DURING); } catch (InterruptedException e) { logger.error(e.getMessage()); } logger.info("title " + title +" >>> href " + href); DownLoadMusic.requestMusic(title, href); } } } } catch(Exception ex) { logger.error(ex.getMessage(), ex); } finally { /*** * 4.關閉 */ EhcacheUtil.shutDownManager(); } } }
- 獲取真實地址
- 執行下載
public static void downLoad(String title, String url) { if(url == null || url.equals("")) { return ; } //已經完成的就不再重新下載 Element finishedCache = EhcacheUtil.getFinishedCache(title); logger.debug("finishedCache >>>>> " + finishedCache); if(finishedCache != null) { logger.info("歌曲已經存在!!!"); return; } String suffix = url.substring(url.lastIndexOf(".")); try { HttpEntity httpEntity = HttpClientUtil.getHttpEntity(url); InputStream inputStream = httpEntity.getContent(); String filePath = KugouSpiderMain.DIR_NAME+title+suffix; FileUtils.copyToFile(inputStream, new File(filePath)); logger.info("***完成下載:***"+title+suffix); logger.info("***總歌曲數量:***"+(new File(KugouSpiderMain.DIR_NAME)).list().length); EhcacheUtil.setFinishedCache(url, title); } catch (IOException e) { logger.error(e.getMessage()); } }
- 設定代理ip
public static boolean setProxy() { // 1.建立一個httpClient CloseableHttpClient httpClient = HttpClients.createDefault(); CloseableHttpResponse response = null; String url = "https://raw.githubusercontent.com/fate0/proxylist/master/proxy.list"; try { response = doRequest(httpClient, url); logger.debug("getHtml " + url + "**處理結果:**" + response.getStatusLine()); // 5.判斷返回結果,200, 成功 if (HttpStatus.SC_OK == response.getStatusLine().getStatusCode()) { HttpEntity httpEntity = response.getEntity(); String html = EntityUtils.toString(httpEntity, "utf-8"); html = "["+html+"]"; List<JSONObject> list = JSONArray.parseArray(html, JSONObject.class); for (JSONObject jsonObject : list) { int port = Integer.valueOf(jsonObject.get("port").toString()); String host = jsonObject.get("host").toString(); logger.info(host + ":"+port); if(isHostConnectable(host, port)) {//代理ip可以連線 Element ipsCache = EhcacheUtil.getProxyIpsCache(host, port);//代理ip未使用過 if(ipsCache == null) { proxyIp = host; proxyPort = port; EhcacheUtil.setProxyIpsCache(host, port); break; } else { logger.info("該代理ip已經使用過,切換下一個"); } } } } } catch (Exception e) { logger.error(e.getMessage(),e); return false; } finally { // 關閉 HttpClientUtils.closeQuietly(response); HttpClientUtils.closeQuietly(httpClient); } logger.info("切換代理ip成功:>>>" + proxyIp + ":" + proxyPort); return true; }
三、原始碼下載
原始碼已經上傳到我的gitee:
https://gitee.com/sam-uncle/kugou-spider
歡迎下載~~
四、遺留問題
1.只能抓取到免費歌曲,對於收費歌曲不能抓取,其實我們也不該抓取
2.程式碼中為了方便用了很多static,不能支援多執行緒或併發抓取
3.其實代理IP那裡可以優化的
宣告:
本爬蟲程式和程式爬取到的內容僅限個人學習交流使用,請勿用於商業用途,否則後果自負!!!謝謝