Jsoup + HtmlUtil 實現網易新聞網頁爬蟲

Liam_Fang_發表於2019-01-14

原文網址 : https://blog.csdn.net/weixin_39912556/article/details/86481402

1.這裡先說明為什麼要用HtmlUtil，僅用Jsoup不行嗎？

如果用Jsoup的方法，那麼爬取網頁的程式碼如下，這也是比較簡單的形式了。

Document docu1=Jsoup.connect(url).get();

用上述程式碼只能爬取靜態網頁的，當遇到動態網頁就會發現你想要的內容爬取不出來。因此我用到了HtmlUtil。

具體程式碼如下:這裡面的方法getHtmlFromUrl(String url)返回一個文件物件，然後可以通過Jsoup的一系列方法獲得想要的內容。

具體的解釋看這篇文章

import org.jsoup.nodes.Document;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
import org.jsoup.Jsoup;
public class HtmlUnitUtil {
	public static Document getHtmlFromUrl(String url) throws Exception{
		WebClient webClient = new WebClient();
        webClient.getOptions().setJavaScriptEnabled(true);
        webClient.getOptions().setCssEnabled(false);
        webClient.getOptions().setActiveXNative(false);
        webClient.getOptions().setCssEnabled(false);
        webClient.getOptions().setThrowExceptionOnScriptError(false);
        webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
        webClient.getOptions().setTimeout(10000);
        HtmlPage htmlPage = null;
        try {
            htmlPage = webClient.getPage(url);
            webClient.waitForBackgroundJavaScript(10000);
            String htmlString = htmlPage.asXml();
            return Jsoup.parse(htmlString);
        } finally {
            webClient.close();
        }
	}
}

下面的url表示你想要爬取的頁面的地址。

for(String url:ls){
			Document docu1=null;
			try {
				docu1 = HtmlUnitUtil.getHtmlFromUrl(url);
				Elements lis = docu1.getElementsByClass("hot_text");
				//爬取的模組名
				Elements first_span = docu1.select("#list_wrap > div.list_content > div.area.baby_list_title > h2 > a");				
				for(Element e:lis){
					if(e.getElementsByTag("a").size()==0){
						continue;
					}
					else{
						Element e_a = e.getElementsByTag("a").get(0);
						//新聞標題
						String title = e_a.text();
						String newsUrl=e_a.attr("href");
						newsUrl = "http:" + newsUrl;
						count++;		
						String moduleName=first_span.get(0).text();
						System.out.println(title+"("+moduleName+"):"+newsUrl);												
					}					
				}
			} catch (Exception e1) {
				// TODO Auto-generated catch block
				e1.printStackTrace();
			}					
		}

上面的程式碼實現瞭如下的內容爬取。

maven依賴如下:

<dependency>
            <groupId>net.sourceforge.htmlunit</groupId>
            <artifactId>htmlunit</artifactId>
            <version>2.18</version>
        </dependency>
 
        <dependency>
            <groupId>net.sourceforge.htmlunit</groupId>
            <artifactId>htmlunit-core-js</artifactId>
            <version>2.9</version>
            <scope>compile</scope>
        </dependency>
        <dependency>
		    <groupId>commons-logging</groupId>
		    <artifactId>commons-logging-api</artifactId>
		    <version>1.1</version>
		</dependency>
		<!-- https://mvnrepository.com/artifact/commons-collections/commons-collections -->
		<dependency>
		    <groupId>commons-collections</groupId>
		    <artifactId>commons-collections</artifactId>
		    <version>3.2</version>
		</dependency>
		<!-- https://mvnrepository.com/artifact/commons-io/commons-io -->
		<dependency>
		    <groupId>commons-io</groupId>
		    <artifactId>commons-io</artifactId>
		    <version>2.5</version>
		</dependency>

感興趣的可以試試。

參考文章: https://blog.csdn.net/gx304419380/article/details/80619043

python爬蟲---網頁爬蟲，圖片爬蟲，文章爬蟲，Python爬蟲爬取新聞網站新聞
2019-01-04
Python爬蟲網頁網站
[網路爬蟲] Jsoup : HTML 解析工具
2024-10-06
爬蟲JSHTML
大規模非同步新聞爬蟲：網頁正文的提取
2018-12-03
非同步爬蟲網頁
大規模非同步新聞爬蟲：實現一個同步定向新聞爬蟲
2018-12-03
非同步爬蟲
《網頁爬蟲》
2018-11-26
網頁爬蟲
爬蟲——網頁爬取方法和網頁解析方法
2020-12-07
爬蟲網頁
大規模非同步新聞爬蟲：用asyncio實現非同步爬蟲
2018-12-03
非同步爬蟲
Python爬蟲實踐--爬取網易雲音樂
2022-02-15
Python爬蟲
Python爬蟲實踐-網易雲音樂
2018-09-09
Python爬蟲
爬取網站新聞
2020-09-24
網站
爬蟲搭建代理池、爬取某網站影片案例、爬取新聞案例
2023-03-16
爬蟲網站
爬蟲實戰：探索XPath爬蟲技巧之熱榜新聞
2024-03-21
爬蟲
python實現selenium網路爬蟲
2021-03-11
Python爬蟲
大規模非同步新聞爬蟲的實現思路
2019-05-20
非同步爬蟲
node：爬蟲爬取網頁圖片
2019-02-16
爬蟲網頁
網頁爬蟲--未完成
2020-10-04
網頁爬蟲
python 爬蟲網頁登陸
2020-11-30
Python爬蟲網頁
網路爬蟲技術Jsoup——爬到一切你想要的
2022-02-02
爬蟲JS
大規模非同步新聞爬蟲：實現一個更好的網路請求函式
2018-12-02
非同步爬蟲函式
網路爬蟲——爬蟲實戰（一）
2022-01-29
爬蟲
Python靜態網頁爬蟲專案實戰
2020-05-01
Python網頁爬蟲
大規模非同步新聞爬蟲的分散式實現
2019-06-10
非同步爬蟲分散式
【爬蟲】網頁抓包工具--Fiddler
2018-12-19
爬蟲網頁
爬蟲抓取網頁資料原理
2021-12-06
爬蟲網頁
大規模非同步新聞爬蟲：簡單的百度新聞爬蟲
2018-12-02
非同步爬蟲
爬蟲進階——動態網頁Ajax資料抓取（簡易版）
2024-04-12
爬蟲網頁
如何用Python網路爬蟲爬取網易雲音樂歌曲
2018-04-27
Python爬蟲
LLM實戰：當網頁爬蟲整合gpt3.5
2024-05-20
網頁爬蟲GPT
大規模非同步新聞爬蟲：實現功能強大、簡潔易用的網址池(URL Pool)
2018-12-03
非同步爬蟲
Java爬蟲利器HTML解析工具-Jsoup
2019-06-21
Java爬蟲HTMLJS
[Python3網路爬蟲開發實戰] 2-爬蟲基礎 2-網頁基礎
2018-03-08
Python爬蟲網頁
爬蟲抓取網頁的詳細流程
2023-11-28
爬蟲網頁
Java培訓教程之使用Jsoup實現簡單的爬蟲技術
2021-07-12
JavaJS爬蟲
python網路爬蟲應用_python網路爬蟲應用實戰
2020-12-29
Python爬蟲
Node JS爬蟲：爬取瀑布流網頁高清圖
2018-05-17
JS爬蟲網頁
Java實現網路爬蟲案例程式碼
2022-11-22
Java爬蟲
易車網實戰+【保姆級】：Feapder爬蟲框架入門教程
2021-07-06
爬蟲框架
Python網路爬蟲實戰
2022-03-18
Python爬蟲

Jsoup + HtmlUtil 實現網易新聞網頁爬蟲

相關文章