1 概述
簡介
Jsoup
是一款基於Java的HTML解析器,它提供了一種簡單、靈活且易於使用的API,用於從URL、檔案或字串中解析HTML文件。它可以幫助開發人員從HTML文件中提取資料、操作DOM元素、處理表單提交等。
主要特點
Jsoup的主要特點包括:
- 簡單易用:Jsoup提供了一系列簡單的API,使得解析HTML變得非常容易。開發人員可以使用類似於jQuery的選擇器語法來選擇DOM元素,從而方便地提取所需的資料。
- 強大的HTML處理能力:Jsoup支援HTML5標準,並且能夠處理不完整或損壞的HTML文件。它可以自動修復HTML中的錯誤,並且在解析過程中保留原始的HTML結構。
- 安全可靠:Jsoup內建了防止XSS攻擊的機制,可以自動過濾惡意的HTML標籤和屬性,保證解析過程的安全性。
- 支援CSS選擇器:Jsoup支援使用CSS選擇器來選擇DOM元素,這使得開發人員可以更加靈活地定位和操作HTML文件中的元素。
- 與Java整合:Jsoup是基於Java開發的,可以與Java程式無縫整合。開發人員可以使用Java的各種特性和庫來處理解析後的資料。
應用場景
Jsoup 在大資料、雲端計算領域的應用場景包括但不限於:
- 網頁資料抓取: Jsoup可以幫助開發人員從網頁中提取所需的資料,例如爬取新聞、商品資訊等。透過解析HTML文件,可以快速準確地獲取所需的資料。
- 資料清洗與處理: 在雲端計算中,大量的資料需要進行清洗和處理。Jsoup可以幫助開發人員解析HTML文件,提取出需要的資料,並進行進一步的處理和分析。
- 網頁內容分析: Jsoup可以幫助開發人員對網頁內容進行分析,例如提取關鍵詞、統計標籤出現次數等。這對於搜尋引擎最佳化、網頁分析等領域非常有用。
競品
爬蟲解析HTML文件的工具有:
- [java] Jsoup
- https://github.com/jhy/jsoup
- https://jsoup.org/
- https://mvnrepository.com/artifact/org.jsoup/jsoup/1.12.2
- [python] Beautiful Jsoup
- https://www.crummy.com/software/BeautifulSoup/
- https://github.com/DeronW/beautifulsoup/tree/v4.4.0
- https://beautifulsoup.readthedocs.io/
- https://beautifulsoup.readthedocs.io/zh-cn/v4.4.0/
2 使用指南
- 本章節,基於 1.14.3 版本
依賴引入
<!-- https://mvnrepository.com/artifact/org.jsoup/jsoup -->
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<!-- 1.12.2 / 1.14.3 / 1.17.2 -->
<version>1.14.3</version>
</dependency>
核心 API
org.jsoup.Jsoup
package org.jsoup;
import java.io.File;
import java.io.IOException;
import java.io.InputStream;
import java.net.URL;
import javax.annotation.Nullable;
import org.jsoup.helper.DataUtil;
import org.jsoup.helper.HttpConnection;
import org.jsoup.nodes.Document;
import org.jsoup.parser.Parser;
import org.jsoup.safety.Cleaner;
import org.jsoup.safety.Safelist;
import org.jsoup.safety.Whitelist;
public class Jsoup {
private Jsoup() {
}
public static Document parse(String html, String baseUri) {
return Parser.parse(html, baseUri);
}
public static Document parse(String html, String baseUri, Parser parser) {
return parser.parseInput(html, baseUri);
}
public static Document parse(String html, Parser parser) {
return parser.parseInput(html, "");
}
public static Document parse(String html) {
return Parser.parse(html, "");
}
public static Connection connect(String url) {
return HttpConnection.connect(url);
}
public static Connection newSession() {
return new HttpConnection();
}
public static Document parse(File file, @Nullable String charsetName, String baseUri) throws IOException {
return DataUtil.load(file, charsetName, baseUri);
}
public static Document parse(File file, @Nullable String charsetName) throws IOException {
return DataUtil.load(file, charsetName, file.getAbsolutePath());
}
public static Document parse(File file, @Nullable String charsetName, String baseUri, Parser parser) throws IOException {
return DataUtil.load(file, charsetName, baseUri, parser);
}
public static Document parse(InputStream in, @Nullable String charsetName, String baseUri) throws IOException {
return DataUtil.load(in, charsetName, baseUri);
}
public static Document parse(InputStream in, @Nullable String charsetName, String baseUri, Parser parser) throws IOException {
return DataUtil.load(in, charsetName, baseUri, parser);
}
public static Document parseBodyFragment(String bodyHtml, String baseUri) {
return Parser.parseBodyFragment(bodyHtml, baseUri);
}
public static Document parseBodyFragment(String bodyHtml) {
return Parser.parseBodyFragment(bodyHtml, "");
}
public static Document parse(URL url, int timeoutMillis) throws IOException {
Connection con = HttpConnection.connect(url);
con.timeout(timeoutMillis);
return con.get();
}
public static String clean(String bodyHtml, String baseUri, Safelist safelist) {
Document dirty = parseBodyFragment(bodyHtml, baseUri);
Cleaner cleaner = new Cleaner(safelist);
Document clean = cleaner.clean(dirty);
return clean.body().html();
}
/** @deprecated */
@Deprecated
public static String clean(String bodyHtml, String baseUri, Whitelist safelist) {
return clean(bodyHtml, baseUri, (Safelist)safelist);
}
public static String clean(String bodyHtml, Safelist safelist) {
return clean(bodyHtml, "", safelist);
}
/** @deprecated */
@Deprecated
public static String clean(String bodyHtml, Whitelist safelist) {
return clean(bodyHtml, (Safelist)safelist);
}
public static String clean(String bodyHtml, String baseUri, Safelist safelist, Document.OutputSettings outputSettings) {
Document dirty = parseBodyFragment(bodyHtml, baseUri);
Cleaner cleaner = new Cleaner(safelist);
Document clean = cleaner.clean(dirty);
clean.outputSettings(outputSettings);
return clean.body().html();
}
/** @deprecated */
@Deprecated
public static String clean(String bodyHtml, String baseUri, Whitelist safelist, Document.OutputSettings outputSettings) {
return clean(bodyHtml, baseUri, (Safelist)safelist, outputSettings);
}
public static boolean isValid(String bodyHtml, Safelist safelist) {
return (new Cleaner(safelist)).isValidBodyHtml(bodyHtml);
}
/** @deprecated */
@Deprecated
public static boolean isValid(String bodyHtml, Whitelist safelist) {
return isValid(bodyHtml, (Safelist)safelist);
}
}
org.jsoup.nodes.Node
關鍵 API
- Jsoup遍歷DOM樹的方法
- 根據id查詢元素: getElementById(String id)
- 根據標籤查詢元素: getElementsByTag(String tag)
- 根據class查詢元素: getElementsByClass(String className)
- 根據屬性查詢元素: getElementsByAttribute(String key)
- 兄弟遍歷方法: siblingElements(), firstElementSibling(), lastElementSibling(); nextElementSibling(), previousElementSibling()
- 層級之間遍歷: parent(), children(), child(int index)
這些方法會返回Element或者Elements節點物件,這些物件可以使用下面的方法獲取一些屬性:
- attr(String key): 獲取某個屬性值
- attributes(): 獲取節點的所有屬性
- id(): 獲取節點的id
- className(): 獲取當前節點的class名稱
- classNames(): 獲取當前節點的所有class名稱
- text(): 獲取當前節點的textNode內容
- html(): 獲取當前節點的 inner HTML
- outerHtml(): 獲取當前節點的 outer HTML
- data(): 獲取當前節點的內容,用於script或者style標籤等
- tag(): 獲取標籤
- tagName(): 獲取當前節點的標籤名稱
有了這些API,就像 JQuery 一樣很便利的操作DOM。
- Jsoup也支援修改DOM樹結構:
- text(String value): 設定內容
- html(String value): 直接替換HTML結構
- append(String html): 元素後面新增節點
- prepend(String html): 元素前面新增節點
- appendText(String text), prependText(String text)
- appendElement(String tagName), prependElement(String tagName)
原始碼
package org.jsoup.nodes;
import java.io.IOException;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Collections;
import java.util.Iterator;
import java.util.LinkedList;
import java.util.List;
import javax.annotation.Nullable;
import org.jsoup.SerializationException;
import org.jsoup.helper.Validate;
import org.jsoup.internal.StringUtil;
import org.jsoup.select.NodeFilter;
import org.jsoup.select.NodeTraversor;
import org.jsoup.select.NodeVisitor;
public abstract class Node implements Cloneable {
static final List<Node> EmptyNodes = Collections.emptyList();
static final String EmptyString = "";
@Nullable
Node parentNode;
int siblingIndex;
protected Node() {
}
public abstract String nodeName();
protected abstract boolean hasAttributes();
public boolean hasParent() {
return this.parentNode != null;
}
public String attr(String attributeKey) {
...
}
public abstract Attributes attributes();
public int attributesSize() {
return this.hasAttributes() ? this.attributes().size() : 0;
}
public Node attr(String attributeKey, String attributeValue) {
attributeKey = NodeUtils.parser(this).settings().normalizeAttribute(attributeKey);
this.attributes().putIgnoreCase(attributeKey, attributeValue);
return this;
}
public boolean hasAttr(String attributeKey) {
Validate.notNull(attributeKey);
if (!this.hasAttributes()) {
return false;
} else {
if (attributeKey.startsWith("abs:")) {
String key = attributeKey.substring("abs:".length());
if (this.attributes().hasKeyIgnoreCase(key) && !this.absUrl(key).isEmpty()) {
return true;
}
}
return this.attributes().hasKeyIgnoreCase(attributeKey);
}
}
public Node removeAttr(String attributeKey) {
Validate.notNull(attributeKey);
if (this.hasAttributes()) {
this.attributes().removeIgnoreCase(attributeKey);
}
return this;
}
public Node clearAttributes() {
if (this.hasAttributes()) {
Iterator<Attribute> it = this.attributes().iterator();
while(it.hasNext()) {
it.next();
it.remove();
}
}
return this;
}
public abstract String baseUri();
protected abstract void doSetBaseUri(String var1);
public void setBaseUri(String baseUri) {
Validate.notNull(baseUri);
this.doSetBaseUri(baseUri);
}
public String absUrl(String attributeKey) {
Validate.notEmpty(attributeKey);
return this.hasAttributes() && this.attributes().hasKeyIgnoreCase(attributeKey) ? StringUtil.resolve(this.baseUri(), this.attributes().getIgnoreCase(attributeKey)) : "";
}
protected abstract List<Node> ensureChildNodes();
public Node childNode(int index) {
return (Node)this.ensureChildNodes().get(index);
}
public List<Node> childNodes() {
if (this.childNodeSize() == 0) {
return EmptyNodes;
} else {
List<Node> children = this.ensureChildNodes();
List<Node> rewrap = new ArrayList(children.size());
rewrap.addAll(children);
return Collections.unmodifiableList(rewrap);
}
}
public List<Node> childNodesCopy() {
List<Node> nodes = this.ensureChildNodes();
ArrayList<Node> children = new ArrayList(nodes.size());
Iterator var3 = nodes.iterator();
while(var3.hasNext()) {
Node node = (Node)var3.next();
children.add(node.clone());
}
return children;
}
public abstract int childNodeSize();
protected Node[] childNodesAsArray() {
return (Node[])this.ensureChildNodes().toArray(new Node[0]);
}
public abstract Node empty();
@Nullable
public Node parent() {
return this.parentNode;
}
@Nullable
public final Node parentNode() {
return this.parentNode;
}
public Node root() {
Node node;
for(node = this; node.parentNode != null; node = node.parentNode) {
}
return node;
}
@Nullable
public Document ownerDocument() {
Node root = this.root();
return root instanceof Document ? (Document)root : null;
}
public void remove() {
Validate.notNull(this.parentNode);
this.parentNode.removeChild(this);
}
public Node before(String html) {
this.addSiblingHtml(this.siblingIndex, html);
return this;
}
public Node before(Node node) {
Validate.notNull(node);
Validate.notNull(this.parentNode);
this.parentNode.addChildren(this.siblingIndex, node);
return this;
}
public Node after(String html) {
this.addSiblingHtml(this.siblingIndex + 1, html);
return this;
}
public Node after(Node node) {
Validate.notNull(node);
Validate.notNull(this.parentNode);
this.parentNode.addChildren(this.siblingIndex + 1, node);
return this;
}
private void addSiblingHtml(int index, String html) {
Validate.notNull(html);
Validate.notNull(this.parentNode);
Element context = this.parent() instanceof Element ? (Element)this.parent() : null;
List<Node> nodes = NodeUtils.parser(this).parseFragmentInput(html, context, this.baseUri());
this.parentNode.addChildren(index, (Node[])nodes.toArray(new Node[0]));
}
public Node wrap(String html) {
Validate.notEmpty(html);
Element context = this.parentNode != null && this.parentNode instanceof Element ? (Element)this.parentNode : (this instanceof Element ? (Element)this : null);
List<Node> wrapChildren = NodeUtils.parser(this).parseFragmentInput(html, context, this.baseUri());
Node wrapNode = (Node)wrapChildren.get(0);
if (!(wrapNode instanceof Element)) {
return this;
} else {
Element wrap = (Element)wrapNode;
Element deepest = this.getDeepChild(wrap);
if (this.parentNode != null) {
this.parentNode.replaceChild(this, wrap);
}
deepest.addChildren(new Node[]{this});
if (wrapChildren.size() > 0) {
for(int i = 0; i < wrapChildren.size(); ++i) {
Node remainder = (Node)wrapChildren.get(i);
if (wrap != remainder) {
if (remainder.parentNode != null) {
remainder.parentNode.removeChild(remainder);
}
wrap.after(remainder);
}
}
}
return this;
}
}
@Nullable
public Node unwrap() {
Validate.notNull(this.parentNode);
List<Node> childNodes = this.ensureChildNodes();
Node firstChild = childNodes.size() > 0 ? (Node)childNodes.get(0) : null;
this.parentNode.addChildren(this.siblingIndex, this.childNodesAsArray());
this.remove();
return firstChild;
}
private Element getDeepChild(Element el) {
List<Element> children = el.children();
return children.size() > 0 ? this.getDeepChild((Element)children.get(0)) : el;
}
void nodelistChanged() {
}
public void replaceWith(Node in) {
Validate.notNull(in);
Validate.notNull(this.parentNode);
this.parentNode.replaceChild(this, in);
}
protected void setParentNode(Node parentNode) {
Validate.notNull(parentNode);
if (this.parentNode != null) {
this.parentNode.removeChild(this);
}
this.parentNode = parentNode;
}
protected void replaceChild(Node out, Node in) {
Validate.isTrue(out.parentNode == this);
Validate.notNull(in);
if (in.parentNode != null) {
in.parentNode.removeChild(in);
}
int index = out.siblingIndex;
this.ensureChildNodes().set(index, in);
in.parentNode = this;
in.setSiblingIndex(index);
out.parentNode = null;
}
protected void removeChild(Node out) {
Validate.isTrue(out.parentNode == this);
int index = out.siblingIndex;
this.ensureChildNodes().remove(index);
this.reindexChildren(index);
out.parentNode = null;
}
protected void addChildren(Node... children) {
List<Node> nodes = this.ensureChildNodes();
Node[] var3 = children;
int var4 = children.length;
for(int var5 = 0; var5 < var4; ++var5) {
Node child = var3[var5];
this.reparentChild(child);
nodes.add(child);
child.setSiblingIndex(nodes.size() - 1);
}
}
protected void addChildren(int index, Node... children) {
...
}
protected void reparentChild(Node child) {
child.setParentNode(this);
}
private void reindexChildren(int start) {
if (this.childNodeSize() != 0) {
List<Node> childNodes = this.ensureChildNodes();
for(int i = start; i < childNodes.size(); ++i) {
((Node)childNodes.get(i)).setSiblingIndex(i);
}
}
}
public List<Node> siblingNodes() {
if (this.parentNode == null) {
return Collections.emptyList();
} else {
List<Node> nodes = this.parentNode.ensureChildNodes();
List<Node> siblings = new ArrayList(nodes.size() - 1);
Iterator var3 = nodes.iterator();
while(var3.hasNext()) {
Node node = (Node)var3.next();
if (node != this) {
siblings.add(node);
}
}
return siblings;
}
}
@Nullable
public Node nextSibling() {
if (this.parentNode == null) {
return null;
} else {
List<Node> siblings = this.parentNode.ensureChildNodes();
int index = this.siblingIndex + 1;
return siblings.size() > index ? (Node)siblings.get(index) : null;
}
}
@Nullable
public Node previousSibling() {
if (this.parentNode == null) {
return null;
} else {
return this.siblingIndex > 0 ? (Node)this.parentNode.ensureChildNodes().get(this.siblingIndex - 1) : null;
}
}
public int siblingIndex() {
return this.siblingIndex;
}
protected void setSiblingIndex(int siblingIndex) {
this.siblingIndex = siblingIndex;
}
public Node traverse(NodeVisitor nodeVisitor) {
Validate.notNull(nodeVisitor);
NodeTraversor.traverse(nodeVisitor, this);
return this;
}
public Node filter(NodeFilter nodeFilter) {
Validate.notNull(nodeFilter);
NodeTraversor.filter(nodeFilter, this);
return this;
}
public String outerHtml() {
StringBuilder accum = StringUtil.borrowBuilder();
this.outerHtml(accum);
return StringUtil.releaseBuilder(accum);
}
protected void outerHtml(Appendable accum) {
NodeTraversor.traverse(new OuterHtmlVisitor(accum, NodeUtils.outputSettings(this)), this);
}
abstract void outerHtmlHead(Appendable var1, int var2, Document.OutputSettings var3) throws IOException;
abstract void outerHtmlTail(Appendable var1, int var2, Document.OutputSettings var3) throws IOException;
public <T extends Appendable> T html(T appendable) {
this.outerHtml(appendable);
return appendable;
}
public String toString() {
return this.outerHtml();
}
protected void indent(Appendable accum, int depth, Document.OutputSettings out) throws IOException {
accum.append('\n').append(StringUtil.padding(depth * out.indentAmount()));
}
public boolean equals(@Nullable Object o) {
return this == o;
}
public int hashCode() {
return super.hashCode();
}
public boolean hasSameValue(@Nullable Object o) {
if (this == o) {
return true;
} else {
return o != null && this.getClass() == o.getClass() ? this.outerHtml().equals(((Node)o).outerHtml()) : false;
}
}
public Node clone() {
...
}
...
org.jsoup.nodes.Element extends Node
org.jsoup.nodes.Document extends Element
應用場景
CASE : 解析 HTML文件 => 獲得 Document 物件
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
String html = "<html><head><title>First parse</title></head><body><p>Parsed HTML into a doc.</p></body></html>";
Document doc = Jsoup.parse(html);
CASE : 解析 HTML 片段 => 獲得 Document 物件
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
String html = "<div><p>Lorem ipsum.</p>";
Document doc = Jsoup.parseBodyFragment(html);
Element body = doc.body();
CASE : 解析 URL => 獲得 Document 物件
org.jsoup.Connection connection = Jsoup.connect("http://example.com/");
Document doc = connection.get();//HTTP Method = GET
String title = doc.title();
還可以攜帶cookie等引數:(和Python的爬蟲類似)
Document doc = Jsoup.connect("http://example.com")
.data("query", "Java")
.userAgent("Mozilla")
.cookie("auth", "token")
.timeout(3000)
.post(); //HTTP Method = POST
CASE : 解析 HTML 本地檔案 => 獲得 Document 物件
File input = new File("/tmp/input.html");
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");
/**
* 提取檔案裡面的文字資訊
*/
public static String openFile(String szFileName) {
try {
BufferedReader bis = new BufferedReader(new InputStreamReader(new FileInputStream(new File(szFileName)), ENCODE));
String szContent = "";
String szTemp;
while ((szTemp = bis.readLine()) != null) {
szContent += szTemp + "\n";
}
bis.close();
return szContent;
} catch (Exception e) {
return "";
}
}
X 參考文獻
- jsoup
- https://jsoup.org/
- https://mvnrepository.com/artifact/org.jsoup/jsoup/1.12.2
- 使用JAVA解析html (Jsoup) - 騰訊雲
- Java爬蟲系列三:使用Jsoup解析HTML(以部落格園為例)「建議收藏」 - 騰訊雲
- 解析html Java工具 - 51CTO