結合LangChain實現網頁資料爬取

霍格沃兹测试开发学社發表於2024-07-18

原文網址 : https://www.cnblogs.com/hogwarts/p/18309127

LangChain網頁

LangChain 非常強大的一點就是封裝了非常多強大的工具可以直接使用。降低了使用者的學習成本。比如資料網頁爬取。

在其官方文件-網頁爬取中，也有非常好的示例。

應用場景

資訊爬取。
RAG 資訊檢索。

實踐應用

需求說明

從 ceshiren 網站中獲取每個帖子的名稱以及其對應的url資訊。
ceshiren論壇地址：https://ceshiren.com/

實現思路

對應原始碼


# 定義大模型
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(temperature=0, model="gpt-3.5-turbo-0613")

# 定義提取方法
def extract(content: str, schema: dict):
    from langchain.chains import create_extraction_chain
    return create_extraction_chain(schema=schema, llm=llm).invoke(content)

import pprint
from langchain_text_splitters import RecursiveCharacterTextSplitter
def scrape_with_playwright(urls, schema):
    # 載入資料
    loader = AsyncChromiumLoader(urls)
    docs = loader.load()
    # 資料轉換
    bs_transformer = BeautifulSoupTransformer()
    # 提取其中的span標籤
    docs_transformed = bs_transformer.transform_documents(
        docs, tags_to_extract=["span"]
    )
    # 資料切分
    splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=1000, chunk_overlap=0)
    splits = splitter.split_documents(docs_transformed)
    # 因為資料量太大，輸入第一片資料使用，傳入使用的架構
    extracted_content = extract(schema=schema, content=splits[0].page_content)
    pprint.pprint(extracted_content)
    return extracted_content

urls = ["https://ceshiren.com/"]
schema = {
    "properties": {
        "title": {"type": "string"},
        "url": {"type": "string"},
    },
    "required": ["title", "url"],
}
extracted_content = scrape_with_playwright(urls, schema=schema)

總結

瞭解網頁爬取的實現思路以及相關技術。
透過LangChain實現爬取測試人網頁的標題和url。

Puppeteer爬取網頁資料
2019-03-22
網頁
Python 爬取網頁資料的兩種方法
2023-02-15
Python網頁
「無程式碼」高效的爬取網頁資料神器
2021-10-18
網頁
用Jupyter—Notebook爬取網頁資料例項14
2020-12-01
網頁
用Jupyter—Notebook爬取網頁資料例項12
2020-12-01
網頁
爬取網頁文章
2021-09-29
網頁
爬蟲——網頁爬取方法和網頁解析方法
2020-12-07
爬蟲網頁
python實現微博個人主頁的資訊爬取
2021-01-03
Python
zf_利用feapder中的selenium網頁爬取資料
2024-06-03
網頁
python爬取58同城一頁資料
2018-08-04
Python
【RAG 專案實戰 06】使用 LangChain 結合 Chainlit 實現文件問答
2024-11-25
LangChain
node：爬蟲爬取網頁圖片
2019-02-16
爬蟲網頁
不會Python爬蟲？教你一個通用爬蟲思路輕鬆爬取網頁資料
2019-01-08
Python爬蟲網頁
Python網路爬蟲第三彈《爬取get請求的頁面資料》
2018-09-14
Python爬蟲
Puppeteer 實戰-爬取動態生成的網頁
2018-11-10
網頁
爬蟲抓取網頁資料原理
2021-12-06
爬蟲網頁
ferret 爬取動態網頁
2019-12-15
網頁
關於python爬取網頁
2021-03-10
Python網頁
JavaScript爬蟲程式實現自動化爬取tiktok資料教程
2023-10-18
JavaScript爬蟲
【SpringBoot】結合Redis實現快取
2024-10-04
Spring BootRedis快取
python爬蟲獲取天氣網實時資料
2022-11-29
Python爬蟲
拉勾網職位資料爬取
2018-08-26
Python爬取噹噹網APP資料
2020-10-21
PythonAPP
使用 Python 爬取網站資料
2024-07-27
Python網站
初識Scrapy框架+爬蟲實戰(7)-爬取鏈家網100頁租房資訊
2018-06-12
框架爬蟲
Python爬取股票資訊，並實現視覺化資料
2020-09-25
Python視覺化
python爬取網頁詳細教程
2021-09-11
Python網頁
爬蟲實戰——58同城租房資料爬取
2019-12-04
爬蟲
房產資料爬取、智慧財產權資料爬取、企業工商資料爬取、抖音直播間資料python爬蟲爬取
2024-07-11
Python爬蟲
利用requests+BeautifulSoup爬取網頁關鍵資訊
2018-11-13
網頁
Python筆記：網頁資訊爬取簡介（一）
2020-11-11
Python筆記網頁
Jsoup + HtmlUtil 實現網易新聞網頁爬蟲
2019-01-14
JSHTML網頁爬蟲
C#爬取動態網頁上的資訊：B站主頁
2024-09-27
C#網頁
快速爬取登入網站資料
2020-11-20
網站
常用正規表示式爬取網頁資訊及分析HTML標籤總結
2018-09-05
網頁HTML
如何用Python爬資料？（一）網頁抓取
2018-06-27
Python網頁
【Python3網路爬蟲開發實戰】6-Ajax資料爬取-3-Ajax結果提取
2018-03-28
Python爬蟲
Node JS爬蟲：爬取瀑布流網頁高清圖
2018-05-17
JS爬蟲網頁

結合LangChain實現網頁資料爬取

應用場景

實踐應用

需求說明

實現思路

對應原始碼

總結

相關文章