linux 環境下 elasticsearch 及 python 相關庫的使用

十方央丶發表於2024-04-08

原文網址 : https://www.cnblogs.com/itsfei/p/18121695

LinuxElasticsearchPython

- elasticsearch 是什麼？

elasticsearch 簡稱 es，是一個開源的分散式搜尋引擎，可以用來實現搜尋、日誌統計、分析、系統監控等功能。

- 安裝

1、下載

官網下載地址

2、解壓

tar zxvf elasticsearch-8.13.0-linux-x86_64.tar.gz -C /usr/local/

3、解決JDK依賴問題

新版本的 es 壓縮包中包含自帶的jdk，如果當前 Linux 環境中已經安裝了 jdk，啟動 es 時預設找的是已經裝好的jdk，此時可能會由於 jdk 版本不一致而報錯！

進入 bin 目錄下修改 elasticsearch 配置

# 將jdk修改為es中自帶jdk的配置目錄
export JAVA_HOME=/usr/local/elasticsearch-7.13.2/jdk
export PATH=$JAVA_HOME/bin:$PATH

if [ -x "$JAVA_HOME/bin/java" ]; then
        JAVA="/usr/local/elasticsearch-8.13.0/jdk/bin/java"
else
        JAVA=`which java`
fi

4、建立 es 使用者

注意root使用者不能啟動！

#建立使用者
useradd user-es

#遞迴設定 es 目錄使用者及使用者組
chown user-es:user-es -R /usr/local/elasticsearch-8.13.0

#切換使用者
su user-es

#進入bin目錄
cd /usr/local/elasticsearch-8.13.0/bin

#啟動elasticsearch
./elasticsearch

本地訪問http://ip:9200 進行測試，預設會需要輸入密碼

如需取消登入密碼，將 /usr/local/elasticsearch-8.13.0/config/elasticsearch.yml 檔案中 xpack.security.enabled 引數修改為 false。

修改日誌路徑、埠號等資訊檔案：

/usr/local/elasticsearch-7.13.2/config/elasticsearch.yml

補充：以下問題可能會遇到

es 記憶體不足問題

vim /usr/local/elasticsearch-8.13.0/config/jvm.options

## -Xms4g
## -Xmx4g

修改為(注意頂格，不要留空格)：

-Xms1g
-Xmx1g

- 使用 Python 操作 es

目的：將某一目錄下的大量檔案全部匯入到 es，之後對某個檔案進行查詢操作

版本一

import os
from datetime import datetime
from elasticsearch import Elasticsearch

#連線到 es
es = Elasticsearch(['http://ip:port'])

file_index = "test1"
directory = "/mnt/dir1"

for root, dirs, files in os.walk(directory):
  for filename in files:
    file_path = os.path.join(root, filename)
    if not es.exists(index=file_index, id=file_path):
        file_info = {
          "file_path": file_path,
          "file_size": os.path.getsize(file_path),
          "created_at": datetime.fromtimestamp(os.path.getctime(file_path)),
        }
        #插入檔案資訊
        es.index(index=file_index, id=file_path, body=file_info)
    else:
        #print(f"Document with id '{file_path}' already exists in the index.")
        pass

#查詢
file_name = "file2"
query = {
    "query": {
        "match": {
            "file_path": file_name
            }
        }
    }
result = es.search(index="test1", body=query)
for hit in result['hits']['hits']:
    print(hit['_source'])

#es.indices.delete(index=file_index)

實現了基本功能，但效率極低

版本二

一條一條將檔案資訊插入到 es 效率太低，可以將資料打包好再發給 es，使用 es 的 bluk 庫一併提交

bulk指令用於批次新增、更新或刪除文件。這個指令允許使用者在一個請求中提交多個操作

import os
from datetime import datetime
from elasticsearch import Elasticsearch
from elasticsearch.helpers import bulk

#連線到 es
es = Elasticsearch(['http://ip:port'])

file_index = "test1"
directory = "/mnt/dir1"
data = []

for root, dirs, files in os.walk(directory):
  for filename in files:
    file_path = os.path.join(root, filename)
    if not es.exists(index=file_index, id=file_path):
        file_info = {
          "file_path": file_path,
          "file_size": os.path.getsize(file_path),
          "created_at": datetime.fromtimestamp(os.path.getctime(file_path)),
        }
        #打包資料
        data.append(file_info)
    else:
        #print(f"Document with id '{file_path}' already exists in the index.")
        pass
#批次提交
success, failed = bulk(es, data)
print(f"成功寫入文件數: {success}, 失敗文件數: {failed}")

#查詢
file_name = "file2"
query = {
    "query": {
        "match": {
            "file_path": file_name
            }
        }
    }
result = es.search(index="test1", body=query)
for hit in result['hits']['hits']:
    print(hit['_source'])

#es.indices.delete(index=file_index)

版本四
分片 + 執行緒池 + bulk

import os
import concurrent.futures
from elasticsearch import Elasticsearch
from elasticsearch.helpers import bulk

# Elasticsearch連線配置
es = Elasticsearch(['http://ip:port'])

files_index = "test4"
directory = "/mnt/dir1"

# 準備一個函式用於將檔案資訊匯入到Elasticsearch中
def index_files_bulk(file_paths):
    actions = []
    for file_path in file_paths:
        file_info = {
            'file_name': os.path.basename(file_path),
            'file_path': file_path,
            # 這裡可以根據需要新增更多的檔案資訊
        }
        # 準備批次操作
        doc_id = hash(file_path)  # 以檔案路徑的雜湊值作為文件ID
        action = {
            '_op_type': 'update',
            '_index': files_index,
            '_id': doc_id,
            'doc': file_info,
            'doc_as_upsert': True  # 如果文件不存在則插入
        }

        actions.append(action)

    # 使用批次操作
    bulk(es, actions)

def index_files_parallel(directory):
    file_paths = []
    for root, dirs, files in os.walk(directory):
        for file in files:
            file_paths.append(os.path.join(root, file))

    # 使用多執行緒處理檔案資訊匯入
    with concurrent.futures.ThreadPoolExecutor() as executor:
        # 將檔案路徑切分成小塊，每個執行緒處理一部分
        chunk_size = 1000  # 每個執行緒處理的檔案數量
        for i in range(0, len(file_paths), chunk_size):
            chunk = file_paths[i:i + chunk_size]
            executor.submit(index_files_bulk, chunk)

# 呼叫函式並行匯入檔案資訊到Elasticsearch中
#index_files_parallel(directory)

#查詢
file_name = "file4"
query = {
    "query": {
        "match": {
            "file_path": file_name
            }
        }
    }
result = es.search(index="test1", body=query)
for hit in result['hits']['hits']:
    print(hit['_source'])

#es.indices.delete(index=file_index)

未完。。。。

Linux環境下非root使用者離線安裝Python及相關庫
2018-04-03
LinuxPython
Linux 環境Skywalking部署Elasticsearch
2022-03-27
LinuxElasticsearch
Linux下使用 virtualenv 虛擬獨立 Python 環境
2019-02-16
LinuxPython
Linux環境下elasticsearch-6.2.2安裝以及配置
2018-03-02
LinuxElasticsearch
Python學習系列之一: python相關環境的搭建
2021-10-24
Python
Linux 環境部署Skywalking支援Elasticsearch
2022-03-29
LinuxElasticsearch
Linux環境下Nginx及負載均衡
2019-05-15
LinuxNginx負載
Linux環境Flume安裝配置及使用
2019-03-07
Linux
Linux環境Sqoop安裝配置及使用
2019-03-09
LinuxOOP
Linux環境Hive安裝配置及使用
2019-02-27
LinuxHive
Linux環境Spark安裝配置及使用
2019-05-07
LinuxSpark
urllib庫在python2和python3環境下的使用區別
2018-08-14
Python
Linux下安裝JDK及環境設定
2020-04-06
LinuxJDK
linux環境下vscode 中C/C++編譯及除錯環境的搭建
2018-03-25
LinuxVSCodeC++編譯除錯
Linux 下使用 Docker 安裝lnmp環境
2021-05-14
LinuxDockerLNMP
Linux下安裝Python，以及環境變數的配置
2018-12-28
LinuxPython變數
Windows 環境下 Python 環境安裝
2022-11-05
WindowsPython
Linux 環境下安裝 Nexus 私服儲存庫
2022-03-18
Linux
Linux環境下達夢資料庫安裝及解除安裝手冊
2021-11-17
Linux資料庫
elasticsearch相關
2024-04-16
Elasticsearch
關於python Miniconda的指令碼安裝及環境控制
2024-07-09
Python指令碼
關於Solidity指令碼相關環境配置及指令碼資料的查詢
2024-06-06
Solid指令碼
python的字典及相關操作
2022-04-12
Python
如何在 Linux 環境下安裝使用 pycharm？
2024-03-13
LinuxPyCharm
Docker 搭建 PHP 環境及相關軟體入門到了解
2019-10-24
DockerPHP
Elk環境篇 --- 本地快速搭建你的ElasticSearch及Kibana
2020-04-04
Elasticsearch
conda linux 管理python 環境
2024-11-19
LinuxPython
linux 開發離線環境下載及安裝
2024-10-25
Linux
linux下配置java環境
2018-09-01
LinuxJava
關於SVN伺服器的環境搭建及使用（轉）
2020-04-04
伺服器
Laravel swoole安裝及使用及inotify熱更新 (Linux centos環境）
2021-03-09
LaravelLinuxCentOS
Python 開發環境搭建(03)：Linux 下 eclipse 安裝
2019-02-05
Python開發環境LinuxEclipse
案例分享：基於Docker環境下的Jenkins搭建及使用
2022-11-08
DockerJenkins
elasticsearch相關2
2024-04-19
Elasticsearch
Elasticsearch叢集搭建教程及生產環境配置
2021-09-01
Elasticsearch
python下多環境開發（虛擬環境）
2018-12-06
Python
可持續整合環境相關文章
2019-01-08
linux或者CentOS環境下安裝.NET Core環境
2024-06-17
LinuxCentOS

linux 環境下 elasticsearch 及 python 相關庫的使用

- elasticsearch 是什麼？

- 安裝

- 使用 Python 操作 es

目的：將某一目錄下的大量檔案全部匯入到 es，之後對某個檔案進行查詢操作

相關文章