Ceph配置引數分析

weixin_34365417發表於2018-11-14

概述
Ceph的配置引數很多,從網上也能搜尋到一大批的調優引數,但這些引數為什麼這麼設定?設定為這樣是否合理?解釋的並不多

本文從當前我們的ceph.conf檔案入手,解釋其中的每一項配置,做為以後引數調優和新人學習的依據;

引數詳解
1,一些固定配置引數

fsid = 6d529c3d-5745-4fa5-*-*
mon_initial_members = **, **, **
mon_host = *.*.*.*, *.*.*.*, *.*.*.*

以上通常是通過ceph-deploy生成的,都是ceph monitor相關的引數,不用修改;

2,網路配置引數

public_network = 10.10.2.0/24  預設值 ""
cluster_network = 10.10.2.0/24 預設值 ""
public network:monitor與osd,client與monitor,client與osd通訊的網路,最好配置為頻寬較高的萬兆網路;
cluster network:OSD之間通訊的網路,一般配置為頻寬較高的萬兆網路;

參考:http://docs.ceph.com/docs/master/rados/configuration/network-config-ref/
3,pool size配置引數

osd_pool_default_size = 3       預設值 3
osd_pool_default_min_size = 1   預設值 0 // 0 means no specific default; ceph will use size-size/2

這兩個是建立ceph pool的時候的預設size引數,一般配置為3和1,3副本能足夠保證資料的可靠性;
4,認證配置引數

auth_service_required = none   預設值 "cephx"
auth_client_required = none    預設值 "cephx, none"
auth_cluster_required = none   預設值 "cephx"

以上是Ceph authentication的配置引數,預設值為開啟ceph認證;
在內部使用的ceph叢集中一般配置為none,即不使用認證,這樣能適當加快ceph叢集訪問速度;

5,osd down out配置引數

mon_osd_down_out_interval = 3600  預設值 300 // seconds
mon_osd_min_down_reporters = 3    預設值 2
mon_osd_report_timeout = 900      預設值 900
osd_heartbeat_interval = 10       預設值 6
osd_heartbeat_grace = 60          預設值 20
mon_osd_down_out_interval:ceph標記一個osd為down and out的最大時間間隔
mon_osd_min_down_reporters:mon標記一個osd為down的最小reporters個數(報告該osd為down的其他osd為一個reporter)
mon_osd_report_timeout:mon標記一個osd為down的最長等待時間
osd_heartbeat_interval:osd傳送heartbeat給其他osd的間隔時間(同一PG之間的osd才會有heartbeat)
osd_heartbeat_grace:osd報告其他osd為down的最大時間間隔,grace調大,也有副作用,如果某個osd異常退出,等待其
他osd上報的時間必須為grace,在這段時間段內,這個osd負責的pg的io會hang住,所以儘量不要將grace調的太大。

基於實際情況合理配置上述引數,能減少或及時發現osd變為down(降低IO hang住的時間和概率),延長osd變為down and out的時間(防止網路抖動造成的資料recovery);

參考:

http://docs.ceph.com/docs/master/rados/configuration/mon-osd-interaction/

http://blog.wjin.org/posts/ceph-osd-heartbeat.html

6,objecter配置引數

objecter_inflight_ops = 10240               預設值 1024
objecter_inflight_op_bytes = 1048576000     預設值 100M
osd client端objecter的throttle配置,它的配置會影響librbd,RGW端的效能;

配置建議:

調大這兩個值

7,ceph rgw配置引數

rgw_frontends = "civetweb num_threads=500"              預設值 "fastcgi, civetweb port=7480"
rgw_thread_pool_size = 200                              預設值 100
rgw_override_bucket_index_max_shards = 20               預設值 0
 
rgw_max_chunk_size = 1048576                            預設值 512 * 1024
rgw_cache_lru_size = 10000                              預設值 10000 // num of entries in rgw cache
rgw_bucket_default_quota_max_objects = *                預設值 -1 // number of objects allowed
rgw_frontends:rgw的前端配置,一般配置為使用輕量級的civetweb;prot為訪問rgw的埠,根據實際情況配置;num_threads為civetweb的執行緒數;
rgw_thread_pool_size:rgw前端web的執行緒數,與rgw_frontends中的num_threads含義一致,但num_threads 優於 rgw_thread_pool_size的配置,兩個只需要配置一個即可;
rgw_override_bucket_index_max_shards:rgw bucket index object的最大shards數,增大這個值能提升bucket index object的訪問時間,但也會加大bucket的ls時間;
rgw_max_chunk_size:rgw最大chunk size,針對大檔案的物件儲存場景可以把這個值調大;

rgw_cache_lru_size:rgw的lru cache size,對於讀較多的應用場景,調大這個值能加快rgw的響應熟讀;
rgw_bucket_default_quota_max_objects:配合該引數限制一個bucket的最大objects個數;

參考:

http://docs.ceph.com/docs/jewel/install/install-ceph-gateway/

http://ceph-users.ceph.narkive.com/mdB90g7R/rgw-increase-the-first-chunk-size

https://access.redhat.com/solutions/2122231

8,debug配置引數

debug_lockdep = 0/0
debug_context = 0/0
debug_crush = 0/0
debug_buffer = 0/0
debug_timer = 0/0
debug_filer = 0/0
debug_objecter = 0/0
debug_rados = 0/0
debug_rbd = 0/0
debug_journaler = 0/0
debug_objectcatcher = 0/0
debug_client = 0/0
debug_osd = 0/0
debug_optracker = 0/0
debug_objclass = 0/0
debug_filestore = 0/0
debug_journal = 0/0
debug_ms = 0/0
debug_mon = 0/0
debug_monc = 0/0
debug_tp = 0/0
debug_auth = 0/0
debug_finisher = 0/0
debug_heartbeatmap = 0/0
debug_perfcounter = 0/0
debug_asok = 0/0
debug_throttle = 0/0
debug_paxos = 0/0
debug_rgw = 0/0 

關閉了所有的debug資訊,能一定程度加快ceph叢集速度,但也會丟失一些關鍵log,出問題的時候不好分析;
參考:

http://www.10tiao.com/html/362/201609/2654062487/1.html

9,osd op配置引數

osd_enable_op_tracker = true       預設值 true
osd_num_op_tracker_shard = 32      預設值 32
osd_op_threads = 5                 預設值 2
osd_disk_threads = 1               預設值 1
osd_op_num_shards = 15             預設值 5
osd_op_num_threads_per_shard = 2   預設值 2
osd_enable_op_tracker:追蹤osd op狀態的配置引數,預設為true;不建議關閉,關閉後osd的 slow_request,ops_in_flight,historic_ops 無法正常統計;
# ceph daemon /var/run/ceph/ceph-osd.0.asok dump_ops_in_flight
op_tracker tracking is not enabled now, so no ops are tracked currently, even those get stuck.  Please enable "osd_enable_op_tracker", and the tracker will start to track new ops received afterwards.
# ceph daemon /var/run/ceph/ceph-osd.0.asok dump_historic_ops
op_tracker tracking is not enabled now, so no ops are tracked currently, even those get stuck.  Please enable "osd_enable_op_tracker", and the tracker will start to track new ops received afterwards.

開啟op tracker後,若叢集iops很高,osd_num_op_tracker_shard可以適當調大,因為每個shard都有個獨立的mutex鎖;

class OpTracker {
...
    struct ShardedTrackingData {
        Mutex ops_in_flight_lock_sharded;
        xlist<TrackedOp *> ops_in_flight_sharded;
        explicit ShardedTrackingData(string lock_name):
            ops_in_flight_lock_sharded(lock_name.c_str()) {}
    };
    vector<ShardedTrackingData*> sharded_in_flight_list;
    uint32_t num_optracker_shards;
...
};
osd_op_threads:對應的work queue有peering_wq(osd peering請求),recovery_gen_wq(PG recovery請求);
osd_disk_threads:對應的work queue為 remove_wq(PG remove請求);
 

osd_op_num_shards和osd_op_num_threads_per_shard:對應的thread pool為osd_op_tp,work queue為op_shardedwq;

處理的請求包括:

OpRequestRef

PGSnapTrim

PGScrub

調大osd_op_num_shards可以增大osd ops的處理執行緒數,增大併發性,提升OSD效能;

10,osd client message配置引數

osd_client_message_size_cap = 1048576000  預設值 500*1024L*1024L     // client data allowed in-memory (in bytes)
osd_client_message_cap = 1000             預設值 100     // num client messages allowed in-memory

這個是osd端收到client messages的capacity配置,配置大的話能提升osd的處理能力,但會佔用較多的系統記憶體;
配置建議:
伺服器記憶體足夠大的時候,適當增大這兩個值
11,osd scrub配置引數

osd_scrub_begin_hour = 10                預設值 0
osd_scrub_end_hour = 5                   預設值 24

// The time in seconds that scrubbing sleeps between two consecutive scrubs

osd_scrub_sleep = 1                      預設值 0        // sleep between [deep]scrub ops
 
osd_scrub_load_threshold = 8             預設值 0.5

// chunky scrub配置的最小/最大objects數,以下是預設值

osd_scrub_chunk_min = 5
osd_scrub_chunk_max = 25

Ceph osd scrub是保證ceph資料一致性的機制,scrub以PG為單位,但每次scrub回獲取PG lock,所以它可能會影響PG正常的IO;
Ceph後來引入了chunky的scrub模式,每次scrub只會選取PG的一部分objects,完成後釋放PG lock,並把下一次的PG scrub加入佇列;這樣能很好的減少PG scrub時候佔用PG lock的時間,避免過多影響PG正常的IO;
同理,引入的osd_scrub_sleep引數會讓執行緒在每次scrub前釋放PG lock,然後睡眠一段時間,也能很好的減少scrub對PG正常IO的影響;
配置建議:

osd_scrub_begin_hour和osd_scrub_end_hour:OSD Scrub的開始結束時間,根據具體業務指定;
osd_scrub_sleep:osd在每次執行scrub時的睡眠時間;有個bug跟這個配置有關,建議關閉; 
osd_scrub_load_threshold:osd開啟scrub的系統load閾值,根據系統的load average值配置該引數;
osd_scrub_chunk_min和osd_scrub_chunk_max:根據PG中object的個數配置;針對RGW全是小檔案的情況,這兩個值需要調大;

參考:

http://www.jianshu.com/p/ea2296e1555c

http://tracker.ceph.com/issues/19497

12,osd thread timeout配置引數

osd_op_thread_timeout = 100                 預設值 15
osd_op_thread_suicide_timeout = 300         預設值 150
 
osd_recovery_thread_timeout = 100           預設值 30
osd_recovery_thread_suicide_timeout = 300   預設值 300
osd_op_thread_timeout和osd_op_thread_suicide_timeout關聯的work queue為:
op_shardedwq - 關聯的請求為:OpRequestRef,PGSnapTrim,PGScrub
peering_wq - 關聯的請求為:osd peering
osd_recovery_thread_timeout和osd_recovery_thread_suicide_timeout關聯的work queue為:

recovery_wq - 關聯的請求為:PG recovery

Ceph的work queue都有個基類 WorkQueue_,定義如下:

/// Pool of threads that share work submitted to multiple work queues.
class ThreadPool : public md_config_obs_t {
...
    /// Basic interface to a work queue used by the worker threads.
    struct WorkQueue_ {
        string name;
        time_t timeout_interval, suicide_interval;
        WorkQueue_(string n, time_t ti, time_t sti)
            : name(n), timeout_interval(ti), suicide_interval(sti)
        { }
...

這裡的timeout_interval和suicide_interval分別對應上面所述的配置timeout和suicide_timeout;
當thread處理work queue中的一個請求時,會受到這兩個timeout時間的限制:

timeout_interval - 到時間後設定m_unhealthy_workers+1
suicide_interval - 到時間後呼叫assert,OSD程式crush
對應的處理函式為:

bool HeartbeatMap::_check(const heartbeat_handle_d *h, const char *who, time_t now)
{
    bool healthy = true;
    time_t was;
    was = h->timeout.read();
    if (was && was < now) {
        ldout(m_cct, 1) << who << " '" << h->name << "'"
                        << " had timed out after " << h->grace << dendl;
        healthy = false;
    }
    was = h->suicide_timeout.read();
    if (was && was < now) {
        ldout(m_cct, 1) << who << " '" << h->name << "'"
                        << " had suicide timed out after " << h->suicide_grace << dendl;
        assert(0 == "hit suicide timeout");
    }
    return healthy;
}

當前僅有RGW新增了worker的perfcounter,所以也只有RGW可以通過perf dump檢視total/unhealthy的worker資訊:

[root@ node1]# ceph daemon /var/run/ceph/ceph-client.rgw.*.asok perf dump | grep worker
        "total_workers": 32,
        "unhealthy_workers": 0

對應的配置項為:

OPTION(rgw_num_async_rados_threads, OPT_INT, 32) // num of threads to use for async rados operations

配置建議:

*_thread_timeout:這個值配置越小越能及時發現處理慢的請求,所以不建議配置很大;特別是針對速度快的裝置,建議調小該值;
*_thread_suicide_timeout:這個值配置小了會導致超時後的OSD crush,所以建議調大;特別是在對應的throttle調大後,更應該調大該值;

13,fielstore op thread配置引數

filestore_op_threads = 5                    預設值 2
filestore_op_thread_timeout = 100           預設值 60
filestore_op_thread_suicide_timeout = 300   預設值 180
filestore_op_threads:對應的thread pool為op_tp,對應的work queue為op_wq;filestore的所有請求都經過op_wq處理;
增大該引數能提升filestore的處理能力,提升filestore的效能;配合filestore的throttle一起調整;
filestore_op_thread_timeout和filestore_op_thread_suicide_timeout關聯的work queue為:

op_wq
配置的含義與上一節中的thread_timeout/thread_suicide_timeout保持一致;

13,filestore merge/split配置引數

filestore_merge_threshold = -1    預設值 10
filestore_split_multiple = 10000  預設值 2
這兩個引數是管理filestore的目錄分裂/合併的,filestore的每個目錄允許的最大檔案數為:
filestore_split_multiple * abs(filestore_merge_threshold) * 16

在RGW的小檔案應用場景,會很容易達到預設配置的檔案數(320),若在寫的過程中觸發了filestore的分裂,則會非常影響filestore的效能;

每次filestore的目錄分裂,會依據如下規則分裂為多層目錄,最底層16個子目錄:

例如PG 31.4C0, hash結尾是4C0,若該目錄分裂,會分裂為 DIR_0/DIR_C/DIR_4/{DIR_0, DIR_F};

原始目錄下的object會根據規則放到不同的子目錄裡,object的名稱格式為: _head_xxxxX4C0,分裂時候X是幾,就放進子目錄DIR_X裡。比如object:_head_xxxxA4C0, 就放進子目錄 DIR_0/DIR_C/DIR_4/DIR_A 裡;

解決辦法:

1)增大merge/split配置引數的值,使單個目錄容納更多的檔案;

2)filestore_merge_threshold配置為負數;這樣會提前觸發目錄的預分裂,避免目錄在某一時間段的集中分裂,詳細機制沒有調研;

3)建立pool時指定expected-num-objects;這樣會依據目錄分裂規則,在建立pool的時候就建立分裂的子目錄,避免了目錄分裂對filestore效能的影響;

參考:

http://docs.ceph.com/docs/master/rados/configuration/filestore-config-ref/

http://docs.ceph.com/docs/jewel/rados/operations/pools/#create-a-pool

http://blog.csdn.net/for_tech/article/details/51251936

http://ivanjobs.github.io/page3/

14,filestore fd cache配置引數

filestore_fd_cache_shards =  32  預設值 16     // FD number of shards
filestore_fd_cache_size = 32768  預設值 128  // FD lru size
filestore的fd cache是加速訪問filestore裡的file的,在非一次性寫入的應用場景,增大配置可以很明顯的提升filestore的效能;

15,filestore sync配置引數

filestore_wbthrottle_enable = false   預設值 true      SSD的時候建議關閉
filestore_min_sync_interval = 1       預設值 0.01 s    最小同步間隔秒數,sync fs的資料到disk,FileStore::sync_entry()
filestore_max_sync_interval = 10      預設值 5 s       最大同步間隔秒數,sync fs的資料到disk,FileStore::sync_entry()
filestore_commit_timeout = 1000       預設值 600 s     FileStore::sync_entry() 裡 new SyncEntryTimeout(m_filestore_commit_timeout)
filestore_wbthrottle_enable的配置是關於filestore writeback throttle的,即我們說的filestore處理workqueue op_wq的資料量閾值;預設值是true,開啟後XFS相關的配置引數有:
OPTION(filestore_wbthrottle_xfs_bytes_start_flusher, OPT_U64, 41943040)
OPTION(filestore_wbthrottle_xfs_bytes_hard_limit, OPT_U64, 419430400)
OPTION(filestore_wbthrottle_xfs_ios_start_flusher, OPT_U64, 500)
OPTION(filestore_wbthrottle_xfs_ios_hard_limit, OPT_U64, 5000)
OPTION(filestore_wbthrottle_xfs_inodes_start_flusher, OPT_U64, 500)
OPTION(filestore_wbthrottle_xfs_inodes_hard_limit, OPT_U64, 5000)

若使用普通HDD,可以保持其為true;針對SSD,建議將其關閉,不開啟writeback throttle;

filestore_min_sync_interval和filestore_max_sync_interval是配置filestore flush outstanding IO到disk的時間間隔的;增大配置可以讓系統做盡可能多的IO merge,減少filestore寫磁碟的壓力,但也會增大page cache佔用記憶體的開銷,增大資料丟失的可能性;

filestore_commit_timeout是配置filestore sync entry到disk的超時時間,在filestore壓力很大時,調大這個值能儘量避免IO超時導致OSD crush;

16,filestore throttle配置引數

filestore_expected_throughput_bytes =  536870912   預設值 200MB    /// Expected filestore throughput in B/s
filestore_expected_throughput_ops = 2000           預設值 200      /// Expected filestore throughput in ops/s
filestore_queue_max_bytes= 1048576000              預設值 100MB
filestore_queue_max_ops = 5000                     預設值 50
 
/// Use above to inject delays intended to keep the op queue between low and high
filestore_queue_low_threshhold = 0.3               預設值 0.3
filestore_queue_high_threshhold = 0.9              預設值 0.9
 
filestore_queue_high_delay_multiple = 2            預設值 0    /// Filestore high delay multiple.  Defaults to 0 (disabled)
filestore_queue_max_delay_multiple = 10            預設值 0    /// Filestore max delay multiple.  Defaults to 0 (disabled)

在jewel版本里,引入了dynamic throttle,來平滑普通throttle帶來的長尾效應問題;
一般在使用普通磁碟時,之前的throttle機制即可很好的工作,所以這裡預設 filestore_queue_high_delay_multiple 和 filestore_queue_max_delay_multiple 都為0;
針對高速磁碟,需要在部署之前,通過小工具 ceph_smalliobenchfs 來測試下,獲取合適的配置引數;

BackoffThrottle的介紹如下:

/**
* BackoffThrottle
*
* Creates a throttle which gradually induces delays when get() is called
* based on params low_threshhold, high_threshhold, expected_throughput,
* high_multiple, and max_multiple.
*
* In [0, low_threshhold), we want no delay.
*
* In [low_threshhold, high_threshhold), delays should be injected based
* on a line from 0 at low_threshhold to
* high_multiple * (1/expected_throughput) at high_threshhold.
*
* In [high_threshhold, 1), we want delays injected based on a line from
* (high_multiple * (1/expected_throughput)) at high_threshhold to
* (high_multiple * (1/expected_throughput)) +
* (max_multiple * (1/expected_throughput)) at 1.
*
* Let the current throttle ratio (current/max) be r, low_threshhold be l,
* high_threshhold be h, high_delay (high_multiple / expected_throughput) be e,
* and max_delay (max_muliple / expected_throughput) be m.
*
* delay = 0, r \in [0, l)
* delay = (r - l) * (e / (h - l)), r \in [l, h)
* delay = h + (r - h)((m - e)/(1 - h))
*/ 

參考:

http://docs.ceph.com/docs/jewel/dev/osd_internals/osd_throttles/
http://blog.wjin.org/posts/ceph-dynamic-throttle.html
https://github.com/ceph/ceph/blob/master/src/doc/dynamic-throttle.txt
Ceph BackoffThrottle分析

17,filestore finisher threads配置引數

filestore_ondisk_finisher_threads = 2 預設值 1
filestore_apply_finisher_threads = 2  預設值 1

這兩個引數定義filestore commit/apply的finisher處理執行緒數,預設都為1,任何IO commit/apply完成後,都需要經過對應的ondisk/apply finisher thread處理;
在使用普通HDD時,磁碟效能是瓶頸,單個finisher thread就能處理好;

但在使用高速磁碟的時候,IO完成比較快,單個finisher thread不能處理這麼多的IO commit/apply reply,它會成為瓶頸;所以在jewel版本里引入了finisher thread pool的配置,這裡一般配置為2即可;

18,journal配置引數

journal_max_write_bytes=1048576000       預設值 10M    
journal_max_write_entries=5000           預設值 100
 
journal_throttle_high_multiple = 2       預設值 0    /// Multiple over expected at high_threshhold. Defaults to 0 (disabled).
journal_throttle_max_multiple = 10       預設值 0    /// Multiple over expected at max.  Defaults to 0 (disabled).

/// Target range for journal fullness

OPTION(journal_throttle_low_threshhold, OPT_DOUBLE, 0.6)
OPTION(journal_throttle_high_threshhold, OPT_DOUBLE, 0.9)

journal_max_write_bytes和journal_max_write_entries 是journal一次write的資料量和entries限制;

針對SSD分割槽做journal的情況,這兩個值要增大,這樣能增大journal的吞吐量;

journal_throttle_high_multiple和journal_throttle_max_multiple是JournalThrottle的配置引數,JournalThrottle是BackoffThrottle的封裝類,所以JournalThrottle與我們在filestore throttle介紹的dynamic throttle工作原理一樣;

int FileJournal::set_throttle_params()
{
    stringstream ss;
    bool valid = throttle.set_params(
                     g_conf->journal_throttle_low_threshhold,
                     g_conf->journal_throttle_high_threshhold,
                     g_conf->filestore_expected_throughput_bytes,
                     g_conf->journal_throttle_high_multiple,
                     g_conf->journal_throttle_max_multiple,
                     header.max_size - get_top(),
                     &ss);
...
}

從上述程式碼中看出相關的配置引數有:

journal_throttle_low_threshhold
journal_throttle_high_threshhold
filestore_expected_throughput_bytes

19,rbd cache配置引數

[client]
rbd_cache_size = 134217728                  預設值 32M // cache size in bytes
rbd_cache_max_dirty = 100663296             預設值 24M // dirty limit in bytes - set to 0 for write-through caching
rbd_cache_target_dirty = 67108864           預設值 16M // target dirty limit in bytes
rbd_cache_writethrough_until_flush = true   預設值 true // whether to make writeback caching writethrough until flush is called, to be sure the user of librbd will send flushs so that writeback is safe
rbd_cache_max_dirty_age = 5                 預設值 1.0  // seconds in cache before writeback starts
rbd_cache_size:client端每個rbd image的cache size,不需要太大,可以調整為64M,不然會比較佔client端記憶體;
參照預設值,根據rbd_cache_size的大小調整rbd_cache_max_dirty和rbd_cache_target_dirty;
rbd_cache_max_dirty:在writeback模式下cache的最大bytes數,預設是24MB;當該值為0時,表示使用writethrough模式;
rbd_cache_target_dirty:在writeback模式下cache向ceph叢集寫入的bytes閥值,預設16MB;注意該值一定要小於rbd_cache_max_dirty值
rbd_cache_writethrough_until_flush:在核心觸發flush cache到ceph叢集前rbd cache一直是writethrough模式,直到flush後rbd cache變成writeback模式;

rbd_cache_max_dirty_age:標記OSDC端ObjectCacher中entry在cache中的最長時間;

參考:

https://my.oschina.net/linuxhunter/blog/541997

相關文章