Redis核心解讀-從Master到Slave的Replicantion

五柳-先生發表於2016-05-25

簡介

replication是redis提供的複製功能,用於master提供給slave的資料同步。
slave在連線master後,master端會在後臺啟動一個程式進行rdb檔案的建立,當檔案建立完成後,傳送給slave端,slave端收到後,會通過rdb檔案完成對master的複製。

Slave端結構定義

在瞭解replicantion核心之前,先了解replication在redis.conf的配置選項。

#slaveof [masterip] [masterport]  設定master的ip和port
#masterauth [master-password]     如果master需要auth,在此設定password
#slave-serve-stale-data yes       如果slave與master的連線斷開,該選項決定slave是否繼續提供服務
#slave-read-only yes              slave是否是隻讀的
#repl-ping-slave-period 10        master端ping slave端的時間間隔,時刻檢測slave連線的有效
#repl-timeout 60                  replication連線的超時時間
#slave-priority 100               slave的權重,用於redis sentinel模式中,如果master down,權重大的slave接替master

server結構的關於slave端的成員變數

/* Slave specific fields */
    char *masterauth;               /* AUTH with this password with master */
    char *masterhost;               /* Hostname of master */
    int masterport;                 /* Port of master */
    int repl_ping_slave_period;     /* Master pings the slave every N seconds */
    int repl_timeout;               /* Timeout after N seconds of master idle */
    redisClient *master;     /* Client that is master for this slave */
    int repl_syncio_timeout; /* Timeout for synchronous I/O calls */
    int repl_state;          /* 值為上述的replication的狀態巨集 */
    off_t repl_transfer_size; /* master傳送給slave的rdb檔案大小 */
    off_t repl_transfer_read; /* 已經從master讀取的rdb檔案大小 */
    off_t repl_transfer_last_fsync_off; /* slave端收到rdb檔案後同步到磁碟的檔案大小偏移 */
    int repl_transfer_s;     /* slave端獲取rdb檔案的socket */
    int repl_transfer_fd;    /* slave收到rdb檔案後存放到磁碟的檔案fd */
    char *repl_transfer_tmpfile; /* slave存放rdb檔案的檔名 */
    time_t repl_transfer_lastio; /* slave端上一次收到master端傳送的ddb檔案的unix time */
    int repl_serve_stale_data; /* 跟master斷開後,是否繼續服務? */
    int repl_slave_ro;          /* Slave is read only? */
    time_t repl_down_since; /* Unix time at which link with master went down */
    int slave_priority;             /* Reported in INFO and used by Sentinel. */

replication的幾個狀態巨集 – slave端複製狀態

#define REDIS_REPL_NONE 0 /* 未複製的狀態 */
#define REDIS_REPL_CONNECT 1 /* 已經接收到slaveof命令,但未發出sync命令給master */
#define REDIS_REPL_CONNECTING 2 /* 正在傳送ping給master */
#define REDIS_REPL_RECEIVE_PONG 3 /* 傳送ping完畢,等待PING回覆 */
#define REDIS_REPL_TRANSFER 4 /* 已經發出sync,但還沒接收完rdb檔案 */
#define REDIS_REPL_CONNECTED 5 /* 連線master成功 */

從狀態巨集也可以看出slave連線master經過幾個過程:
1. 收到replication的指示
2. 建立socket連線到master,準備傳送ping命令個master
3. 傳送ping給master後,等待master的回覆
4. 等待master傳送rdb檔案->收到rdb檔案後,完成replication建立。額外的ping命令是redis應用層校驗連線成功的額外過程。
redis通過replicantion狀態的標示來非同步進行replicantion的各階段。

Slave端發起同步請求

首先,即將成為slave的redis instance收到slaveof命令或者啟動時配置了slaveof選項,則執行slaveofCommand函式(replicantion.c)。如果命令是slaveof no one,那麼取消replication。

void slaveofCommand(redisClient *c) {
    if (!strcasecmp(c->argv[1]->ptr,"no") &&
        !strcasecmp(c->argv[2]->ptr,"one")) {
        if (server.masterhost) {
            sdsfree(server.masterhost);
            server.masterhost = NULL;
            if (server.master) freeClient(server.master);
            if (server.repl_state == REDIS_REPL_TRANSFER)
                replicationAbortSyncTransfer();
            else if (server.repl_state == REDIS_REPL_CONNECTING ||
                     server.repl_state == REDIS_REPL_RECEIVE_PONG)
                undoConnectWithMaster();
            server.repl_state = REDIS_REPL_NONE;
            redisLog(REDIS_NOTICE,"MASTER MODE enabled (user request)");
        }

否則得到master ip和port後,判斷是否已經與該master建立連線,若不是,則放棄若已有的replication連線並初始化server的幾個replication成員變數。

        long port;

        if ((getLongFromObjectOrReply(c, c->argv[2], &port, NULL) != REDIS_OK))
            return;

        /* Check if we are already attached to the specified slave */
        if (server.masterhost && !strcasecmp(server.masterhost,c->argv[1]->ptr)
            && server.masterport == port) {
            redisLog(REDIS_NOTICE,"SLAVE OF would result into synchronization with the master we are already connected with. No operation performed.");
            addReplySds(c,sdsnew("+OK Already connected to specified master\r\n"));
            return;
        }
        sdsfree(server.masterhost);
        server.masterhost = sdsdup(c->argv[1]->ptr);
        server.masterport = port;
        if (server.master) freeClient(server.master);
        disconnectSlaves(); /* Force our slaves to resync with us as well. */
        if (server.repl_state == REDIS_REPL_TRANSFER)
            replicationAbortSyncTransfer();
        server.repl_state = REDIS_REPL_CONNECT;

到此,slaveof的初始化結束,通過server.repl_state來標示replicantion的進展。

在serverCron這個redis核心回撥中,呼叫replicationCron()(replication.c)

    /* Replication cron function -- used to reconnect to master and
     * to detect transfer failures. */
    run_with_period(1000) replicationCron();

在replicationCron()中,通過server.repl_state做檢測,檢測是否連線master超時,傳輸rdb檔案是否超時,連線master成功後是否空閒超時,如果server.repl_state為REDIS_REPL_CONNECT,也就是在slaveofCommand設定的狀態,那麼啟動連線master,呼叫connectWithMaster()

    /* Check if we should connect to a MASTER */
    if (server.repl_state == REDIS_REPL_CONNECT) {
        redisLog(REDIS_NOTICE,"Connecting to MASTER...");
        if (connectWithMaster() == REDIS_OK) {
            redisLog(REDIS_NOTICE,"MASTER <-> SLAVE sync started");
        }

在connectWithMaster()中,嘗試連線到server後,建立檔案事件,當可讀或者可寫時,呼叫syncWithMaster()(replication.c),設定server.repl_state為REDIS_REPL_CONNECTING,表示建立socket連線後,即將傳送ping命令給master端。

int connectWithMaster(void) {
    int fd;

    fd = anetTcpNonBlockConnect(NULL,server.masterhost,server.masterport);
    if (fd == -1) {
        redisLog(REDIS_WARNING,"Unable to connect to MASTER: %s",
            strerror(errno));
        return REDIS_ERR;
    }

    if (aeCreateFileEvent(server.el,fd,AE_READABLE|AE_WRITABLE,syncWithMaster,NULL) ==
            AE_ERR)
    {
        close(fd);
        redisLog(REDIS_WARNING,"Can't create readable event for SYNC");
        return REDIS_ERR;
    }

    server.repl_transfer_lastio = server.unixtime;
    server.repl_transfer_s = fd;
    server.repl_state = REDIS_REPL_CONNECTING;
    return REDIS_OK;
}

syncWithMaster()是slave端在replication啟動後與master端建立slave-master關係的核心函式。在之前的connectWithMaster()後,此時slave端需要傳送PING命令給master,檢測master是否能回覆,以此判斷連線成功的socket是否為redis instance,在這種,redis一改之前全部的非同步讀寫,使用了syncWrite,該函式向fd寫入內容,如果阻塞,那麼redis也會阻塞直到全部內容寫入傳送緩衝區。

if (server.repl_state == REDIS_REPL_CONNECTING) {
        redisLog(REDIS_NOTICE,"Non blocking connect for SYNC fired the event.");
        /* Delete the writable event so that the readable event remains
         * registered and we can wait for the PONG reply. */
        aeDeleteFileEvent(server.el,fd,AE_WRITABLE);
        server.repl_state = REDIS_REPL_RECEIVE_PONG;
        /* Send the PING, don't check for errors at all, we have the timeout
         * that will take care about this. */
        syncWrite(fd,"PING\r\n",6,100);
        return;
    }

寫入成功後,退出該函式,直到master端回覆PING命令後,slave端再次呼叫syncWithMaster(),並且進入以下過程。由於master已經回覆PING命令,該fd不再需要可讀檔案事件回撥,同步讀取master的回覆內容,並且判斷是否正常。

    if (server.repl_state == REDIS_REPL_RECEIVE_PONG) {
        char buf[1024];

        aeDeleteFileEvent(server.el,fd,AE_READABLE);

        /* Read the reply with explicit timeout. */
        buf[0] = '\0';
        if (syncReadLine(fd,buf,sizeof(buf),
            server.repl_syncio_timeout*1000) == -1)
        {
            redisLog(REDIS_WARNING,
                "I/O error reading PING reply from master: %s",
                strerror(errno));
            goto error;
        }
        //此時有可能slave回覆需要auth
        if (buf[0] != '-' && buf[0] != '+') {
            redisLog(REDIS_WARNING,"Unexpected reply to PING from master.");
            goto error;
        } else {
            redisLog(REDIS_NOTICE,
                "Master replied to PING, replication can continue...");
        }
    }

如果master端需要auth,那麼此時再傳送AUTH命令,驗證slave端。這裡呼叫的sendSynchronousCommand()(replicantion.c)是特殊的同步傳送命令,該傳送會確保write到fd的內容全部輸出,並且阻塞等待master端回覆命令。因此,當返回時,master已經回覆通過auth驗證或者驗證失敗等。

/* AUTH with the master if required. */
    if(server.masterauth) {
        err = sendSynchronousCommand(fd,"AUTH",server.masterauth,NULL);
        if (err) {
            redisLog(REDIS_WARNING,"Unable to AUTH to MASTER: %s",err);
            sdsfree(err);
            goto error;
        }
    }

slave端在與master完成應用層握手後,阻塞傳送sync命令,請求同步。

    if (syncWrite(fd,"SYNC\r\n",6,server.repl_syncio_timeout*1000) == -1) {
        redisLog(REDIS_WARNING,"I/O error writing to MASTER: %s",
            strerror(errno));
        goto error;
    }

同時,slave會建立臨時檔案,用於存放獲得的rdb檔案。maxtries用於不斷嘗試建立臨時檔案的次數。

/* Prepare a suitable temp file for bulk transfer */
    while(maxtries--) {
        snprintf(tmpfile,256,
            "temp-%d.%ld.rdb",(int)server.unixtime,(long int)getpid());
        dfd = open(tmpfile,O_CREAT|O_WRONLY|O_EXCL,0644);
        if (dfd != -1) break;
        sleep(1);
    }
    if (dfd == -1) {
        redisLog(REDIS_WARNING,"Opening the temp file needed for MASTER <-> SLAVE synchronization: %s",strerror(errno));
        goto error;
    }

Slave端接收資料

到了這裡,slave端準備開始接收master端傳送的rdb檔案,建立檔案事件,回撥readSyncBulkPayload()(replication.c)。設定server.repl_state為REDIS_REPL_TRANSFER並且初始化傳輸rdb檔案的相關變數。syncWithMaster的使命也完成了。

    if (aeCreateFileEvent(server.el,fd, AE_READABLE,readSyncBulkPayload,NULL)
            == AE_ERR)
    {
        redisLog(REDIS_WARNING,"Can't create readable event for SYNC");
        goto error;
    }
    server.repl_state = REDIS_REPL_TRANSFER;
    server.repl_transfer_size = -1;
    server.repl_transfer_read = 0;
    server.repl_transfer_last_fsync_off = 0;
    server.repl_transfer_fd = dfd;
    server.repl_transfer_lastio = server.unixtime;
    server.repl_transfer_tmpfile = zstrdup(tmpfile);
    return;

error:
    close(fd);
    server.repl_transfer_s = -1;
    server.repl_state = REDIS_REPL_CONNECT;
    return;
}

在redis的ae事件庫檢測到master端fd可讀時,表示master已經建立完成rdb檔案,開始傳送給slave。事件庫回撥readSyncBulkPayload(),開始非同步接收rdb檔案。第一次接收時,需要先得到整個rdb檔案的大小。syncReadLine()提取接收到的內容,master傳送的rdb檔案會以”$[rdb size]\r\n”開始,因此readline後,正確接收的buf應該只會從緩衝區獲得該頭部。

if (server.repl_transfer_size == -1) {
        if (syncReadLine(fd,buf,1024,server.repl_syncio_timeout*1000) == -1) {
            redisLog(REDIS_WARNING,
                "I/O error reading bulk count from MASTER: %s",
                strerror(errno));
            goto error;
        }

        if (buf[0] == '-') {
            redisLog(REDIS_WARNING,
                "MASTER aborted replication with an error: %s",
                buf+1);
            goto error;
        } else if (buf[0] == '\0') {
            /* At this stage just a newline works as a PING in order to take
             * the connection live. So we refresh our last interaction
             * timestamp. */
            server.repl_transfer_lastio = server.unixtime;
            return;
        } else if (buf[0] != '$') {
            redisLog(REDIS_WARNING,"Bad protocol from MASTER, the first byte is not '$', are you sure the host and port are right?");
            goto error;
        }
        server.repl_transfer_size = strtol(buf+1,NULL,10);
        redisLog(REDIS_NOTICE,
            "MASTER <-> SLAVE sync: receiving %ld bytes from master",
            server.repl_transfer_size);
        return;
    }

在得知rdb檔案大小後,該fd再次可讀時,開始接收rdb內容。

    left = server.repl_transfer_size - server.repl_transfer_read;
    readlen = (left < (signed)sizeof(buf)) ? left : (signed)sizeof(buf);
    nread = read(fd,buf,readlen);
    if (nread <= 0) {
        redisLog(REDIS_WARNING,"I/O error trying to sync with MASTER: %s",
            (nread == -1) ? strerror(errno) : "connection lost");
        replicationAbortSyncTransfer();
        return;
    }
    server.repl_transfer_lastio = server.unixtime;
    if (write(server.repl_transfer_fd,buf,nread) != nread) {
        redisLog(REDIS_WARNING,"Write error or short write writing to the DB dump file needed for MASTER <-> SLAVE synchronization: %s", strerror(errno));
        goto error;
    }
    server.repl_transfer_read += nread;

    /* 我們需要經常講內容寫到磁碟上,避免最後時刻才寫造成一定的延誤 */
    if (server.repl_transfer_read >=
        server.repl_transfer_last_fsync_off + REPL_MAX_WRITTEN_BEFORE_FSYNC)
    {
        off_t sync_size = server.repl_transfer_read -
                          server.repl_transfer_last_fsync_off;
        rdb_fsync_range(server.repl_transfer_fd,
            server.repl_transfer_last_fsync_off, sync_size);
        server.repl_transfer_last_fsync_off += sync_size;
    }

重複以上過程,直到rdb檔案完整接收後,重新命名該臨時檔案,然後呼叫emptyDb()清空資料庫。刪除master端可讀的檔案事件。從rdb檔案中讀取完成建立資料庫。

if (server.repl_transfer_read == server.repl_transfer_size) {
        if (rename(server.repl_transfer_tmpfile,server.rdb_filename) == -1) {
            redisLog(REDIS_WARNING,"Failed trying to rename the temp DB into dump.rdb in MASTER <-> SLAVE synchronization: %s", strerror(errno));
            replicationAbortSyncTransfer();
            return;
        }
        redisLog(REDIS_NOTICE, "MASTER <-> SLAVE sync: Loading DB in memory");
        emptyDb();
        aeDeleteFileEvent(server.el,server.repl_transfer_s,AE_READABLE);
        if (rdbLoad(server.rdb_filename) != REDIS_OK) {
            redisLog(REDIS_WARNING,"Failed trying to load the MASTER synchronization DB from disk");
            replicationAbortSyncTransfer();
            return;
        }

最後完成server的相關成員變數的初始化,並且重新啟動aof。

        /* Final setup of the connected slave <- master link */
        zfree(server.repl_transfer_tmpfile);
        close(server.repl_transfer_fd);
        server.master = createClient(server.repl_transfer_s);
        server.master->flags |= REDIS_MASTER;
        server.master->authenticated = 1;
        server.repl_state = REDIS_REPL_CONNECTED;
        redisLog(REDIS_NOTICE, "MASTER <-> SLAVE sync: Finished with success");
        
        if (server.aof_state != REDIS_AOF_OFF) {
            int retry = 10;

            stopAppendOnly();
            while (retry-- && startAppendOnly() == REDIS_ERR) {
                redisLog(REDIS_WARNING,"Failed enabling the AOF after successful master synchrnization! Trying it again in one second.");
                sleep(1);
            }
            if (!retry) {
                redisLog(REDIS_WARNING,"FATAL: this slave instance finished the synchronization with its master, but the AOF can't be turned on. Exiting now.");
                exit(1);
            }
        }
    }

到這裡,slave端的replication部分已經結束。讓我們再看master端進行的repliation操作。

Master端接受同步請求

redisClient的幾個關於replication的成員變數

    int replstate;          /* 複製狀態 */
    int repldbfd;           /* 傳送給該slave的rdb fd */
    long repldboff;         /* 傳送給該slave的rdb 偏移 */
    off_t repldbsize;       /* 傳送給該slave的ddb檔案大小 */
    int slave_listening_port; /* As configured with: SLAVECONF listening-port */

master端的slave的replication狀態,指redisClient.replstate

#define REDIS_REPL_WAIT_BGSAVE_START 3 /* 等待master啟動bgsave */
#define REDIS_REPL_WAIT_BGSAVE_END 4 /* 等待master bgsave完成,啟動rdb的傳輸 */
#define REDIS_REPL_SEND_BULK 5 /* master正在傳送rdb檔案 */
#define REDIS_REPL_ONLINE 6 /* rdb傳送結束,持續接收更新 */

通過該狀態,我們可能得出,master端完成對slave的replication有以下過程:接收到slave傳送的sync命令,啟動bgsave後臺建立rdb->bgsave完成,開始對該slave進行傳送rdb->rdb傳送完成,開始持續對slave的更新

master端在slave端完成PING命令應用層校驗後,傳送sync命令開始replication準備。接收到SYNC命令後呼叫syncCommand()(replicantion.c)。

首先,redis會檢查是否有slave正在進行bgsave,如果有,則可以把該slave的bgsave建立的rdb檔案同時傳送給當前正在處理的slave,同時把該slave的等待傳送的緩衝區複製當前處理的slave(緩衝區內為建立rdb檔案後到傳送給slave的時間內,redis進行的更行操作的命令)。
如果沒有,那麼呼叫rdbSaveBackground()(rdb.c)準備bgsave。
此時,replstate的狀態都將設定為REDIS_REPL_WAIT_BGSAVE_END。

    if (server.rdb_child_pid != -1) {
        redisClient *slave;
        listNode *ln;
        listIter li;

        listRewind(server.slaves,&li);
        while((ln = listNext(&li))) {
            slave = ln->value;
            if (slave->replstate == REDIS_REPL_WAIT_BGSAVE_END) break;
        }
        if (ln) {
            /* Perfect, the server is already registering differences for
             * another slave. Set the right state, and copy the buffer. */
            copyClientOutputBuffer(c,slave);
            c->replstate = REDIS_REPL_WAIT_BGSAVE_END;
            redisLog(REDIS_NOTICE,"Waiting for end of BGSAVE for SYNC");
        } else {
            /* No way, we need to wait for the next BGSAVE in order to
             * register differences */
            c->replstate = REDIS_REPL_WAIT_BGSAVE_START;
            redisLog(REDIS_NOTICE,"Waiting for next BGSAVE for SYNC");
        }
    } else {
        /* Ok we don't have a BGSAVE in progress, let's start one */
        redisLog(REDIS_NOTICE,"Starting BGSAVE for SYNC");
        if (rdbSaveBackground(server.rdb_filename) != REDIS_OK) {
            redisLog(REDIS_NOTICE,"Replication failed, can't BGSAVE");
            addReplyError(c,"Unable to perform background save");
            return;
        }
        c->replstate = REDIS_REPL_WAIT_BGSAVE_END;
    }

server.dirty儲存的是沒有寫到磁碟上的更新操作的個數,當啟動bgsave後,所有更新操作都會被寫到磁碟,server.dirty也需要更新。fork後子程式呼叫rdbSave(),父程式儲存子程式狀態,返回。

int rdbSaveBackground(char *filename) {
    pid_t childpid;
    long long start;

    if (server.rdb_child_pid != -1) return REDIS_ERR;

    server.dirty_before_bgsave = server.dirty;

    start = ustime();
    if ((childpid = fork()) == 0) {
        int retval;

        /* Child */
        if (server.ipfd > 0) close(server.ipfd);
        if (server.sofd > 0) close(server.sofd);
        retval = rdbSave(filename);
        exitFromChild((retval == REDIS_OK) ? 0 : 1);
    } else {
        /* Parent */
        server.stat_fork_time = ustime()-start;
        if (childpid == -1) {
            redisLog(REDIS_WARNING,"Can't save in background: fork: %s",
                strerror(errno));
            return REDIS_ERR;
        }
        redisLog(REDIS_NOTICE,"Background saving started by pid %d",childpid);
        server.rdb_save_time_start = time(NULL);
        server.rdb_child_pid = childpid;
        updateDictResizePolicy();
        return REDIS_OK;
    }
    return REDIS_OK; /* unreached */
}

回到syncCommand(),redis完成對該slave的註冊。

    c->repldbfd = -1;
    c->flags |= REDIS_SLAVE;
    c->slaveseldb = 0;
    listAddNodeTail(server.slaves,c);
    return;
}

同樣,在serverCron中,當檢測到有bgsave子程式時,會檢測bgsave是否結束。如果結束,則檢測結束的狀態和結束的原因。然後呼叫backgroundSaveDoneHandler()(rdb.c)

    if (server.rdb_child_pid != -1 || server.aof_child_pid != -1) {
        int statloc;
        pid_t pid;

        if ((pid = wait3(&statloc,WNOHANG,NULL)) != 0) {
            int exitcode = WEXITSTATUS(statloc);
            int bysignal = 0;
            
            if (WIFSIGNALED(statloc)) bysignal = WTERMSIG(statloc);

            if (pid == server.rdb_child_pid) {
                backgroundSaveDoneHandler(exitcode,bysignal);
            } else {
                backgroundRewriteDoneHandler(exitcode,bysignal);
            }
            updateDictResizePolicy();
        }
    } 

確保bgsave程式是正常結束,不是被訊號打斷,更新server.dirty。呼叫updateSlavesWaitingBgsave()(replication.c)

void backgroundSaveDoneHandler(int exitcode, int bysignal) {
    if (!bysignal && exitcode == 0) {
        redisLog(REDIS_NOTICE,
            "Background saving terminated with success");
        server.dirty = server.dirty - server.dirty_before_bgsave;
        server.lastsave = time(NULL);
        server.lastbgsave_status = REDIS_OK;
    } else if (!bysignal && exitcode != 0) {
        redisLog(REDIS_WARNING, "Background saving error");
        server.lastbgsave_status = REDIS_ERR;
    } else {
        redisLog(REDIS_WARNING,
            "Background saving terminated by signal %d", bysignal);
        rdbRemoveTempFile(server.rdb_child_pid);
        server.lastbgsave_status = REDIS_ERR;
    }
    server.rdb_child_pid = -1;
    server.rdb_save_time_last = time(NULL)-server.rdb_save_time_start;
    server.rdb_save_time_start = -1;
    /* Possibly there are slaves waiting for a BGSAVE in order to be served
     * (the first stage of SYNC is a bulk transfer of dump.rdb) */
    updateSlavesWaitingBgsave(exitcode == 0 ? REDIS_OK : REDIS_ERR);
}

檢查每個slave的replication狀態,如果是REDIS_REPL_WAIT_BGSAVE_END,則建立可寫檔案事件,回撥sendBulkToSlave(),傳送rdb檔案給slave。如果是REDIS_REPL_WAIT_BGSAVE_START,則說明又有新加入的slave需要建立rdb檔案,那麼再次呼叫rdbSaveBackground()進行bgsave。

void updateSlavesWaitingBgsave(int bgsaveerr) {
    listNode *ln;
    int startbgsave = 0;
    listIter li;

    listRewind(server.slaves,&li);
    while((ln = listNext(&li))) {
        redisClient *slave = ln->value;

        if (slave->replstate == REDIS_REPL_WAIT_BGSAVE_START) {
            startbgsave = 1;
            slave->replstate = REDIS_REPL_WAIT_BGSAVE_END;
        } else if (slave->replstate == REDIS_REPL_WAIT_BGSAVE_END) {
            struct redis_stat buf;

            if (bgsaveerr != REDIS_OK) {
                freeClient(slave);
                redisLog(REDIS_WARNING,"SYNC failed. BGSAVE child returned an error");
                continue;
            }
            if ((slave->repldbfd = open(server.rdb_filename,O_RDONLY)) == -1 ||
                redis_fstat(slave->repldbfd,&buf) == -1) {
                freeClient(slave);
                redisLog(REDIS_WARNING,"SYNC failed. Can't open/stat DB after BGSAVE: %s", strerror(errno));
                continue;
            }
            slave->repldboff = 0;
            slave->repldbsize = buf.st_size;
            slave->replstate = REDIS_REPL_SEND_BULK;
            aeDeleteFileEvent(server.el,slave->fd,AE_WRITABLE);
            if (aeCreateFileEvent(server.el, slave->fd, AE_WRITABLE, sendBulkToSlave, slave) == AE_ERR) {
                freeClient(slave);
                continue;
            }
        }
    }

當有slave的fd可寫時,事件庫回撥sendBulkToSlave(),當第一次傳送rdb檔案時,slave->repldboff==0,需要先傳送rdb檔案的大小。

void sendBulkToSlave(aeEventLoop *el, int fd, void *privdata, int mask) {
    redisClient *slave = privdata;
    REDIS_NOTUSED(el);
    REDIS_NOTUSED(mask);
    char buf[REDIS_IOBUF_LEN];
    ssize_t nwritten, buflen;

    if (slave->repldboff == 0) {
        sds bulkcount;

        bulkcount = sdscatprintf(sdsempty(),"$%lld\r\n",(unsigned long long)
            slave->repldbsize);
        if (write(fd,bulkcount,sdslen(bulkcount)) != (signed)sdslen(bulkcount))
        {
            sdsfree(bulkcount);
            freeClient(slave);
            return;
        }
        sdsfree(bulkcount);
    }

然後傳送rdb檔案內容,由於整個檔案分次傳送,需要多次回撥sendBulkToSlave()傳送,每次使用lseek定位到上次傳送位置,傳送後續內容。

    lseek(slave->repldbfd,slave->repldboff,SEEK_SET);
    buflen = read(slave->repldbfd,buf,REDIS_IOBUF_LEN);
    if (buflen <= 0) {
        redisLog(REDIS_WARNING,"Read error sending DB to slave: %s",
            (buflen == 0) ? "premature EOF" : strerror(errno));
        freeClient(slave);
        return;
    }
    if ((nwritten = write(fd,buf,buflen)) == -1) {
        redisLog(REDIS_VERBOSE,"Write error sending DB to slave: %s",
            strerror(errno));
        freeClient(slave);
        return;
    }
    slave->repldboff += nwritten;

重複傳送直到完成後,完成了對slave的replication建立,slave->replstate = REDIS_REPL_ONLINE。並且建立檔案可寫事件,回撥sendReplyToClient()(networking.c),該回撥旨在當master收到寫命令後,需要更新slave的資料,回撥sendReplyToClient()傳送更新命令。

    if (slave->repldboff == slave->repldbsize) {
        close(slave->repldbfd);
        slave->repldbfd = -1;
        aeDeleteFileEvent(server.el,slave->fd,AE_WRITABLE);
        slave->replstate = REDIS_REPL_ONLINE;
        if (aeCreateFileEvent(server.el, slave->fd, AE_WRITABLE,
            sendReplyToClient, slave) == AE_ERR) {
            freeClient(slave);
            return;
        }
        redisLog(REDIS_NOTICE,"Synchronization with slave succeeded");
    }

此時,master端也完成了整個replication過程,在bgsave建立到傳送給slave的時間內master傳送的更新操作會寫到傳送給slave的傳送緩衝區,傳播更新。
在master接受到命令執行後,當出現寫操作時,在call()(redis.c)會呼叫propagate()傳播該命令。propagate()呼叫replicationFeedSlaves()

    if (flags & REDIS_CALL_PROPAGATE) {
        int flags = REDIS_PROPAGATE_NONE;

        if (c->cmd->flags & REDIS_CMD_FORCE_REPLICATION)
            flags |= REDIS_PROPAGATE_REPL;
        if (dirty)
            flags |= (REDIS_PROPAGATE_REPL | REDIS_PROPAGATE_AOF);
        if (flags != REDIS_PROPAGATE_NONE)
            propagate(c->cmd,c->db->id,c->argv,c->argc,flags);
    }

void propagate(struct redisCommand *cmd, int dbid, robj **argv, int argc,
               int flags)
{
    if (server.aof_state != REDIS_AOF_OFF && flags & REDIS_PROPAGATE_AOF)
        feedAppendOnlyFile(cmd,dbid,argv,argc);
    if (flags & REDIS_PROPAGATE_REPL && listLength(server.slaves))
        replicationFeedSlaves(server.slaves,dbid,argv,argc);
}

所有不處於REDIS_REPL_WAIT_BGSAVE_START的slaves的傳送緩衝都會寫入命令。slave->slaveseldb指的是目前slave選擇的資料庫,如果與該命令寫入的資料庫不一致,master還需要傳送select命令。

void replicationFeedSlaves(list *slaves, int dictid, robj **argv, int argc) {
    listNode *ln;
    listIter li;
    int j;

    listRewind(slaves,&li);
    while((ln = listNext(&li))) {
        redisClient *slave = ln->value;

        /* Don't feed slaves that are still waiting for BGSAVE to start */
        if (slave->replstate == REDIS_REPL_WAIT_BGSAVE_START) continue;

        if (slave->slaveseldb != dictid) {
            robj *selectcmd;

            if (dictid >= 0 && dictid < REDIS_SHARED_SELECT_CMDS) {
                selectcmd = shared.select[dictid];
                incrRefCount(selectcmd);
            } else {
                selectcmd = createObject(REDIS_STRING,
                    sdscatprintf(sdsempty(),"select %d\r\n",dictid));
            }
            addReply(slave,selectcmd);
            decrRefCount(selectcmd);
            slave->slaveseldb = dictid;
        }
        addReplyMultiBulkLen(slave,argc);
        for (j = 0; j < argc; j++) addReplyBulk(slave,argv[j]);
    }
}

master端在replicationCron和propagate過程中會遍歷slaves,執行操作,因此每個master保持的slaves不宜過多,建議較多slave時採用鏈式slave->slave->master。

在replication中可以看到,redis的非同步性得到很大體現,通過狀態標示來解決非同步時回撥函式的雜亂問題。

參考: redis原始碼分析-replication(http://www.mysqlops.com/2011/09/01/redis-replication.html) 該blog圖片有助於幫助理解replication過程。

轉載:http://www.wzxue.com/redis%E6%A0%B8%E5%BF%83%E8%A7%A3%E8%AF%BB-%E4%BB%8Emaster%E5%88%B0slave%E7%9A%84replicantion/

相關文章