使用 Go 構建高效能檔案上傳器

banq發表於2024-06-16

本文中,我們將探討使用 Go 構建高效能檔案上傳器的過程。此檔案上傳器會將大檔案拆分為較小的塊,並行上傳這些塊,並僅同步已修改的塊。我們還將實現檔案監視以自動處理更新。

我們的檔案上傳器將由以下元件構建:

  1. 檔案分塊:將大檔案分割成較小的塊。
  2. 並行處理:並行讀取和上傳塊。
  3. 後設資料管理:使用後設資料跟蹤塊來檢測變化。
  4. 檔案監視:自動重新上傳修改後的塊。


逐步流程
1.檔案分塊

  • 目標:將大檔案分割成較小的、易於管理的塊。
  • 過程:
  • 該檔案以固定大小的塊(例如 1MB)開啟和讀取。
  • 每個塊都儲存為一個單獨的檔案,其命名約定包括原始檔名和塊索引(例如file.txt.chunk.0)。
  • 為每個塊計算後設資料(MD5 雜湊值)以唯一地標識其內容。

2.並行處理

  • 目標:透過同時處理多個塊來加快分塊和上傳過程。
  • 過程:
  • 工作池模式用於同時處理多個塊。
  • 塊由並行執行的多個 goroutine 讀取和上傳。
  • 通道用於分配工作和收集結果,而互斥鎖則確保對共享資源的執行緒安全更新。

3.後設資料管理

  • 目標:跟蹤和儲存每個塊的後設資料以檢測變化並避免重新上傳未更改的塊。
  • 過程:
  • 每個塊的後設資料(例如檔名、MD5 雜湊值)都儲存到 JSON 檔案中。
  • 重新分塊時,將載入後設資料以將新的雜湊值與現有的雜湊值進行比較。
  • 僅重新上傳更改的塊,從而減少不必要的上傳。

塊後設資料 — → 比較雜湊值 — -> 上傳更改的塊

4.檔案監視

  • 目標:自動檢測檔案變化並觸發修改後的塊的重新上傳。
  • 過程:
  • 檔案監視程式監視原始檔案的變化。
  • 當檢測到變化時,它會觸發重新分塊過程。
  • 僅識別和上傳修改過的塊。

第一步介面
為了保持程式碼整潔和模組化,我們將定義用於分塊檔案、上傳塊和管理後設資料的介面。我們還將為塊大小和伺服器 URL 等設定建立配置結構。

<font>// struct_interface.go<i>
package fileuploader

type ChunkMeta struct {
 FileName string `json:
"file_name"`
 MD5Hash  string `json:
"md5_hash"`
 Index    int    `json:
"index"`
}

type Config struct {
 ChunkSize int
 ServerURL string
}

type DefaultFileChunker struct {
 chunkSize int
}

type DefaultUploader struct {
 serverURL string
}

type DefaultMetadataManager struct{}

type FileChunker interface {
 ChunkFile(filePath string) ([]ChunkMeta, error)
 ChunklargeFile(filePath string) ([]ChunkMeta, error)
}

type Uploader interface {
 UploadChunk(chunk ChunkMeta) error
}

type MetadataManager interface {
 LoadMetadata(filePath string) (map[string]ChunkMeta, error)
 SaveMetadata(filePath string, metadata map[string]ChunkMeta) error

第 2 步:實現 DefaultFileChunker
DefaultFileChunker 將實現 FileChunker 介面。我們將以並行分塊的方式讀取檔案,併為每個分塊儲存後設資料。

<font>// chunk_file.go<i>
package fileuploader

import (
 
"crypto/md5"
 
"encoding/hex"
 
"fmt"
 
"io"
 
"os"
 
"sync"
)

// ChunkFile splits a file into smaller chunks and returns metadata for each chunk.<i>
// It reads the file sequentially and chunks it based on the specified chunk size.<i>
func (c *DefaultFileChunker) ChunkFile(filePath string) ([]ChunkMeta, error) {
 var chunks []ChunkMeta
// Store metadata for each chunk<i>

 
// Open the file for reading<i>
 file, err := os.Open(filePath)
 if err != nil {
  return nil, err
 }
 defer file.Close()

 
// Create a buffer to hold the chunk data<i>
 buffer := make([]byte, c.chunkSize)
 index := 0
// Initialize chunk index<i>

 
// Loop until EOF is reached<i>
 for {
 
// Read chunkSize bytes from the file into the buffer<i>
  bytesRead, err := file.Read(buffer)
  if err != nil && err != io.EOF {
   return nil, err
  }
  if bytesRead == 0 {
   break
// If bytesRead is 0, it means EOF is reached<i>
  }

 
// Generate a unique hash for the chunk data<i>
  hash := md5.Sum(buffer[:bytesRead])
  hashString := hex.EncodeToString(hash[:])

 
// Construct the chunk file name<i>
  chunkFileName := fmt.Sprintf(
"%s.chunk.%d", filePath, index)

 
// Create a new chunk file and write the buffer data to it<i>
  chunkFile, err := os.Create(chunkFileName)
  if err != nil {
   return nil, err
  }
  _, err = chunkFile.Write(buffer[:bytesRead])
  if err != nil {
   return nil, err
  }

 
// Append metadata for the chunk to the chunks slice<i>
  chunks = append(chunks, ChunkMeta{FileName: chunkFileName, MD5Hash: hashString, Index: index})

 
// Close the chunk file<i>
  chunkFile.Close()

 
// Move to the next chunk<i>
  index++
 }

 return chunks, nil
}

// ChunklargeFile splits a large file into smaller chunks in parallel and returns metadata for each chunk.<i>
// It divides the file into chunks and processes them concurrently using multiple goroutines.<i>
func (c *DefaultFileChunker) ChunklargeFile(filePath string) ([]ChunkMeta, error) {
 var wg sync.WaitGroup
 var mu sync.Mutex
 var chunks []ChunkMeta
// Store metadata for each chunk<i>

 
// Open the file for reading<i>
 file, err := os.Open(filePath)
 if err != nil {
  return nil, err
 }
 defer file.Close()

 
// Get file information to determine the number of chunks<i>
 fileInfo, err := file.Stat()
 if err != nil {
  return nil, err
 }

 numChunks := int(fileInfo.Size() / int64(c.chunkSize))
 if fileInfo.Size()%int64(c.chunkSize) != 0 {
  numChunks++
 }

 
// Create channels to communicate between goroutines<i>
 chunkChan := make(chan ChunkMeta, numChunks)
 errChan := make(chan error, numChunks)
 indexChan := make(chan int, numChunks)

 
// Populate the index channel with chunk indices<i>
 for i := 0; i < numChunks; i++ {
  indexChan <- i
 }
 close(indexChan)

 
// Start multiple goroutines to process chunks in parallel<i>
 for i := 0; i < 4; i++ {
// Number of parallel workers<i>
  wg.Add(1)
  go func() {
   defer wg.Done()
   for index := range indexChan {
   
// Calculate the offset for the current chunk<i>
    offset := int64(index) * int64(c.chunkSize)
    buffer := make([]byte, c.chunkSize)
// Create a buffer for chunk data<i>

   
// Seek to the appropriate position in the file<i>
    file.Seek(offset, 0)

   
// Read chunkSize bytes from the file into the buffer<i>
    bytesRead, err := file.Read(buffer)
    if err != nil && err != io.EOF {
     errChan <- err
     return
    }

   
// If bytesRead is 0, it means EOF is reached<i>
    if bytesRead > 0 {
     
// Generate a unique hash for the chunk data<i>
     hash := md5.Sum(buffer[:bytesRead])
     hashString := hex.EncodeToString(hash[:])

     
// Construct the chunk file name<i>
     chunkFileName := fmt.Sprintf(
"%s.chunk.%d", filePath, index)

     
// Create a new chunk file and write the buffer data to it<i>
     chunkFile, err := os.Create(chunkFileName)
     if err != nil {
      errChan <- err
      return
     }
     _, err = chunkFile.Write(buffer[:bytesRead])
     if err != nil {
      errChan <- err
      return
     }

     
// Append metadata for the chunk to the chunks slice<i>
     chunk := ChunkMeta{
      FileName: chunkFileName,
      MD5Hash:  hashString,
      Index:    index,
     }
     mu.Lock()
     chunks = append(chunks, chunk)
     mu.Unlock()

     
// Close the chunk file<i>
     chunkFile.Close()

     
// Send the processed chunk to the chunk channel<i>
     chunkChan <- chunk
    }
   }
  }()
 }

 
// Wait for all goroutines to finish<i>
 go func() {
  wg.Wait()
  close(chunkChan)
  close(errChan)
 }()

 
// Check for errors from goroutines<i>
 for err := range errChan {
  if err != nil {
   return nil, err
  }
 }

 return chunks, nil
}

步驟 3:實現 DefaultUploader
DefaultUploader 將實現 Uploader 介面,以處理將資料塊上傳到伺服器的事宜。

<font>//upload_chunk.go<i>
package fileuploader

import (
 
"bytes"
 
"fmt"
 
"io/ioutil"
 
"net/http"
)

func (u *DefaultUploader) UploadChunk(chunk ChunkMeta) error {
 data, err := ioutil.ReadFile(chunk.FileName)
 if err != nil {
  return err
 }

 req, err := http.NewRequest(
"POST", u.serverURL, bytes.NewReader(data))
 if err != nil {
  return err
 }

 client := &http.Client{}
 resp, err := client.Do(req)
 if err != nil {
  return err
 }
 defer resp.Body.Close()

 if resp.StatusCode != http.StatusOK {
  return fmt.Errorf(
"failed to upload chunk: %s", resp.Status)
 }

 return nil
}

第四步:實現 DefaultMetadataManager
DefaultMetadataManager 將處理塊後設資料的載入和儲存。

<font>// load_save_metadata<i>
package fileuploader

import (
 
"encoding/json"
 
"io/ioutil"
)

func (m *DefaultMetadataManager) LoadMetadata(filePath string) (map[string]ChunkMeta, error) {
 metadata := make(map[string]ChunkMeta)

 data, err := ioutil.ReadFile(filePath)
 if err != nil {
  return metadata, err
 }

 err = json.Unmarshal(data, &metadata)
 if err != nil {
  return metadata, err
 }

 return metadata, nil
}

func (m *DefaultMetadataManager) SaveMetadata(filePath string, metadata map[string]ChunkMeta) error {
 data, err := json.MarshalIndent(metadata,
"", "  ")
 if err != nil {
  return err
 }

 err = ioutil.WriteFile(filePath, data, 0644)
 if err != nil {
  return err
 }

 return nil
}

第五步:實現同步
我們將實現一個同步函式來並行上傳資料塊,只上傳修改過的資料塊。

<font>// synchronizer.go<i>
func synchronizeChunks(chunks []ChunkMeta, metadata map[string]ChunkMeta, uploader Uploader, wg *sync.WaitGroup, mu *sync.Mutex) error {
 
// Create channels to communicate between goroutines<i>
 chunkChan := make(chan ChunkMeta, len(chunks))
// Channel to send chunks to workers<i>
 errChan := make(chan error, len(chunks))      
// Channel to receive errors from workers<i>

 
// Iterate over the chunks slice and send each chunk to the chunk channel<i>
 for _, chunk := range chunks {
  wg.Add(1)
  chunkChan <- chunk
 }

 close(chunkChan)
// Close the chunk channel to signal that all chunks have been sent<i>

 
// Start multiple goroutines to process chunks in parallel<i>
 for i := 0; i < 4; i++ {
// Number of parallel workers<i>
  go func() {
   for chunk := range chunkChan {
// Iterate over chunks received from the chunk channel<i>
    defer wg.Done()
// Decrease the WaitGroup counter when the goroutine finishes<i>

    newHash := chunk.MD5Hash
// Calculate the MD5 hash of the current chunk<i>

   
// Check if the chunk exists in the metadata map<i>
    mu.Lock()
// Lock the mutex to prevent concurrent access to the metadata map<i>
    oldChunk, exists := metadata[chunk.FileName]
    mu.Unlock()
// Unlock the mutex after accessing the metadata map<i>

   
// If the chunk does not exist in the metadata map or its hash has changed<i>
    if !exists || oldChunk.MD5Hash != newHash {
     
// Upload the chunk using the uploader interface<i>
     err := uploader.UploadChunk(chunk)
     if err != nil {
      errChan <- err
// Send any errors to the error channel<i>
      return
     }

     
// Update the metadata map with the new chunk information<i>
     mu.Lock()
// Lock the mutex to prevent concurrent access to the metadata map<i>
     metadata[chunk.FileName] = chunk
     mu.Unlock()
// Unlock the mutex after updating the metadata map<i>
    }
   }
  }()
 }

 wg.Wait()      
// Wait for all goroutines to finish processing chunks<i>
 close(errChan)
// Close the error channel after all errors have been received<i>

 
// Check for errors from the error channel<i>
 for err := range errChan {
  if err != nil {
   return err
// Return the first error encountered<i>
  }
 }

 return nil
// Return nil if no errors occurred during synchronization<i>
}

步驟 6:實施檔案監視
我們將使用 fsnotify 來監視檔案變化,並在檔案被修改時觸發同步。

<font>// watcher.go<i>
func watchFile(filePath string, changeChan chan bool) {
 
// Create a new file watcher<i>
 watcher, err := fsnotify.NewWatcher()
 if err != nil {
  log.Fatal(err)
// Terminate the program if an error occurs while creating the watcher<i>
 }
 defer watcher.Close()
// Close the watcher when the function exits<i>

 
// Add the specified file to the watcher's list of watched files<i>
 err = watcher.Add(filePath)
 if err != nil {
  log.Fatal(err)
// Terminate the program if an error occurs while adding the file to the watcher<i>
 }

 
// Infinite loop to continuously monitor events from the watcher<i>
 for {
  select {
  case event, ok := <-watcher.Events:
   
// Check if the events channel is closed<i>
   if !ok {
    return
// Exit the function if the channel is closed<i>
   }
   
// Check if the event corresponds to a write operation on the file<i>
   if event.Op&fsnotify.Write == fsnotify.Write {
    log.Println(
"Modified file:", event.Name) // Log the name of the modified file<i>
    changeChan <- true                          
// Send a signal to the change channel indicating file modification<i>
   }
  case err, ok := <-watcher.Errors:
   
// Check if the errors channel is closed<i>
   if !ok {
    return
// Exit the function if the channel is closed<i>
   }
   log.Println(
"Error:", err) // Log any errors that occur during file watching<i>
  }
 }
}

該函式使用 fsnotify 軟體包監控指定檔案的更改。它會持續監聽檔案修改等事件,一旦檢測到修改事件,就會向 changeChan 頻道傳送訊號,表明檔案已被修改。該函式以無限迴圈的方式執行,確保持續監控檔案的更改,直到顯式停止或遇到錯誤為止。

步驟 7:主函式
我們將在主函式中整合一切,初始化元件並處理工作流程。

<font>//main.go<i>
package main

import (
 
"fmt"
 
"log"
 
"os"
 
"sync"
 
"time"
)

const (
 defaultChunkSize = 1024 * 1024
// 1MB chunks<i>
 maxRetries       = 3
)

func loadEnv() error {
 return godotenv.Load()
}

func main() {
 err := loadEnv()
 if err != nil {
  log.Println(
"No .env file found, using default configuration")
 }

 chunkSize := defaultChunkSize
 if size, ok := os.LookupEnv(
"CHUNK_SIZE"); ok {
  fmt.Sscanf(size,
"%d", &chunkSize)
 }

 serverURL, ok := os.LookupEnv(
"SERVER_URL")
 if !ok {
  log.Fatal(
"SERVER_URL environment variable is required")
 }

 if len(os.Args) < 2 {
  log.Fatal(
"Usage: go run main.go <file_path>")
 }
 filePath := os.Args[1]

 config := Config{ChunkSize: chunkSize, ServerURL: serverURL}

 chunker := &DefaultFileChunker{chunkSize: config.ChunkSize}
 uploader := &DefaultUploader{serverURL: config.ServerURL}
 metadataManager := &DefaultMetadataManager{}

 chunks, err := chunker.ChunkFile(filePath)
 if err != nil {
  log.Fatal(err)
 }

 metadata, err := metadataManager.LoadMetadata(fmt.Sprintf(
"%s.metadata.json", filePath))
 if err != nil {
  log.Println(
"Could not load metadata, starting fresh.")
  metadata = make(map[string]ChunkMeta)
 }

 var wg sync.WaitGroup
 var mu sync.Mutex

 err = synchronizeChunks(chunks, metadata, uploader, &wg, &mu)
 if err != nil {
  log.Fatal(err)
 }

 wg.Wait()

 err = metadataManager.SaveMetadata(fmt.Sprintf(
"%s.metadata.json", filePath), metadata)
 if err != nil {
  log.Fatal(err)
 }

 changeChan := make(chan bool)
 go watchFile(filePath, changeChan)

 for {
  select {
  case <-changeChan:
   log.Println(
"File changed, re-chunking and synchronizing...")
   chunks, err = chunker.ChunkFile(filePath)
   if err != nil {
    log.Fatal(err)
   }

   err = synchronizeChunks(chunks, metadata, uploader, &wg, &mu)
   if err != nil {
    log.Fatal(err)
   }

   wg.Wait()

   err = metadataManager.SaveMetadata(fmt.Sprintf(
"%s.metadata.json", filePath), metadata)
   if err != nil {
    log.Fatal(err)
   }
  case <-time.After(10 * time.Second):
   log.Println(
"No changes detected, checking again...")
  }
 }
}
// samle metadata.json <i>
{
   
"file1.txt.chunk.0": {
       
"FileName": "file1.txt.chunk.0",
       
"MD5Hash": "e7d620b64e3151947828cd5ca2b1b628",
       
"Index": 0
    },
   
"file1.txt.chunk.1": {
       
"FileName": "file1.txt.chunk.1",
       
"MD5Hash": "2d7115b627b4b61b4e39604e7d3e1e84",
       
"Index": 1
    },
   
"file1.txt.chunk.2": {
       
"FileName": "file1.txt.chunk.2",
       
"MD5Hash": "eb24f90d7d6d3cf7b285b94e2af59c2a",
       
"Index": 2
    },
   
"file1.txt.chunk.3": {
       
"FileName": "file1.txt.chunk.3",
       
"MD5Hash": "4c9f5959bbfd2b67eacfb805c8b24635",
       
"Index": 3
    }
}


從檔案塊重建檔案

<font>// reconstruct.go<i>
// <i>
// 它讀取每個塊檔案,將其內容連線起來,然後寫入輸出檔案。<i>
func ReconstructFile(metadata map[string]ChunkMeta, outputFilePath string) error {
 
// Create or truncate the output file<i>
 outputFile, err := os.Create(outputFilePath)
 if err != nil {
  return err
 }
 defer outputFile.Close()

 
// Iterate through the metadata to determine the order of the chunks<i>
 var chunks []ChunkMeta
 for _, chunk := range metadata {
  chunks = append(chunks, chunk)
 }

 sort.Slice(chunks, func(i, j int) bool {
  return chunks[i].Index < chunks[j].Index
 })

 
// Iterate through the sorted chunks and concatenate their content to reconstruct the file<i>
 for _, chunk := range chunks {
 
// Open the chunk file<i>
  chunkFile, err := os.Open(chunk.FileName)
  if err != nil {
   return err
  }
  defer chunkFile.Close()

 
// Read the content of the chunk file and write it to the output file<i>
  _, err = io.Copy(outputFile, chunkFile)
  if err != nil {
   return err
  }
 }

 return nil
}

總結
透過將檔案上傳過程分解為分塊、並行處理、後設資料管理和檔案觀察,我們可以用 Go 構建一個高效、高效能的檔案上傳程式。這種方法可確保高效處理大檔案,只重新上傳修改過的塊,並最佳化整個過程的效能。

這種架構具有很強的可擴充套件性,可適用於各種分散式系統場景,是管理大檔案的強大工具。

相關文章