[Design Pattern] Upload big file - 4. Code Design - part 2 & Summary

Zhentiw發表於2024-12-05

How to Control Requests?

Controlling requests involves addressing several key issues:

1. How to Maximize Bandwidth Utilization

  • In chunked uploads, a large number of requests are sent. These requests should not be sent all at once, causing network congestion, nor sent sequentially, wasting bandwidth.
  • Solution: Use the foundational TaskQueue to implement concurrency control.

2. How to Decouple from Upper-Layer Request Libraries

  • For versatility, upper-layer applications may use different request libraries to send requests. Therefore, the frontend SDK should not bind itself to any specific request library.
  • Solution: Use the Strategy Pattern to decouple from the request library.

The implementation of the request control mechanism can be complex. Below is the core code structure:

Request strategy:

// requestStrategy.ts

import { Chunk } from "./chunk";

export interface RequestStrategy {
  // create file request, return token
  createFile(file: File): Promise<string>;
  // chunk upload request
  uploadChunk(chunk: Chunk): Promise<void>;
  // merge file request, return url
  mergeFile(token: string): Promise<string>;
  // hash check request
  patchHash<T extends "file" | "chunk">(
    token: string,
    hash: string,
    type: T
  ): Promise<
    T extends "file"
      ? { hasFile: boolean }
      : { hasFile: boolean; rest: number[]; url: string }
  >;
}

Request control:

import { Task, TaskQueue } from "../upload-core/TaskQueue";
import { Chunk } from "./chunk";
import { ChunkSplitor } from "./chunkSplitor";
import { RequestStrategy } from "./requestStrategy";

export class UploadController {
  private requestStrategy: RequestStrategy;
  private splitStrategy: ChunkSplitor;
  private taskQueue: TaskQueue;
  // other properties strategy
  // ...

  constructor(
    private file: File,
    private token: string,
    requestStrategy: RequestStrategy,
    splitStrategy: ChunkSplitor
  ) {
    this.requestStrategy = requestStrategy;
    this.splitStrategy = splitStrategy;
    this.taskQueue = new TaskQueue();
    // other properties strategy
  }

  async init() {
    this.token = await this.requestStrategy.createFile(this.file);
    this.splitStrategy.on("chunks", this.handleChunks.bind(this));
    this.splitStrategy.on("wholeHash", this.handleWholeHash.bind(this));
  }

  private handleChunks(chunks: Chunk[]) {
    chunks.forEach((chunk) => {
      this.taskQueue.addAndStart(new Task(this.uploadChunk.bind(this), chunk));
    });
  }

  async uploadChunk(chunk: Chunk) {
    const resp = await this.requestStrategy.patchHash(
      this.token,
      chunk.hash,
      "chunk"
    );
    if (resp.hasFile) {
      return;
    }

    await this.requestStrategy.uploadChunk(chunk, this.uploadEmitter);
  }

  private async handleWholeHash(hash: string) {
    const resp = await this.requestStrategy.patchHash(this.token, hash, "file");
    if (resp.hasFile) {
      this.emit.emit("end", resp.url);
      return;
    }

    // according resp.rest to upload the rest chunks
    // ...
  }
}

Key issue for Backend

Compared to the client, the server faces greater challenges.

How to isolate different file uploads?

In the file creation protocol, the server uses a combination of UUID and JWT to generate a tamper-proof unique identifier, which is used to distinguish different file uploads.

[Design Pattern] Upload big file - 4. Code Design - part 2 & Summary

How to ensure chunks are not duplicated?

Here, duplication refers to:

  1. Not saving duplicate chunks
  2. Not uploading duplicate chunks

This requires chunks to be uniquely identifiable across files and never deleted.

[Design Pattern] Upload big file - 4. Code Design - part 2 & Summary

Chunk file storage, chunk database, upload database

Chunk file storage: Store all the chunks across all the files, due to it is possible that two files might share the same chunk

Chunk database: record name, hash, size of each chunk's metadata

Upload database: token filename, hash, url metadata of each file

In other words, the server does not store the merged file but only records the order of chunks within the file.

What exactly does chunk merging do?

Merging causes several problems, the most significant being:

  • Extremely time-consuming
  • Data redundancy

Therefore, the server does not perform actual merging. Instead, it records the chunks included in the file in the database.

Therefore, during the merge operation, the server only performs simple tasks:

  1. Validates the file size
  2. Verifies the file hash
  3. Marks the file status
  4. Generates the file access URL
  5. ..

These operations are highly efficient.

How about file access?

Since the server does not perform actual file merging, it needs to handle dynamic processing when subsequent requests for the file are made. The specific approach is as follows:

  1. Receive File Request:

    • The server receives a request for the file and retrieves the corresponding file metadata from the database.
  2. Locate All Chunks:

    • The server retrieves the list of all chunk IDs for the file from the metadata and locates the corresponding chunk files in storage.
  3. Stream File Using TaskQueue:

    • The server utilizes the TaskQueue to control concurrency during file processing.
    • Chunks are read sequentially or in parallel as needed, and a continuous read stream is created.
    • The stream is piped directly to the network I/O to serve the file to the client.
檢視程式碼
 import fs from 'fs';
import { TaskQueue } from './taskQueue'; // Assume TaskQueue is implemented

const taskQueue = new TaskQueue(4); // Limit to 4 concurrent file reads

// Simulated database with metadata
const fileMetadata = {
	fileId: '12345',
	chunks: ['chunk1.dat', 'chunk2.dat', 'chunk3.dat'],
};

// Serve file dynamically
async function serveFile(req, res) {
	const { fileId } = req.params;

	// Validate and fetch file metadata
	if (fileId !== fileMetadata.fileId) {
		res.status(404).send('File not found');
		return;
	}

	// Create readable stream and pipe to response
	res.setHeader('Content-Type', 'application/octet-stream');
	res.setHeader('Content-Disposition', 'attachment; filename="output-file.dat"');

	for (const chunk of fileMetadata.chunks) {
		await taskQueue.addTask(() => {
			return new Promise((resolve, reject) => {
				const chunkStream = fs.createReadStream(`./storage/${chunk}`);
				chunkStream
					.on('end', resolve)
					.on('error', reject)
					.pipe(res, { end: false });
			});
		});
	}

	res.end(); // End the response after all chunks
}

// Express server setup
import express from 'express';
const app = express();

app.get('/file/:fileId', serveFile);

app.listen(3000, () => {
	console.log('Server running on http://localhost:3000');
});

Summary

Developed the entire upload SDK from scratch, providing comprehensive support for file uploads, particularly large file uploads, across both frontend and backend. The SDK unifies the development approach for file uploads, covering everything from low-level protocols, utility classes, frontend components, to backend middleware.

In terms of implementation, to ensure flexibility, various design patterns were utilized to achieve complete decoupling between the SDK and upper-layer applications. Additionally, the server's storage structure was meticulously designed to ensure the uniqueness of file storage and transmission

Design choice

The common solution for large file uploads is file chunking. File chunking essentially breaks the large file upload process, which is a single large transaction, into multiple smaller chunk upload transactions, thereby reducing the risk of upload failures.

Implementing large file uploads involves numerous technical details. For example, defining the low-level protocol standard is critical as it determines how the frontend and backend interact, which in turn influences how the frontend and backend code are developed. Beyond the protocol, other considerations include how the frontend handles concurrency control, how to efficiently split files into chunks, and how the backend stores chunks, efficiently merges them, and ensures their uniqueness, among other challenges.

There isn’t a universal solution available on the market for these issues. While public cloud services like OSS (Object Storage Service) provide their own implementations, considering that our product may be deployed in customers' private clouds, the most reliable approach is to implement the entire large file upload process ourselves.

Technical implementation

My initial focus was on designing the upload process.

Traditional large file upload processes typically involve the client completing all chunking first, then calculating the hash for each chunk and the entire file. The hash is then used to exchange file information with the server. However, since hash calculation is a CPU-intensive operation, this approach can lead to prolonged client-side blocking. While using Web Workers can accelerate hash computation, my tests showed that even with multithreading, calculating the hash for extremely large files (e.g., files over 10 GB) on less powerful client machines can take more than 30 seconds, which is unacceptable.

To address this, I optimized the upload process. Assuming most uploads involve new files, I modified the workflow to allow users to start uploading chunks before the complete file hash is calculated. This approach achieves near-zero delay for uploads. Once the full file hash is computed, the hash data is supplemented to the server afterward.

Based on this workflow, I designed a standardized file upload protocol

The protocol consists of four communication standards:

  1. File Creation Protocol:
    The frontend sends a GET request to the server with the basic file information and receives a unique upload token in response. All subsequent requests must include this token.

  2. Hash Verification Protocol:
    The frontend sends the hash of a specific chunk or the entire file to the server to obtain the status of the chunk or file.

  3. Chunk Upload Protocol:
    The frontend uploads the binary data of each chunk to the server for storage.

  4. Chunk Merging Protocol:
    The frontend notifies the server that all chunks have been uploaded and the server can proceed with merging the chunks.

After designing the protocol, the next step was to implement it in code

For the frontend, the main challenges centered around two areas: how to split files into chunks and how to control the request flow.

File Chunking

Considering different scenarios, various chunking modes might be needed, such as:

  • Multithreaded chunking
  • Time-sliced chunking (similar to React Fiber)
  • Custom chunking modes defined by upper-layer applications.

To address this, I used the template pattern, leveraging TypeScript's abstract classes to define the overall chunking process. Specific subclasses only need to implement chunk hash calculations, enabling maximum flexibility.

Request Flow Control

Since many requests need to be sent, I developed a concurrent request control class to make full use of network bandwidth.

Additionally, the request process required exposing various hooks to the upper layer, such as:

  • Progress updates
  • Request state changes

To handle this, I implemented a generic EventEmitter class using the publish-subscribe pattern. This allows the request process to emit various events, which the upper-layer application can handle by listening to these events, enabling seamless integration.

Of course, the most complex part of the system lies in the backend

Since our project includes a BFF (Backend for Frontend) layer, file handling must be done in the BFF, requiring me to write corresponding server-side code.

The biggest challenge for the server is ensuring the uniqueness of each chunk. This uniqueness includes both storage uniqueness and transmission uniqueness:

  • Storage Uniqueness: Ensures that chunks are not stored redundantly, avoiding data duplication.
  • Transmission Uniqueness: Ensures that chunks are not uploaded redundantly, avoiding communication overhead.

To ensure that chunks are not stored redundantly, chunks and files must be decoupled. Chunks are stored independently and do not belong to any specific file, while files are independently recorded and point to their respective chunks in order.

This design means that even if two different files share the same chunk, the server avoids duplicate storage because chunks are independent entities.

When a user requests a file, I retrieve the chunk records for the corresponding file from the database and sequentially read the chunk data using file streams. The data is then directly streamed to the client via a pipeline.

This approach ensures extremely high efficiency for both merging and file access, while eliminating any storage redundancy on the server.

相關文章