2024-05-25

[Node.js] How to Handle Large Files on AWS S3

When handling large files with Node.js, it is essential to leverage stream processing to minimize memory usage.

By using stream processing, you can process data in chunks without loading the entire file into memory at once, enabling efficient and stable upload and download of large files.

About `highWaterMark` in Node.js Stream Processing

In Node.js stream processing, you can specify a parameter called highWaterMark. highWaterMark sets the internal buffer size for the stream, specified in bytes. By default, it is set to 16KB, but increasing this value can improve performance.

Increasing the highWaterMark value enhances read and write performance but also increases memory consumption. Therefore, it is necessary to set an appropriate value considering the system's resources. In this article, 256KB (256 * 1024 bytes) is used.

Steps

Prerequisites

Node.js is installed
You have an AWS account and an S3 bucket is created

Installing Required Packages

Use aws-sdk and @aws-sdk/lib-storage.

npm install aws-sdk @aws-sdk/lib-storage

Downloading Files

Use ReadStream for downloading from S3 and combine it with WriteStream to write the file.

const AWS = require('aws-sdk');
const fs = require('fs');
const path = require('path');

// S3 configuration
AWS.config.update({
    accessKeyId: 'your_access_key_id',
    secretAccessKey: 'your_secret_access_key',
    region: 'your_aws_region'
});

const s3 = new AWS.S3();

const downloadFile = (s3Key, downloadFilePath) => {
    const params = {
        Bucket: 'your_s3_bucket_name',
        Key: s3Key
    };

    const fileStream = fs.createWriteStream(downloadFilePath, { highWaterMark: 256 * 1024 }); // 256KB chunk size

    return new Promise((resolve, reject) => {
        s3.getObject(params).createReadStream({ highWaterMark: 256 * 1024 }) // 256KB chunk size
            .on('error', (err) => {
                console.error(err);
                reject(err);
            })
            .pipe(fileStream)
            .on('close', () => {
                console.log('File downloaded: ' + downloadFilePath);
                resolve();
            });
    });
};

downloadFile('file/key/in/s3', 'path/to/downloaded/file');

Uploading Files

When uploading, in addition to stream processing, using @aws-sdk/lib-storage enables multipart upload (splitting a file into multiple parts for upload).

During multipart upload, you can specify the chunk size with partSize and the concurrency level with queueSize. Each part is uploaded independently, and once all parts are uploaded, S3 assembles the file.

Here are the specific steps for uploading:

const AWS = require('aws-sdk');
const { S3Client } = require('@aws-sdk/client-s3');
const { Upload } = require('@aws-sdk/lib-storage');
const fs = require('fs');
const path = require('path');

// S3 configuration
AWS.config.update({
    accessKeyId: 'your_access_key_id',
    secretAccessKey: 'your_secret_access_key',
    region: 'your_aws_region'
});

const s3Client = new S3Client({ region: 'your_aws_region' });

const uploadFile = (filePath) => {
    const uploadParams = {
        Bucket: 'your_s3_bucket_name',
        Key: path.basename(filePath),
        Body: fs.createReadStream(filePath, { highWaterMark: 256 * 1024 }) // 256KB chunk size
    };

    return new Promise((resolve, reject) => {
        const parallelUploads3 = new Upload({
            client: s3Client,
            params: uploadParams,
            leavePartsOnError: false, // Abort multipart upload if any part fails
            queueSize: 4, // Maximum number of concurrent uploads
            partSize: 1024 * 1024 * 20 // 20MB per part
        });

        parallelUploads3.on('httpUploadProgress', (progress) => {
            console.log(progress);
        });

        parallelUploads3.done().then(() => {
            console.log('File uploaded: ' + uploadParams.Key);
            resolve();
        }).catch(err => {
            console.error(err);
            reject(err);
        });
    });
};

uploadFile('path/to/large/file');

Conclusion

By leveraging stream processing and multipart upload, it is possible to efficiently handle large files.

Adjust the various parameters according to your system resources to achieve efficient file processing.