How we created a Scalable File and Image Management using Google Cloud Storage

At GoApptiv, We have multiple internal applications built by the internal Engineering teams which stores images and different files. As time passed, the number of files and images storage size had reached more than 1 TB. We started experiencing a number of pain points that forced us to look towards a new file management solution.

Before going to the solution let's understand the technical concerns which we faced while handling files and images on the server.

Technical concerns

  • Authorizing Users to view the file.
  • Authorizing Users to upload the file.
  • Uploading the file to the server in a scalable way.
  • Validating the file data.
  • Creating file variants such as thumbnails or compressing the image size.
  • Archiving old files.

Previously all the files and images were stored on the server and a request handler was being tied up for the entire amount till the file was uploaded, As a result, a new architecture was required to be focused on reducing the amount of time the server is actually handling an upload request.

The Solution

We implemented the solution using a File Management Service as an interface, Cloud Storage (in our case Google Cloud Storage), Signed URL, and Google Cloud Functions.

We're using Google Cloud Storage to reduce the burden on our web servers for uploading and serving the files.

Google Cloud Storage

Google Cloud Storage is a RESTful online file storage web service for storing and accessing data on Google Cloud Platform infrastructure.

Signed URL

A signed URL is a URL that has authorization parameters in it. We can use signed URLs for uploading and downloading to provide protection and ensure that only authorized users perform these actions.

https://storage.googleapis.com/bucket/document/1/ufo.png?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=xxx-cloud-storage%40xxx-project.iam.gserviceaccount.com%2F20210909%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20210909T114520Z&X-Goog-Expires=604700&X-Goog-SignedHeaders=content-type%3Bhost&X-Goog-Signature=61acf1c8f90380e3cfb330e5d02978d659423bd82b5a3d78a9ece945c0253639d148f8cb2b2cc0241f9c112be33f3c70029a0f8cd9820d9dda2b942cbde0805de9c2e08d3cc7f93c76075e639a151ef53f5334cd9cf454b6b235b5ef4e0610f252ba0983d53daf61d426b903d67de663f48bb34a86b670b2f47bc45f7076a10c89259b8b2d27635a9f79057f07a097efbb93e06a38157066c646f92ccccc99da82d3cedd4849d629

Storage Classes

A storage class is a piece of metadata that is used by every object in Cloud Storage. In simple words, there are different storage classes in GCP (STANDARD, NEARLINE, COLDLINE, ARCHIVE)

Each storage class has some minimum storage duration and a different pricing model.

You can refer to Google Storage Classes doc for more details

Google Cloud Functions

Google Cloud Functions is a serverless execution environment for building and connecting cloud services. With Cloud Functions, you write simple, single-purpose functions that are attached to events emitted from your cloud infrastructure and services.

You can refer to Google Cloud Functions doc for more details

File Management Service

File Management Service is a separate service used to encapsulate the logic of creating the signed read URL, signed upload URL, and processing the files.

We've open-sourced the File Management Service, you can find the source code at Github.

The Process

  1. Client request an upload URL from the file management server.
  2. Client uploads the file data to the upload URL.
  3. Client tells the File Management Server the upload is completed.
  4. Google Cloud function processes the file in the background once the file is uploaded.
  5. Client requests to archive file after a few days or months.

Client request an upload URL from the file management server

flow-1.png

The client asks the file management server to provide the upload URL. This URL points to Google Cloud Storage. The file can now be directly uploaded to Google Cloud Storage using this upload URL.

During this step, The file management server generates a signed URL using the metadata provided by the client. This signed URL is time-restricted and authorized.

async generateUploadSignedUrl(
  path: string,
  contentType: string,
  expiryTime?: moment.Moment,
): Promise<string> {
  const options: GetSignedUrlConfig = {
    version: 'v4',
    action: 'write',
    expires: expiryTime.toDate(),
    contentType,
};

// Get a v4 signed URL for uploading file
const [url] = await this.bucket.file(path).getSignedUrl(options);

After generating the signed URL, the file management server also creates a record in the database to track the file upload along with UUID.

The server returns the upload URL along with the UUID to the client.

curl --location --request POST 'localhost:3000/api/v1/files' \
--header 'Content-Type: application/json' \
--data-raw '{
    "referenceNumber": "REFERENCE-1",
    "templateCode": "COVER_IMAGE",
    "file": {
        "name": "ufo.png",
        "size": 2048,
        "type": "image/png"
    }
}'

Using this process the image data never touches our server which solves lots of problems of our web server resources and denial-of-service. We're offloading the actual upload to another server (in our case Google Cloud Storage).

Client uploads the file data to the upload URL

flow-2.png

This is a straightforward step, The client now uploads the file directly to the upload URL with the PUT request.

curl --location --request PUT 'https://storage.googleapis.com/bucket/document/1/ufo.png?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=xxx-cloud-storage%40xxx-project.iam.gserviceaccount.com%2F20210909%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20210909T114520Z&X-Goog-Expires=604700&X-Goog-SignedHeaders=content-type%3Bhost&X-Goog-Signature=61acf1c8f90380e3cfb330e5d02978d659423bd82b5a3d78a9ece945c0253639d148f8cb2b2cc0241f9c112be33f3c70029a0f8cd9820d9dda2b942cbde0805de9c2e08d3cc7f93c76075e639a151ef53f5334cd9cf454b6b235b5ef4e0610f252ba0983d53daf61d426b903d67de663f48bb34a86b670b2f47bc45f7076a10c89259b8b2d27635a9f79057f07a097efbb93e06a38157066c646f92ccccc99da82d3cedd4849d629' \
--header 'Content-Type: image/png' \
--data-binary '@/Users/sagarvaghela/Desktop/ufo.png'

Client tells the File Management Server the upload is completed

flow-3.png

The client makes a request to the server along with UUID once the file is uploaded to Google Cloud Storage.

curl --location --request PUT 'localhost:3000/api/v1/files/confirm' \
--header 'Content-Type: application/json' \
--data-raw '{
    "uuid": "81713b3c17277dd1cf76ae0c9cef55d902083437"
}'

Google Cloud function processes the file in the background once the file is uploaded

Google Cloud function runs a piece of code according to the event.

So once a file is uploaded to the bucket we can process the file with the help of cloud functions. The process can be anything like creating thumbnails or running some OCR on the image.

This complete thing runs asynchronously in the background.

Client requests to archive file after a few days or months

archive.png

Sometimes we require files only for a few days, after that we do not want to access the file but also don't want to delete it. This can be done easily by changing the storage class of the file to ARCHIVE.

curl --location --request PUT 'localhost:3000/api/v1/files/archive' \
--header 'Content-Type: application/json' \
--data-raw '{
    "uuid": "81713b3c17277dd1cf76ae0c9cef55d902083437"
}'

Conclusion

We implement a file management service with Google Cloud Storage which scales significantly and handles lots of technical concerns. The upload endpoints are now protected by DDOS attacks and expire after a limited time.

The read URLs can also be protected by making them expire after a certain time hence solving the authorization issue.

Archiving file also becomes easy with just one line of code and files can be retrieved in the future without any complex procedure.