Files

This lesson covers the reference implementation of file uploads in Codifly projects. It is meant to be a useful instrument to read before applying changes to existing code, but it also lays a good foundation about the concepts of file handling in backends.

In case of any problems later during your career, certainly look at the FAQ-section at the bottom. For the purpose of this lesson and the exercises, you are probably fine when you skip the FAQ.

Requests of type `multipart/form-data`

As you are very familiar with making HTTP requests with a JSON body, you might think that it is a good idea to send files to backends by encoding the file contents as base64 and embedding the resulting value in a JSON string. Consider the following example of an HTTP request created in this manner. As you know, all HTTP requests start with a request line, followed by the request headerd and then the request body, which is preceeded by a blank line.

POST /
Content-Type: application/json

{ "fileData": "TmllbWFuZCB3ZWV0IGRhdCBpayByZXBlbHN0ZWVsdGplIGhlZXQu" }

Unfortunately, base64-encoding file contents makes it take up more bytes. More than 33% more bytes. This is not acceptable in a network context. As a result, for uploading files from the browser, we do not use the Content-Type: application/json header.

Instead, the best method for uploading files is by using Content-Type: multipart/form-data. In such requests, we can have a request body consisting of multiple parts, some being fields and some being files. There are many guides on the internet, so I will only give a short example request below.

In the following example, we have a multipart/form-data request. It starts, like all HTTP requests, with a request line and headers. Then, the body follows. Unlike our JSON-example, the body consists of parts, three in total in this example. The first part is named someField and has as value a string "field value" (it is thus a field, not a file). The second part is named file1 and has as value a file file1.txt. The third part is a part named file2, which has as value a file file2.html.

POST / HTTP/1.1
Host: localhost:9300
Content-Type: multipart/form-data; boundary=---------------------------0123456789012345678901234567

-----------------------------0123456789012345678901234567
Content-Disposition: form-data; name="someField"

someField value
-----------------------------0123456789012345678901234567
Content-Disposition: form-data; name="file1"; filename="file1.txt"
Content-Type: text/plain

This is text file.

-----------------------------0123456789012345678901234567
Content-Disposition: form-data; name="file2"; filename="file2.html"
Content-Type: text/html

<!DOCTYPE html><title>This is an ugly HTML-file.</title>

-----------------------------0123456789012345678901234567--

Learn more about multipart/form-data in rfc2388, rfc1341 section 7.2, the blog post by Adam Chalmers

Streaming and buffering

When uploading files, we can make use of two different approaches: streaming and buffering.

Suppose that a user uploads a file to our server, which should then be stored on AWS S3. When using a buffering approach, the server first stores the entire file at some location it can easily access. It can keep the file in working memory (in which case its size is basically limited by the maximum free memory available in the Docker container) or it can keep it in the file system (in which case write permission to some directory is needed and the in which case there should be sufficient disk space).

Given the obviously high demand buffering approaches have on the runtime environment, we prefer streaming approaches. In a streaming approach, bytes in a file flow through the system like water flowing through a river. The file in its entirety is never fully stored on the server, only consecutive chunks of bytes in the file. A chunk comes in, is forwarded, and we are ready for the next chunk to come in (more or less).

It is easy to see that streaming is, from an engineering perspective, superior to buffering. However, it is hard to implement. For example, if we want to only accept files smaller than a certain size, as we do not know the length of the file in advance we do not know if it will eventually be too large. Furthermore, in a multipart/form-data request, as the name implies, there can be multiple parts, but we don't know in advance which part will still follow later, after the file's contents. We are thus in the impossibility to check the full body for correctness in advance.

Most libraries supporting file uploads work by leveraging a buffering approach. In our seed, we use busboy, one of the only well known packages that supports streaming. Note that we can not use koa-busboy, as this package... uses busboy to implement a buffering approach. Yes. It does.

Check out the documentation of busboy. See how it uses callback functions that are called while (parts of) a request body "flows" in.

Reference implementation in the seed

Configuration

The reference implementation can store uploaded files at S3 (backend S3) or at a directory in the local file system (backend LOCAL). The former is to be used in production environments, the latter is great for local development and testing.

Configuration is straightforward:

FILES_BACKEND: LOCAL to store uploaded files in a folder in the local file system, S3 to store files on an AWS S3 bucket.
FILES_LOCAL_FILE_DIR: directory name of the folder to use (in case of backend LOCAL) or null (in case of backend S3)
FILES_AWS_REGION: region where the S3 bucket is located (in case of backend S3) or null (in case of backend LOCAL)
FILES_AWS_BUCKET_NAME: name of the S3 bucket to use (in case of backend S3) or null (in case of backend LOCAL)
FILES_MAX_SIZE_IN_BYTES: maximum size of uploaded files in bytes. Don't forget that a KiB is 1024 bytes, not 1000.

Architectural overview

We have three big submodules in the files service:

Local file abstractions: contains internal helpers for storing files locally in the file system. Function names have form localXxx().
S3 abstractions: contains internal helpers for uploading files to S3. Function names have form s3Xxx().
File service: the core file service with publicly accessible functions.

When we look at the file service itself,

Regarding initialization, initialize() calls either localBackendInit() or s3BackendInit()
Function createFileByMultipartFormData() extracts the file's ReadStream from an HTTP request of type multipart/form-data. This is then passed to createFileByReadStream() which either stores the file locally by calling localBackendUpload() or uploads it to S3 s3BackendUpload().

Testing

Automatic testing

In ./api/tests/specs/files.spec.ts, you can find some pre-made tests for file upload functionality. Those tests run with yarn run test, as usual.

They make intensive use of function generateReadable(), that returns a stream of a given length in bytes. You can look at this stream as the contents of a file to upload, streamed from, for example, the local file system.

Under the hood, this function works by leveraging an async generator function to generate "chunks" of bytes, of varying lengths. Each chunk is between 500 and 999 bytes long, except for the last chunk which can be up to 1000 bytes.

Manual testing

During the implementation, I had to test a multitude of different scenarios. For example, the upload of an unacceptably large file should be aborted as soon as it exceeds the maximum size. To test this functionality, I used bash scripting employing CURL under the hood.

Anyways, You can use the following code snippet to give you a quick start in writing your own test scripts. Note that the script depends on jq for JSON parsing, which can easily be installed via brew install jq.

The first request retrieves an auth token by logging you in. The second request uploads a file (with progress bar!) and retrieves the file id, as stored in our database. The third requests downloads the file. The variables are:

As ORIGIN, you can use the local dev server's origin http://localhost:9300, or the one from a remote environment, like https://stg6969-api.sandbox.codifly.be.
As EMAIL and PASSWORD, use the credentials of a seeded user as found in ./api/migrations/seeds (if seeding is enabled for your environment), or use credentials of any registered user with sufficient privileges.
As FILE, enter the name of the file to upload.

export ORIGIN=http://localhost:9300
export EMAIL=...
export PASSWORD=...
export FILE=someFile.txt

echo Logging you in...
export AUTH_TOKEN=$(curl -X POST $ORIGIN/api/account/login \
   -H 'Content-Type: application/json' \
   -d "{\"email\":\"${EMAIL}\",\"password\":\"${PASSWORD}\"}" \
   -s | jq -r '.authToken')

echo Uploading the file...
export FILE_ID=$(curl --progress-bar \
   -X POST $ORIGIN/api/files \
   -H "Authorization: Bearer ${AUTH_TOKEN}" \
   -F file=@$FILE | jq -r '.id')

echo Downloading the file...
curl -X GET $ORIGIN/api/files/${FILE_ID}

FAQ

What are the limitations for multipart/form-data requests in the seed?

In our reference implementation, we only support a multipart/form-data request with exactly one part named file, which should have as value a file. In other words, we do not support uploading multiple files. We do not support using a part name other than file. We do not support parts containing fields instead of files. In all these unsupported cases, the API corresponds with a correct status code.

I get a 301 response when using the S3 backend. What should I do?

The bucket you specified in FILES_AWS_BUCKET_NAME is not located in the region FILES_AWS_REGION you specified. Open the AWS console, go to S3, and look up the region where your bucket is located.

I do not have to configure an AWS profile name or an AWS Access Key. How does this bloody thing has access to the S3 bucket?

Every container running in AWS ECS (including the staging, acceptance and production containers in our traditional setup) have two roles associated with them: the execution role and the task role. The execution role is used by AWS internally to get the container going. However, inside the container, at runtime, the running process automatically assumes the task role.

In our ops-modules repository, we have configured the task role to have access to S3... but, and this is very important, only if you have included S3 in the runtime_permissions array you provided to the backend ops module!

Should the bucket be a public bucket?

No, as we are using the afore mentioned task role, we are an authorized (and not some anonymous) user. As a result, we should not make the bucket public in order to access its file contents.

Exercises

Clone the seed project and start the backend. Then, answer the following questions.

List 3 typical problems you might expect at runtime on servers using a buffered approach for file uploads.
What is meaning of "0123456789012345678901234567" in the example of a multipart/form-data request above? What is the meaning of the trailing -- on the last line?
Where are uploaded files stored in the local development environment?
From where did we receive permissions to create that directory and write into it? Tip: docker exec into your container and check the permissions and owner of the root directory with ls -al /.
What is the maximum file size for files stored in the local development environment?
What happens if the request contains a file that is larger than the allowed limit, what is the response of the server to such a request? What is the CURL-command you used to test this?