Streaming Unzip with Go & AWS Lambda

piaras
4 min readApr 12, 2021

--

Credit where credit’s due: Inspired by this post by Dávid Mikuš and leveraging the really nice zipstream package by @krolaw

Photo by omar jabri on Unsplash

We talk sometimes about the consolations of philosophy and I think a similar notion can apply to the practice of writing code. I reckon I’m not the only one who derives some measure of contentment from spiking a simple tool or technique away from the pressures or realities of the boiler-room.

Of course, working a solution through from end-to-end should provide some learnings that you can carry forward to improve your general skill-set. But shouldn’t a practice entail more than the constant repetitious circuits of self-improvement?

The process of completing a discrete self-contained chunk of software might also reward you with a tasty slice of validation pie. Getting something simple finished and working can be an end in itself sometimes and you may find the process can be comforting and reassuring; it may help tend and sustain your love for the craft. This short write-up is the product of such code consolations.

Mission

If you find yourself working on an application that has a requirement for users to submit packages of files for ingest into a system then Zip archives and an object store such as S3 will be a natural (and user-friendly) choice. But once a user has submitted the ingest package what next? It feels ugly to download the entire thing to disk, unzip and then re-upload each file to the object store. Double ugly if the only local operation we perform is the decompression of the archive file.

Let’s look at a quick method for a stream decompression of the zip archive using Golang and for the sheer hell of it we’ll throw deployment to AWS Lambda into the bargain. The streaming approach works particularly well in a serverless context owing to the memory quota applied to Lambda functions. Although the resource ceiling on this has recently been raised pretty significantly, if we can unzip the file using a pipe we shouldn’t have to worry about memory constraints period (it’s worth pointing out that this approach can work equally well in a VM or containerized runtime scenario and offers the same benefits).

Go

The guts of our code will consist of two functions running concurrently: one to read data from a remote file and write it to a pipe; and a second function that will read data from the pipe, process it and write the result to a remote file.

We can leverage the AWS Golang SDK to manage the upload and download of files and io.Pipe from the Go standard library to transit the bits from one location in our program to another. However, in order to use io.Pipe with the AWS SDK we need to ensure our pipe satisfies the io.WriterAt interface, which we can achieve thusly:

Tip of the hat to Dávid Mikuš for this insight

Here we extend io.Writer with a custom WriteAt method which we can then use to wrap our io.PipeWriter and pass safely into the S3 download manager:

Using AWS S3 download manager with io.Pipe

Now that there is data in the pipe we can start processing it. The zipstream package takes an io.Reader meaning we can plug the io.PipeReader directly in and start reading from the pipe:

Reading data from io.Pipe using zipstream

Thezipstream package provides us with the header and bytes for each file. We can use these to construct the PutRequest for S3:

Writing data from zipstream Reader to S3 via upload manager

That’s the nuts and bolts of our streaming unzip. Now we just need to wrap this with some concurrency in order to read from the pipe as soon some data is present. Golang’s built-in concurrency make this very easy:

Our functions made concurrent. Note that we should close the pipe in our download function once we’re finished with it. As the saying goes, “it’s the old broom that knows all the dirty corners”.

You can see here we have wrapped our download and unzip code in go-routines, allowing them to run concurrently. We use a WaitGroup to co-ordinate matters and to signal that we are finished.

That’s all the plumbing we need in order to perform streaming unzip on a remote file.

Lambda

We’ll do two things in order to ready this for our serverless environment: 1) setup our handler 2) configure AWS SAM to make deploys trivial.

Our handler function will configure an S3 client and parse the trigger event:

Nothing much to see here: we read an environment variable specifying our output bucket, configure the S3 client, setup the managers and iterate over the trigger event.

We’ll assume the installation of the AWS SAM CLI is taken care of and skip on to configuring the SAM template. The following sets up an ingest bucket and an output bucket, configures a trigger on the ingest bucket and sets some (pretty open) permissions on the buckets:

SAM template deploying two S3 buckets, lambda function and S3 trigger

In the AWS::Serverless::Function configuration we’re setting the CodeUri and Handler to point at a zip file and binary generated using the following Makefile:

You can now deploy using something like the following (assuming you have your profiles configured correctly):

AWS_PROFILE={PROFILE_NAME} make deploy

Conclusion

That’s it. It’s worth pointing out that this is a pretty simple demo; we should add some integration tests, better logging and do some load testing before unleashing into the wild. One issue I’ve found is that this method doesn’t handle sub-directories well and tends to puke when encountering them in the zip file.

Other archive formats may support this type of operation with greater ease but I find this to be an elegant and interesting solution. It provided me with an afternoon’s contentment and that’ll ‘bout do it betimes.

You can find the full code in the following repo: https://github.com/phoban01/streaming-unzipper.

--

--

piaras
piaras

Responses (2)