Musings of Performance Past

I once worked in a bit of dream scenario. My job was simple: render a web page. Lots of times. Generally this involved background processing and caching, which I found to be fun and full of surprises. The evolution of this solution is an intesting story, to be sure, and one I'm excited to tell. Many long hours were spent poring over log messages and building tools, agonizing over every detail. I'm proud of what was accomplished, and hope that others can appreciate the hard learned lessons and unabashed creativity involved.

Simple, First

The project was a week or two old by time I was asked to join the efforts. It was fresh start, from scratch, to build something that we could use to serve content at our large scale. My first commit probably involved fixing a couple of minor bugs in the script we ran via cron. Actually, it wasn't really even cron! It was a cron-like module trying to contend the event loop to schedule things. It worked for the modest start we had, but I had been down a road like this before.

As more data was added to the job, it ran longer, and eventually past the cadence set in cron. Before one finished, another began. Just like any good programmer confronted with an inconvenient timeout scenario, we extended it. Every thirty seconds became every forty-five, then sixty seconds. When would the madness end? I have experienced similar issues in my .NET days while using the stock timer that comes with the framework (sadly I've been both the creator and inheritor of these bastardizations). Being far from my first rodeo, I had a clear idea of what wouldn't work.

Aggregators

One of the first innovations happened when we realized what the data ingest was really trying to do. For any given page being built in the app, there were a number of distinct requests made to the news API. The requests were related by the page, meaning we could aggregate all of the requests into one single view model. Emergent patterns like this are the bread and butter of refinement.

Aggregators, as they would come to be known, provided a fixture that abstracted away all of the common code for making requests. From this we got a nice little package with a name, and fully decoupled from the rest of the application. It knew nothing about storing data or scheduling execution, only about how to export a key/value when everything was said and done.

Here's a basic example of how an aggregator was created:

const Aggregator = require('./Aggregator');
const aggregator = new Aggregator('some-clever-name');

const model = {};

aggregator.endpoints.push({
  url: 'http://www.timeapi.org/utc/now.json',
  complete: (data) => {
    model.time = data;
  },
});

aggregator.export('current_time', () => {
  return model.time;
});

module.exports = aggregator;

These aggregators were teeming with introspection. In many ways, you can think of the worker process as one big aggregator introspector. When building out an aggregator, you're providing declarative information, like what URLs you want to request. You're also exporting keys, which is how data is retrieved from the aggregators at any point in time. The worker process code is free to look at that data, allowing it to organize the workloads as it chooses. More on this later.

It is worthwhile to point out that the aggregators weren't exclusively declarative. You'll notice that the module closes some state in the form of the variable model. The export should do nothing but return some kind of data. There are plenty of hooks for processing data at whatever point you need. Every time an aggregator is run it will execute the callback associated with a URL for a given endpoint. The callback can modify the model. The callback only gets invoked if the request was successful. This is important because it provides a level of robustness for the worker process! Note that common sense should prevail here while modifying this shared state. Willy-nilly calls to Array.push() on the model's arrays is a recipe for disaster. This is essentially a memory leak, and should be avoided! You might wonder how I know that. I did it on a couple of occasions, and helped others avoid it through code reviews and hitting some hard production fire drills.

Consuming APIs is anything but fully reliable. By closing state, each time the aggregator ran it was able to start from the last known good state. The result is that if there were intermittent failures while consuming an endpoint, the model would remain at the last known good state. The rendering application enjoyed a long period of stability from this. Although, it was not without its problems. One major problem that manifested from this approach was the perception of reversions in the data model. The service provided the last known good version because some portion of the newest data was unavailable (API Gateway timeouts, error responses, etc.). I believe that this was a reconcilable flaw.

Over time we realized a need to make a second round of requests within Aggregators. As it turns out, the aggregators could be leveraged to carry out these subsequent tasks. That is when I added an after function that could be specified for an endpoint, which takes a context (this is the child aggregator). You could even specify a complete function to execute after the aggregator was done (this always happened at the end of a run, and was guaranteed to complete before export functions are called.

The aggregators also had a built in capability to measure the time it took to make each request and the status of the last response it got. At various points in the worker's hayday this information was critical for determining upstream issues in the news API and even in the worker itself. Sometimes they were quirky API gateway bugs, and in some cases traceable to document database replication errors.

Arguably the largest triumph of the aggregators was how much flexibility it provided while trying different ways to iterate and refine our performance strategy.

Declarative Payoff

All aggregators begin their life with a name, which was useful for referring to specific groups of data or tasks in the worker. It allowed us to do things like specify a whitelist or blacklist of aggregator names, and the worker could run them. This was configurable from a command line parameter (aggregators=mainpage), or more commonly, the environment variable ENABLED_WORKER_AGGREGATORS.

The worker process also used this information to generate a dashboard website, which allowed you to explore the vast amounts of collected data. You could see each request that the worker was making, make the request through the app to see the current state of the endpoint, and also request the redis key to see the current state of the model. I can't express how truly useful this was to me over time.

Another angle into this was a development-time feature for the rendering app (available at ~/dashboard/refresh/ui) that allowed you to kick off a background process to run a specific aggregator. This was developed primarily for lightweight hosting providers in the absense of a dedicated, always running worker.

Since the aggregators provided all this great information, we were able to leverage it to generate some amount of documentation, such as the keys for data that would ultimately be stored in redis, and the URLs that were made in order to generate the data for those keys. Time and again this documentation was helpful to communicate what data the rendering application was consuming from the news API.

Spikes

One time a spike was created to leverage the existing aggregators and map them to endpoints as their own API. It took fifteen minutes to spin up an express app and write some generalized code to create routes for each aggregator by its name. As a proof of concept it was able to demonstrate the high degree of flexibility that the aggregators possessed.

Another time I added a compression step to the persistence layer around aggregators, offering 60%-90% improvements over payloads transferred to and from redis. If I remember right, I read somewhere that StackOverflow used this technique to eke out some extra performance with a similar setup.

Cadence

Originally the worker was set up to run as a cron job. There was an npm package that emulated cron for the purposes of scheduling workloads. As the worker ramped up in the variety and amount of data that it had to aggregate, cron became increasingly irrelevant and even problematic. The time it took to run all aggregators to completion a single time could take longer than the desired cron schedule. Choosing to skip the current cycle or cancel the previous one could result in extensive delays in updating the view models.

The next change was to just allow the process to run to completion, and then incur a throttle time. It would essentially sleep the worker for the configured duration. This worked out much nicer, especially when we had opportunities to spin up workers dedicated to a subset of aggregators. Specifically this was a great success while covering an election night.

A problem we experienced with the workers was when it would inexplicably hang for unbounded periods of time. Admittedly, this could probably be narrowed down to a workload that is unreasonably large to be efficiently coordinated in a single instance, and likely some blatantly poor coding around the asynchronous tasks. This could be handled more effectively with strict timeouts and some more robust handling of unexpected or unusual behavior in the service.

One attempt at solving (spoiler alert: it solved nothing) this inexplicable hanging problem was to use a heartbeat to detect when the worker ran over on its usual workload. The heartbeat module basically used a timeout duration, and expected to be notified with a "beat" or "tick" to indicate normal behavior. That is to say, the application would say "Hey heartbeat, checking in because I finished my work. About to start again." and the heartbeat would reply "Okay, see you next tick." In the absense of a timely tick, the heartbeat would just log a single exception and wait around to see if the latent tick would eventually happen. When it did, the heartbeat would happily return to normal function. It was simultaneously enlightening and annoying. I always intended to do something more sophisticated when the module tripped, but never got around to it. Many Splunk alerts were sent on behalf of this heartbeat!

Rather than be a brainless scraper, the worker should have just become event driven. I later learned that there were only a few thousand changes throughout the day that would necessitate updates to our view models. The aggregation could still happen, but rather than attempt to be scheduled they could be initiated via editorial events. It would have been a significant reduction in the total amount of work needed.

Worker as a Dev Tool

Rather than demand a developer to use redis in their local environment, the worker could write to static JSON files. This made the designer's life much easier. The files are readily available to modify the data used to render. It continued for basically the full lifetime of the worker. Eventually this was leveld up by creating a development proxy service devoted to this abstraction.

Analyzing Performance

I would be remiss to not make mention of how we analyzed the performance and behavior of the worker through Splunk. The worker wrote some good information out to its logs (mind you this is pre-JSON formatted logs). As aggregators ran, it would report when aggregators would finish, their name, and the amount of time it took. It also logged out "render complete" entries with the total time it took for the worker to finish one full iteration. By charting this over time, we were able to identify periods of low performance, and even correlate it to upstream problems caused by document database replication and the news API. It was essential to establishing an accurate picture of what normal was for the worker.

Fate

The worker was eventually retired in favor of a more on-demand API transformation layer. It, too, was an interesting project, but that's probably a story for another time. If I could do it all over again, I'd start with the event driven approach, but I would explore opportunities to adapt the aggregators of this worker to see how well they would work. Rather than making a request in the aggregator, the event data would be passed to it to act on, and continue to use its notion of exported keys to persist the view model in some data store such as redis.