Over the last couple of weeks, we’ve done some major re-architecting and have accomplished a lot of our performance goals. This is what Poptip is if you don’t know.
Taking inspiration from the Etsy Code as Craft post on performance, we decided to highlight some findings and results from our performance and sustainability project called “Project Winter.” This is PART ONE covering our needs and the tools we used to address those needs. PART TWO will cover our performance results.
Having built our first iteration of Poptip in a bit of a hurry, we never set up continuous integration or used any uniform testing suite. We couldn’t handle traffic the way we wanted to, and as our data set grew, it was time for us to figure out long term solutions to data processing and code sustainability.
Why we needed Project Winter
Poptip’s initial application design was not optimal for scalability, speed, or stability for a number of reasons. For example…
- Multiple application layer instances could not be brought up in tandem, leaving a single point of failure if the machine hosting the instance were to go down or an application crash were to occur.
- All functions of the service were handled by a single process, resulting in exceptionally high CPU and network load on an instance during peak usage (such as a high-volume poll), increasing latency for all users.
- Many tasks such as serving HTML, processing Twitter data, and maintaining socket connection states could be separated into their own process to provide cleaner abstraction layers, reduce single points of failure, and decrease latency by allowing certain elements to scale horizontally.
Continuous Integration + Testing Needs
In addition to adhering to documentation and test writing guidelines included in JSDoc guidelines, we needed to decide on a CI server, tools for unit tests, and tools for integration tests so that most (if not all) of our tests would be uniform.
Continuous Integration + Testing Tools
- CI Server: Atlassian Bamboo - highly documented, most tools work directly with Bamboo, accepts jUnit output as well.
- AWS CI Server Plugin: Atlassian Bamboo plugin.
- Unit Tests: Nodeunit - Easy unit testing in node.js and the browser, based on the assert module. Has jUnit output for Bamboo.
- Automation plugin (Cucumber + Bamboo): Cucumber plugin.
Multiple tiers of instrumentation and monitoring were needed:
- Machine-based. If a machine goes down, has high-CPU, etc. alerts needed to be triggered on the appropriate email/hipchat/paging channels. Whoever is “on-duty” should easily be able to see which machine is ailing and take steps to mitigate the problem without much overhead.
- Process-based. EVERYTHING needed to be instrumented. From the initial request to the database call and back again, all actions were to be tracked and timed. This helps us identify performance bottlenecks. The information collected here should be useful in cases where a machine starts acting up. Either for the post-mortem or during an episode.
- User-based metrics: All user actions should be recorded and instrumented.
Note: We attempted to use New Relic, and we were pleased with the initial outcome. However, because we are a Node.js and MongoDB stack and New Relic’s node support is still in alpha/beta, we are going to wait out using New Relic in production.
Code Review Needs + Guidelines
Because we don’t have an enormous engineering team (we’re still growing) pair programming just isn’t possible; and we’re not sure we’d even want to pair if we had the chance. However, code review as a means for learning, code sustainability, and knowledge of the codebase was important. If you need more reasons why code review is good, check here.
The below is taken directly from our internal doc on code review practices. Some of this may seem obvious, but it was important for us to articulate so that all team members (whether seasoned or not) were aware of expectations.
You are expected to respond to a review request immediately. If you have something else you need to get done or will not be able to get to it as soon as possible, then it is your responsibility to communicate that to the author. Even if you’re reviewing it right then, saying as much will at least let the author know that they can expect a response soon.
The reason for this is that we want to up the quality of the code while also keeping our development velocity high. Don’t block other people’s work and encourage going around the review system because getting code committed takes too long. It will take an increased effort on our part in the beginning, but will pay off in the end as it becomes reflex.
Who reviews what?
Chances are we’ll be putting up reviews in Hipchat for people to review at a whomever-is-available basis, but in some cases (you’re making substantial changes to code that you did not originally write), it will be necessary for the original author (or domain expert) to accept the patch before it can be committed.
What requires a review?
Everything. The only exception is trivial changes like whitespace, but if possible, just roll those into a larger change. This combined with unit testing will enable us to have continuous deployment on production without as much worry as before.
Size of reviews
Enormous reviews put a significant burden on the reviewer. If possible, try and keep them at a manageable size, where “manageable” is determined by the team through trial-and-error. You’ll know an unmanageable review when you see it.
Code Review Tools
- Phabricator - can review other’s code with Differential to see what has changed. You are able to easily comment on other’s work on a per line or top level basis.
Look for PART TWO: We will be releasing the results of improved performance. They are dramatic and truly show the importance of development best practices post rapid (lean) development.