Welcome to Yammer Engineering

We're on a mission to create enterprise software that people actually want to use. We are a social media startup backed by exceptional leadership and a validated revenue-generating business model. At Yammer, we value engineering excellence and product clarity above all else.

Sign up for Yammer
Yammer Jobs

Yammer is hiring!
Flickr
« New API Documentation | Main | X-env JavaScript and Testing »
Tuesday
Aug032010

Our Realtime Future

When I started at Yammer, Adam (our CTO and mascot) and I sat down and worked out a rough architectural roadmap for Yammer. High on our list was a stable, scalable system for the realtime (i.e., sub-second latency) delivery of messages for two major reasons: user experience and operational efficiency.

First, part of our World Domination Plan is to extend Yammer to include conversations of all types. We use Yammer internally as an almost complete replacement for email, and its pub/sub messaging style allows group messaging to scale gracefully. But realtime conversations like group chat or instant messaging are impossible when messages can take up to 30 seconds to arrive. With realtime delivery of messages, we can take Yammer's existing functionality—searchability, attachments, bookmarks, etc.—and apply it to realtime conversations, allowing Yammer to encompass the entire range of conversation types.

Second, the delivery of ephemeral messages can be done at much greater scale with much less resources than the constant querying of persistent state. Consider a common case: if a user watches their following feed for an hour and 40 new messages arrive, we need to respond to 120 polling requests (one poll every 30 seconds). With a push-based system, we only need to deliver the 40 messages.

And so I started to plan out Artie, our realtime message delivery service.

Using Scala

After an overview of various server-side technologies we could use, we knew early on that we wanted to run on the Java Virtual Machine (JVM). Among other things, it has

  • a high performance Just-In-Time compiler which after a few iterations produces code approaching C in speed
  • a highly concurrent and incredibly tunable generational garbage collector
  • appealing operational capabilities, like sampling remote profiling, remote monitoring and management via JMX, etc.
  • a wide variety of mature, well-reviewed libraries

Our initial prototype of Artie was in Java, but as a weekend experiment I tried reimplementing it in Scala 2.8. After a day, I had dropped about half the lines of code and added several tricky features. I was sold. It might be easier to hire Java developers, but a Scala team will be able to get a lot more done.

Artie uses simple-build-tool (SBT), an amazingly flexible build tool written in Scala, to manage the compilation, the dependencies, running tests, etc., and I personally use IntelliJ IDEA as an IDE. (Though that's more a matter of taste than anything.)

Despite its tight integration with Java, it's entirely possible to deploy a Scala application sanely. Instead of using an application server like Tomcat or Glassfish, we use SBT to build a JAR file containing all of Artie's dependencies with Jetty embedded and a simple runner class. Starting the Artie server is just an init.d script which calls java -jar /opt/artie/current.jar with some options. To deploy a new version, we use Capistrano to log into the Artie machines and have them pull the latest build from Hudson, swap a symlink, and restart the process.

Using Bayeux

After choosing a platform, we then needed to determine what sort of interface our push service would have. After looking at the myriad options available for Comet web services, we chose Bayeux for two major reasons.

First, it has robust, maintained implementations for both clients and servers. Artie is based on CometD, a Java implementation of a Bayeux server which uses the wonderful, stable, Jetty HTTP server for most of the heavy lifting. The ability to piggy-back on a widely-deployed, well-reviewed open source project allowed us to implement a production-quality service in just a couple of months. Likewise, CometD has robust, high-quality Javascript client which can integrate into a variety of Javascript frameworks like Dojo and jQuery.

Second, Bayeux was designed with real, everyday browsers in mind. Simpler push technologies, like long-held, chunked HTTP responses, are conceptually much cleaner (as well as being easier to implement) at the expense of interoperability with "legacy" browsers, funky antivirus proxies, content blockers, misconfigured routers and VPNs, and other accouterments of modern corporate IT. Despite its relative complexity, Bayeux successfully works in even the most benighted desktop environments by downgrading to less elegant but more widely-supported push transports like JSONP when more the more straight-forward long-polling options are unavailable.

In broad strokes, Bayeux is a JSON-based pub/sub Comet protocol. Clients— usually Javascript running in browsers—perform a handshake with Artie, passing in an authentication token and receiving a session ID. Clients then negotiate a message transport based on each client's capability. CometD supports both long-polling, in which poll requests are held open until a message arrives or a timeout period expires, and JSONP, in which long-polling is done via dynamically generated <script> elements and Javascript callbacks.

A client determines which messages they receive by subscribing to or unsubscribing from different channels. A channel is a hierarchical identifier (e.g., /channels/news) by which messages can be routed. Wildcard channels (e.g., /channels/news/*) and recursive wildcard channels (/channels/news/**) are allowed, but not at the root level (e.g., /* or /**).

Artie maps Yammer's idea of a message feed onto Bayeux channels—each message feed has its own channel. A message feed with a feed key of f9j29Jm is equivalent to the channel /feeds/f9j29Jm. Since each Artie session is authenticated, each messages passes through a security filter before delivery to ensure that no messages are published outside of a valid authentication context.

Integrating Artie

When a message is created, liked, or bookmarked via Yammer (i.e., whenever a message is delivered to a feed) it serializes the message's JSON representation and some meta-information (e.g., message network, the feed keys of the feeds it belongs in) and publishes it as a JSON object to a RabbitMQ fanout exchange. Every Artie instance receives an identical copy of the message, deserializes it, and delivers its representation to any channels with current subscribers. This way message delivery is split between servers without requiring a global list of which sessions are subscribed to which feeds.

The result looks like this:

architectural diagram of Artie

We decided to use RabbitMQ instead of reliable UDP multicast mainly because we already have RabbitMQ installed and monitored as a critical component of our infrastructure. The number of messages generated per second (as opposed to the number of messages delivered) is well within the throughput limits for RabbitMQ, though we may move to a multicast technology like ZeroMQ to remove RabbitMQ as a single point of failure.

This broadcast approach leverages the pub/sub model to allow our main Rails application to remain unaware of the routing details of which users have sessions on which servers as well as allowing us to horizontally scale Artie instances to increase our capacity for realtime users. This broadcast architecture also allows us to build other realtime services alongside Artie, like search indexing, analytics, or other push delivery systems.

The End Result

We're currently in the process of rolling out realtime delivery to our users; over half of our more than one million users have realtime delivery enabled, and end-to-end latency is in the tens of milliseconds; from the human perspective, delivery is instantaneous.

We're gathering about 200 different metrics on Artie's behavior—from performance to user behavior to simple JVM load—via Ganglia and using a set of custom scripts to generate an aggregate dashboard. Having such rich metrics makes a rollout of a new service much less stressful, not to mention the fact that it gives us deep insight into how user behavior is changing as more and more users have their messages delivered in realtime.

messages/sec in Artie

One of the things I love about my job is that within a week or two of starting, I was tasked with designing, implementing, and deploying a brand-new service which had to scale to more than a million users from Day 1. And given the information of how Artie has changed the way we use Yammer, I can't wait to see how the product will change to take advantage of opportunities this new technology offers.

Now that the backend work for Yammer's realtime delivery is largely done, I'm starting work on a next-generation, distributed, fault-tolerant data store for our terabytes of messages. If that and Artie sound like fun things to work on, Yammer is currently hiring for backend developers!

PrintView Printer Friendly Version

EmailEmail Article to Friend