Blog

Our Realtime Future

When I started at Yammer, Adam (our CTO and mascot) and I sat down and worked
out a rough architectural roadmap for Yammer. High on our list was a stable,
scalable system for the realtime (i.e., sub-second latency) delivery of
messages for two major reasons: user experience and operational efficiency.

First, part of our World Domination Plan is to extend Yammer to include
conversations of all types. We use Yammer internally as an almost complete
replacement for email, and its pub/sub messaging style allows group messaging to
scale gracefully. But realtime conversations like group chat or instant
messaging are impossible when messages can take up to 30 seconds to arrive. With
realtime delivery of messages, we can take Yammer’s existing functionality—searchability, attachments, bookmarks, etc.—and apply it to realtime
conversations, allowing Yammer to encompass the entire range of conversation
types.

Second, the delivery of ephemeral messages can be done at much greater scale
with much less resources than the constant querying of persistent state.
Consider a common case: if a user watches their following feed for an hour and
40 new messages arrive, we need to respond to 120 polling requests (one poll
every 30 seconds). With a push-based system, we only need to deliver the 40
messages.

And so I started to plan out Artie, our realtime message delivery service.

Using Scala

After an overview of various server-side technologies we could use, we knew
early on that we wanted to run on the Java Virtual Machine (JVM). Among other
things, it has

  • a high performance Just-In-Time compiler which after a few iterations produces
    code approaching C in speed
  • a highly concurrent and incredibly tunable generational garbage collector
  • appealing operational capabilities, like sampling remote profiling, remote
    monitoring and management via JMX, etc.
  • a wide variety of mature, well-reviewed libraries

Our initial prototype of Artie was in Java, but as a weekend experiment I tried
reimplementing it in Scala 2.8. After a day, I had dropped about half the lines
of code and added several tricky features. I was sold. It might be easier to
hire Java developers, but a Scala team will be able to get a lot more done.

Artie uses simple-build-tool (SBT),
an amazingly flexible build tool written in Scala, to manage the compilation,
the dependencies, running tests, etc., and I personally use IntelliJ IDEA as an
IDE. (Though that’s more a matter of taste than anything.)

Despite its tight integration with Java, it’s entirely possible to deploy a
Scala application sanely. Instead of using an application server like Tomcat or
Glassfish, we use SBT to build a JAR file containing all of Artie’s dependencies
with Jetty embedded and a simple runner class. Starting the Artie server is just
an init.d script which calls java -jar /opt/artie/current.jar with some
options. To deploy a new version, we use Capistrano
to log into the Artie machines and have them pull the latest build from Hudson,
swap a symlink, and restart the process.

Using Bayeux

After choosing a platform, we then needed to determine what sort of interface
our push service would have. After looking at the myriad options available for
Comet web services, we chose Bayeux
for two major reasons.

First, it has robust, maintained implementations for both clients and servers.
Artie is based on CometD, a Java implementation of a Bayeux server which uses
the wonderful, stable, Jetty HTTP server for most of the heavy lifting. The
ability to piggy-back on a widely-deployed, well-reviewed open source project
allowed us to implement a production-quality service in just a couple of months.
Likewise, CometD has robust, high-quality Javascript client which can integrate
into a variety of Javascript frameworks like Dojo and jQuery.

Second, Bayeux was designed with real, everyday browsers in mind. Simpler push
technologies, like long-held, chunked HTTP responses, are conceptually much
cleaner (as well as being easier to implement) at the expense of
interoperability with “legacy” browsers, funky antivirus proxies, content
blockers, misconfigured routers and VPNs, and other accouterments of modern
corporate IT. Despite its relative complexity, Bayeux successfully works in even
the most benighted desktop environments by downgrading to less elegant but more
widely-supported push transports like JSONP when more the more straight-forward
long-polling options are unavailable.

In broad strokes, Bayeux is a JSON-based pub/sub Comet protocol. Clients—
usually Javascript running in browsers—perform a handshake with Artie,
passing in an authentication token and receiving a session ID. Clients then
negotiate a message transport based on each client’s capability. CometD supports
both long-polling, in which poll requests are held open until a message arrives
or a timeout period expires, and JSONP, in which long-polling is done via
dynamically generated <script> elements and Javascript callbacks.

A client determines which messages they receive by subscribing to or
unsubscribing from different channels. A channel is a hierarchical identifier
(e.g., /channels/news) by which messages can be routed. Wildcard channels
(e.g., /channels/news/*) and recursive wildcard channels (/channels/news/**)
are allowed, but not at the root level (e.g., /* or /**).

Artie maps Yammer’s idea of a message feed onto Bayeux channels—each message
feed has its own channel. A message feed with a feed key of f9j29Jm is
equivalent to the channel /feeds/f9j29Jm. Since each Artie session is
authenticated, each messages passes through a security filter before delivery to
ensure that no messages are published outside of a valid authentication context.

Integrating Artie

When a message is created, liked, or bookmarked via Yammer (i.e., whenever a
message is delivered to a feed) it serializes the message’s JSON representation
and some meta-information (e.g., message network, the feed keys of the feeds it
belongs in) and publishes it as a JSON object to a RabbitMQ fanout exchange.
Every Artie instance receives an identical copy of the message, deserializes it,
and delivers its representation to any channels with current subscribers. This
way message delivery is split between servers without requiring a global list of
which sessions are subscribed to which feeds.

The result looks like this:

 

We decided to use RabbitMQ instead of reliable UDP multicast mainly because we
already have RabbitMQ installed and monitored as a critical component of our
infrastructure. The number of messages generated per second (as opposed to the
number of messages delivered) is well within the throughput limits for
RabbitMQ, though we may move to a multicast technology like ZeroMQ to remove
RabbitMQ as a single point of failure.

This broadcast approach leverages the pub/sub model to allow our main Rails
application to remain unaware of the routing details of which users have
sessions on which servers as well as allowing us to horizontally scale Artie
instances to increase our capacity for realtime users. This broadcast
architecture also allows us to build other realtime services alongside Artie,
like search indexing, analytics, or other push delivery systems.

The End Result

We’re currently in the process of rolling out realtime delivery to our
users; over half of our more than one million users
have realtime delivery enabled, and end-to-end latency is in the tens of
milliseconds; from the human perspective, delivery is instantaneous.

We’re gathering about 200 different metrics on Artie’s behavior—from performance
to user behavior to simple JVM load—via Ganglia and using a set of custom
scripts to generate an aggregate dashboard. Having such rich metrics makes a
rollout of a new service much less stressful, not to mention the fact that it
gives us deep insight into how user behavior is changing as more and more users
have their messages delivered in realtime.

 

One of the things I love about my job is that within a week or two of starting,
I was tasked with designing, implementing, and deploying a brand-new service
which had to scale to more than a million users from Day 1. And given the
information of how Artie has changed the way we use Yammer, I can’t wait to see
how the product will change to take advantage of opportunities this new
technology offers.

Now that the backend work for Yammer’s realtime delivery is largely done, I’m
starting work on a next-generation, distributed, fault-tolerant data store for
our terabytes of messages. If that and Artie sound like fun things to work on,
Yammer is currently hiring for backend developers!