Last updated on September 21st, 2015 at 03:08 pm
At the recentĀ @Scale Conference in San Jose, Calif., leading figures and experts in computer engineering, coding and cloud computing gathered to share news, views, successes and failures of their profession. One of those experts, Arun Jayandra, software development lead at Microsoft, shared his experiences using Spark cluster computing and Cassandra database technologies for Big Data analytics.
With his involvement in Office 365, Jayandra and his team at Microsoft designed the online office productivity suite to run with three-nines and four-nines availabilityā99.9 and 99.99 percent reliability.
Office 365 tenants, or customers, were not satisfied with this level of performance, according to Jayandra. But another issue underlay customer satisfaction with the Office 365 experienceāanalyzing actual reliability of the applications. āTo date, weāre not the most experienced at measuring the availability the tenant is getting,ā Jayandra says.
Big believer in Big Data
Naturally, Jayandraās Microsoft team wanted to use their internal IP to create a Big Data analytics engine for Office 365. But after trying to build the analytics engine with proprietary Microsoft technology, the development team turned to open source solutions to replace their own products.
They did so at least in part based on their need for real time and batch mode analytics. For these purposes, a weekās worth of user data insights seemed sufficient, according to Jayandra. But even seven days of user storage and retrieval information proved daunting. āWith Office 365 data there is much data velocity,ā Jayandra says. āItās very high frequency data with 10 terabytes (TB) stored a day.ā
Having so much customer data on hand posed a lot of risk for the Office 365 team, which brought about the needĀ to create a protection methodology with resilience and redundancy. āThe customer data needed to be protected in multiple geographies replicated across datacenters,ā Jayandra says.
He also spotted an issue relating to data signals.Ā āA small set of signals tend to double every eight months.Ā So we needed a model that can scale linearly.ā In other words, Microsoft wanted Cassandra, with its ācontinuous availability, linear scale performance, operational simplicity and easy data distribution across multiple datacenters and cloud availability zones,ā as its website notes.
Canāt start a fire without Spark
Possessing ability to run on top of Hadoop, standalone or in the cloud, Sparkās made for processing large scales of data quickly. Jayandra had particular interest in using the Spark Streaming solution for building fault-tolerant computer clusters. āWe spent time building fault tolerance and resilience,ā he says.
Using the Spark connector to Cassandra made Office 365ās performance better, according to Jayandra. For example, the gateway services for Azure, Microsoftās own cloud computing solution, can pull data from Spark and push it into Cassandra. āIn the cluster, we run Spark and Cassandra,ā Jayandra says. āAnalytics run in the other datacenter.ā
However, this was only for batch mode analytics. āWe cannot have real-time apps,ā Jayandra says. āEven Spark Streaming has no support to pull real time data.ā
Data never rests
With geo-redundancy in Microsoftās Spark strategy, itās a matter of having a similar passive stack in a different region: one on the U.S. East Coast and one on the U.S. West Coast. āThe web server that powers the interface can query both datacenters, depending on which the user is closest to,ā Jayandra says. That said, Office 365 does not use the analytics cluster in the passive region.
In other cases, the analytics cluster cannot access data due to legal restrictions in some countries against storing customer data abroad. āSo we have to replicate data in country to make data queries faster,ā Jayandra says.
LessonsĀ and mistakes with Spark, Cassandra
Overall, while building 36 nodes of Cassandra and Spark, Jayandra came to several conclusions: It is not a low maintenance process, cannot be built just with open source Apache products. Also it needed to take bits from DataStax, a leading technology provider to Big Data applications developers.
As Microsoftās first open source project, Jayandra says they made some rookie mistakes. For example, rows were too wide, which led to compaction slowing down and COM errors. Records became really big and rules were too large to load into memory. āWhat was a stable system had to be remodeled after just three weeksā Jayandra says.
Despite the Spark and Cassandra configuration passing stability tests, when the project was moved to bigger production servers it really slowed the system, according to Jayandra. āYou canāt test for this,ā he says. āInstead of a manual update of tables, the admin created a state where it went up by hundreds of thousands. It got us into a state where there were 200,000 files per node.ā And you cannot let a node get like that. āBecause thereās no going back,ā Jayandra says.
In Azure, only a small bandwidth exists between datacenters, making it impossible to rebuild a datacenter, according to Jayandra. āInstead, we need to back up and restore.ā Monitoring is very important in those scenarios where there are datacenter replication problems. Jayandra learned to take a datacenter out of the cluster if problems manifest themselves.
As it is today, Office 365 running on Spark and Cassandra is a low volume activity, with only tens of jobs on a daily basis. āAs we increase jobs, we see there is no good job server,ā Jayandra says. āWe have not had good luck with open source job servers.ā
What theyāve done to compensate for lack of reliable job server solutions is to create an alert when performance drops by 10 to 15 percent. āThat way we use Cassandra data as a deterministic test to check on the pipeline.ā
Photo via Derek Handova