This is the first in a series of developer blog posts on Azure, analytics, and Big Data in the Colligo Engage cloud solution.
The Colligo Mission
The Colligo Engage platform collects and stores actions performed by users against content for engagement and governance analysis. For example, whenever files are synced to a device or a user shares a file with another user, an event is generated that captures that activity. The activities are stored as blobs in Azure Storage which are both secure and cost-effective. By processing the activities through various queries and filters we are able to power real-time dashboards and Business Intelligence reports. Through these dashboards and reports we provide customers with useful insights on how data is being used in the organization, and identify usage patterns including those representing potential risk of data loss or other breaches in data security.
During initial development, we discovered our custom data pipeline based on Azure queue and worker role(s) would be insufficient. For one thing it doesn’t do real-time or close-to-real-time reports. Secondly we’ve had to tackle scalability challenges as the worker roles just can’t ingest the activities fast enough. Third and perhaps most importantly, our custom data pipeline has limited support for monitoring and retry.
In March 2016, Colligo processed on average 1.6 million blobs per day, where each blob has multiple activities. (We try to be chunky rather than chatty).
Given our time-to-market requirement, we knew using Hadoop (IaaS) for this purpose would be out of the question. Though the Hadoop framework is highly applicable to the data analysis problem we’re tackling and is massively scalable, we wanted to provide some early insights on the data using a lower-overhead (cost-of-entry) service like Azure Stream Analytics (ASA). We were also looking for something flexible and lightweight that we could plug into the existing pipeline. It was not practical to totally dismantle and reconstruct our custom pipeline. Fortunately for us, ASA fits our requirements well. We choose ASA because it:
- Offers scalable processing of real-time data
- Uses familiar SQL syntax
- Supports blob and event hub as input sinks
- Can easily plug in to existing pipeline. Specifically, we configure ASA to read off our partitioned blobs (more on partition later) and output to Azure table.
- Integrates with Azure monitoring infrastructure
- Is a SaaS offering supported by Microsoft
We’ll do a deep-dive and share implementation details in future posts. In this introductory post based on our experience. I want to highlight some design decisions to consider before you consider adopting ASA.
Only blob, event and IoT hubs
Out of the box, ASA can read data from Azure blob and event-hub families (IoT and event hub). This doesn’t mean that if data is stored in say, Azure Table storage we cannot use ASA. We could have a worker role that reads off Azure table data and sends to event hub. In other words, you can replay the events. This approach of course introduces additional latency, which may not be acceptable in some cases.
No auto scale for jobs
ASA does not yet offer the ability to auto-scale based on metrics like say, resource utilization. Instead, you have to benchmark and decide what SU (Streaming Units) the job should be set to. In general, we should not pay for more resources than we need but we must still make sure the resource utilization < 80%. Refer to Microsoft’s stream analytics documentation for more details. Here’s a highlight:
The utilization of the Streaming Unit(s) assigned to a job from the Scale tab of the job. Should this indicator reach 80%, or above, there is high probability that event processing may be delayed or stopped making progress.
Partition your input
If reading from blobs, consider creating a virtual hierarchy with ‘/’. Keeping a reasonably-sized virtual directory will help with performance and troubleshooting, not to mention let you avoid potential performance issues with ASA reading a container with millions of blobs.
In the Colligo example, all deployments were manual in early stages of ASA development. Not only can this approach be painful and error-prone, it is soul-crushing during a fast dev cycle. Consider utilizing ARM deployment via VSTS. In most cases, it should be inexpensive or free if you have a MSDN subscription.
Use multiple queries in one job
Try to combine similar queries to one job. We found through experimentation that a single ASA job can output up to five different output sinks. Doing so can save money on I/O reading from blob or event hub. For example, instead of having five different jobs reading from the same input, consider having a combined job that reads input once but generates different output to five different output sinks.
Note that we did run across a few issues with ASA (one involving reading from non-partitioned blob container and another with backdated jobs). Azure support has been very responsive and since then they have made several fixes. This again validates the advantage of a SaaS model and how it can help reduce time-to-market.
Below are the resources we recommend
- Azure Stream Analytics Team Blog – THE most useful resource IMO. Good mix of technical detail and examples.
- Official Stream Analytics Documentation – It’s official, you should read it.
- MSDN Forum – This forum is actively monitored by Microsoft staff. Ironically, we find it more useful than StackOverflow
I would also recommend following @AzureStreaming on twitter for latest new features and tips.