An Ops Detective🕵️ can tell you what’s going on

ELI5 is an acronym people use when they encounter a complicated concept and want it explained simply. Some things are pretty straightforward in the Ops world – you know you have a new subscription in SaaS when Stripe tells you that you do – but many other things are not immediately obvious.

Records in your data warehouse that change in unexpected ways? Records that doesn’t show up where you expect? Duplicates created in your system even though you have a solid deduping process. A customer subscription scheduled to end on a date that doesn’t end automatically. All of these require a data detective to troubleshoot and resolve.

Playing the data detective is not always easy, requiring you to:

  • understand the internal rules of engagement – how should this record change according to our expected processes and norms?

  • check the history of this record – what happened to this record over time?

  • view the order of operations across systems – as many ops processes touch multiple systems, you need to know what happened and when

To find out what happened, you need an Ops Detective.

Wait a minute, don’t you already have logs? You do! But they do a lousy job of telling you what happens across systems when you have a common workflow that causes changes to multiple types of records. Solving this problem can take a lot of time and effort and would improve if the lineage and reporting information were better.

How would you make the reporting process better?

Using a product lens, there are some specific ways to make reporting these anomalies clearer. Think about how this would work if every process self-documented its results in a standard format. At key points, each workflow could broadcast a message with a time stamp to capture the transaction.

If you think this sounds like a fancy way of describing a transaction log, you’re not wrong. As a detective, you need to know what happened to an object over time and across systems. This gives you the ability to aggregate rows into a time series and understand the timeline as it happens.

Building a successful transaction log has a few challenges:

  1. Granularity – there’s a lot of noise here, so which details do you care about. Storage is inexpensive so storing a lot of data is not usually an issue, but you want to store things you actually use.

  2. Explanation – when a change happens, how do you summarize it in the context of a larger workflow? Calling it an event name and defining events uniquely might be one way, but it’s not always easy to know what’s happening without an event sequence.

  3. Transaction – some things that happen can be reversed and other items are immutable, so it would be helpful to know all of the events that belong together in a transaction

Now we have the schema for a potential transaction log, how would the mechanics of this work?

An Idea to Unify Ops Reporting

Adding a log for every process (at important events) would add small amount of overhead to existing workflows, and could look like an API to do the following steps.

There needs to be a catalog of processes or workflows that when first run registers itself with the logging app, letting it know what it does, how often it normally runs (if known) and the objects it typically affects.

For example, a “subscription start” workflow might register itself in this way:

  • Type: On Demand

  • Name: “Subscription Start”

  • WorkflowGroup: “Subscription Start”

  • Description: “Runs when accounts begin their subscription”

  • Event time: time stamp

  • Metadata: A JSON package containing the typical keys to be expected, e.g. customerID, subscriptionID, subscriptionStatus

When this workflow runs it will send a message. You can use this message to aggregate it into a group of events that correspond to the Subscription Start for this account. Using the time stamps of these events lines up the transaction in an order that allows it to be observed sequentially and know when things go as expected or not in the expected order or content.

How would you use an Ops Detective?

Let’s say you had this improved transaction log. What would you do with it? I’d use it to answer some common questions and to output a standard result for questions like these for a given transaction:

  1. What happened?

  2. Was the outcome similar or different than expected?

  3. What’s the impact of the change?

Of these questions, number 3 is the hardest to generalize. You are probably creating dashboards or reporting to alert on questions 1 and 2, and using a lot of extra time to understand impact. The goal of an Ops Detective is to standardize more of the typical transaction reporting across systems so that answering a question about a new workflow publishing to this log will be an easier process.

What’s the takeaway? Unexpected outcomes are the norm in operations, and building a structured way to troubleshoot them will make it easier to find patterns and to build an automated alert once you find a pattern. Build a self-reinforcing system by registering each workflow with the Ops Detective.

gregmeyer
gregmeyer
Articles: 566