After DevOps, DataOps?

Feb 14

I just finished a curious book, The Phoenix Project, that has me thinking about the big picture future of data analysis. First, however, I need to lay a little groundwork.

Let me start with another book called The Goal. This “business novel” written by Eliyahu Goldratt in 1984 used the format of a novel as a teaching tool for his theories about business, especially his theory of constraints. The theory of constraints is, in a nutshell, the idea that all business processes have a single binding constraint (bottleneck) at any given point in time. If you want to improve the overall business process, the only thing you can do that will have any impact at all is to improve that specific constraint. Any other improvement you make anywhere else will have literally no real beneficial impact.

Imagine a convoy of ships. The convoy can only travel as fast as the slowest ship. If your second-slowest ship has a mechanical fault and you fix it (increasing the speed of that ship) you have done literally nothing to increase the speed of the convoy because that ship was already faster than the slowest ship. The only ship that matters is the slowest ship. (Although once you fix that one, then the second-slowest ship will become the new bottleneck.)

There’s more to the theory of constraints than this, and some of the corollaries are interesting and counter intuitive. For example, Goldratt argues that the non-bottleneck resources in the process should deliberately be designed to work below capacity.

To communicate these ideas, Goldratt invented a story about a man who had to take charge of a dysfunctional factory that was at risk of being shut down. With the help of a Yoda-like mentor, he slowly figured out the various theories Goldratt wanted to teach the audience, eventually saving his job, the factory, and his marriage. (Yeah, his marriage. We’ll get back to that little detail again at the end.)

Goldratt’s The Goal was written from the standpoint of fixing physical processes. The Phoenix Project—written by Gene Kim, Kevin Behr, and George Spafford—remakes The Goal in the context of IT. A mid-level IT manager is unexpectedly promoted to the head of IT for a struggling company and an eccentric mentor shows up to help him learn how to figure out how to apply Goldratt’s ideas to his IT department. This is tricky, of course, since machines in a factory producing widgets aren’t the same as engineers in an IT department deploying code. Or are they? The Phoenix Project makes the case that they’re close enough that the same principles can be applied.

Now if the principles in The Goal are really helpful to improving process efficiency in manufacturing plants, and if The Phoenix Project’s thesis that those principles can be transplanted to software deployment, then I think it’s very possible that they could also be applied to data analysis processes as the next step in a sort of process of increasing abstraction.

What do I mean by a process of increasing abstraction? Well, The Phoenix Project came out almost exactly 30 years after The Goal (1984 to 2013) when cloud computing was still pretty new. In fact, the concept of virtualization (which is at the heart of cloud computing) plays a minor but critical role in the book’s plot. And one way to look at virtualization is as a kind of abstraction.

Just as The Phoenix Project moves from building physical widgets to building immaterial code deployments, virtualization transforms a process that used to be physical (building servers by literally assembling physical components and then placing them in racks and running the cabling) to an immaterial one (“building” servers virtually without touching any physical components at all).

This makes IT a kind of quasi-material field. Yes, you still obviously need those underlying hardware components. “The cloud is just somebody else’s computer,” as they say. But from the standpoint of the folks deploying infrastructure, you’ve taken an erstwhile manual, physical process and made it (potentially) automatic and immaterial. Voila: “infrastructure as code.”

Now all of a sudden you’re doing things that would have seemed crazy at one point, like using version control software for your infrastructure definition and deploying 10 times a day instead of every three months. (That link goes to a real presentation from 2009 by John Allspaw and Paul Hammond about how they got to ten deploys a day; the presentation is referenced in The Goal.)

Can we get the same kinds of revolutionary advances in throughput for data analysis that The Goal describes for physical manufacturing and The Phoenix Project describes for IT? Maybe.

The biggest obstacle in my mind is that the tasks of data analysis don’t seem to be amenable to the kind of analysis that manufacturing tasks are. The whole point of modern manufacturing is that you take the process of assembling a physical object and break it down into the discrete, repeatable, identical steps.

The earliest known example of a real assembly line comes from the Portsmouth Block Mills, which was built between 1801 and 1803 and was used to fabricate rigging blocks using 22 custom-designed tools. By the time we get to Henry Ford and his assembly lines, the processes were much more complicated, but they were still broken down into simple steps.

Some benefits of the assembly line were obvious immediately, such as unskilled labor and interchangeable parts. But in many ways the deeper benefits weren’t recognized until much later. A key figure in the modern refinement of production was W. Edwards Deming. His most influential lectures were given in 1950 (about century and a half after Portsmouth Block Mills went into production and about thirty years before The Goal was published).

The story of Deming—at least, the story as it was told to me by one fervent devotee who taught a semester of statistics when I was an undergrad—is mythic in proportions. A prophet without honor in his home country (America), Deming had to go to a foreign land (Japan) to find disciples willing to hear and apply what he preached. There, according to Wikipedia:

From June–August 1950, Deming trained hundreds of engineers, managers, and scholars in SPC [statistical process control] and concepts of quality. He also conducted at least one session for top management (including top Japanese industrialists of the likes of Akio Morita, the cofounder of Sony Corp.) Deming's message to Japan's chief executives was that improving quality would reduce expenses, while increasing productivity and market share…A number of Japanese manufacturers applied his techniques widely and experienced heretofore unheard-of levels of quality and productivity. The improved quality combined with the lowered cost created new international demand for Japanese products.

There is no question that Deming helped revolutionize modern manufacturing, and projects inspired by him or related to his work continue to this day, such as The Toyota Way. (Deming also had a huge influence on Goldratt and, through him, Kim, Behr and Spafford. He is mentioned by name in both The Goal and The Phoenix Project.)

But all of Deming’s work—and that of countless other academics and professionals—is predicated on the fact that assembly line manufacturing is mathematically tractable due to its discrete steps. If you invest too much in stories—whether about fictional characters in The Goal or The Phoenix Project or mythical historical figures like Deming—you’re going to miss the underlying fundamentals. In this case the underlying fundamentals are that once you have discretized a process, then you can subject it to a vast array of quantitative analytical techniques. Basically, turning manufacturing into a discrete process allows you to do for manufacturing what calculus did for physics starting with Newton.

So, at the level of fundamentals, the biggest obstacle I see is the disparity between the sorts of tasks in manufacturing and data analysis. Unless and until data analysis processes can be discretized the way manufacturing processes are, the whole vast literature (not just a couple of popular fictional versions) can’t be readily applied to data analysis.

This is the crux of the issue: can we discretize and systematize the steps of data analysis?

There are some who think we’re on the threshold of doing this completely any day now. I’m thinking of someone like Pedro Domingos and his quest for the Master Algorithm. Although it’s a great book as a way of understanding the different approaches to data science, I’m extremely skeptical that there is or will be in the foreseeable future anything like a one-size-fits-all, singular “master algorithm”.

But this isn’t really a binary question. That’s the key thing The Phoenix Project persuaded me of. The move to virtualization didn’t make IT exactly like manufacturing, but it made it a lot more like that. It was close enough—because enough of the tasks were discretizable—that the theories of manufacturing optimization had at least some traction.

If you want to reduce data analysis to steps as quantifiable as “twist this specific wrench 90 degrees” you are probably waiting in vain. But we don’t have to make data analysis an exact match to start to see some revolutionary benefits.

At first it seems impossible to discretize data analysis processes at all. Data analysis tasks just do not seem repeatable. Believe me, I know. I’ve managed data analysis / data science teams for a long time. For most of that period my teams used the basic outline of agile methodology, which involves planning out your tasks, estimating the level of effort involved, and then putting them into sprints (usually 2-week planning increments). I’ve also led software projects, and in my experience estimating data analysis tasks is much harder than software tasks (which are already legendarily hard to estimate). And if you can’t estimate your tasks then obviously they aren’t repeatable steps, right? And if you don’t have repeatable steps, then you don’t have anything that looks remotely like an assembly line and none of the theories about optimizing manufacturing processes—no matter how powerful in their own domain—can be successfully transplanted to this alien soil.

But I think it’s also possible that the reason our tasks are so hard to estimate is that we’re simply not estimating the right tasks. Usually we define the task in terms of the outcome (for example: a new report) and completely skip over the intermediary steps. This is a huge problem when (as is almost always the case) you don’t actually know how to build the report yet.

The reality is that most of the work in analytical projects is the invisible stuff: finding the data, assessing the data, evaluating the infrastructure requirements, and so on. All of this stuff is unknown when you start out on the project. If you don’t know where the data is you implicitly also don’t know if the data exists, but you’re going to estimate the time it takes to deliver the report that depends on this Schrodinger data that may or may not exist (be accessible / be reliable / etc.)?

So the task “build a report” is basically impossible to estimate. But what about the task “define what data this report will need” or the task “find out if we have this data” and “test the validity of this data”? Not only are those tasks easier to estimate, they’re also much closer to being uniform.

In fact, the more mature the IT ecosystem becomes, the easier it is to standardize these tasks. If your company has no data catalog, the “find out if we have this data” step is semi-automatic at best and has relatively high variance and error. But if your company has a real data catalog with all the fields and meta data and lineage defined, then this becomes an automatable step.

I do not believe that data analysis will be fully automated or reduced to an assembly line in the practically relevant future, but I do believe there is the potential—especially as IT continues to automate—for data analysis to become automated enough that process optimization approaches may yield revolutionary advances in throughput. In short, I think you can take The Goal and The Phoenix Project as two points, draw a line between them, project it into the future, and see the contours of what coming decades will bring to data analysis.

However, I have a huge caveat: watch out for hype. I mentioned way up at the top that in The Goal the main character not only managed to apply the theory of constraints to save his factory, but also his marriage. That’s because Goldratt really did believe that his theories were more or less universal and could be applied to a marriage as well as a factory. That’s… probably taking things a little too far. To put it mildly.

If you work anywhere near tech, you know how bad hype can get

And that’s really the biggest risk of hype. It makes the story too simple. This in turn gives a false sense of confidence that we’ll be able to easily see the answers or even know what the answers are ahead of time. And if you think you already know what you’re going to find, then you’re blinding yourself to what may actually be there.

When it comes to popularizing these ideas in a novel, this oversimplification is not only unavoidable, it’s the whole point. The reason a novel works to teach these ideas is that the author can tweak everything in unrealistic ways to make the principles really, really clear and obvious. As a teaching methodology that’s fine, especially if the novel is also memorable and fun (so that more people read it).

But once you put the book back down it’s time to take a step back and remember that real life is going to be a lot more ambiguous and complex than the book, and you can’t just forklift the theory out of the pages of the novel and into a business plan.

I am convinced that there are broad similarities between the three fields of (1) optimizing manufacturing processes, (2) optimizing IT processes, and (3) optimizing data analysis processes. I am equally convinced that they are not actually the same thing. Let’s keep things in perspective. It took 150 years to get from the first assembly line in England to Deming’s lectures in Japan. This process was long and no one alive in 1803 could have predicted where it would lead in 1950, let alone 2021. What’s more, the overall arc of this progression was based on fundamentals (discretizing manufacturing process and then applying mathematical techniques). The specifics—individual people or their theories—are historically contingent and we should not focus on them as much.

In conclusion, I think we have reason to believe that data analysis can be revolutionized in ways similar to that undergone by manufacturing and happening right now in IT, but not identical. We can use the history of related fields to project a general trend, but if we want any applicable specifics, we’re going to have to go out and discover / invent them ourselves the hard way: by trial and error.

data

Nate Givens http://nategivens.com

After DevOps, DataOps?

The Similes of The Expanse

I (also) write when tired