Massive amounts of data being generated and collected across virtually all parts of business ecosystems – manufacturing, supply chains, marketing, online advertising, customer relationship management, social media – making business operations open to data mining for business problem solving and competitive advantage.
The work of a data science team (or if you prefer, data engineering team) may be characterized as the extraction of knowledge from data, or at least, the refinement of data into increasingly higher of states value. Data science teams apply an array of practices and tools for the analysis and solution of data-centric business problems. The challenge for many teams today is how to accomplish this in an agile way, meaning how do these teams deliver the equivalent of ‘working software’ (or working models, or working business solutions), or at least meaningful stakeholder value every 2-4 weeks?
The ‘inspect & adapt’ principle still applies – data teams, in collaboration with other stakeholders, meet frequently to reflect, assess progress and collaborate on evolving solutions to business challenges and opportunities. While the practice of data science is not the same as the practice of software development, requiring different skills, knowledge, techniques and tools, the 12 principles enshrined in the Agile Manifesto can provide valuable guidance. Further, given the intrinsically empirical nature of data science, the basic ceremonies of scrum (iteration planning, daily stand-ups, iteration demos and retrospectives), and team organization (product owner, scrum master and team) can all be applied very effectively to improve the agility of any data science team.
For those (like me) who have not completely memorized the 12 agile principles, here they are, with emphasis on those most relevant to our discussion:
Almost any endeavor that that can benefit from an empirical approach can leverage Scrum. Empirical means: proceed incrementally via a series of experiments, and iterate on the results following each increment. This is the essence of the agile approach, and indeed the philosophy of the scientific method. Set a goal for each development increment (sprint) – in science this might be to run an experiment to collect data to test a hypothesis, in software it could be to build a feature and get feedback from users. At the end of each increment the results are evaluated (sprint review in scrum), and a decision (feedback) is made to either to continue to the next goal or to potentially discard the increment and change direction. This is how scrum’s Inspect And Adapt principle is applied.
Many data science teams have adopted lifecycle models like CRISP-DM (Cross Industry Standard Process for Data Mining), or SEMMA (Sample, Explore, Modify, Model, and Assess). These frameworks define a set of steps or stages to guide the process of data-mining and data-modeling. The problem with these models is that they prescribe a purely sequential series of steps, each producing an intermediate work-product that may not be of demonstrable value to a business stakeholder.
Furthermore, few of these intermediate work products are consistent with 2-4 week agile iterations – data preparation tasks may consume up to 80% of the effort in some data science projects. The CRISP model is essentially a waterfall. A team may come up with an approach that produces things like ‘Data Preparation Stories’, or ‘EDA Stories’, and so on. However calling these artifacts ‘stories’ does not make the process agile. These things are of course essential steps for the data science team, but are unlikely to make much sense to a business stakeholder, who may be looking for answers to questions like: which customers will not renew their wireless service contracts, or which customers will fail to pay off their credit card balances? Furthermore, many of these artifacts require significantly longer to produce than the 2-4 week time-boxes typical of agile practices. Teams will close out their sprints frustrated in not having much in the way of demonstrable business value to show for their efforts, and will need to continually push unfinished work into subsequent sprints.
The same challenge exists in software development, namely, how to support the basic agile value proposition of delivering stakeholder value on a frequent basis. Enter user stories. In software, user stories are a vehicle for breaking up product features into iteration-sized pieces that still provide demonstrable value. Frequently we have to include elements of a user interface, some business objects, and data from a database within the same story to demonstrate something of meaningful value to a potential user of the system. This is referred to as vertical slicing.
Vertical slicing is important because showing a user a ‘database story’ is completely meaningless to them – they need to see a real working increment of the product’s capabilities in order to acknowledge progress and provide feedback. The later we delay the demonstration of working software to stakeholders, the later we push opportunities for feedback and change, and the larger the risk we incur to project success. Collaboration and consensus-building with business stakeholders is critical for project success. So what options do we have?
One way to frame the question is: How to deliver the data science equivalent of ‘working software’ or ‘stakeholder value’ in 2-4 week sprints? Do partially working models make any sense? For example, a model that can classify outcomes with a 50% accuracy is effectively useless, and certainly not deployable. However the same model may be considered valuable in the sense that it has helped moved the project forward by identifying certain model features that can be eliminated. (Sprint Goal: Measure the impact of including attributes x5 and x9 on model performance).
We still have a number of options for adopting a less waterfall-like approach. Let’s explore some of these options and their associated challenges.
Approach 1: ‘Horizontal Stories’. Iterations deliver intermediate data science work-products – ‘Data Preparation Stories’, ‘EDA Stories’, and so on – demonstration of working models or anything of real stakeholder value is not possible until late in the program. This is essentially waterfall with iterations. It severely limits opportunities for stakeholder collaboration.The following diagram shows how option 1 could be setup using Jira.
Approach 2: ‘Vertical Stories’. Preferable but challenging. Iterations deliver (at least partially complete) working models. We start with limited data (‘thin slices’) and improve by iteration. However, even thinly sliced stories may take multiple iterations. Data preparation in particular may consume the majority of the effort in a data science project.
Approach 3: Hybrid Horizontal/Vertical Approach. In the hybrid Horizontal/Vertical approach we attempt to get all or most of the time-consuming data preparation work done first – call it a ‘Sprint-0’ – and then follow with sprints that build and improve models incrementally, in thin slices. Stories need to be carefully scoped to fit in an iteration – strive for incremental changes to EDA/Modeling/Model Evaluation in a single sprint. We iterate on the model construction work until we converge on something close to Minimum Viable Product (MVP), at which point we hand-off the data science work to an IT team for productization and deployment to actual users.
Even with the hybrid approach where most most of the required data preparation work is completed up-front to establish a solid baseline for model development, new data needs may be identified before an MVP goal is achieved requiring teams to cycle back for more data to improve model performance and get closer to a viable solution.
A serious weakness common to all three approaches is the IT team hand-off for productization and deployment, once an MVP status has been achieved by the data science team. As any agilist will tell you, hand-offs are fraught with risk and the potential for significant re-work and delay.
In Agile Data Science – Part 2 we will continue to seek an answer to the question: How does a data science project deliver meaningful value with every iteration, and fully leverage collaboration and feedback with stakeholders. We will also discuss how to simultaneously eliminate the ‘IT hand-off’ step in the program. We will take a look at an entirely different approach based on the ‘data value pyramid’.