In Agile Data Science – Part 1 we reviewed some of the fundamental challenges of doing data science with agility, specifically, how to produce and demonstrate ‘stakeholder value’ within a typical agile time-box of 2-4 weeks. In part 2, we will consider a different approach based on the Data Value Pyramid. The difference in this approach is that each iteration produces a valuable increment of a business solution that is demonstrable to stakeholders, opening up the program for collaboration. Within this framework the data science team is part of a multi-team program, whose goal is to deliver a deployable solution. There are no hand-offs from one team to the next, reducing overall risk and speeding time to deployment. As part of our discussion we are going to adopt some modified definitions, specifically, we are going to be redefine the meaning of ‘value’, and we are also going to define what ‘done’ means.
- Value: stories that deliver demonstrable progress towards solving a business problem, from any level of the data value pyramid, ideally demonstrated to a stakeholder in a production environment for purposes of getting feedback. Delivery of value begins as soon the first feedback loop involving stakeholders can occur.
- Done: Definition includes requirement that a story runs in production environment, with data flowing through the entire data pipeline right through to the user’s browser. Applies to stories at any level of the value pyramid.
- Stories vs. Tasks (What vs. How) continues to apply at all pyramid levels. Tasks refer to the work required to accomplish a given story. Tasks, though essential, do not represent stakeholder value. For example, tasks for a story about Getting Data might be:
- Upload data to SQL data warehouse.
- Get SQL server credentials.
On the other hand, an experiment that scores model performance using different combinations of features is certainly valuable and can be the basis of a productive discussion with stakeholders.
The Data Value Pyramid
The data value pyramid enables us to incrementally build value from the display of very simple records through to making predictions based on refined data and models. The pyramid provides a framework from which a reasonable amount of structure can be applied to a data science project. It provides a roadmap for a data science team to plan around. Some of the lower levels of the pyramid my not be ‘shippable’ in the software sense, but artifacts from this work provide a basis for dialog with stakeholders and at least a confirmation of early direction for the team. Stories from the lower levels are also very valuable in demonstrating/debugging/validating the operation of the data pipeline.
Here is a summary of the levels of the pyramid and the associated data science artifacts:
- Records: Display of basic data from which models are to be constructed.
- Basic Charts: Ability to summarize basic properties of the data.
- Charts with Correlations and Relationships. Ability to demonstrate relationships and correlations between within the data. Which variables are the most effective predictors of the problem? This is equivalent to the EDA step.
- Models that can predict things: Ability to make predictions that address business problems with new data.
Iterations deliver and demonstrate artifacts from each of these levels. The route to the top of the pyramid may not be a linear path in one direction only – frequently the team my need to cycle back to lower levels to get new data or iterate on EDA’s. In these cases this work can still be represented as ‘value’, and should be demonstrated to stakeholders. Among the ranks of stakeholders will be domain experts who can comment on interim results and recommend next steps or changes in direction. These trips back to the lower levels still represent progress as they serve to improve models, eliminate data that is not useful, and ultimately converge on a solution that meets business goals.
In agile software development the increment is the user story, and for mature agile teams the definition of done means ‘shippable’. In software development user stories are defined in the following format:
|As a <user type> I want to have <capability> so that <business value>|
For data science, with our redefined definition of ‘value’, we can now refine our story template as follows:
|As a <stakeholder>, I want to gain some <outcome> which furthers some <business objective>
Where outcome is an artifact from the data value pyramid.
The goal is to deliver ‘value’ with every iteration, even if this means displaying raw data records from a database table on the user’s browser. In this case the target audience may not be the actual business stakeholders, but there is much learning opportunity in publishing, presenting, sharing and getting feedback on this work.
Example: As a project stakeholder, I want to see the results of an EDA, so that I am confident that we understand the primary drivers of customer churn rate.
Definition of Done for Data Science Stories
Here is a typical definition of done for a software story:
- Code clean compiles with all static analysis warnings removed
- Code reviewed, with all review issues resolved
- Story has been unit tested, and units tests are automated
- Code and automated unit test cases checked into build system
- Build passes all automated unit tests, test coverage is measured, and meets a pre-defined threshold. (Failing to meet coverage threshold will cause build to be rejected).
- Build passes all predefined ‘Smoke’ tests – an automated functional regression.
Data science stories will likely have different definitions of done for each story type. For example, data gathering story – definition of done:
- Data has been extracted from target data sources
- Query building and validation complete.
- Queries saved in GIT.
- Data has been transformed: cleaned and tokenized where appropriate
- A resulting consolidated dataset has been constructed, and is available in the analytics environment.
- Documentation created and saved in GIT.
Or, a model evaluation story – definition of done:
- Model has been run against validation dataset
- Area Under ROC curve data has been collected and reported
- Results reviewed with stakeholders.
Slicing the pyramid into stories
We will definitely need to break the work that goes on in each layer down into story-sized chunks. If we look at the details of each stage of a data science project we can see more levels of detail that could be broken into discrete stories:
- Data Acquisition
- data cleaning
- loading into analytic environment
- Exploring Data
- data summarizing
- basic reporting
- feature extraction
- model training
- model evaluation
- model tuning
- Model Deployment
- Stories to build an application to consume model results
- Stories to publish model output to a web service
- Retraining Models
- Acquisition of new or improved data
- Data preparation of new data
- Retraining models and evaluating performance
A prerequisite task for adopting the pyramid model is to get the required data science delivery plumbing in place. Remember, our definition of ‘done’ is based on demonstrable artifacts that pull data all the way from raw data sources right through to the users browser.
Cross-functional Team Composition
Having defined ‘value’ and ‘done’ to require that new stories are demonstrable in a production environment, with data being extracted and processed through a real data pipeline, we need skills on the team to support this capability from the outset. Either our data science team has the breadth to apply these skills or we need to supplement them with software development or “IT” skills. Building the models, data pipeline infrastructure and software applications that leverage model capabilities concurrently removes a huge amount of risk and potential delay in getting business solutions into production.
- The overall business solution requires contributions from all contributing teams: Data Science, Data Pipeline, and Application Development teams.
- Agile means demonstrating ‘value’ frequently (every 2-4 weeks).
- Demonstration of value requires a fully integrated partial solution from all teams.
We thus need an approach similar to an Agile Release Train (ART), where we have individual teams, with their own roles and ceremonies, each contributing to the overall solution, together with orchestration at the program level, with another layer of ceremonies and roles. Teams operate within a common cadence so that their work can be integrated and demonstrated at regular intervals. This is how we vertically slice with data science.
We again emphasize that ‘done’ means a story can be demonstrated in a ‘system demo’ running in the target deployment application, with data flowing across the entire data pipeline infrastructure.
Questions? Need help? Contact me here!
- Agile Data Science, Russell Jurney, 2014, O’Reilly Media Inc.