Onwards to data-driven world, part 2: From programming to data-driven programming - Onwards to data-driven world, part 2: From programming to data-driven programming
Onwards to data-driven world, part 2: From programming to data-driven programming
In part 1 of this blog post series I painted a future vision where data science and software development together launch a new wave of data-driven digital solutions. In this post I discuss in more practical terms what data actually means to programming.
Data-driven programming can be defined as programming where program statements describe operations, transformations or queries over data instead of describing a sequence of steps to take. As a concept data-driven programming is nothing new. For example, Eric S. Raymond’s The Art of Unix Programming positions data-driven programming squarely as a central concept in the philosophy of Unix programming. Raymond writes: "It's good practice to move as much of the complexity in your design as possible away from procedural code and into data.” He considers data-driven programming and high-level languages as good practices, as they seriously increase the effectiveness of a programmer. Fred Brooks makes similar observations in his classic article about productivity: "The programmer at wit's end ... can often do best by disentangling himself from his code, rearing back, and contemplating his data.”
Consider a case where programmers are working on an online dataset repository. The repository provides a web user interface and allows users to store datasets, organise them into folders and share data with each other. Users appreciate the ability to organise data into folders, but when the amount of usage and data increases, it becomes a bit awkward to navigate among the datasets and folders of interest. The development team devises a plan to implement “favourites feature”, giving users the ability to tag their favourite datasets have them displayed on every page for quick navigation.
Initially one would think that adding a ‘star’ button to dataset panel is enough. And of course, adding the corresponding favourite table to the database and updating all data access logic to support it.
When adding new attributes to a system, you need to consider CRUD. Typical pieces of data need to be Created, Read, Updated and Deleted, with the corresponding functionality on all layers of the system architecture. Creating was already covered, but reading needs to be added to every view of the user interface to show quick links, as well as to every view with datasets to show which one of them are favourited. Fortunately the new feature is a simple one, so there is no need for updating entries on the favourite table. Delete needs to be supported to un-favourite datasets when they are not relevant any more.
Soon the development team realizes that the favourites feature is still not complete. When users start to really hit that ‘star’ button, they soon end up filling their UI views with oversized quick link lists (silly them, software should be used gently). The UI has room for only five quick links, so should it show the five best ones? To achieve this the team would need to allow users to give 0-5 ‘stars’ and use them to sort database queries. And support updating the number of stars for each favorite. The team decides not to go with this idea, so instead they just limit the favorite table to five entries. For this simple and crude solution they need to add UI functionality to explain to user that adding a sixth one is not possible and you should remove older ones first, plus some kind of dialog panel to prune the favourite table so that user experience is not too crappy.
Now the team has implemented all required functionality, but obviously the work has only started. They were a bit lazy and did not write tests before or while implementing the functionality, so they need to catch up and take care of fixing the test coverage. Also, technical documentation for the system needs the same updates. And they also have some user documentation, which needs an update as well, because the logic is not trivial and UI screenshots need to be updated. But most importantly, every database migration, code refactoring, code review, technology update and other task they perform on their whole system needs to take into account that little bit of added functionality and primary data, until the very end of the lifecycle for our dataset repository software.
Anyone who has been programming, at least professionally for a longer time, can confirm how programming is first and foremost a lot of work. For laypeople it can be surprising how tiny and brittle are those little bits and pieces that programmers use to build up complete systems. The beauty of software is not in how easy it is to make, because it is awfully hard, but in how easy it is to run. Once it is written, it can be used and copied endlessly with close to zero cost.
How our development team could have used data-driven approach to make the “favourites feature” in a better way? They might have implemented a quick link list that shows the 5 last accessed datasets. This information you can read from the application log files: it could be simple “grep and sort” on the command line. Link list is just a link list, so it can be updated overnight by a scheduled job. Any software engineer can implement it.
The limitation of 5 last accesses is that usage patterns are forgotten too easily. You make a quick detour to your less accessed folders and suddenly the quick link list is filled with obsolete links you are not going to need. Well, it is easy to count the number of accesses and show the most popular – but then the list gets stuck with some old items you used to access a lot, but don’t need any more. So, you should somehow show quite recent and quite fresh items, because they are likely to be the most needed quick links. And here we get to the turning point of the story: translating “quite recent” and “quite fresh” into numbers and analytical formulas is a task for a data scientist.
And while the data scientist is working with the application logs, the whole challenge can be raised to the power of two. Often datasets and folders form clusters, which are collections of related items. It is enough to have quick link for such a collection and it should point to the most natural entry point in the cluster. This could be tackled by taking the graph of data items, integrating it with access pattern data mined from the application logs and identifying the clusters of related items. Obviously, the traditional software engineering solution would be to introduce a new layer of data management through “data collection feature” and carry the load of all CRUD coding, UI changes, documentation and testing. With a data driven approach, the feature can be introduced overnight without any disruption to users and zero to minimal changes to the UI.
Software is getting more personalized and capable of handling larger concepts. In other words, software is getting better. Instead of transmitting messages, we are handling social networks. Instead of forming logical queries to express what we want to search, the search application asks us if this is what we meant. To keep up with this development, it is not possible to rely on the traditional way of making computers do what we need: by writing it all in precise program code. Instead, we need to heed Fred Brooks’ advice, rear back and contemplate on our data.
The author is development manager of data analytics at CSC