Data science during COVID-19: Some reassembly required

The enormous impact of the COVID-19 pandemic is obvious. What many still haven’t realized, however, is that the impact on ongoing data science production setups has been dramatic, too. Many of the models used for segmentation or forecasting started to fail when traffic and shopping patterns changed, supply chains were interrupted, and borders were locked down.

In short, when people’s behavior changes fundamentally, data science models based on prior behavior patterns will struggle to keep up. Sometimes, data science systems adapt reasonably quickly when the new data starts to represent the new reality. In other cases, the new reality is so fundamentally different that the new data is not sufficient to train a new system. Or worse, the base assumptions built into the system just don’t hold anymore, so the entire process from model creation to production deployment must be revisited.

This post describes different scenarios and a few examples of what happens when old data becomes completely outdated, base assumptions are no longer valid, or patterns in the overall system change. I then highlight some of the challenges data science teams face when updating their production system and conclude with a set of recommendations for a robust and future-proof data science setup.

Data science impact scenario: Data and process change

The most dramatic scenario is a complete change of the underlying system — one that not only requires an update of the data science process but also a revision of the assumptions that went into its design in the first place. This requires a full new data science creation and productionization cycle: understanding and incorporating business knowledge, exploring data sources (possibly to replace data that doesn’t exist anymore), and selecting and fine-tuning suitable models. Examples include traffic predictions (especially near suddenly closed borders), shopping behavior under more or less stringent lockdowns, and healthcare-related supply chains.

A subset of the above is the case where the availability of the data has changed. An illustrative example here is weather predictions, where quite a bit of data is collected by commercial passenger aircraft that are equipped with additional sensors. With the grounding of those aircraft, the volume of available data has been drastically reduced. Because base assumptions about weather systems remain the same (ignoring for a moment that changes in pollution and energy consumption may affect the weather as well) “only” a retraining of the existing models may be sufficient. However, if the missing data represents a significant portion of the information that went into model construction, the data science team would be wise to rerun the model selection and optimization process as well.

Data science impact scenario: Data changes, process remains the same

In many other cases, the base assumptions remain the same. For example, recommendation engines will still work very much the same, but some of the dependencies extracted from the data will change. This is not necessarily very different from, say, a new bestseller entering the charts, but the speed and magnitude of change may be far bigger — as we saw with the sudden spike in demand for health-related supplies. If the data science process has been designed flexibly enough, its built-in change detection mechanism should quickly identify the shift and trigger a retraining of the underlying rules. Of course, that presupposes that change detection was in fact built-in and that the retrained system achieves sufficient quality levels.

Copyright © 2020 IDG Communications, Inc.

Source link