How to move data science into production

Deploying data science into production is still a big challenge. Not only does the deployed data science need to be updated frequently but available data sources and types change rapidly, as do the methods available for their analysis. This continuous growth of possibilities makes it very limiting to rely on carefully designed and agreed-upon standards or work solely within the framework of proprietary tools.

KNIME has always focused on delivering an open platform, integrating the latest data science developments by either adding our own extensions or providing wrappers around new data sources and tools. This allows data scientists to access and combine all available data repositories and apply their preferred tools, unlimited by a specific software supplier’s preferences. When using KNIME workflows for production, access to the same data sources and algorithms has always been available, of course. Just like many other tools, however, transitioning from data science creation to data science production involved some intermediate steps.

In this post, we are describing a recent addition to the KNIME workflow engine that allows the parts needed for production to be captured directly within the data science creation workflow, making deployment fully automatic while still allowing every module to be used that is available during data science creation.

Why is deploying data science in production so hard?

At first glance, putting data science in production seems trivial: Just run it on the production server or chosen device! But on closer examination, it becomes clear that what was built during data science creation is not what is being put into production.

I like to compare this to the chef of a Michelin star restaurant who designs recipes in his experimental kitchen. The path to the perfect recipe involves experimenting with new ingredients and optimizing parameters: quantities, cooking times, etc. Only when satisfied, are the final results — the list of ingredients, quantities, procedure to prepare the dish — put into writing as a recipe. This recipe is what is moved “into production,” i.e., made available to the millions of cooks at home that bought the book.

This is very similar to coming up with a solution to a data science problem. During data science creation, different data sources are investigated; that data is blended, aggregated, and transformed; then various models (or even combinations of models) with many possible parameter settings are tried out and optimized. What we put into production is not all of that experimentation and parameter/model optimization — but the combination of chosen data transformations together with the final best (set of) learned models.

Source link