Dataiku overview: Knowledge science match for the enterprise

Dataiku Knowledge Science Studio (DSS) is a platform that tries to span the wants of knowledge scientists, information engineers, enterprise analysts, and AI customers. It largely succeeds. As well as, Dataiku DSS tries to span the machine studying course of from finish to finish, i.e. from information preparation by MLOps and utility help. Once more, it largely succeeds.

The Dataiku DSS consumer interface is a mix of graphical parts, notebooks, and code, as we’ll see afterward within the overview. As a consumer, you usually have a selection of the way you’d wish to proceed, and also you’re often not locked into your preliminary selection, on condition that graphical decisions can generate editable notebooks and scripts.

Throughout my preliminary dialogue with Dataiku, their senior product advertising and marketing supervisor requested me level clean whether or not I most well-liked a GUI or writing code for information science. I mentioned “I often wind up writing code, however I’ll use a GUI every time it’s quicker and simpler.” This met with approval: Lots of their clients have the identical pragmatic angle.

Dataiku competes with just about each information science and machine studying platform, but in addition companions with a number of of them, together with Microsoft Azure, Databricks, AWS, and Google Cloud. I take into account KNIME much like DSS in its use of circulate diagrams, and at the least half a dozen platforms much like DSS of their use of Jupyter notebooks, together with the 4 companions I discussed. DSS is much like DataRobot, H2O.ai, and others in its implementation of AutoML.

Dataiku DSS options

Dataiku says that its key capabilities are information preparation, visualization, machine studying, DataOps, MLOps, analytic apps, collaboration, governance, explainability, and structure. It helps extra capabilities by plug-ins.

Dataiku information preparation encompasses a visible circulate the place customers can construct information pipelines with datasets, recipes to affix and rework datasets, plus code and reusable plug-in parts.

Dataiku does fast visible evaluation of columns, together with the distribution of values, high values, outliers, invalids, and general statistics. For categorical information, the visible evaluation contains the distribution by worth, together with the depend and % of values for every worth. The visualization capabilities allow you to carry out exploratory information evaluation with out resorting to Tableau, though Dataiku and Tableau are companions.

Dataiku machine studying contains AutoML and have engineering, as proven within the determine beneath. Every Dataiku undertaking has a DataOps visible circulate, together with the pipeline of datasets and recipes related to the undertaking.

dataiku 02 IDG

Dataiku DSS offers three kinds of AutoML models and three kinds of expert models.

For MLOps, the Dataiku unified deployer manages project files’ movement between Dataiku design nodes and production nodes for batch and real-time scoring. Project bundles package everything a project needs from the design environment to run on the production environment.

Dataiku makes it easy to create project dashboards and share them with business users. The Dataiku visual flow is the canvas where teams collaborate on data projects; it also represents the DataOps and provides an easy way to access the details of individual steps. Dataiku permissions control who on the team can access, read, and change a project.

Dataiku provides critical capabilities for explainable AI, including reports on feature importance, partial dependence plots, subpopulation analysis, and individual prediction explanations. These are in addition to providing interpretable models.

DSS has a large collection of plug-ins and connectors. For example, time series prediction models come as a plug-in; so do interfaces to the AI and machine learning services of AWS and Google Cloud, such as Amazon Rekognition APIs for Computer Vision, Amazon SageMaker machine learning, Google Cloud Translation, and Google Cloud Vision. Not all plug-ins and connectors are available to all plans.

Dataiku targets data scientists, data engineers, business analysts, and AI consumers. I went through the Dataiku Data Scientist tutorial, which seems to be the closest match to my skills, and took screen shots as I went.

dataiku 03 IDG

Dataiku currently offers quick start tutorials for four personas: business analysts, data scientists, data engineers, and AI consumers.

Dataiku data preparation and visualization

The initial state of the flows in this tutorial reflects having some of the setup, data finding, data cleaning, and joining done by someone else, presumably a data analyst or data engineer. In a team effort, that’s likely. For a solo practitioner, it’s not. Dataiku may support both use cases, but has made a considerable effort to support teams in enterprises.

dataiku 04 IDG

The Dataiku DSS Data Scientist Quick Start tutorial has two flows, one for data preparation and one for model assessment.

Clicking into a dataset’s icon in a flow brings it up in a sheet.

dataiku 05 IDG

Dataiku DSS displays tabular data in a spreadsheet-like table. Note the shading on missing values.

Showing the data is useful, but exploratory data analysis is even more useful. Here we are generating a Jupyter notebook for a single dataset, which was in turn created by joining two prepared datasets.

I have to complain a little at this point. All of the prebuilt or generated notebooks I used were written in Python 2, but that’s no longer a valid DSS environment, since Python 2 has (at long last) been deprecated by the Python Software Foundation. I had to edit many notebook cells for Python 3, which was annoying and time-consuming. Fortunately, it was fairly simple: The most frequent fix was to add parentheses around the arguments of the print function, which are required in Python 3. Dataiku should really update its notebook templates for Python 3.

dataiku 06 IDG

Dataiku DSS has a number of pre-defined templates for notebooks that can visualize datasets.

The generated notebook uses standard Python libraries such as Pandas, Matplotlib, Seaborn, and SciPy to handle data, generate plots, and compute descriptive statistics.

dataiku 07 IDG

A couple of clicks and a few seconds of computation generated this notebook that does exploratory data analysis on a single dataset. The notebook goes on to display more interesting graphics and descriptive statistics, such as box plots and Shapiro-Wilk tests.

Dataiku machine learning and model assessment

Before I could do anything with the Model Assessment flow zone, I had to add a recipe to check whether a customer’s revenue is over or under a specific barrier variable, which is defined globally. The recipe created the high_value dataset, which has an additional column for the classification. In general, recipes in a flow (other than data preparation steps that remove rows or columns) do add a column with the new computed values. Then I had to build all the flow outputs reachable from the split step.

dataiku 08 IDG

The split step looks at the data_source column and uses it to split the output into test and train datasets. The right-click context menu gives access to, among other options, “Build Flow outputs reachable from here.”

Dataiku AutoML, interpretable models, and high-performance models

This tutorial moves on to creating and running an AutoML session with interpretable models, such as Random Forest, rather than high-performance models (just a different initial selection of model choices) or deep learning models (Keras/TensorFlow, using Python code). As it turns out, my Booster Plan Dataiku cloud instance didn’t have a Python environment that could support deep learning, and didn’t have GPUs. Both could be added using a more expensive Orbit plan, which also adds distributed Spark support.

I was restricted to in-memory training with Scikit-learn and custom models on two CPUs, which was fine for exploratory purposes. Most of the feature engineering options in the DSS AutoML model were turned off for the purposes of the tutorial. That was fine for learning purposes, but I would have used them for a real data science project.

dataiku 09 IDG

This session of AutoML using interpretable models, including custom models, showed that Random Forest gave the highest area under the ROC (receiver operating characteristic) curve. The price of the first item purchased and the customer’s age were the most import variables contributing to the prediction of high-value customers.

Dataiku deployment and MLOps

After finding a winning model in the AutoML session, I deployed it and explored some of the MLOps features of DSS, using Scenarios. The scenario supplied with the flow for this tutorial uses a Python script to rebuild the model, and replace the deployed model if the new model has a higher ROC AUC value. The exercise to test this capability uses an external variable to change the definition of a high-value customer, which isn’t all that interesting, but does make the point about MLOps automation.

Overall, Dataiku DSS is a very good, end-to-end platform for data analysis, data engineering, data science, MLOps, and AI browsing. Its self-service cloud pricing is reasonable, but not cheap; the basis for enterprise pricing is reasonable, although I have no concrete information about its actual enterprise pricing.

Dataiku tries hard to support non-programmers in DSS with a graphical UI and visual machine learning. The visual aspects of the product do generate notebooks with code a programmer can customize, which saves a lot of time.

I’m not totally convinced, however, that non-programming “citizen data scientists” can perform data engineering and data science effectively, even with all of the tools and training that Dataiku supplies. Data science teams need at least one member who can program and at least one member with an intuition for feature engineering and model building, not necessarily the same person. In the worst case, you might have to rely on Dataiku’s consultants for guidance.

It’s certainly worth doing a free evaluation of Dataiku DSS. You can use either the downloaded Community Edition (free forever, three users, files or open source databases) or the 14-day hosted cloud trial (five users, two CPUs, 16 GB RAM, 100 GB plus BYO cloud storage).

Cost

Hosted self-service cloud plans: Ignition plan: $348/month, 1 CPU, 8 GB RAM, 100 GB cloud storage, file uploads, DSS plus Python, one user. Booster plan: $1,128/month, 2 CPUs, 16 GB RAM, 100 GB plus BYO cloud storage,  files plus databases plus apps, DSS plus Python plus Snowflake, five users. Orbit plan: $1,700/month and up, adds Spark, scalable resources, 10 users.

On-premises/own cloud plans: Community Edition: free, up to three users. Discover Edition (up to five users), Business Edition (up to 20 users), Enterprise Edition: Subscription-based pricing depends on the license type, the number of users, and the type of users (designers vs. explorers).

Platform

Dataiku Cloud;  Linux x86-x64, 16 GB RAM; macOS 10.12+ (evaluation only); Amazon EC2, Google Cloud, Microsoft Azure, VirtualBox, VMware. 64-bit JDK or JRE, Python, R. Supported browsers: latest Chrome, Firefox, and Edge.

Copyright © 2021 IDG Communications, Inc.

Source link