In my August 2020 article, “How to decide on a cloud machine studying platform,” my first guideline for selecting a platform was, “Be near your knowledge.” Preserving the code close to the information is important to maintain the latency low, because the velocity of sunshine limits transmission speeds. In spite of everything, machine studying — particularly deep studying — tends to undergo all of your knowledge a number of instances (every time by way of is named an epoch).
I mentioned on the time that the perfect case for very massive knowledge units is to construct the mannequin the place the information already resides, in order that no mass knowledge transmission is required. A number of databases help that to a restricted extent. The pure subsequent query is, which databases help inside machine studying, and the way do they do it? I’ll focus on these databases in alphabetical order.
Amazon Redshift is a managed, petabyte-scale knowledge warehouse service designed to make it easy and cost-effective to investigate your whole knowledge utilizing your current enterprise intelligence instruments. It’s optimized for datasets starting from just a few hundred gigabytes to a petabyte or extra and prices lower than $1,000 per terabyte per 12 months.
Amazon Redshift ML is designed to make it simple for SQL customers to create, practice, and deploy machine studying fashions utilizing SQL instructions. The CREATE MODEL command in Redshift SQL defines the information to make use of for coaching and the goal column, then passes the information to Amazon SageMaker Autopilot for coaching by way of an encrypted Amazon S3 bucket in the identical zone.
After AutoML coaching, Redshift ML compiles the perfect mannequin and registers it as a prediction SQL perform in your Redshift cluster. You may then invoke the mannequin for inference by calling the prediction perform inside a SELECT assertion.
Abstract: Redshift ML makes use of SageMaker Autopilot to mechanically create prediction fashions from the information you specify by way of a SQL assertion, which is extracted to an S3 bucket. The very best prediction perform discovered is registered within the Redshift cluster.
BlazingSQL is a GPU-accelerated SQL engine built on top of the RAPIDS ecosystem; it exists as an open-source project and a paid service. RAPIDS is a suite of open source software libraries and APIs, incubated by Nvidia, that uses CUDA and is based on the Apache Arrow columnar memory format. CuDF, part of RAPIDS, is a Pandas-like GPU DataFrame library for loading, joining, aggregating, filtering, and otherwise manipulating data.
Dask is an open-source tool that can scale Python packages to multiple machines. Dask can distribute data and computation over multiple GPUs, either in the same system or in a multi-node cluster. Dask integrates with RAPIDS cuDF, XGBoost, and RAPIDS cuML for GPU-accelerated data analytics and machine learning.
Summary: BlazingSQL can run GPU-accelerated queries on data lakes in Amazon S3, pass the resulting DataFrames to cuDF for data manipulation, and finally perform machine learning with RAPIDS XGBoost and cuML, and deep learning with PyTorch and TensorFlow.
Google Cloud BigQuery
BigQuery is Google Cloud’s managed, petabyte-scale data warehouse that lets you run analytics over vast amounts of data in near real time. BigQuery ML lets you create and execute machine learning models in BigQuery using SQL queries.
BigQuery ML supports linear regression for forecasting; binary and multi-class logistic regression for classification; K-means clustering for data segmentation; matrix factorization for creating product recommendation systems; time series for performing time-series forecasts, including anomalies, seasonality, and holidays; XGBoost classification and regression models; TensorFlow-based deep neural networks for classification and regression models; AutoML Tables; and TensorFlow model importing. You can use a model with data from multiple BigQuery datasets for training and for prediction. BigQuery ML does not extract the data from the data warehouse. You can perform feature engineering with BigQuery ML by using the TRANSFORM clause in your CREATE MODEL statement.
Summary: BigQuery ML brings much of the power of Google Cloud Machine Learning into the BigQuery data warehouse with SQL syntax, without extracting the data from the data warehouse.
IBM Db2 Warehouse
IBM Db2 Warehouse on Cloud is a managed public cloud service. You can also set up IBM Db2 Warehouse on premises with your own hardware or in a private cloud. As a data warehouse, it includes features such as in-memory data processing and columnar tables for online analytical processing. Its Netezza technology provides a robust set of analytics that are designed to efficiently bring the query to the data. A range of libraries and functions help you get to the precise insight you need.
Db2 Warehouse supports in-database machine learning in Python, R, and SQL. The IDAX module contains analytical stored procedures, including analysis of variance, association rules, data transformation, decision trees, diagnostic measures, discretization and moments, K-means clustering, k-nearest neighbors, linear regression, metadata management, naïve Bayes classification, principal component analysis, probability distributions, random sampling, regression trees, sequential patterns and rules, and both parametric and non-parametric statistics.
Summary: IBM Db2 Warehouse includes a wide set of in-database SQL analytics that includes some basic machine learning functionality, plus in-database support for R and Python.
Kinetica Streaming Data Warehouse combines historical and streaming data analysis with location intelligence and AI in a single platform, all accessible via API and SQL. Kinetica is a very fast, distributed, columnar, memory-first, GPU-accelerated database with filtering, visualization, and aggregation functionality.
Kinetica integrates machine learning models and algorithms with your data for real-time predictive analytics at scale. It allows you to streamline your data pipelines and the lifecycle of your analytics, machine learning models, and data engineering, and calculate features with streaming. Kinetica provides a full lifecycle solution for machine learning accelerated by GPUs: managed Jupyter notebooks, model training via RAPIDS, and automated model deployment and inferencing in the Kinetica platform.
Summary: Kinetica provides a full in-database lifecycle solution for machine learning accelerated by GPUs, and can calculate features from streaming data.
Microsoft SQL Server
Microsoft SQL Server Machine Learning Services supports R, Python, Java, the PREDICT T-SQL command, and the rx_Predict stored procedure in the SQL Server RDBMS, and SparkML in SQL Server Big Data Clusters. In the R and Python languages, Microsoft includes several packages and libraries for machine learning. You can store your trained models in the database or externally. Azure SQL Managed Instance supports Machine Learning Services for Python and R as a preview.
Microsoft R has extensions that allow it to process data from disk as well as in memory. SQL Server provides an extension framework so that R, Python, and Java code can use SQL Server data and functions. SQL Server Big Data Clusters run SQL Server, Spark, and HDFS in Kubernetes. When SQL Server calls Python code, it can in turn invoke Azure Machine Learning, and save the resulting model in the database for use in predictions.
Summary: Current versions of SQL Server can train and infer machine learning models in multiple programming languages.
Oracle Cloud Infrastructure (OCI) Data Science is a managed and serverless platform for data science teams to build, train, and manage machine learning models using Oracle Cloud Infrastructure. It includes Python-centric tools, libraries, and packages developed by the open source community and the Oracle Accelerated Data Science (ADS) Library, which supports the end-to-end lifecycle of predictive models:
- Data acquisition, profiling, preparation, and visualization
- Feature engineering
- Model training (including Oracle AutoML)
- Model evaluation, explanation, and interpretation (including Oracle MLX)
- Model deployment to Oracle Functions
OCI Data Science integrates with the rest of the Oracle Cloud Infrastructure stack, including Functions, Data Flow, Autonomous Data Warehouse, and Object Storage.
Models currently supported include:
ADS also supports machine learning explainability (MLX).
Summary: Oracle Cloud Infrastructure can host data science resources integrated with its data warehouse, object store, and functions, allowing for a full model development lifecycle.
Vertica Analytics Platform is a scalable columnar storage data warehouse. It runs in two modes: Enterprise, which stores data locally in the file system of nodes that make up the database, and EON, which stores data communally for all compute nodes.
Vertica uses massively parallel processing to handle petabytes of data, and does its internal machine learning with data parallelism. It has eight built-in algorithms for data preparation, three regression algorithms, four classification algorithms, two clustering algorithms, several model management functions, and the ability to import TensorFlow and PMML models trained elsewhere. Once you have fit or imported a model, you can use it for prediction. Vertica also allows user-defined extensions programmed in C++, Java, Python, or R. You use SQL syntax for both training and inference.
Summary: Vertica has a nice set of machine learning algorithms built-in, and can import TensorFlow and PMML models. It can do prediction from imported models as well as its own models.
All eight of these databases support doing machine learning internally. The exact mechanism varies, and some are more capable than others. If you have so much data that you might otherwise have to fit models on a sampled subset, however, then any of these databases might help you to build models from the full dataset without incurring serious overhead for data export.
Copyright © 2021 IDG Communications, Inc.