Overview: AWS AI and Machine Studying stacks up, and up

Amazon Internet Providers claims to have the broadest and most full set of machine studying capabilities. I actually don’t understand how the corporate can declare these superlatives with a straight face: Sure, the AWS machine studying choices are broad and pretty full and slightly spectacular, however so are these of Google Cloud and Microsoft Azure.

Amazon SageMaker Make clear is the brand new add-on to the Amazon SageMaker machine studying ecosystem for Accountable AI. SageMaker Make clear integrates with SageMaker at three factors: within the new Knowledge Wrangler to detect knowledge biases at import time, corresponding to imbalanced courses within the coaching set, within the Experiments tab of SageMaker Studio to detect biases within the mannequin after coaching and to clarify the significance of options, and within the SageMaker Mannequin Monitor, to detect bias shifts in a deployed mannequin over time.

Traditionally, AWS has introduced its companies as cloud-only. That’s beginning to change, at the least for large enterprises that may afford to purchase racks of proprietary home equipment corresponding to AWS Outposts. It’s additionally altering in AWS’s industrial choices, corresponding to Amazon Monitron and AWS Panorama, which embody some edge units.

aws ai and ml services 01 IDG

This diagram summarizes the AWS Machine Studying stack as of December 2020. It appeared typically throughout talks at AWS re:Invent.

AWS Machine Studying Providers

After I reviewed Amazon SageMaker in 2018, I assumed it was fairly good and that it had “considerably improved the utility of AWS for knowledge scientists.” Little did I do know then how a lot traction it will get and the way a lot it will increase in scope.

After I checked out SageMaker once more in April 2020, it was in a preview section with seven main enhancements and expansions, and I mentioned that it was “adequate to make use of for end-to-end machine studying and deep studying: knowledge preparation, mannequin coaching, mannequin deployment, and mannequin monitoring.” I additionally mentioned that the consumer expertise nonetheless wanted somewhat work.

There at the moment are twelve components in Amazon SageMaker: Studio, Autopilot, Floor Reality, JumpStart, Knowledge Wrangler, Function Retailer, Make clear, Debugger, Mannequin Monitor, Distributed Coaching, Pipelines, and Edge Supervisor. A number of of the brand new SageMaker options, corresponding to Knowledge Wrangler, are main enhancements.

Amazon SageMaker Studio

Amazon SageMaker Studio is an integrated machine learning environment where you can build, train, deploy, and analyze your models all in the same application. The IDE is based on JupiterLab, and now supports both Python and R natively in notebook kernels. It has specific support for seven frameworks: Apache MXNet, Apache Spark, Chainer, PyTorch, Scikit-learn, SparkML Serving, and TensorFlow.

SageMaker Studio seems to be a wrapper around SageMaker Notebooks with a few additional features, including SageMaker JumpStart and a different launcher. Both take you to JupyterLab notebooks for actual calculations.

I showed lots of notebook examples in my April 2020 review, but only for Python notebooks. Since then, there are more samples in the repository. Plus, the sample repository is easier to reach from a notebook, and there is now support for R kernels in the notebooks, as shown in the screenshot below. Unlike Microsoft Azure Machine Learning notebooks, SageMaker does not support RStudio.

aws ai and ml services 02 IDG

Amazon SageMaker now supports R kernels as well as Python kernels in its notebooks. This example is a simple “Hello, World” that does some pre-analysis of an abalone measurement dataset. I also tried an end-to-end R sample.

aws ai and ml services 03 IDG

Amazon SageMaker Notebook samples and launcher.

aws ai and ml services 04 IDG

Amazon SageMaker Studio with JumpStart launched assets and Get Started.

Amazon SageMaker Autopilot

In the April review I showed a SageMaker Autopilot sample, which took four hours to run. Looking at another Autopilot sample in the repository, for customer churn prediction, I see that it has been improved by adding a model explainability section. This is a welcome addition, as explainability is one facet of Responsible AI, although not the whole story. (See Amazon SageMaker Clarify, below.)

According to the notes in the sample, the enabling improvement for this in the SageMaker Python SDK, introduced in June 2020, was to allow Autopilot-generated models to be configured to return probabilities of each inference. Unfortunately, that means you need to retrain any Autopilot models produced on previous versions of the SDK if you want to add explainability to them.

Amazon SageMaker Ground Truth

As I discussed in April 2020, SageMaker Ground Truth is a semi-supervised learning process for data labeling that combines human annotations with automatic annotations. I don’t see any notable changes in the service since then.

Amazon SageMaker JumpStart

SageMaker JumpStart is a new “Getting Started” feature of SageMaker Studio, which should help newcomers to SageMaker. As you can see in the screenshot below, there are two new colored icons at the bottom of the left sidebar: the upper one brings up a list of solutions, model endpoints, or training jobs created with SageMaker JumpStart, and the lower one, SageMaker Components and Registries, brings up a list of projects, data wrangler flows, pipelines, experiments, trials, models, or endpoints, or access to the feature store.

The Browse JumpStart button in the SageMaker JumpStart Launched Assets panel brings up the browser tab at the right. The browser lets you look through end-to-end solutions tied to other AWS services, text models, vision models, built-in SageMaker algorithms, example notebooks, blogs, and video tutorials.

When you click on a solution square in the browser, you bring up a documentation screen for the solution, which includes a button to launch the actual solution. When you click on a model square that has a fine-tuning option, you should see both Deploy and Train buttons on the documentation screen for the model, when I brought up the BERT Large Cased text model the Train button was disabled, and had a note that said “Unfortunately, fine-tuning is not yet available for this model.”

aws ai and ml services 05 IDG

Amazon SageMaker Studio with the SageMaker JumpStart browser.

Amazon SageMaker Data Wrangler

Amazon claims that the new SageMaker Data Wrangler reduces the time it takes to aggregate and prepare data for machine learning from weeks to minutes. It essentially gives you an interactive workspace where you can import data and try data transformations; on export you can generate a processing notebook.

Supported data transformations include joining and concatenating datasets; custom transforms and formulas; encoding categorical variables; featurizing text and date/time variables; formatting strings; handling outliers and missing values; managing rows, columns, and vectors; processing numeric variables; search and edit; parse value as type; and validate strings. Custom transformations support PySpark, Pandas, and PySpark (SQL) directly, and other Python libraries with import statements.

I’m not sure I buy the “weeks to minutes” claim, unless you already know what you’re doing and could write the code snippets yourself off the top of your head. I’d believe that most people could handle data preparation with SageMaker Data Wrangler in a few hours, given some knowledge of Pandas, PySpark, and machine learning data basics, as well as a feel for statistics.

I went through a demo of SageMaker Data Wrangler using a Titanic passenger survival dataset. It took me most of an afternoon, but cost under $2 despite using some sizable VMs for processing.

aws ai and ml services 06 IDG

You can import data into SageMaker Data Wrangler from Amazon S3 buckets, either directly or using Athena (an implementation of Presto SQL). The interface says it supports data uploads directly, but I couldn’t get that to work.

aws ai and ml services 07 IDG

Here I’m using a custom transformation with Pandas code to drop unnecessary columns. Each transformation creates a new DataFrame.

aws ai and ml services 08 IDG

Data imports and preparation steps appear on a data flow diagram. When you export the data flow, you can create a data wrangler job as a Jupyter Notebook, a notebook that creates a Pipeline, a Python code file, or a notebook that creates a Feature Store feature group.

aws ai and ml services 09 IDG

This SageMaker Data Wrangler Job Notebook is the result of an export of a Data Wrangler flow. You can see how the export created a lot of Python code for you that would have been time-consuming to write from scratch.

aws ai and ml services 10 IDG

At the end of the Data Wrangler Job Notebook there’s an optional SageMaker training step using XGBoost. I ran it and saw reasonably good results. Note the instances, apps, and sessions listed at the left.

Amazon SageMaker Feature Store

Metadata and data sharing are two of the missing links in most machine learning data. SageMaker Feature Store allows you to fix that: It is a fully managed, purpose-built repository to store, update, retrieve, and share machine learning features. As mentioned in the previous section, one way to generate a feature group in Feature Store is to save the output from a SageMaker Data Wrangler flow. Another way is to use a streaming data source such as Amazon Kinesis Data Firehose. Feature Store allows you to standardize your features (for example by converting them all to the same units of measure) and to use them consistently (for example by using the same data for training and inference).

aws ai and ml services 11 IDG

Amazon SageMaker Feature Store makes it easy to find and reuse features for machine learning.

Amazon SageMaker Clarify

SageMaker Clarify is Amazon’s Responsible AI offering. It integrates with SageMaker at three points: in SageMaker Data Wrangler to detect data biases, such as imbalanced classes in the training set; in the Experiments tab of SageMaker Studio to detect biases in the model and to explain the importance of features; and in the SageMaker Model Monitor, to detect bias shifts over time.

aws ai and ml services 12 IDG

The Analyze step of SageMaker Data Wrangler includes a data bias report, which includes four standard tests: class imbalance (an issue in this dataset); difference in positive proportions in labels; Jensen-Shannon divergence; and conditional demographic disparity in labels, which isn’t checked for this particular report. There are four more tests, classified as “additional.”

aws ai and ml services 13 IDG

The Bias report is one of the tabs in the Experiments pane. It lists metrics that might indicate biases for the chosen feature, in this case ForeignWorker. The class Imbalance of -0.92 is the same as it was in the original data; in other words, foreign workers are under-represented both in the data and in the model.

aws ai and ml services 14 IDG

The SageMaker Model Monitor can detect biases in inferences in real time. Bias metrics above the orange threshold line indicate possible drifts in the population and may require you to retrain the model.

Amazon SageMaker Debugger

The SageMaker Debugger is a misnomer, but it’s a useful facility for monitoring and profiling training metrics and system resources during machine learning and deep learning training. It allows you to detect common training errors such as inadequate RAM or GPU memory, gradient values exploding or going to zero, over-utilized CPU or GPU, and error metrics starting to rise during training. When it detects specific conditions, it can stop the training or notify you, depending on how you have set up your rules.

Source link