Datagrom | AI & Data Science Consulting

View Original

Released in 2020: Meet AWS SageMaker Studio, Azure Machine Learning Studio, & GCP AI Platform

See this content in the original post

Cloud Machine Learning Platform 2020 Timeline

April 29, 2020 Amazon announces Notebooks in AWS SageMaker Studio (GA)

Aug 24, 2020 Microsoft announces Notebooks in Azure Machine Learning Studio (GA)

Sep 21, 2020 — Google announces Notebooks in Google AI Platform (GA)

What these releases represent, is that for the first time, in 2020, each of the three top cloud providers now offer embedded Jupyter Notebooks within their respective machine learning platforms.

This article will provide:

  1. Background: A brief background of machine learning platforms and why they are needed

  2. Description: A quick look at each of these latest cloud offerings as described by the cloud providers

  3. Commentary: Platform fit, according to the needs of both cross-functional and expert data science teams

Machine Learning Platform Background

It’s no secret that enterprises have been struggling for some time to build and deploy machine learning models into production. Forbes recently reported,

One of the reasons for this is that it can be extremely complex and difficult to deploy machine learning models into production. This task often requires piecing together numerous different technologies in order to build out a coherent machine learning pipeline that can reliably deploy, monitor, and manage models in repeatable fashion.

Making matters worse, there is a shortage of experts who can complete this task, and the vast majority of them are concentrated within the technology and financial services sectors. Businesses in other sectors may struggle.

AI leaders in the technology sector, have long understood the solution to reliably deploy machine learning models in a systematic and repeatable way at massive scale. The solution is to implement a standardized machine learning platform that simplifies the data science process.

Because technology companies attract the overwhelming majority of data science talent, they often enjoy the ability to build out their own machine learning platforms. Several well known examples include:

  • FBLearner Flow: Facebook’s AI backbone and machine learning platform

  • Metaflow: Netflix’s data science & machine learning platform which it has open-sourced

  • Michelangelo: Uber’s machine learning platform as-a-service

Businesses who lack the expertise or resources to build out their own machine learning platforms, have looked to data science technology vendors for pre-built solutions. A couple of examples:

  • Dataiku: An end-to-end data science & machine learning platform which caters to both no-code analysts and R/Python code-centric data scientists alike. Jupyter Notebooks have been embedded for years.

  • Domino: A data science & machine learning platform which caters more exclusively to data scientists. Jupyter Notebooks have also been embedded for years.

The top cloud vendors, Amazon (AWS), Microsoft (Azure), and Google Cloud Platform (GCP), have been relatively slow to develop and deliver machine learning platforms of their own to meet this need.

As the popularity of cloud platforms and data science continues to increase, however, the cloud vendors have mobilized to build cloud-native machine learning platforms to better serve their growing list of customers. They also provide yet another option for customers to consume and pay for additional cloud resources.

Recent releases in 2020 provide, for the first time, cloud machine learning platforms by AWS, Azure, and GCP which enable and manage at least most of the tasks within the CRISP-DM data science process, from within a single web interface. We’ll take a look at each offering in turn.

CRISP-DM Analytics Process


Subscribe to our Data Science & Machine Learning Newsletter

Stay ahead of the curve with data science & machine learning Insights, resources, and tips

See this form in the original post

Amazon SageMaker Studio

Amazon claims SageMaker Studio is “The first fully integrated development environment (IDE) for machine learning.” To Amazon’s credit, they do seem to be the first cloud company to announce general availability of a machine learning platform with embedded notebooks.

Amazon further explains that, “Amazon SageMaker is a fully managed service that provides every developer and data scientist with the ability to build, train, and deploy machine learning (ML) models quickly. SageMaker removes the heavy lifting from each step of the machine learning process to make it easier to develop high quality models. Traditional ML development is a complex, expensive, iterative process made even harder because there are no integrated tools for the entire machine learning workflow. You need to stitch together tools and workflows, which is time-consuming and error-prone. SageMaker solves this challenge by providing all of the components used for machine learning in a single toolset so models get to production faster with much less effort and at lower cost.

Well articulated by Amazon. AWS seems to understand, perhaps recently, the problem that data science & machine learning platform companies like Dataiku and Domino have well understood and have been solving for years now.

Amazon SageMaker Studio Ecosystem

Here’s how Amazon visually describes the new SageMaker ecosystem.

Amazon SageMaker Key Capabilities

SageMaker Studio: Notebook and model quality split view

SageMaker Studio: AutoML UI

SageMaker Studio includes:

  • Web-based visual interface to build machine learning models

  • SageMaker Notebooks: To collaborate with shareable embedded Jupyter notebooks

  • SageMaker Autopilot: For AutoML capabilities

  • SageMaker Ground Truth: For labeling

  • SageMaker Experiments: Helps you organize and track iterations to machine learning models

  • SageMaker Debugger: Captures real-time metrics during training such as training and validation, confusion matrices, and learning gradients to help improve model accuracy

  • One-click deployment to API endpoint

  • SageMaker Model Monitor: Allows developers to detect and remediate concept drift

  • Kubeflow Pipelines for job orchestration and scheduling

  • Kubernetes integration for orchestration and management

Amazon SageMaker Studio Pricing

Amazon SageMaker Studio is free, you only pay for the AWS services that you use within Studio.

Microsoft Azure Machine Learning Studio

The branding of Microsoft’s machine learning platform offering can be confusing. Microsoft launched Azure ML Studio in 2015 as their first drag-and-drop machine learning builder. It is now referred to as ML Studio (classic). It lacked an integrated notebook for Python / R development, an ML pipeline for automated workflows, autoML, or capabilities for model monitoring / management (among other key requirements of a modern ML platform).

With Azure Machine Learning Studio released in 2020, Microsoft seems to have addressed many of these deficiencies.

Microsoft explains that “Azure Machine Learning studio is a web portal in Azure Machine Learning that contains low-code and no-code options for project authoring and asset management.

Microsoft Azure Machine Learning Studio

Microsoft provides this visual representation of the Azure Machine Learning Workspace.

Microsoft Azure Machine Learning Studio Key Capabilities

Azure Machine Learning Studio: Designer drag and drop UI

Azure Machine Learning Studio: AutoML UI

Azure Machine Learning Studio manages:

  • Datasets

  • Datastores

  • Compute resources

  • Notebooks: Write and run code in managed Jupyter Notebook servers

  • Azure Machine Learning designer: Drag and drop interface to train and deploy machine learning models without writing any code, and drag and drop datasets and modules to create ML pipelines

  • AutoML

  • Data labeling: To efficiently coordinate data labeling projects

  • Experiments

  • Run logs

  • Pipelines

  • Pipeline endpoints

Microsoft Azure Machine Learning Studio Pricing

Azure Machine Learning which includes Azure Machine Learning Studio is free. Customers incur the costs associated with the Azure resources consumed (for example, compute and storage costs).

Google AI Platform

Google claimsAI Platform makes it easy for developers, data scientists, and data engineers to streamline their ML workflows. Whether it is point-and-click data science using AutoML or advanced model optimization, AI Platform helps all users take their projects from ideation to deployment seamlessly.

Google AI Platform Ecosystem

Here’s how Google visually describes the new AI Platform ecosystem, which supports an “End-to-end machine learning life cycle”.

If we dig a bit deeper, however, Google defines the specific stages of the machine learning workflow which are managed by AI Platform. Data preparation capabilities are notably excluded. Blue boxes indicate services handled by Google AI Platform.

Google AI Platform Key Capabilities

AI Platform: Dashboard UI

AI Platform manages:

  • AI Platform Notebooks: A managed Jupyter Notebook service that provides fully configured environments for model development. You can then train your models in the cloud with

  • AI Platform Training: For machine learning model training

  • AI Explanations: Helps you understand your model's outputs, verify the model behavior, recognize bias in your models, and get ideas for ways to improve your model and your training data

  • AI Platform Vizier: A black-box optimization service, to tune hyperparameters and optimize your model’s output

  • AI Platform Prediction: Deploy your models at scale and get predictions from them in the cloud and manages the infrastructure needed to run your model and makes it available for online and batch prediction requests.

  • AutoML Vision Edge: To deploy your models at the edge and trigger real-time actions based on local data

  • TensorFlow Enterprise: Offers enterprise-grade support for your TensorFlow instance

  • MLOps: Manage your models, experiments, and end-to-end workflows

  • AI Platform Pipelines: Deploy robust, repeatable pipelines. Continuous evaluation helps you monitor the performance of your models and provides continual feedback on how your models are performing over time.

Google AI Platform Pricing

Google explains that Explainable AI, Notebooks, Vizier, TensorFlow Enterprise, and Pipelines (Marketplace) can be used for no charge, but you do pay for any Google Cloud resources, such as Compute and Storage you use with them.

Cloud Machine Learning Platform Fit

Each of the newly released cloud machine learning platforms can provide good options for cloud customers depending on the composition of their data science teams and unique requirements. We’ll consider:

  1. Cross-Functional Data Science Teams — Data science project teams which include contributors and stakeholders who do not code

  2. Expert Data Science Teams — Code-first data science project teams which include mostly experts who code

Cross-Functional Data Science Teams

Of the top three cloud machine learning platforms, Microsoft Azure Machine Learning Studio arguably caters to a wider range of skill-sets and user personas. This is because in addition to Notebooks, it seems to be the only machine learning cloud platform which also includes visual data preparation capabilities within the platform itself. This is a feature which Azure has included since its previous machine learning platform iteration, Azure ML Studio (classic), as this tutorial clearly demonstrates.

Visual data preparation (in addition to Notebooks) can be useful for several main reasons:

  1. Citizen data scientists: Visual data preparation features enable collaboration between data scientists who code in R / Python, and analysts and data engineers who may not. For small scale data science projects, data scientists may build out entire projects themselves using code in Notebooks.

    For large scale projects, however, data scientists may need to collaborate with analysts during the data wrangling process. Analysts often do not code, and are accustomed to working with Excel-like visual interfaces for data manipulation.


  2. Project explainability: Model explainability is certainly a critical component of “Business Understanding” within the CRISP-DM analytics model, and each of the new machine learning platforms by the leading cloud companies now provide capabilities within this area.

    Explainability of entire projects, which includes the Data Preparation phase, can also be as equally important. It may be difficult to explain model behavior to business leaders without effectively also explaining the input data used to train these models. And business leaders are more likely to understand visual interpretations of data manipulation, instead of lines of code in a Notebook.


  3. Time to value: Even for expert data scientists who are highly proficient in R / Python code, visual data preparation features may complement their efforts, and allow them to deliver business-ready projects in less time. Because data preparation can be the most time consuming phase of project development, additional capabilities which simplify this process can be helpful.


To further illustrate the importance of simplifying the Data Preparation step in the CRISP-DM process, according to a 2020 survey of data scientists conducted by Anaconda with 2,360 responses, respondents reported that on average 45% of their time is spent getting data ready (loading and cleansing) before they can use it to develop models and visualizations.

Anaconda 2020 survey: Data scientist time allocation

The success and growth of data science platforms like Alteryx which provide robust visual data preparation functionality also indicate substantial market demand for visual data preparation capabilities.

Expert Data Science Teams

From purely a process perspective, Amazon SageMaker Studio and Google AI Platform are better suited for teams of expert, highly skilled data scientists. This is because they currently do not seem to provide embedded visual data preparation features within their platforms.

AWS & GCP both offer supplementary visual data prep solutions, like Trifacta Wrangler Pro, and Cloud Dataprep by Trifacta respectively. These solutions are not, however, embedded within SageMaker Studio, or AI Platform. This creates a less seamless experience than might be ideal, which could result in increased time to value when compared with projects built within the Azure Machine Learning Studio offering.

As the CRISP-DM analytics process diagram well shows, data preparation and modeling phases are closely linked. Data scientists often need to quickly switch back-and-forth between these tasks because input training data is directly linked to model behavior and quality.

For example, a data scientist may want to experiment and test how either removing or including a feature column in a training data set impacts model prediction quality. In which case, it will likely be more cumbersome for analysts or data scientists who wish to manipulate data visually to switch back and forth between data prep and modeling platforms on AWS or GCP.

Cloud Machine Learning Platform Summary

In summary, 2020 is the first year in which each of the top three cloud providers offer a machine learning platform which includes managed, embedded Jupyter notebooks for data scientists to code in R & Python.

With these machine learning platforms the leading cloud companies seem to have made great strides in simplifying the machine learning process for their customers. Any of whom with little more than a credit card may immediately begin running experiments within them at low cost.

Azure Machine Learning Studio may be the best choice for less technical customers and for projects where cross-functional collaboration is required. AWS SageMaker Studio and Google AI Platform, may be better suited for expert data science teams.

Each of these platforms may provide suitable options for customers who have already migrated much of their data and workloads to the cloud. As they are in their infancy, however, they may not yet provide a compelling reason to do so on their own.

See this content in the original post

See this gallery in the original post

Get your FREE 30-Min Data Science Consultation

Please fill out this form and we'll get in touch to schedule your free consultation.

See this content in the original post

Subscribe to our weekly Data Science & Machine Learning Technology Newsletter

See this form in the original post

Recent Posts

See this gallery in the original post

Posts by Category

See this content in the original post