Datagrom | AI & Data Science Consulting

View Original

My Take on the Gartner MQ for Data Science and Machine Learning Platforms: “Why I buy” Data Science and Machine Learning Platform Edition

See this content in the original post

Update: Gartner has postponed its Data and Analytics Conference. Hopefully, this article helps to inform in that conference’s absence. If you have any questions about my comments, please comment below, or send me a note

A few weeks ago, Gartner released its 2020 Machine Learning and Data Science platform Magic Quadrant. Each year Gartner uses their “Completeness of Vision /Ability to Execute” methodology to categorize and rank each vendor who participates in the annual Gartner dance. The Gartner report claims to provide clarity on a complicated market. I believe it creates total confusion. (Yes, this I know that’s their model)

As a veteran of the machine learning and data science technology market, I have some of my own thoughts on these products, their value proposition, their targeted personas and their actual personas reached. I have paid close attention to the MQ for the last 5 years and have worked closely with and for many of the companies/technologies in the quadrant. I wanted to share some of my thoughts in a series of posts. This is the first.

This first post is called “Why I buy,” and is intended to level-set these participants going forward.

Rules for reading this Post

Caveat: I am writing this to dispel BS. So this is going to be simple and opinionated. Unlike Gartner, I can be brutally honest.

  • The “why I buy” format: Borrowing the lexicon of Clayton Christensen here, “Why I buy” refers to the primary value that the primary persona gains from the purchase and use of this product.

  • Each company does a little cataloging, data wrangling, modeling, and deployment. When i refer to a company, I am talking about their core product and what their primary persona actually uses. For example: Alteryx, I talk about the desktop tool for data wrangling. Datarobot, the modeling tool and not Paxata (recently in the portfolio through acquisition), etc.

  • The legacy companies in the space, SAS/Mathworks/IBM/Tibco are not given primary attention here. I wanted to discuss products based on what their customers actually buy and use rather than the product line they currently market. For example, SAS Enterprise Miner and SAS Base clients make up nearly all SAS use in data science. The Viya platform, while at the front of the SAS marketing effort for tools (and quite advanced), is barely used in the market. For IBM, the main use is with SPSS products. And with Mathworks, the Matlab package is the workhorse.

  • There are two stages to all data science, a prototyping phase and some notion of a production phase. I don’t tackle the differentiation of these two phases here and will leave it for a future post. (If you want to interpret this as me de-emphasizing MLOps, feel free to. Too much hype about this right now and everyone does it.)

  • Synchronized collaboration in data science has little value. All real data scientists build their own data extracts, do their transformations and build their own models. Reproducibility is the only real concern.

  • The vendors are presented in the order of my preference within each subsection.

  • Each section has a “vendor speak” (Let’s call it ‘VS’) in the event you use this post as ammunition. Listen for ‘VS’ at the Gartner conference.

The Data Wranglers / DAG Builders

  1. Alteryx

  2. Dataiku

  3. RapidMiner

  4. KNIME

These tools are most often used to do one-off data wrangling for reporting and visualization by individuals who cannot write Python/R. Point and click persona. This group might click on some “automodel” features but these models are not put in to production.

Alteryx

“Why I Buy”

“I’m in the business and need my analysts who are preparing data for purposes of visualization (Tableau, PowerBI, Excel) to use something more robust than excel to manipulate said data.”

Big Pro: Easiest user experience and huge user community

Big Con: Processing is done in the Alteryx compute environment

No BS: Buy Alteryx if…. you have teams of 10 or fewer analysts needing wrangling tools and they can’t write code.

“VS”

“We support coders.” ⇥ There is no value add here for coders. Better to use notebook/IDEs. Buying Alteryx for coders is wasted dollars.

“We have automated modeling and deployment.” Analysts don’t build production ML models. Ever. Period.

“We have pushdown native processing.” ⇥ This requires the expensive Alteryx Server. Few companies buy this.

Ask Alteryx about Dataiku’s superior architecture.

Dataiku

“Why I Buy”

“I’m in IT and need analysts (team > 10 analysts) who are preparing data for purposes of reporting and visualization to use something blessed by IT (good security, native processing).”

Big Pro: Super modern architecture. Very little vendor lock-in.

Big Con: Less user friendly for wrangling than Alteryx, very small community of users and super expensive.

Buy Dataiku if….you are willing to tradeoff usability for better governance and a higher price tag.

“VS”

“We have auto modeling so we should be in both categories.” ⇥ It is not your core focus, it’s not very good, and your white papers and pushed use cases back this up.

“We have deployment of flows and models.” ⇥ Customers use the tool primarily for prototyping. Also, analysts will never and should never build production ML models. Ever.

“We are cloud agnostic” ⇥ That’s too bad, we are Azure/AWS/GCP and want opinionated software.

Ask Dataiku about Alteryx’s ease of use and 100x larger user base.

Rapidminer

“Why I Buy”

“I’m in IT and I want to govern data wrangling for purposes of visualization and reporting a bit worse than Alteryx and I’m very price conscious.”

Big Pro: Connects to all the data sources and cheap. Much larger user base than Dataiku.

Big Con: Less good at ease of use than Alteryx. Less IT compliant than Dataiku. (Heavy client)

Buy Rapidminer if….You want to standardize your wrangling tasks cheaply

“VS”

“We have native pushdown to Hadoop” ⇥ What’s Hadoop?

“We have machine learning” ⇥ That’s cute. I can use Google’s automl for free.

Ask Rapidminer for any customers switching from Dataiku or Alteryx to Rapidminer. Also, like KNIME, RapidMiner has been around a while (2006) and has less than 120 employees. Why?

Knime

“Why I Buy”

“I’m in IT and I want to govern data wrangling for purposes of visualization and reporting much worse than Alteryx and I’m very very price conscious.”

Big Pro: Connects to many data sources and very cheap/free options

Big Con: Heavy client. Less easy to use than Alteryx.

Buy KNIME if….You want to govern your data wrangling tasks cheaply.

“VS”

The company has been around for 15 plus years. It lists approximately 70 employees. Comparatively, Dataiku has been around 7 years and has more than 500 employees. Alteryx has more than 1,200. Why hasn’t KNIME grown comparably to these other companies.

Not in Gartner but I’d consider for data wrangling…

Trifacta (whitelabeled as Google Data Studio)

Paxata (Acquired by Datarobot)

Tableau Prep

Modeling Tools

  1. Datarobot

  2. H2O.AI Driverless

These tools are primarily used to take prepared data and make a bunch of models. (Truth: auto modeling tools beat hand created models because modeling is boring once the data are prepped.)

DataRobot

“Why I Buy”

“I want my stats-first data scientists and analysts to have a productivity enhancing tool to find the best models (cross sectional and time series) in the fastest way.”

Pros: The best interface to build models fast.

Cons: Can be on the expensive side as a DS prototyping tool

Buy Datarobot if…You want your Ai project to succeed in implementation and you are slightly less budget constrained

“VS”

“We have Paxata for wrangling now” ⇥ oh yeah, cool, we use Alteryx or write code”

“We do deployment now that we purchased ParralellM” ⇥ great, can someone at Datarobot please explain how this technology works.

Ask Datarobot to explain how an analyst with no stats training would posit a model and prep data accordingly.

H2O

“Why I Buy”

“I want my computer-first data scientists to have a productivity enhancing tool to discover good models (cross sectional and time series) in a super fast and automated way.”

Pros: Solid model interpretation

Cons: Company business model is fragmented and customer successes limited

Buy H2O.ai Driverless if…You need an inexpensive and quick prototyping tool to do exploration and build some models in the prototyping stage.

“VS”

“We have our open-source library for coders” ⇥ it’s awesome, Apache V2.0 and thanks.

“We are building Ai for BI” ⇥ Tableau has a thousand developers on this and they stink at it. How many are you devoting to it?

Ask H2o.ai compare their free models with their paid Driverless product.

Additional vendors to also look at…

Google automodel

Dotdata

SAS Enterprise Miner

BIGML

Coding Only / Compute Environments

  1. Databricks

  2. Domino

  3. Cloud Providers

  4. Anaconda

Look at this group of vendors if you have Python and R coders to do data science prototyping and production.

Databricks

“Why I Buy”

“I want my Python-first coders to have a fully managed environment to work with data of unlimited size.”

Pros: Easy to buy. Very high user satisfaction. Low starting costs.

Cons: Best for skilled Python coders. Only on AWS and Azure. Costs can escalate without internal controls.

Buy Databricks if...You want happy Python-first coding data scientists (computer scientists) and data engineers.

“VS”

“We have something for data analysts with our SQL integration” ⇥ Databricks is a power user tool for Python coders only. The only value for Analysts is well engineered data pipelines.

“PySpark, SparkR and SparkSQL are the only tools needed for data science and all data are ready in cloud data storage programs” ⇥ Actually, that is untrue. A multitude of tools are used in data science including R, SAS, Alteryx and others as well as data storage technologies such as Teradata, Oracle and other SQL databases.

Ask Databricks about workflows that utilize stats models not available in Python (spoiler, they just built Matlab integration. Damn these guys are good).

Domino

“Why I Buy”

“I want my R and Python coding data scientists to have a simple environment to do their work and a way for IT to easily manage dependencies in these environments.”

Pros: Simple environment for coders to improve reproducibility. Complete flexibility.

Cons: No real features for audibility or collaboration. Little value add beyond solid processes and procedures for coding. Strictly a power coder tool.

Buy Domino if…Your R and Python coders want maximum flexibility in their coding environments, your IT struggles in keeping open source packages up to date, cloud cost containment is a real concern and your data scientists are constantly crashing shared machines.

“VS”

“Our platform is collaborative by nature.” ⇥ Interactive notebooks in cloud environments are just as collaborative. The collaborative features actually rely on the package management and good coding practices.

“We are agnostic to cloud vendor.” ⇥ As was the case with other vendors, companies care about use cases and not cloud lock-in. They will diversify by use cases on different clouds.

Ask Domino why the cloud notebooks are not a better option.

Azure ML (Data Science Machines) / AWS Sagemaker / Google Collaboratory Notebooks

Who knows what these offerings are right now.

They are constantly changing. They are all badly in need of product management. Give these to your data scientists if you want to confuse them and you don’t care if they build anything of value quickly. In fact, just give them access to your cloud account and watch them spend money.

Pros: Sort of one click startup on the respective clouds

Cons: Offering is constantly changing. Very confusing to start and use. Basically these are all Jupyter notebooks on the respective clouds. Customers also complain of minimal support.

“VS”

“We have all the services needed to prototype and deploy a machine learning workflow.” True. All you need is a team of unicorn data scientists with maximum patience and you can create something of value.

Ask the cloud vendors to show you how multiple personas use their products.

Anaconda

“Why I Buy”

“I want my Python and R coding data scientists to have someone to support their environment and package management.”

Pros: Support for Python installs.

Cons: Unclear business model. High turnover.

Buy Anaconda if: You have a really big team of Python coding data scientists that don’t know much about deployment and cost isn’t an issue. (This is rare).

“VS”

“We are major contributors to the open-source community” ⇥ Thank you.

“We help Python-first coders deploy data science models and pipelines.” ⇥ Really good Python coders don’t need help with this.

Ask Anaconda why you shouldn’t just use Domino.

Other vendors to look at…

Cubonacci

Gigantum

Not Elsewhere classified but in Gartner (Evaluated based on why companies pay money to these vendors, in no specific order):

Matlab (Mathworks): This is great software to do statistics and write your own matrix algebra. There is a cool Python library. Not sure what Gartner sees here. Must have been a great demo gluing pieces together.

TIBCO (Statistica): Anyone actually buying TIBCO for Data Science is actually buying and using Statistica. (They purchased Alpine years ago but seem to be reworking it and few people used Alpine back then) Statistica is a decent statistics package but I’d buy Stata, SAS, Matlab, Minitab, JMP or SPSS before this. So, I’m confused. How can Gartner put TIBCO in the Leader quadrant. They must have sponsored a few dinners at last year’s summit.

SAS (Base SAS, SAS/STAT plus Enterprise Miner/Forecast Server): Good software. Super pricey. There is an implied insurance policy against complete failure when you buy SAS. Excellent for analytics teams that are just starting out and need structure. If you are buying SAS, buy the tried and true software, not Viya or any other “open source” SAS. It is not “open” at all. They will likely force Viya on you in the sale. All I have heard are issues installing it.

IBM: Not really any reason to buy IBM for anything data science but there is some implied insurance policy in buying IBM. SPSS is a fine tool but is pricey compared to other proprietary options such as Stata, Minitab, JMP, etc.

Where I’d put my money…

Alation | PowerBI | Alteryx | Dataiku | Datarobot

So that’s my take on Gartner. If you have any questions or comments, please leave them below!

See this content in the original post

See this gallery in the original post

Subscribe to our weekly Data Science & Machine Learning Technology Newsletter

See this form in the original post

Recent Posts

See this gallery in the original post

Posts by Category

See this content in the original post