Data Science Is A Team Sport: Oracle’s New Cloud Platform Provides The Playing Field

When you think about data science, you might picture a PhD mathematician magically swirling data on a laptop until it reveals its secrets. But really, data science is a team effort.

Getty Images/iStockphoto

For data science to happen, someone’s got to find and prepare datasets—which can include any piece of information, such as a location, a name, an item in a warehouse, a person’s age, a social media comment, a timestamp, or an attribute of a picture. Then, someone has to bring the data into a computer, using open source tools to apply statistical techniques to tease out relationships—and hopefully arrive at some new understanding about the world.

And finally, when the process yields a valuable insight, someone has to publish the model as a governable, repeatable process to run on future datasets.

At least, that’s how it’s supposed to work.

In reality, “most organizations are seeing only a fraction of the enormous potential of their data,” says Greg Pavlik, Oracle’s senior vice president of product development for data and AI services. That’s because, with all the people, computer power, and work processes involved in data science, too often the right handoffs don’t happen, systems and libraries aren’t shared, data isn’t secured, or there’s so much data that it’s hard to move it to the systems on which the algorithms run.

Greg Pavlik, Oracle’s senior vice president of product development for data and AI services.

Oracle

Fixing that problem is why Oracle built the Oracle Cloud Data Science Platform. The new services make it easier for data science teams to collaboratively build, train, and deploy machine learning models. “Our goal is to increase the success of data science projects,” Pavlik says.

Pavlik brings long experience in the world of open source big data projects, and saw firsthand how powerful, cloud-based platforms replaced the use of one-off, custom systems to run big data projects, thus transforming that part of the industry. Now, he says, Oracle is combining its second-generation cloud infrastructure and its industry-leading data management to do the same thing for data science.

Unlike other data science products that focus on helping individual data scientists, Oracle Cloud Infrastructure Data Science helps improve the effectiveness of data science teams with capabilities like shared projects, model catalogs, team security policies, and reproducibility and auditability features.

“Data scientists are experimenters. They want to try stuff and see how it works,” says Pavlik. “They grab sample datasets, they pull in all kinds of open source tools, and they're doing great stuff. What we want to do is let them keep doing that, but improve their productivity by automating their entire workflow and adding strong team support for collaboration to help ensure that data science projects deliver real value to businesses.”

The starting point for data science to deliver value is doing more with machine learning, and being more efficient with the data and algorithms involved.

“Effective machine learning models are the foundation of successful data science projects,” Pavlik says, but the volume and variety of data facing data science teams “can stall these initiatives before they ever get off the ground.” So Oracle Cloud Infrastructure Data Science gives the team a powerful platform to develop, train, and share machine learning algorithms, including:

AutoML algorithm selection and tuning automates the process of running tests against multiple algorithms and hyperparameter configurations. It checks results for accuracy and confirms that data scientists are picking the best model and configuration. This helps data scientist achieve the same results as the most experienced practitioners.
Automated predictive feature selection simplifies feature engineering by automatically identifying key predictive features from larger datasets.
Model evaluation generates a comprehensive suite of evaluation metrics and suitable visualizations to measure model performance against new data and can rank models over time. Model evaluation goes beyond raw performance to take into account normal behavior and uses a cost model that considers the different impacts of false positives and false negatives.
Model explanation provides explanation of the relative weighting and importance of the factors that go into generating a prediction, and offers the first commercial implementation of model-agnostic explanation. For example, with a fraud detection model, a data scientist can explain which factors are the biggest drivers of fraud so the business can modify processes or implement safeguards, or explain the factors leading to a specific prediction.

Because Oracle Cloud Infrastructure Data Science is built on Oracle’s powerful cloud infrastructure, “we make it easy for you to get access to not just the languages and libraries and tools, but also the computer resources that are required,” Pavlik says, including integrated cloud services for big data management and access to an array of open source data stores and virtual machines for data science.

“We're all about productivity—from data exploration and model training, all the way through to the production delivery and maintenance of models,” says Pavlik. “We have made it a really productive and enterprise-ready platform experience.”

The ease of getting started is a big reason more data science work will move to the cloud, Pavlik predicts. For this new service, just sign in to Oracle Cloud and go to the data science service option on the console, “and just start creating a project and doing your work,” he says.