Train Machine Learning Models for Healthcare Use Cases

Use Oracle Cloud Infrastructure Data Science service to explore and train machine learning models for healthcare use cases.

Architecture

This architecture shows a typical Oracle Cloud Infrastructure Data Science deployment in Oracle Cloud Infrastructure (OCI).

The following diagram shows the core services and some of the optional services you can incorporate, as needed.

Description of healthcare-ml-design-pattern.png follows
Description of the illustration healthcare-ml-design-pattern.png

healthcare-ml-design-pattern-oracle.zip

The following are the key components of the architecture:

  • Object Storage or Oracle Autonomous Database as the storage location.
  • Data Science Notebook Session for exploration and development of the models
  • Model deployment to productize models and make them available through a REST API.

This architecture supports the following components:

  • Region

    An Oracle Cloud Infrastructure region is a localized geographic area that contains one or more data centers, called availability domains. Regions are independent of other regions, and vast distances can separate them (across countries or even continents).

  • Virtual cloud network (VCN) and subnet

    A VCN is a customizable, software-defined network that you set up in an Oracle Cloud Infrastructure region. Like traditional data center networks, VCNs give you complete control over your network environment. A VCN can have multiple non-overlapping CIDR blocks that you can change after you create the VCN. You can segment a VCN into subnets, which can be scoped to a region or to an availability domain. Each subnet consists of a contiguous range of addresses that don't overlap with the other subnets in the VCN. You can change the size of a subnet after creation. A subnet can be public or private.

  • Internet gateway

    The internet gateway allows traffic between the public subnets in a VCN and the public internet.

  • API Gateway

    Oracle API Gateway enables you to publish APIs with private endpoints that are accessible from within your network, and which you can expose to the public internet if required. The endpoints support API validation, request and response transformation, CORS, authentication and authorization, and request limiting.

  • Data Integration

    Oracle Cloud Infrastructure Data Integration is a fully managed, serverless, cloud-native service that extracts, loads, transforms, cleanses, and reshapes data from a variety of data sources into target Oracle Cloud Infrastructure services, such as Autonomous Data Warehouse and Oracle Cloud Infrastructure Object Storage. ETL (extract transform load) leverages fully-managed scale-out processing on Spark, and ELT (extract load transform) leverages full SQL push-down capabilities of the Autonomous Data Warehouse in order to minimize data movement and to improve the time to value for newly ingested data. Users design data integration processes using an intuitive, codeless user interface that optimizes integration flows to generate the most efficient engine and orchestration, automatically allocating and scaling the execution environment. Oracle Cloud Infrastructure Data Integration provides interactive exploration and data preparation and helps data engineers protect against schema drift by defining rules to handle schema changes.

  • Data catalog

    Oracle Cloud Infrastructure Data Catalog is a fully managed, self-service data discovery and governance solution for your enterprise data. It provides data engineers, data scientists, data stewards, and chief data officers a single collaborative environment to manage the organization's technical, business, and operational metadata.

  • Object storage

    Object storage provides quick access to large amounts of structured and unstructured data of any content type, including database backups, analytic data, and rich content such as images and videos. You can safely and securely store and then retrieve data directly from the internet or from within the cloud platform. You can seamlessly scale storage without experiencing any degradation in performance or service reliability. Use standard storage for "hot" storage that you need to access quickly, immediately, and frequently. Use archive storage for "cold" storage that you retain for long periods of time and seldom or rarely access.

  • Autonomous Database

    Oracle Cloud Infrastructure Autonomous Database is a fully managed, preconfigured database environments that you can use for transaction processing and data warehousing workloads. You do not need to configure or manage any hardware, or install any software. Oracle Cloud Infrastructure handles creating the database, as well as backing up, patching, upgrading, and tuning the database.

  • Data Science

    Oracle Cloud Infrastructure Data Science is an end-to-end machine learning (ML) service that offers JupyterLab Notebook environments and access to hundreds of popular open source tools and frameworks. Build and train ML models with NVIDIA GPUs, AutoML features, and automated hyperparameter tuning. Deploy models as HTTP endpoints or use Oracle Functions. Manage models through version control, repeatable jobs, and model catalogs.

Considerations for Machine Learning

When getting started with Machine Learning on Oracle Cloud Infrastructure Data Science service, consider the following:

  • Understand the Data

    Data is the primary and most critical component of any machine learning project. Published datasets have typically been curated and features may even have been extracted for you already, making it a good choice for learning about the service.

    Working with new data requires more work to clean up artifacts, impute missing values, and transform, encode or augment the dataset with additional features.

    This part of the data scientist workflow is typically the most time-consuming and can easily account for 80% to 90% of the time spend on a machine learning project.

  • Learn Jupyter Notebook syntax

    Oracle Cloud Infrastructure Data Science service builds on top of the widely adopted Jupyter Notebook framework. It provides a rich visual environment to experiment with data in the python language. Python is one of the most popular languages for Data Science, and Jupyter Notebook augments the language with specific syntax (called magic) that helps cut down on some cumbersome operations while enhancing the visual rendering of the data. Take the time to learn more about the syntax specific to Jupyter Notebook to take advantage of these features.

  • Use Jobs for expensive operations

    While exploration is a very interactive activity that is well suited for the Jupyter Notebook interface, expensive operations like model training and hyper-parameter tuning may take an extended period of time and can be off-loaded to the Jobs feature, which let users run long-running scripts on dedicated machines.