Punch Torino: High-Performance Computing (HPC) Cluster Deployment on Oracle Cloud

To speed up the run time of its computational fluid dynamics simulations, Italian tier-1 engine manufacturer, Punch Torino, moved its CFD platform to Oracle Cloud Infrastructure (OCI).

By using Oracle Cloud Infrastructure High-Performance Computing (HPC), Punch Torino's engineers are now able to run CPU-, memory, and I/O-intensive simulation and testing workloads up to 24% faster with 33% fewer compute cores.

Partnering with high-performance computing consulting company, Doit Systems, Punch Torino's production environment went live in just ten weeks after its proof-of-concept was completed.

In its Oracle Cloud Infrastructure tenancy, Punch Torino runs the Abaqus, Converge, StarCCM+, Optistruct applications.

Unique features to Punch Torino's deployment on Oracle Cloud Infrastructure include:

  • HPC bare metal servers coupled with Oracle’s cluster networking provide access to ultra-low latency RDMA (< 2 μs latency across clusters of tens of thousands of cores) over converged Ethernet (RoCE) v2
  • Ease of use in HPC automation tools to scale up and down bare metal servers in minutes
  • Oracle's flat, two-tier network topology provides uniform bandwidth and latency across all nodes allowing HPC clusters to scale up linearly
  • High I/O throughput storage with the 6.4TB NVMe SSD locally attached to the bare metal instance

For future deployments, Punch Torino is also considering:

  • New types of compute instances, such as Optimized X9
  • FastConnect to transfer more data and reduce latency in remote sessions on the GPU nodes

Customer Story

Learn more about Punch Torino's journey to Oracle Cloud:

Architecture

Punch Torino's users access the applications by using a virtual private network (VPN) from the on-premises access and control center web application, which is an Altair Access web application. The on-premises Active Directory system performs authentication by using Oracle Cloud Infrastructure Identity and Access Management so users do not have direct access to the high-performance computing (HPC) cluster.

The control node brings up the HPC cluster nodes on demand. After the nodes are ready, the control node separates the job into several parts and submits them to process concurrently. The Control Scheduler autoscales the compute nodes via REST APIs. The HPC cluster provisions bare metal instances on demand. The simulations are typically optimized to complete in five to six hours.

The files Punch Torino processes can be as large as 50GB. Three types of storage are used to optimize storage costs:
  • Simulations require high I/O throughput using the hot storage provided by the 6.4TB NVMe SSD local storage attached to the bare metal instance.
  • Results are stored in warm (file) storage for analysis.
  • The remote graphic analysis session copies the files to hot (block) storage attached to the VM instance for fast rendering.
After users launch the remote graphic sessions, they can analyze the results on the Oracle Cloud Infrastructure NVIDIA VM instances. After analyzing the datasets, the compute instances and associated hot storage are shut down and deleted. The analyzed data is stored in cold object storage, which can be accessed for up to eight years.

The following diagram illustrates this reference architecture.



punch-torino-oci-arch-oracle.zip

The following diagram shows how data flows through the architecture:



punch-torino-oci-flow-oracle.zip

  1. Users initiate access to the applications from the on-premises access and control center.
  2. On-premises Active Directory authenticates the user.
  3. On-premises licensing server supplies available licenses.
  4. On-premises access and control center brings up the HPC cluster nodes on demand.
  5. Users upload simulation file (up to 50 GB) to file ("warm") storage.
  6. The simulation file is copied to local SSD ("hot") storage and results are saved back to file storage.
  7. On-premises access and control center brings up the visual nodes on demand.
  8. The simulation file is copied from file storage to block ("hot") storage for processing by the visual node.
  9. The results are saved to object ("cold") storage for long-term storage.

The architecture has the following components:

  • Region

    An Oracle Cloud Infrastructure region is a localized geographic area that contains one or more data centers, called availability domains. Regions are independent of other regions, and vast distances can separate them (across countries or even continents).

  • Identity and access management (IAM)

    Oracle Cloud Infrastructure Identity and Access Management (IAM) enables you to control who can access your resources in Oracle Cloud Infrastructure and the operations that they can perform on those resources.

  • Audit

    The Oracle Cloud Infrastructure Audit service automatically records calls to all supported Oracle Cloud Infrastructure public application programming interface (API) endpoints as log events. Currently, all services support logging by Oracle Cloud Infrastructure Audit.

  • Availability domain

    Availability domains are standalone, independent data centers within a region. The physical resources in each availability domain are isolated from the resources in the other availability domains, which provides fault tolerance. Availability domains don’t share infrastructure such as power or cooling, or the internal availability domain network. So, a failure at one availability domain is unlikely to affect the other availability domains in the region.

  • Virtual cloud network (VCN) and subnets

    A VCN is a customizable, software-defined network that you set up in an Oracle Cloud Infrastructure region. Like traditional data center networks, VCNs give you complete control over your network environment. A VCN can have multiple non-overlapping CIDR blocks that you can change after you create the VCN. You can segment a VCN into subnets, which can be scoped to a region or to an availability domain. Each subnet consists of a contiguous range of addresses that don't overlap with the other subnets in the VCN. You can change the size of a subnet after creation. A subnet can be public or private.

  • Security list

    For each subnet, you can create security rules that specify the source, destination, and type of traffic that must be allowed in and out of the subnet.

  • Route table

    Virtual route tables contain rules to route traffic from subnets to destinations outside a VCN, typically through gateways.

  • Dynamic routing gateway (DRG)

    The DRG is a virtual router that provides a path for private network traffic between a VCN and a network outside the region, such as a VCN in another Oracle Cloud Infrastructure region, an on-premises network, or a network in another cloud provider.

  • High-performance computing

    Designed for high-performance computing workloads that require high frequency processor cores and cluster networking for massively parallel HPC workloads.

    Oracle Cloud Infrastructure bare metal servers coupled with Oracle’s cluster networking provide access to ultra-low latency RDMA (< 2 μs latency across clusters of tens of thousands of cores) over converged Ethernet (RoCE) v2.

  • Virtual Machine

    The Oracle Cloud Infrastructure Compute service enables you to provision and manage compute hosts in the cloud. You can launch compute instances with shapes that meet your resource requirements for CPU, memory, network bandwidth, and storage. After creating a compute instance, you can access it securely, restart it, attach and detach volumes, and terminate it when you no longer need it.

    Oracle’s bare metal servers provide customers with isolation, visibility, and control by using dedicated compute instances. The servers support applications that require high core counts, large amounts of memory, and high bandwidth. They can scale up to 160 cores (the largest in the industry), 2 TB of RAM, and up to 1 PB of block storage. Customers can build cloud environments on Oracle’s bare metal servers with significant performance improvements over other public clouds and on-premises data centers.

  • Object storage

    Object storage provides quick access to large amounts of structured and unstructured data of any content type, including database backups, analytic data, and rich content such as images and videos. You can safely and securely store and then retrieve data directly from the internet or from within the cloud platform. You can seamlessly scale storage without experiencing any degradation in performance or service reliability. Use standard storage for "hot" storage that you need to access quickly, immediately, and frequently. Use archive storage for "cold" storage that you retain for long periods of time and seldom or rarely access.

  • File storage

    The Oracle Cloud Infrastructure File Storage service provides a durable, scalable, secure, enterprise-grade network file system. You can connect to a File Storage service file system from any bare metal, virtual machine, or container instance in a VCN. You can also access a file system from outside the VCN by using Oracle Cloud Infrastructure FastConnect and IPSec VPN.

  • Block volume

    With block storage volumes, you can create, attach, connect, and move storage volumes, and change volume performance to meet your storage, performance, and application requirements. After you attach and connect a volume to an instance, you can use the volume like a regular hard drive. You can also disconnect a volume and attach it to another instance without losing data.

Get Featured in Built and Deployed

Want to show off what you built on Oracle Cloud Infrastructure? Care to share your lessons learned, best practices, and reference architectures with our global community of cloud architects? Let us help you get started.

  1. Download the template (PPTX)

    Illustrate your own reference architecture by dragging and dropping the icons into the sample wireframe.

  2. Watch the architecture tutorial

    Get step by step instructions on how to create a reference architecture.

  3. Submit your diagram

    Send us an email with your diagram. Our cloud architects will review your diagram and contact you to discuss your architecture.

Acknowledgements

  • Authors: Sasha Banks-Louie, Wei Han, Dimitri Manca
  • Contributor: Robert Lies