octopus

octopus is a high density hybrid GPU supercomputer, designed for high performance parallel computing with distributed memory. It is composed of 34 Supermicro nodes. Each node hosts 2 CPUs and 4 Nvidia GeForce GTX Titan X GPUs.
Hardware

The supercomputer is composed of 34 Supermicro high density computing nodes. Each node features 4 Nvidia GeForce GTX Titan X (12 GB) GPUs, 2 Intel Haswell processors (Intel XEON E5-2620V3 2.4GHz) and 128 GB DDR4 RAM (node34 includes 1 TB DDR4 RAM – fat node). The nodes are designed to optimise intra-node memory bandwidth. Each node also hosts 2 Mellanox single port InfiniBand (IB) FDR cards (ConnectX-3 PCIe x8 – 8 GT/s) to work in parallel (dual rail).

The 34 nodes are interconnected with dual-rail InfiniBand FDR high speed interconnect from Mellanox. Due to the high-density nodes (4 GPU accelerators per node), the inter-node communication needs maximal bandwidth in order to enable communication between processes on the fastest rate.

In addition to the 34 compute nodes, the supercomputer is composed of a login node, 2 Mellanox IB FDR 36 ports switches (managed) and one Quanta 48 ports Ethernet switch.

The focus on memory bandwidth optimization together with the dual rail configuration allows us to reach measured bandwidths up to 18 GB/s (~9 GB/s per direction).

Latest additions

– node35 (same CPU and RAM specs as nodes 1-34) hosts one Tesla V100 PCIe and two Xeon Phi coprocessors.
– node40 “Volta” is a Supermicro DGX-1 like node, hosting 8 Tesla V100 Nvlink, 2 Intel Xeon Silver (4112 2.6GHz), 768 GB DDR4 RAM and 3 SSD drives of 768 GB each.

Filesystem

BeeGFS (Fraunhofer) is a leading parallel cluster file system, developed with a strong focus on performance. Two instances are shared on the cluster with native IB RDMA support:

1) /scratch > 34 nodes x 2HDD x ~3TB = 204TB

uses in priority the 2 locals HDD (in stripping).

2) /project > 34 nodes x 1HDD x ~3TB = 102TB

uses 10 pairs (buddy groups in BeeGFS) for data mirroring.

Additionally, 4 metadata servers each with 2 SSD 512GB in RAID1 are running. Metadata mirroring is also activated.

Software

CentOS 6.9 is installed

  • cuda 7.0, 8.0, 9.0, 10.0
  • openmpi | cuda aware parallel (dual-rail)
  • mvapich2 | cuda aware parallel (dual-rail)
  • Matlab
  • Paraview

Development partner | Colfax Intl, Sunnyvale CA (USA)

Colfax