University of Heidelberg Institute of Computer Engineering


Description

The goal of the EXTOLL project is to develop an interconnection architecture specifically designed to satisfy the needs of low-latency inter-process-communication in parallel machines. The EXTOLL project tries to develop a system architecture which includes advances on the network layer, the attachement of the individual hosts to the interconnection network (network interface controller) and the software layer to enable applications to exploit the EXTOLL hardware and reach higher performance and efficiency.

EXTOLL is an on-going project and as new results become available this website will be updated.

Hardware Architecture Overview

The complete architecture is formed through a number of major blocks, namely the host interface block, the network interface controller (NIC) block and the network block (green, blue and red in the blockdiagram to the left). The host interface in turn consists of the HyperTransport IP Core and the on-chip network (HyperTransport Advanced Crossbar - HTAX). The NIC block consists of the two communication engines, VELO and RMA, and the supporting units ATU and the registerfile. Finally, the network block consists of network ports, the EXTOLL crossbar and linkports. Networkports convert between the communication engines and the EXTOLL network protocol. Linkports implement the linklayer protocol of EXTOLL, especially retransmission is handled here. The EXTOLL crossbar is the network switching element and forwards incoming packets from any port, either linkport or networkport, to any outgoing port, again either network port or linkport.

Both crossbar components of EXTOLL are parametrizable in width in number of ports. A scripted environments allows easy adaption to another number of ports. This feature allows for relatively simple exchange or addition of different engines in the NIC block.

A non comprehensive list of features is given below:

  • HyperTransport is used as the host interface for lowest latency of data transport between CPU and device
    • The host interface of EXTOLL is implemented by the HT-Core, also developed at the CAG, which enables communication between the host and the network with very low latency and high-bandwidth (see also the paper An open-source HyperTransport core for more information).

  • A Modified HyperTransport protocol is used as on-chip communication protocol

  • EXTOLL implements a lean and optimized network interface controller
    • Minimize state information on NIC
    • Provide user-level, virtualized access (kernel bypass)
    • Minimize CPU - device and memory - device transactions
    • Currently two primary communication engines are implemented
      • VELO - for extreme low latency send/recv style communication. Supports messages of up to 64 byte on the hardware level.
      • RMA - direct access to remote memory using put and get operations. Enables CPU-offloading, zero-copy and true one-sided protocols.
    • To support these a Control & Status Registerfile and unit to translate virtual addresses (Address Translation Unit - ATU) have been developed.

  • EXTOLL also implements its own network implementation including link layer, routing and switching:
    • Designed for low latency
    • Cut-through switching to allow for low latency
    • Reliable transmission of messages
    • Hardware based retransmission protocol
    • In-order delivery of messages enabling latency savings in upper layers
    • Implements an optimized source-path routing
    • The network supports virtual channels for deadlock avoidance
    • Direct network: each node includes its own switch with 6 external links, enabling for example a 3-d torus topology

The block diagram on the right illustrates an incarnation of the EXTOLL architecture, as it was implemented for an FPGA-based prototype.

EXTOLL consists of a number of building-block components. It is possible to adapt the architecture depending on the requirements for a given system environment. The prototype implementation chooses components and a configuration which optimizes communication latency especially for small communication operations in AMD64 host systems. At the same time the implementation is able to support a general set of communication operations and fits into an FPGA based hardware platform.


FPGA Prototype Implementation

The EXTOLL prototype is based on the HTX-Board:

  • Virtex-4 FX100 FPGA, speed-grade 11 or 12
  • 6 SFP optical transceivers
  • 32 bit data width on network layer at up to 180 MHz
  • 64 bit data width on NIC layer at up to 180 MHz
  • > 90% of all slices of the FPGA are in use for the design

Below is a floorplan of the VP4FX100 device with an EXTOLL design after place and route. The HyperTransport interface is colored in green, the NIC layer is blue and the network layer is red. The grey areas are used for SERDES management, I2C interface etc.


Software Stack

A complete Software Stack is being developed with the following goals:

  • OS bypass
  • Layered approach
  • MPI support for communication with extreme low latency
  • Efficient usage of the available bandwidth and CPU cycles
  • Support for other middlewares, for example GasNET which in turn is the base for Universal Parallel C (UPC).

The diagram shows the different software components for EXTOLL. These are:

  • Low-level API library for VELO (libVELO)
  • Low-level API library for RMA (libRMA)
  • Linux kernel modules for the different HW modules including
    • EXTOLL base driver
    • VELO driver
    • RMA driver
    • ATU driver
    • Registerfile driver
  • MPI support through OpenMPI components
  • PGAS support through GasNET
  • Network management and configuration support


Development Cluster

To develop the EXTOLL software, a (small) development cluster was set-up at the CAG lab. The cluster consists of 10 machines each equipped as follows:

  • Two dual-core Opteron 870 (2.0 Ghz) processors
  • 4 GB RAM
  • Iwill DK8-HTX Mainboard
  • SATA HD, GE network
  • HTX-Board, revision 1.3 equipped with speedgrade 12 Xilinx Virtex4 FX100 devices
  • proudly running Coreboot

There will soon be a larger cluster available which is also able to run the EXTOLL firmware.


Performance Results

The following performance numbers have all been measured on a two node configuration, each node equipped with two Opteron 870 (2 GHz, dual-core, K8 generation) and an HTX-Board using HT400. EXTOLL is running at 180MHz, the serial links between the nodes at 3.6Gb/s. The achieved unidirectional payload bandwidth is 316MB/s in this setup, while the half-round trip latency starts at about 1µs. OpenMPI adds about 300 ns of latency. Note that these numbers have been measured on FPGA hardware and include all hardware and software latencies.

The hardware latency includes the passing of two switch stages (one EXTOLL crossbar in each node) and an optical serial link, which, due to the sub-optimal latency behavior of the FPGA SerDES, contributes significantly to the overall latency. Also in respect to the software latency, the used CPUs are relatively old and slow, and lower latencies can be reached with higher clocked devices.



Outlook

The results so far are very promising. The latency that was measured on the prototype is in the range of the best available commercial networking hardware. The important point to remember here is the technology disadvantage of EXTOLL which was benchmarked on an FPGA. The plot on the right side illustrates the estimated latency and bandwidth that could be reached by an ASIC implementation of the EXTOLL system.

Publications


Contact

Mondrian Nuessle, email: Mondrian.Nuessle {at} ziti.uni-heidelberg.de

Last modified:02.05.2011

Lectures

SS 12
WS 11/12

Contact

Universität Heidelberg
LS Rechnerarchitektur
Prof. Dr. U. Brüning
B6, 26, Building B (3rd floor)
68131 Mannheim
Fon: +49 (0) 621 - 181 2723
Fax: +49 (0) 621 - 181 2713
Email: ulrich.bruening {at} ziti.uni-heidelberg.de

Delivery Address

Universität Heidelberg
LS Rechnerarchitektur
Prof. Dr. U. Brüning
B6, 26, Building B (3rd floor)
68159 Mannheim

HT Center of Excellence