

# Introduction to FPGA computing for the HPC ecosystem



### FPGA architecture



#### **FPGA Architecture Overview**

- A field-programmable gate array (FPGA)
- Reconfigurable semiconductor integrated circuit (IC)
- FPGAs are a cheaper off-the-shelf alternative

- Grid of configurable logic composed of:
  - Logic Element (LEs) or Adaptive Logic Modules (ALMs)
  - Programmable switches
  - Digital Signal Processing blocks
  - RAM blocks
  - o Etc...





## **Logic Element**

- Main building block
- Can built any arbitrary logic circuit
- Lookup table (LUT)
- Register
- Multiplexer part

## LE High-Level Block Diagram for Intel® Cyclone® 10 LP Devices (source: Intel)





## Look-up Table (LUTs)

- Built out of:
  - EEPROM or SRAM holding the configuration, i.e., LUT-mask
  - o Set of multiplexers for bits selection drove to the output
- A k-LUT can implement any function of k inputs
- $2^k$ SRAM bits and  $2^k$  multiplexers
- Ex: 4-LUT with (A,B,C,D) as inputs







## Adaptive Logic Modules (ALM)

- Extend LEs
- Improved performance
- More complex but also more flexible





#### **DSP Block**

- High-performance multiply/add/accumulate operations
- Arria 10 DSP Block can do 32-bit IEEE-compliant floating-point multiply-add



(source: alteral)



## Random Access Memory (RAM) Blocks

- on-chip memory structures to support design
- Typical sizes:
  - Single memory block is 20 Kilobits
  - MLABs are general-purpose dual-port memory SRAM array (640 bits)





#### **FPGA Interconnect**

- Type:
  - SRAM: reprogrammable and volatile
  - FLASH: reprogrammable and non-volatile
  - Antifuse: One-time programmable and non-volatile
- Most FPGAs use SRAM cell technology to program interconnect and LUT function levels





## Phase-Locked Loop (PLL)

- Many FPGAs use a phase-locked loop (PLL) to increase the internal clock speed.
- Ex: The iCE40 on the IceStick allows you to run up to 275 MHz by setting the internal PLL with the onboard 12 MHz reference clock.



(source:Altera)



## **FPGA** programming

- Active mode: FPGA controls programming sequence automatically at power on
- <u>Passive mode</u>: CPU controls programming
- Program stored using either EEPROM, CPLD, SRAM, etc ...



(source:Intel -- Cyclone 10 GX FPGA



## FPGA systems (CPU-FPGA)

- Modern FPGA cards combine CPU and FPGA
- <u>Internal bus</u>: low-power embedded devices (System on Chip)
- External bus: PCIe for high-performance computing



(source:Design of FPGA-Based Computing Systems with OpenCL)



**FPGA** on the market



#### **FPGA vendors**

- Two major FPGA vendors:
  - o Intel [Altera]
  - o Xillinx [AMD]
- Intel acquired Altera in 2015
- Xillinx is solely focusing on the FPGA market
- While Intel is a sum of many parts
- Both profiles are very interesting for heterogeneous computing
- Among the others:
  - Lattice Semiconductor
  - QuickLogic
  - Microchip Technology
  - Achronix
  - Efinix

#### Programmable Logic Devices' Vendors by Revenue in Calendar 2015



Source: IHS



## Intel® FPGA family

- Intel® FPGAs are ideal for a wide variety of applications, from high-volume applications to state-of-the-art products.
- Each FPGA series with different features:
  - Embedded memory,
  - Digital Signal Processing (DSP) blocks,
  - High-speed transceivers,
  - High-speed I/O pins to cover a broad range of applications
- Intel® has four classes of FPGAs to meet market needs from the industry's highest density and performance to the most cost effective:
  - Agilex FPGA and SoC devices accelerate your delivery of the most advanced bandwith-intensive applications
  - Stratix 10 FPGA and SoC family enables you to deliver high-performance, state-of-the-art products to market faster with lower risk and higher productivity
  - Arria family delivers optimal performance and power efficiency in the midrange
  - Cyclone 10GX FPGA series is built to meet your low-power, cost-sensitive design needs, enabling you to get to market faster



## Intel® FPGA family

• Fabric + Tiles: built using heterogeneous 3D system-in-package (SiP) technology

| Intel® FPGA Family                                           | Technology     | Architecture   |
|--------------------------------------------------------------|----------------|----------------|
| Intel <sup>®</sup> Agilex <sup>™</sup> F-Series and I-Series | 10 nm SuperFin | Fabric + Tiles |
| Intel <sup>®</sup> Stratix® 10                               | 14 nm FinFet   | Fabric + Tiles |
| Intel <sup>®</sup> Arria® 10                                 | 20 nm Planar   | Monolithic     |
| Intel <sup>®</sup> Cyclone® 10 GX                            | 20 nm Planar   | Monolithic     |



## Intel® - Xilinx Device Comparison

| Application         | Xilinx Devices                                            | Intel Devices                                                                                                                                            |
|---------------------|-----------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------|
| Highest performance | Versal Prime                                              | Intel Agilex F-Series                                                                                                                                    |
|                     | Versal Premium                                            | Intel Agilex I-Series                                                                                                                                    |
| High<br>performance | Virtex UltraScale+ Kintex<br>UltraScale+ Zynq UltraScale+ | Intel Agilex F-Series Intel Agilex I-Series Intel<br>Stratix 10 GX Intel Stratix 10 SX<br>Intel Stratix 10 TX Intel Stratix 10 MX Intel<br>Stratix 10 DX |
| Mid-range           | Virtex UltraScale Kintex<br>UltraScale Zynq-7000          | Intel Stratix 10 GX Intel Stratix 10 SX Intel<br>Stratix 10 TX Intel Stratix 10 MX<br>Intel Stratix 10 DX Intel Arria 10 GX Intel Arria<br>10 SX         |
| Low cost            | Artix -7                                                  | Intel Cyclone 10 GX                                                                                                                                      |



#### HLS v.s. HDL

• **HLS:** High-Level Synthesis allows designers to describe hardware using high-level programming languages like C, C++, or SystemC. This means that HLS works at a higher level of abstraction, where developers can describe algorithms or logic without specifying the exact hardware details.

• SystemVerilog/VHDL: These HDLs require a more detailed specification of the hardware, providing a gate-level or Register Transfer Level (RTL) description. They require knowledge of the specific hardware constructs, like registers, flip-flops, etc.



## **Productivity**

- **HLS:** Usually, HLS offers faster development time since engineers can write code using familiar programming paradigms. Automated synthesis tools then translate the high-level code into RTL, allowing quicker prototyping.
- SystemVerilog/VHDL: Writing in HDLs typically takes more time as developers have to manually describe the low-level hardware details. This can result in more control and optimization but is generally more time-consuming.



## Flexibility & Optimization

- HLS: While HLS can accelerate development, it often provides less control over the final hardware implementation. This might result in less efficient utilization of FPGA resources or higher latency compared to hand-crafted RTL code.
- SystemVerilog/VHDL: Since these languages allow developers to describe hardware at a more granular level, there is usually greater opportunity for manual optimization of the design.



## **Learning Curve**

- **HLS**: Generally, HLS has a lower learning curve for software engineers or those familiar with C/C++. This makes it more accessible to developers who might not have a hardware background.
- SystemVerilog/VHDL: Learning these languages typically requires a deeper understanding of hardware concepts. Thus, there's a steeper learning curve, but it can provide more expertise in hardware design.



#### Verification

- **HLS:** Verification might be less comprehensive compared to what can be achieved with traditional HDLs, although tools are evolving to bridge this gap.
- **SystemVerilog/VHDL:** These languages offer robust verification methodologies and frameworks that are often used in industry for rigorous verification of complex designs.



#### Use cases

- **HLS:** Often preferred for algorithm development, data flow designs, and when a software prototype exists that needs to be converted into hardware. => hardware acceleration
- SystemVerilog/VHDL: Used for more traditional hardware design, where control over implementation details and optimizations is critical. => chip design



## **Applications**

- Historically present in embedded devices
- Small devices with dedicated functions
- Applications:
  - Consumer Automotive
  - Measurement
  - Communications
  - Military & Industrial
  - Computer & Storage
  - o Etc...





## **FPGA-based HPC accelerators**







## **Mapping program to FPGA**

Example provided by Altera (Intel)



#### **CPU** instructions

#### High-level code

Mem[100] += 42 \* Mem[101]



#### **CPU** instructions

- R0 ← Load Mem[100]
- R1 ← Load Mem[101]
- R2 ← Load #42
- R2 ← Mul R1, R2
- R0 ← Add R2, R0
- Store R0 → Mem[100]



## **Example of simple CPU**





## Load constant value into register





## CPU activity, step by step





## **Remove Fetch operation**





## **Remove unused ALUs**





## Remove unused Load / Store





## Remove unused Load / Store



- Removed fetch
- Removed unused ALUs
- Removed unused Load/Store
- Connect elements and propagate state



## Simplify workflow



- Removed fetch
- Removed unused ALUs
- Removed unused Load/Store
- Connect elements and propagate state
- Schedule and simplify workflow



## Custom data-path for the program

#### High-level code

Mem[100] += 42 \* Mem[101]

#### Custom data-path



#### Build exactly what you need:

Operations
Data widths
Memory size & configuration

#### Efficiency:

Throughput / Latency / Power



**Models for Heterogeneous Computing** 





**OpenCL: Low-level Heterogeneous Parallel Programing** 



#### What is OpenCL

- OpenCL (Open Computing Language)
- Framework dedicated to heterogeneous computing:
  - Host API and Kernel language
  - Low-level programming style
  - Use for Hardware Acceleration (not only FPGA)
- Open and royalty-free
  - o Initially developed by Apple and technical teams (AMD, IBM
  - First release in August, 2009
  - Maintained by Khronos
  - Khronos also maintain other standard
  - o (e.g., OpenGL, SYCL, EGL, etc...)
  - Official website for the OpenCL standard
  - o OpenCL guide: a github repository





#### **Adopters**

- OpenCL is widely used throughout the industry
- Many silicon <u>vendors</u> ship OpenCL with their processors, including GPUs, DSPs and FPGAs
- Intel is a strong contributor





## FPGA architecture for OpenCL implementation





#### **Platform Model**

- Abstract Hardware Model
- The platform consists of a single host and multiple devices
- Multiple platforms can coexist on a system but they are generally isolated
- There is a one-to-one mapping between the platform and a vendor provided SDK



|      | SDK | 520 FPGA | (source:Bittware) |
|------|-----|----------|-------------------|
| host |     |          | device            |

| OpenCL platform              | SDK version                       |
|------------------------------|-----------------------------------|
| CPU (host) and CPU (device)  | Intel SDK for OpenCL applications |
| CPU (host) and GPU (device)  | Nvidia SDK for OpenCL             |
| CPU (host) and GPU (device)  | AMD APP SDK 3.0 for 64-bit Linux  |
| CPU (host) and FPGA (device) | Intel FPGA SDK for OpenCL         |



#### **Execution Model**

- Execution Model:
  - Define how host and devices communicate
  - An OpenCL context is created with one or more devices
  - o Provides an environment for host-device interaction



(source:Design of FPGA-Based Computing Systems with OpenCL)



## **Kernel Programming Model**

- Kernels are functions executed on an OpenCL device
- The execution unit is the work-item
- Work-items are organized into work-groups
- Collection of work-items is called a **NDRange**, i.e., a multidimensional grid (max N=3)
- Sizes of NDRange and work-groups are specified by the host program
- Work-item identified by its global work-item ID and local work-item ID

(source:Khronos) Work-Group (executed on a compute unit) (executed on a

Work-Item

processing element)

**NDRange** 



#### **Memory Model**

- **Host memory** is accessible only to the host
- Data should be transferred from the host to the global memory of the device
- Constant memory is a read-only memory for the device
- Local memory belongs to a particular work-group
- **Private memory** belongs to a work-item





## Why OpenCL?

#### Advantages:

- C-like programming
- $\circ$  Explicit parallelism  $\rightarrow$  you code parallel kernels
- Support for I/O controllers (e.g., memory PCIe, DMA)
  - HDL designers need to design everything from scratch
- Require less hardware knowledge as OpenCL is more abstract
- o Compatible and Re-usable on different type of FPGA
- Design methodologies using C languages are more efficient than HDL-based ones

#### Drawbacks:

- Synthesis is time-consuming
- No control on the hardware architecture
- Cannot design a specific clock frequency
- Difficult to control resource utilization



#### **OpenCL** paradigm

- Two programming sides:
  - Kernel code (\*.cl) translated by Intel Offline Compiler
  - Host code (\*.c, \*.cpp) compiled with host compiler:
    - Intel
    - GCC
    - Etc...
- Board support package (BSP) contains:
  - o logic and memory information, and also
  - I/O controllers such as DDR3 controller
  - o PCI controller, etc
- BSP provided by vendors
  - Can provide your own BSP
  - Need a high-level of expertise
  - o Intel provides <u>documentation</u> for it



(source:OpenCL-ready High Speed FPGA Network for Reconfigurable High Performance Computing)



#### OpenCL parallelism

- Parallelism depends on the platform
  - FPGA ≠GPU
  - GPU starts multiple threads in //
  - FPGA intensively use pipelining
- Parallelism is set explicitly by the developer:
  - Task parallelism is obtained using the queues and event coordination
  - Data parallelism (a.k.a SIMD) is the simultaneous execution of parallel work-items (threads) on the same function across the elements of a dataset
  - Loop pipeline parallelism is achieved when the offline compiler analyzes dependencies between iterations of a loop and is able to pipeline each iteration for acceleration
- Data management
  - Explicit
  - Managed by the programmer
  - Up to the programmer to check memory and bandwidth efficiency



## Compilation flow -- Intel FPGA SDK for OpenCL

- Kernels are compiled using an offline compilers (AOC for Intel)
- AOC executes the following tasks:
  - Parsing the OpenCL kernel source code
  - o Circuit performance analysis
  - Synthesis or hardware compilation

- Kernels are first translated to a aoco object file
  - representing the hardware system

- A aocx executable file is finally created
  - use to program the FPGA



(source:<u>Intel</u>)



## FPGA OpenCL kernels -- configuration time

- Hardware synthesis can be very long
- Emulation is a practical way of testing your OpenCL kernels

| Design properties                                                                         | Estimated time | Estimated memory |  |
|-------------------------------------------------------------------------------------------|----------------|------------------|--|
| Low resource utilization (<10% in Kernel System)                                          |                |                  |  |
| Simple memory interface (Global interconnect for < 10 global loads + stores)              | 2-4h           | 45 GB            |  |
| Loops with low to medium latency (<500 cycles)                                            |                |                  |  |
| Medium resource utilization (<40% ALUTs and FFs, and <60% RAMs and DSPs in Kernel System) |                |                  |  |
| Simple to medium memory interface (Global interconnect for < 20 global loads + stores)    | 8-12h          | 60-90 GB         |  |
| Loops with low to medium latency (<500 cycles)                                            |                |                  |  |
| High resource utilization (>50% ALUTs and FFs, or >70% RAMs and DSPs in Kernel System)    |                |                  |  |
| Simple to medium memory interface (Global interconnect for < 20 global loads + stores)    | 12-20h         | 90-120 GB        |  |
| Loops with low to medium latency (<500 cycles)                                            |                |                  |  |
| Any resource utilization                                                                  |                |                  |  |
| Simple to medium memory interface (Global interconnect for > 100 global loads + stores)   | 30-60h         | 120+ GB          |  |
| or Loops with high to very high latency (>2000 cycles)                                    |                |                  |  |



(source: <u>bPC++ book</u>) (source: <u>DPC++ book</u>)



## Why using FPGA OpenCL kernels

- While the long compilation time of FPGA designs is a genuine concern and can be a barrier in some contexts
- FPGAs offer in terms of customization, power efficiency, flexibility, and more may justify this trade-off in many situations.
- Once FPGA image has been compiled:
  - High Throughput for Fixed Functions that can be specialized and doesn't change frequently
  - Low Latency as they can process data in parallel without the overhead of a general-purpose processor.
  - Customization for hardware design tailored to specific tasks
  - Power efficiency due to less generated hardware

#### **Compilation kernels**

- aoc -list-boards
  - List available boards within the current package

```
[u100057@mel3009 first_code]$ aoc -list-boards
Board list:
    p520_hpc_m210h_g3x16 (default)
    Board Package: /apps/USE/easybuild/staging/2022.1/software/520nmx/20.4
    Memories:    HBM0, HBM1, HBM2, HBM3, HBM4, HBM5, HBM6, HBM7, HBM8, HBM9, HBM10, HBM11, HBM12, HBM1
3, HBM14, HBM15, HBM16, HBM17, HBM18, HBM19, HBM20, HBM21, HBM22, HBM23, HBM24, HBM25, HBM26, HBM27, HBM2
8, HBM29, HBM30, HBM31

p520_max_m210h_g3x16
    Board Package: /apps/USE/easybuild/staging/2022.1/software/520nmx/20.4
    Memories:    HBM0, HBM1, HBM1, HBM2, HBM4, HBM5, HBM6, HBM7, HBM8, HBM9, HBM10, HBM11, HBM12, HBM1
3, HBM14, HBM15, HBM16, HBM17, HBM18, HBM19, HBM20, HBM21, HBM22, HBM23, HBM24, HBM25, HBM26, HBM27, HBM2
8, HBM29, HBM30, HBM31
    Channels:    kernel_input_ch0, kernel_output_ch0, kernel_input_ch1, kernel_output_ch1, kernel_output_ch2, kernel_output_ch3, kernel_output_ch3, kernel_output_ch3, kernel_output_ch3,
```

- aoc -board=<board> <kernel file>
  - Compile the kernel for a specific board
  - Generate the kernel hardware system
  - Call Intel Quartus Prime software to create the aocx file

**AOC** 



kernel.aocx





[u100057@mel3009 first\_code]\$ aoc -board=p520\_hpc\_m210h\_g3x16 first\_kernel.cl

(source: Intel)



#### Compilation outputs files

- <kernel file>.aoco
  - List available boards within the current package
- <kernel file>.aocx
  - Kernel executable file to program FPGA
- <kernel file> folder
  - o <kernel file> folder/reports/report.html
    - Interactive HTML report
    - Static report showing optimization and architectural information
  - < <kernel file>.log
    - Kernel compilation log
  - < <kernel file> folder/{\*.tcl,\*.v,\*.qsf,\*.qsys, etc ...}
    - Numerous intermediate files
    - Generated by Intel Quartus Prime

```
[u100057@mel3009 first_code]$ ls first_kernel
acds_version_rom.hex
                         first_kernel_sys.v
                                                      quartuserr.tmp
acds version rom.mif
                         first kernel.v
                                                      quartus.ini
adjust_floorplan.py
                         flat.gsf
                                                      quartus_sh_compile.log
automigration.rpt
                         hbm bottom
                                                      quartus version.id
base.aocx
                         hbm bottom.gsys
                         hbm_logic_lock.qsf
                                                      root_partition.garlog
base_bak.gar
base_bak.qarlog
                         hbm top
                                                      root partition.qdb
base.kernel.pmsf
                         hbm top.qsys
                                                      scripts
base.pof
                         hw iface.iipx
                                                      sw iface.iipx
base.gar
                         iface.ipx
                                                      sys description.hex
base.garlog
                                                      sys_description.legend.txt
base. adb
                         ip_include.tcl
                                                      sys_description.txt
base.qsf
                         kernel hdl
                                                      tmp-clearbox
base.sdc
                         kernel_pll_refclk_freq.txt top.fit.finalize.rpt
base.sof
                         kernel_report.tcl
                                                      top.fit.place.rpt
base.static.msf
                         kernel system.gip
                                                      top.fit.plan.rpt
                         kernel_system.v
 board
                                                      top.fit.retime.rpt
board.asvs
                         llc.err
                                                      top.fit.route.rpt
board spec.xml
                         opencl bsp ip.qsf
                                                      top.flow.rpt
compiler_metrics.out
                         opencl.ipx
                                                      top.pin
compile script.tcl
                                                     top_post.sdc
                         out_directory_tmp.txt.tmp
control_aer.sh
                         pr_base.id
                                                      top.qpf
cra ring rom.params
                         pr_region_logic_lock.qsf
                                                      top.asf
device opn.tcl
                         pwrmat.asf
                                                      top.sdc
device.tcl
                         gar_info.json
                                                      top.syn.rpt
first_kernel.bc.xml
                         adb
                                                      top.syn.summary
first kernel.log
                         gdb.gar
                                                      top.v
first kernel sys hw.tcl gdb.garlog
```

(source: Intel)



#### **Kernel as Custom Hardware**

- Each kernel become a Compute Unit (CU)
- Some element have been precompiled
  - Provided by the BSP
  - o Ex: Memory Controller, I/O controller, etc ...
- C operation in the kernel are converted to circuits:
  - Use HDL existing library and ip cores
  - Create LOAD/STORE units for read/write operations
  - Connections of all elements to follow the dataflow
  - Elements or full circuit can be replicated multiple times
- Memory:
  - $\circ$  Global  $\rightarrow$  DDR, QDR
  - $\circ$  Local  $\rightarrow$  On-chip memory
  - $\circ$  Private  $\rightarrow$  BRAM, register

```
__kernel void vectadd (...){
    int xid = get_global_id(0);
    c[xid] = a[xid] + b[xid];
}
```





#### **Pipelining Parallelism**

- FPGA Parallelism is clearly very different from GPU
- GPU are better for data parallelism
  - Independent tasks with almost no dependencies
- FPGA takes advantage of pipelining parallelism
  - Create a deep pipeline of the kernel
  - Stages can be executed concurrently by work-items
- Two main approaches for FPGA:
  - NDRange kernel is executed by multiple work-items
  - Single work-item kernel where loop-iterations are computed in different pipeline stages
- Use **NDRange**:
  - when there are no data sharing between work-items
  - when kernel will be used on GPU and FPGA platforms
- Use **single-work item**:
  - when you have data dependencies
  - When you want to port CPU code to FPGA



(source:OpenCL-Based Design of an FPGA Accelerator for Phase-Based Correspondence Matching)



#### NDRange to FPGA

- Replicating hardware for each work-item is suboptimal
- Need to take into account that FPGA are efficient for pipeline
- Why:
  - FPGA are different from GPU (lots of thread started at the same time)
  - o Impossible to replicate a million time of kernel for a FPGA card
  - This is wasteful as you can be sure that all stages of all pipelines won't be busy
  - How many work-items do you finally need?

|               |               |     | FPGA | <b>Operati</b> | on Execu | ution Par | allelism |     |    |
|---------------|---------------|-----|------|----------------|----------|-----------|----------|-----|----|
| Op1           | Op2           | Op3 | Op4  | Op5            | Op6      | Op7       | Op8      | Op9 | Ор |
|               | Op1           | Op2 | Op3  | Op4            | Op5      | Op6       | Op7      | Op8 | Ор |
|               |               | Op1 | Op2  | Op3            | Op4      | Op5       | Op6      | Op7 | Ор |
|               |               |     | Op1  | Op2            | Op3      | Op4       | Op5      | Op6 | Op |
| Deep Pipeline |               |     | Op1  | Op2            | Op3      | Op4       | Op5      | Ор  |    |
| Parallelism   |               |     |      |                | Op1      | Op2       | Op3      | Op4 | Ор |
| Op = Opera    | ation Executi | ion |      |                |          | Op1       | Op2      | Ор3 | Ор |
| ₽ Data        |               |     |      |                |          |           | Op1      | Op2 | Ор |
| →Time         |               |     |      |                |          |           |          | Op1 | Ор |
|               |               |     |      | (cource:Inte   | 21)      |           |          |     |    |

## Spatial Implementation of Operations

Operation 1

Data Input

Operation 2

Operation 3

Operation 4

Operation 5

Operation...

Data Output

(source:Intel)



- Vector addition
- 8 work-items
- Each clock cycle, all parts of the pipeline process different items







- Vector addition
- 8 work-items
- Each clock cycle, all parts of the pipeline process different items







- Vector addition
- 8 work-items
- Each clock cycle, all parts of the pipeline process different items







- Vector addition
- 8 work-items
- Each clock cycle, all parts of the pipeline process different items







- Vector addition
- 8 work-items
- Each clock cycle, all parts of the pipeline process different items







#### **NDRange**

- Useful when you can create a deep pipeline representation
- Each clock you send an new input data which is processed by the pipeline
- Fine-grained parallelism
- A kernel with a thousand of stages will concurrently execute a thousand of work-items
- Best suited for applications with independent loops (no data dependencies)
- Barriers should be used to avoid race conditions and have an additional hardware cost

```
int gid = get_global_id(0);
C[gid] = A[gid] + B[gid];
barrier();
C[N-gid] = C[M-xid] + A[gid]
```



#### Single-Work Item

- Atomic element of a NDrange ⇔ task
- Kernel executing on a compute unit by exactly one work-item



63



#### Single-Work Item

- Implementation of a single-work item is very close to classical C program
- A single-work contains loops which have <u>multiple loop-iterations</u>
- Loop pipelining execute the multiple loop-iterations in different pipeline stages in parallel

```
#define SIZE 1024
__kernel void vectoradd_single_work_item ( __global const int *A, __global const int *B, __global int *C)

{
for(int i=1, i<SIZE; i++)
    C[i] = A[i-1] + B[i];
}
```





## Single-Work Item (trivial example)



Time



#### Single-Work Item

- The offline compiler analyze each iteration of the loop
- It detects any dependencies
- Schedule and launch operations a.s.a.p

```
float array[M]; for (unsigned int i = 0; i < N; i++) {

for (unsigned int j = 0; j < M-1; j++) array[j = array[j+1]; array[j = a[i];

for (unsigned int j = 0; j < M; j++){
answer[j = array[j] + ar
```

(source:Intel)



#### Single-Work Item

#### No Loop Pipelining



No Overlap of Iterations!

#### With Loop Pipelining



Finishes Faster because Iterations
Are Overlapped

(source:Intel) 67





**SYCL: High-level Heterogeneous Parallel Programing** 



## Why SYCL?



OpenCL 1.2

**Becomes** industry baseline for heterogeneous parallel computing

# OpenCL

2013

OpenCL 2.0

**Enables new class** of hardware SVM Generic Addresses On-device dispatch

## SPIR.



2015

OpenCL 2.1 SPIR-V 1.0

SPIR-V in Core Kernel Language Flexibility

#### Industry Open Standard Intermediate Language



OpenCL 2.2 SPIR-V 1.2

OpenCL C++ Kernel Language Static subset of C++14 Templates and Lambdas

> **SPIR-V 1.2** OpenCL C++ support

**Pipes** Efficient device-scope communication between kernels

**Code Generation Optimizations** - Specialization constants at SPIR-V compilation time - Constructors and destructors of program scope global objects
- User callbacks can be set at program release time



(source:khonos.org)



#### What is SYCL?

- High-level C++ abstraction layer for OpenCL
- Full coverage for all OpenCL features
- Interop to enable existing OpenCL code with SYCL
- Single-source compilation
- Automatic scheduling of data movement

#### **Developer Choice**

The development of the two specifications are aligned so code can be easily shared between the two approaches

C++ Kernel Language **Low Level Control** 'GPGPU'-style separation of device-side kernel source code and host code



Single-source C++ **Programmer Familiarity** Approach also taken by C++ AMP and OpenMP



70



#### What is SYCL?



71



#### SYCL code example

```
// Copyright (C) 2020 Intel Corporation
// SPDX-License-Identifier: MIT
#include <sycl.hpp>
#include <iostream>
using namespace sycl;
const std::string secret {"Ifmmp-!xpsme\"\012J(n!tpssz-!Ebwf/!" "J(n!bgsbje!J!dbo(u!ep!uibu/!.!IBM \01"
};
const auto sz = secret.size();
int main() {
 queue Q;
  char *result = malloc shared <char>(sz, Q);
 std::memcpy(result, secret.data(),sz);
 Q.parallel for (sz, [=] (auto& i) {
   result[i] -= 1;
 }).wait();
 std::cout << result << "\n";</pre>
  return 0;
```

 $(\text{source:} \underline{\text{DPC++ book}})$  72



#### **SYCL** resources

- <u>Data Parallel C++</u>: Mastering DPC++ for Programming of Heterogeneous Systems using C++ and SYCL
- SYCL academy
- ENCCS <u>Heterogeneous programming with SYCL</u>
- More resources & tutorials on the <u>Khronos website</u>







#### Libraries/Frameworks with SYCL

- SYCL-BLAS An open source implementation of BLAS using the SYCL open standard for acceleration on OpenCL devices
- SYCL-DNN An open source neural network operations library written using the SYCL API
- SYCL-ML An open source C++ library implementing classical machine learning algorithms in SYCL
- SYCL-ParallelSTL An open source Parallel STL implementation
- Tensorflow An implementation of TensorFlow using SYCL



## **SYCL** implementation

- ComputeCpp SYCL v1.2.1 conformant implementation by Codeplay Software
- <u>Intel LLVM SYCL oneAPI DPC++</u> an open source implementation of SYCL that is being contributed to the LLVM project
- hipSYCL an open source implementation of SYCL over NVIDIA CUDA and AMD HIP
- triSYCL an open-source implementation led by Xilinx



#### Intel LLVM SYCL oneAPI DPC++

Let's go back to <a href="https://luxprovide.github.io/oneAPI-FPGA/">https://luxprovide.github.io/oneAPI-FPGA/</a>