Developing SYCL programs for Intel® FPGA cards¶

Anatomy of a SYCL program¶

Data Management¶

In the context of SYCL, Unified Shared Memory (USM) and buffers represent two different ways to handle memory and data management. They offer different levels of abstraction and ease of use, and the choice between them may depend on the specific needs of an application. Here's a breakdown of the differences:

Unified Shared Memory (USM)¶

Unified Shared Memory is a feature that simplifies memory management by providing a shared memory space across the host and various devices, like CPUs, GPUs, and FPGAs. USM provides three different types of allocations:

Device Allocations: Allocated memory is accessible only by the device.
Host Allocations: Allocated memory is accessible by the host and can be accessed by devices. However, the allocated memory is stored on the host global memory.
Shared Allocations: Allocated memory is accessible by both the host and devices. The allocated memory is present in both global memories and it is synchronized between host and device.

USM allows for more straightforward coding, akin to standard C++ memory management, and may lead to code that is easier to write and maintain.

FPGA support

SYCL USM host allocations are only supported by some BSPs, such as the Intel® FPGA Programmable Acceleration Card (PAC) D5005 (previously known as Intel® FPGA Programmable Acceleration Card (PAC) with Intel® Stratix® 10 SX FPGA).

Using SYCL, you can verify if you have access to the different features:

Verify USM capabilities

if (!device.has(sycl::aspect::usm_shared_allocations)) {
    # Try to default to host allocation only
    if (!device.has(sycl::aspect::usm_host_allocations)) {
        # Default to device and explicit data movement
        std::array<int,N> host_array;
        int *my_array = malloc_device<int>(N, Q);
    }else{
        # Ok my_array is located on host memory but transferred to device as needed
        int* my_array = malloc_host<int>(N, Q);
    }
}else{
        # Ok my_array is located on both global memories and synchronized automatically 
        int* shared_array = malloc_shared<int>(N, Q);
}

That's not all

Concurrent accesses and atomic modifications are not necessarily available even if you have host and shared capabilities.
You need to verify aspect::usm_atomic_shared_allocations and aspect::usm_atomic_host_allocations.

Bittware 520N-MX

The USM host allocations is not supported by some BSPs. We will therefore use explicit data movement.

Explicit USM

QuestionSolution

Go to the GettingStarted/fpga_compile/part4_dpcpp_lambda_buffers/src
Replace the original code with explicit USM code
Verify your code using emulation

#include <iostream>
// oneAPI headers
#include <sycl/ext/intel/fpga_extensions.hpp>
#include <sycl/sycl.hpp>

// Forward declare the kernel name in the global scope. This is an FPGA best
// practice that reduces name mangling in the optimization reports.
class VectorAddID;

void VectorAdd(const int *vec_a_in, const int *vec_b_in, int *vec_c_out,
               int len) {
  for (int idx = 0; idx < len; idx++) {
    int a_val = vec_a_in[idx];
    int b_val = vec_b_in[idx];
    int sum = a_val + b_val;
    vec_c_out[idx] = sum;
  }
}

constexpr int kVectSize = 256;

int main() {
  bool passed = true;
  try {
    // Use compile-time macros to select either:
    //  - the FPGA emulator device (CPU emulation of the FPGA)
    //  - the FPGA device (a real FPGA)
    //  - the simulator device
    #if FPGA_SIMULATOR
        auto selector = sycl::ext::intel::fpga_simulator_selector_v;
    #elif FPGA_HARDWARE
        auto selector = sycl::ext::intel::fpga_selector_v;
    #else  // #if FPGA_EMULATOR
        auto selector = sycl::ext::intel::fpga_emulator_selector_v;
    #endif

    // create the device queue
    sycl::queue q(selector);

    // make sure the device supports USM host allocations
    auto device = q.get_device();

    std::cout << "Running on device: "
              << device.get_info<sycl::info::device::name>().c_str()
              << std::endl;

    // declare arrays and fill them
    int host_vec_a[kVectSize];
    int host_vec_b[kVectSize];
    int host_vec_c[kVectSize];
    int * vec_a = malloc_device<int>(kVectSize,q);
    int * vec_b = malloc_device<int>(kVectSize,q);
    int * vec_c = malloc_device<int>(kVectSize,q);
    for (int i = 0; i < kVectSize; i++) {
      host_vec_a[i] = i;
      host_vec_b[i] = (kVectSize - i);
    }

    std::cout << "add two vectors of size " << kVectSize << std::endl;

    q.memcpy(vec_a, host_vec_a, kVectSize * sizeof(int)).wait();
    q.memcpy(vec_b, host_vec_b, kVectSize * sizeof(int)).wait();



    q.single_task<VectorAddID>([=]() {
        VectorAdd(vec_a, vec_b, vec_c, kVectSize);
      }).wait();

    q.memcpy(host_vec_c, vec_c, kVectSize * sizeof(int)).wait();

    // verify that VC is correct
    for (int i = 0; i < kVectSize; i++) {
      int expected = host_vec_a[i] + host_vec_b[i];
      if (host_vec_c[i] != expected) {
        std::cout << "idx=" << i << ": result " << host_vec_c[i] << ", expected ("
                  << expected << ") A=" << host_vec_a[i] << " + B=" << host_vec_b[i]
                  << std::endl;
        passed = false;
      }
    }

    std::cout << (passed ? "PASSED" : "FAILED") << std::endl;

    sycl::free(vec_a,q);
    sycl::free(vec_b,q);
    sycl::free(vec_c,q);
  } catch (sycl::exception const &e) {
    // Catches exceptions in the host code.
    std::cerr << "Caught a SYCL host exception:\n" << e.what() << "\n";

    // Most likely the runtime couldn't find FPGA hardware!
    if (e.code().value() == CL_DEVICE_NOT_FOUND) {
      std::cerr << "If you are targeting an FPGA, please ensure that your "
                   "system has a correctly configured FPGA board.\n";
      std::cerr << "Run sys_check in the oneAPI root directory to verify.\n";
      std::cerr << "If you are targeting the FPGA emulator, compile with "
                   "-DFPGA_EMULATOR.\n";
    }
    std::terminate();
  }
  return passed ? EXIT_SUCCESS : EXIT_FAILURE;
}

Buffer & accessors¶

Buffers and accessors are key abstractions that enable memory management and data access across various types of devices like CPUs, GPUs, DSPs, etc.

Buffers:Buffers in SYCL are objects that represent a region of memory accessible by the runtime. They act as containers for data and provide a way to abstract the memory management across host and device memories. This allows for efficient data movement and optimization by the runtime, as it can manage the data movement between host and device memory transparently.
Accessors:Accessors provide a way to access the data inside buffers. They define the type of access (read, write, read-write) and can be used within kernels to read from or write to buffers.

Advantage

Through the utilization of these accessors, the SYCL runtime examines the interactions with the buffers and constructs a dependency graph that maps the relationship between host and device functions. This enables the runtime to automatically orchestrate the transfer of data and the sequencing of kernel activities.

Using Buffers and Accessors

    #include <array> 
    // oneAPI headers
    #include <sycl/ext/intel/fpga_extensions.hpp>
    #include <sycl/sycl.hpp>

    class Kernel;
    constexpr int N = 100;
    std::array<int,N> in_array;
    std::array<int,N> out_array;
    for (int i = 0 ; i <N; i++)
        in_array[i] = i+1;
    queue device_queue(sycl::ext::intel::fpga_selector_v);

    { // This one is very important to define the buffer scope
      // buffer<int, 1> in_device_buf(in.data(), in.size());
      // Or more convenient

      buffer in_device_buf(in_array);
      buffer out_device_buf(out_array);
      device_queue.submit([&](handler &h) {
        accessor in(in_device_buf, h, read_only);
        accessor out(out_device_buf, h, write_only, no_init);
        h.single_task<Kernel>([=]() { });
      };
    } 
    // Accessor going out of the scope
    // Data has been copied back !!!

What about memory accesses in FPGA ?

For FPGAs, the access pattern, access width, and coalescing of memory accesses can significantly affect performance. You might want to make use of various attributes and pragmas specific to your compiler and FPGA to guide the compiler in optimizing memory accesses.

In order to use Direct Memory Access (DMA), you will need to setup proper data alignment or the offline compiler will output the following warnings:

Running on device: p520_hpc_m210h_g3x16 : BittWare Stratix 10 MX OpenCL platform (aclbitt_s10mx_pcie0)
add two vectors of size 256
** WARNING: [aclbitt_s10mx_pcie0] NOT using DMA to transfer 1024 bytes from host to device because of lack of alignment
**                 host ptr (0xb60b350) and/or dev offset (0x400) is not aligned to 64 bytes
** WARNING: [aclbitt_s10mx_pcie0] NOT using DMA to transfer 1024 bytes from host to device because of lack of alignment
**                 host ptr (0xb611910) and/or dev offset (0x800) is not aligned to 64 bytes
** WARNING: [aclbitt_s10mx_pcie0] NOT using DMA to transfer 1024 bytes from device to host because of lack of alignment
**                 host ptr (0xb611d20) and/or dev offset (0xc00) is not aligned to 64 bytes

For example, you may need to replace:

    int * vec_a = new int[kVectSize];
    int * vec_b = new int[kVectSize];
    int * vec_c = new int[kVectSize];

by these ones:

   int * vec_a = new(std::align_val_t{ 64 }) int[kVectSize];
   int * vec_b = new(std::align_val_t{ 64 }) int[kVectSize];
   int * vec_c = new(std::align_val_t{ 64 }) int[kVectSize];

Queue¶

Contrary to OpenCL, queues in SYCL are out-of-order by default. Nonetheless, you can change this behavior you declare it in your code.

In-order-queue

  ... 
  queue device_queue{sycl::ext::intel::fpga_selector_v,{property::queue::in_order()}};
  // Task A
  device_queue.submit([&](handler& h) {
        h.single_task<TaskA>([=]() { });
  });
  // Task B
  device_queue.submit([&](handler& h) {
        h.single_task<TaskB>([=]() { });
  });
  // Task C
  device_queue.submit([&](handler& h) {
        h.single_task<TaskC>([=]() { });
  }); 
  ...

graph TD
A[TaskA] --> B[TaskB];
B[TaskB] --> C[TaskC];

This behavior is not very useful nor flexible. Queue objects, by default, are out-of-order queues, except when they're constructed with the in-order queue property. Because of this, they must include mechanisms to arrange tasks that are sent to them. The way queues organize tasks is by allowing the user to notify the runtime about the dependencies that exist between these tasks. These dependencies can be described in two ways: either explicitly or implicitly, through the use of command groups.

A command group is a specific object that outlines a task and its dependencies. These groups are generally expressed as C++ lambdas and are handed over as arguments to the submit() method within a queue object. The single parameter within this lambda is a reference to a handler object, utilized inside the command group to define actions, generate accessors, and outline dependencies.

Explicit dependencies¶

Like for OpenCL, you can manage dependencies explicitly using events.

Using events

  ... 
  queue device_queue{sycl::ext::intel::fpga_selector_v};
  // Task A
  auto event_A = device_queue.submit([&](handler &h) {
        h.single_task<TaskA>([=]() { });
  });
  event_A.wait();
  // Task B
  auto event_B = device_queue.submit([&](handler &h) {
        h.single_task<TaskB>([=]() { });
  });
  // Task C
  auto event_C = device_queue.submit([&](handler &h) {
        h.single_task<TaskC>([=]() { });
  });
  // Task D
  device_queue.submit([&](handler &h) {
  h.depends_on({event_B, event_C});
  h.parallel_for(N, [=](id<1> i) { /*...*/ });
  }).wait();
  ...

graph TD
A[TaskA] --> B[TaskB];
A[TaskA] --> C[TaskC];
B[TaskB] --> D[TaskD];
C[TaskC] --> D[TaskD];

Explicit dependencies using events is relevant when you use USM since buffers make use of accessors to model data dependencies.
They are three possibilities to declare a dependcies explicitely:
Calling the method wait() on the queue it-self
Calling the method wait on the event return by the queue after submitting a command
Calling the method depends_on of the handler object

Implicit dependencies¶

Implicit dependencies occurs when your are using buffer & accessor.
Accessors have different access modes:
read_only: The content of the buffer can only be accessed for reading. So the content will only be copied once to the device
write_only: The content of the buffer can only be accessed for writing. The content of buffer is still copied from host to device before the kernel starts
read_write: The content of the buffer can be accessed for reading and writing.

You can add the no_init property to an accessor in write_only mode. This tells the runtime that the original data contains in the buffer can be ignored and don't need to be copied from host to device.

Implicit dependencies obey to three main patterns (see DPC++ book):

Read-after-Write (RAW) : occurs when some data modified by a kernel should be read by another kernel.
Write-after-Read (WAR) : occurs when some data read by a kernel will be modified by another one
Write-after-Write (WAW) : occurs when two kernels modified the same data

Implicit dependencies

QuestionSolution

By default without access mode, each accessor will be read_write inducing unnecessary copies.
Note also the first use of host_accessor. Why did we use it here ?

Modifiy the following code to take into account implicit dependencies.

   constexpr int N = 100;
   queue Q;
   buffer<int> A{range{N}};
   buffer<int> B{range{N}};
   buffer<int> C{range{N}};
   Q.submit([&](handler &h) {
      accessor aA{A, h};
      accessor aB{B, h};
      accessor aC{C, h};
      h.single_task<Kernel1>([=]() { 
         for(unsigned int i =0; i<N; i++)
             aA[i] = 10;
             aB[i] = 50;
             aC[i] = 0;
      });
   });
   Q.submit([&](handler &h) {
       accessor aA{A, h};
       accessor aB{B, h};
       accessor aC{C, h};
       h.single_task<Kernel2>([=]() { 
          for(unsigned int i =0; i<N; i++)
             aC[i] += aA[i] + aB[i]; 
        });
   });
   Q.submit([&](handler &h) {
       accessor aC{C, h};
       h.single_task<Kernel3>([=]() {
         for(unsigned int i =0; i<N; i++)
            aC[i]++; 
       });
   });
   host_accessor result{C};

   constexpr int N = 100;
   queue Q;
   buffer<int> A{range{N}};
   buffer<int> B{range{N}};
   buffer<int> C{range{N}};
   Q.submit([&](handler &h) {
      accessor aA{A, h, write_only, no_init};
      accessor aB{B, h, write_only, no_init};
      accessor aC{C, h, write_only, no_init};
      h.single_task<Kernel1>([=]() { 
         for(unsigned int i =0; i<N; i++)
             aA[i] = 10;
             aB[i] = 50;
             aC[i] = 0;
      });
   });
   Q.submit([&](handler &h) {
       accessor aA{A, h, read_only};
       accessor aB{B, h, read_only};
       accessor aC{C, h, write_only};
       h.single_task<Kernel2>([=]() { 
          for(unsigned int i =0; i<N; i++)
             aC[i] += aA[i] + aB[i]; 
        });
   });
   Q.submit([&](handler &h) {
       accessor aC{C, h, write_only};
       h.single_task<Kernel3>([=]() {
         for(unsigned int i =0; i<N; i++)
            aC[i]++; 
       });
   });
   host_accessor result{C, read_only};

* The host_accessor obtains access to buffer on the host and will wait for device kernel to execute to generate data.

Parallelism model for FPGA¶

Vectorization

Vectorization is not the main source of parallelism but help designing efficient pipeline. Since hardware can be reconfigured at will. The offline compiler can design N-bits Adders, multipliers which simplify greatly vectorization. In fact, the offline compiler vectorizes your design automatically if possible.

Pipelining with ND-range kernels¶

ND-range kernels are based on a hierachical grouping of work-items
A work-item represents a single unit of work
Independent simple units of work don't communicate or share data very often
Useful when porting a GPU kernel to FPGA

FPGAs are different from GPU (lots of threads started at the same time)
Impossible to replicate a hardware for a million of work-items
Work-items are injected into the pipeline
A deep pipeline means lots of work-items executing different tasks in parallel

In order to write basic data-parallel kernel, you will need to use the parallel_for method. Below is an example of simple data-parallel kernel. As you can notice it, there is no notion of groups nor sub-groups.

Matrix addition

   constexpr int N = 2048;
   constexpr int M = 1024;
   queue.submit([&](sycl::handler &h) {
     sycl::accessor acc_a{buffer_a, h, sycl::read_only};
     sycl::accessor acc_b{buffer_b, h, sycl::read_only};
     sycl::accessor acc_c{buffer_c, h, sycl::read_write, sycl::no_init};
     h.parallel_for(range{N, M}, [=](sycl::id<2> idx) {
      acc_c[idx] = acc_a[idx] + acc_b[idx];
     });
   });

Vector addition

QuestionSolution

Go to the GettingStarted/fpga_compile/part4_dpcpp_lambda_buffers/src
Adapt the vector_add.cpp single-task kernel to a basis data-parallel kernel
Emulate to verify your design

#include <iostream>
// oneAPI headers
#include <sycl/ext/intel/fpga_extensions.hpp>
#include <sycl/sycl.hpp>

// Forward declare the kernel name in the global scope. This is an FPGA best
// practice that reduces name mangling in the optimization reports.
class VectorAddID;

constexpr int kVectSize = 256;

int main() {
bool passed = true;
  try {
    // Use compile-time macros to select either:
    //  - the FPGA emulator device (CPU emulation of the FPGA)
    //  - the FPGA device (a real FPGA)
    //  - the simulator device
    #if FPGA_SIMULATOR
        auto selector = sycl::ext::intel::fpga_simulator_selector_v;
    #elif FPGA_HARDWARE
        auto selector = sycl::ext::intel::fpga_selector_v;
    #else  // #if FPGA_EMULATOR
        auto selector = sycl::ext::intel::fpga_emulator_selector_v;
    #endif

    // create the device queue
    sycl::queue q(selector);

    // make sure the device supports USM host allocations
    auto device = q.get_device();

    std::cout << "Running on device: "
              << device.get_info<sycl::info::device::name>().c_str()
              << std::endl;

    // declare arrays and fill them
    int * vec_a = new(std::align_val_t{ 64 }) int[kVectSize];
    int * vec_b = new(std::align_val_t{ 64 }) int[kVectSize];
    int * vec_c = new(std::align_val_t{ 64 }) int[kVectSize];
    for (int i = 0; i < kVectSize; i++) {
      vec_a[i] = i;
      vec_b[i] = (kVectSize - i);
    }

    std::cout << "add two vectors of size " << kVectSize << std::endl;
    {
      // copy the input arrays to buffers to share with kernel
      sycl::buffer buffer_a{vec_a, sycl::range(kVectSize)};
      sycl::buffer buffer_b{vec_b, sycl::range(kVectSize)};
      sycl::buffer buffer_c{vec_c, sycl::range(kVectSize)};

      q.submit([&](sycl::handler &h) {
        // use accessors to interact with buffers from device code
        sycl::accessor accessor_a{buffer_a, h, sycl::read_only};
        sycl::accessor accessor_b{buffer_b, h, sycl::read_only};
        sycl::accessor accessor_c{buffer_c, h, sycl::write_only, sycl::no_init};

        h.parallel_for<VectorAddID>(sycl::range(kVectSize),[=](sycl::id<1> idx) {
      accessor_c[idx] = accessor_a[idx] + accessor_b[idx];
        });
      });
    }
    // result is copied back to host automatically when accessors go out of
    // scope.

    // verify that VC is correct
    for (int i = 0; i < kVectSize; i++) {
      int expected = vec_a[i] + vec_b[i];
      if (vec_c[i] != expected) {
        std::cout << "idx=" << i << ": result " << vec_c[i] << ", expected ("
                  << expected << ") A=" << vec_a[i] << " + B=" << vec_b[i]
                  << std::endl;
        passed = false;
      }
    }

    std::cout << (passed ? "PASSED" : "FAILED") << std::endl;

    delete[] vec_a;
    delete[] vec_b;
    delete[] vec_c;
  } catch (sycl::exception const &e) {
    // Catches exceptions in the host code.
    std::cerr << "Caught a SYCL host exception:\n" << e.what() << "\n";

    // Most likely the runtime couldn't find FPGA hardware!
    if (e.code().value() == CL_DEVICE_NOT_FOUND) {
      std::cerr << "If you are targeting an FPGA, please ensure that your "
                   "system has a correctly configured FPGA board.\n";
      std::cerr << "Run sys_check in the oneAPI root directory to verify.\n";
      std::cerr << "If you are targeting the FPGA emulator, compile with "
                   "-DFPGA_EMULATOR.\n";
    }
    std::terminate();
  }
  return passed ? EXIT_SUCCESS : EXIT_FAILURE;
}

If you want to have a fine-grained control of your data-parallel kernel, ND-range data-parallel kernels are the equivalent of ND-range kernels in OpenCL.

ND-range kernel in SYCL

nd_range(range<dimensions> globalSize, range<dimensions> localSize);
ND-range kernels are defined with two range objects
- global representing the total size of work-items
- local representing the size of work-groups

Tiled Matrix Multiplication

QuestionSolution

Fill the blank and complete the code

#include <iostream>
#include <algorithm>
#include <random>

// oneAPI headers
#include <sycl/ext/intel/fpga_extensions.hpp>
#include <sycl/sycl.hpp>

#include <boost/align/aligned_allocator.hpp>


// Forward declare the kernel name in the global scope. This is an FPGA best
// practice that reduces name mangling in the optimization reports.
class MatMultKernel;


int main() {
  bool passed = true;
  try {
    // Use compile-time macros to select either:
    //  - the FPGA emulator device (CPU emulation of the FPGA)
    //  - the FPGA device (a real FPGA)
    //  - the simulator device
#if FPGA_SIMULATOR
    auto selector = sycl::ext::intel::fpga_simulator_selector_v;
#elif FPGA_HARDWARE
    auto selector = sycl::ext::intel::fpga_selector_v;
#else  // #if FPGA_EMULATOR
    auto selector = sycl::ext::intel::fpga_emulator_selector_v;
#endif

    // create the device queue
    sycl::queue q(selector);

    // make sure the device supports USM host allocations
    auto device = q.get_device();

    std::cout << "Running on device: "
              << device.get_info<sycl::info::device::name>().c_str()
              << std::endl;


    // initialize input and output memory on the host
    constexpr size_t N = 512;
    constexpr size_t B =  16;
    std::vector<float,boost::alignment::aligned_allocator<float,64>> mat_a(N * N);
    std::vector<float,boost::alignment::aligned_allocator<float,64>> mat_b(N * N);
    std::vector<float,boost::alignment::aligned_allocator<float,64>> mat_c(N * N); 

    std::random_device rd;
    std::mt19937 mt(rd());
    std::uniform_real_distribution<float> dist(0.0, 1.0);

    // Generate random values
    std::generate(mat_a.begin(), mat_a.end(), [&dist, &mt]() {
      return dist(mt);
    });

    // Generate random values
    std::generate(mat_b.begin(), mat_b.end(), [&dist, &mt]() {
      return dist(mt);
    });

    // fill with zero
    std::fill(mat_c.begin(), mat_c.end(), 0.0); 


    std::cout << "Matrix multiplication A X B = C " <<std::endl;
    {
      // copy the input arrays to buffers to share with kernel
      // We can access the buffer using mat[i][j]
      sycl::buffer<float,2> buffer_a{mat_a.data(), sycl::range<2>(N,N)};
      sycl::buffer<float,2> buffer_b{mat_b.data(), sycl::range<2>(N,N)};
      sycl::buffer<float,2> buffer_c{mat_c.data(), sycl::range<2>(N,N)};


      /* DEFINE HERE the global size and local size ranges*/


      q.submit([&](sycl::handler &h) {
        // use accessors to interact with buffers from device code
        sycl::accessor accessor_a{buffer_a, h, sycl::read_only};
        sycl::accessor accessor_b{buffer_b, h, sycl::read_only};
        sycl::accessor accessor_c{buffer_c, h, sycl::read_write, sycl::no_init};

        sycl::local_accessor<float,2> tileA{{B,B}, h};
        sycl::local_accessor<float,2> tileB{{B,B}, h};

        h.parallel_for<MatMultKernel>(sycl::nd_range{global, local}, [=](sycl::nd_item<2> item)

            [[intel::max_work_group_size(1, B, B)]]    {
            // Indices in the global index space:
            int m = item.get_global_id()[0];
            int n = item.get_global_id()[1];

            // Index in the local index space:
            // Provide local indexes i and j -- fill here

            float sum = 0;
            for (int p = 0; p < N/B; p++) {
              // Load the matrix tile from matrix A, and synchronize
              // to ensure all work-items have a consistent view
              // of the matrix tile in local memory.
              tileA[i][j] = accessor_a[m][p*B+j];
              // Do the same for tileB
              // fill here 
              item.barrier();

              // Perform computation using the local memory tile, and
              // matrix B in global memory.
              for (int kk = 0; kk < B; kk++) {
       sum += tileA[i][kk] * tileB[kk][j];
              }

              // After computation, synchronize again, to ensure all
         // Fill here 
            }

            // Write the final result to global memory.
            accessor_c[m][n] = sum;

        });
      });
    }


  // result is copied back to host automatically when accessors go out of
  // scope.

    // verify that Matrix multiplication is correct
    for (int i = 0; i < N; i++) {
      for (int j = 0; j < N; j++){
         float true_val=0.0;
         for (int k = 0 ; k < N; k++){
           true_val += mat_a[i*N +k] * mat_b[k*N+j];
         }
         if (std::abs(true_val - mat_c[i*N+j])/true_val > 1.0e-4 ) {
            std::cout << "C[" << i << ";" << j << "] = " << mat_c[i*N+j] << " expected = " << true_val << std::endl;
            passed = false;
         }
    }
    }

    std::cout << (passed ? "PASSED" : "FAILED") << std::endl;

  } catch (sycl::exception const &e) {
    // Catches exceptions in the host code.
    std::cerr << "Caught a SYCL host exception:\n" << e.what() << "\n";

    // Most likely the runtime couldn't find FPGA hardware!
    if (e.code().value() == CL_DEVICE_NOT_FOUND) {
      std::cerr << "If you are targeting an FPGA, please ensure that your "
                   "system has a correctly configured FPGA board.\n";
      std::cerr << "Run sys_check in the oneAPI root directory to verify.\n";
      std::cerr << "If you are targeting the FPGA emulator, compile with "
                   "-DFPGA_EMULATOR.\n";
    }
    std::terminate();
  }
  return passed ? EXIT_SUCCESS : EXIT_FAILURE;
}

#include <iostream>
#include <algorithm>
#include <random>

// oneAPI headers
#include <sycl/ext/intel/fpga_extensions.hpp>
#include <sycl/sycl.hpp>

#include <boost/align/aligned_allocator.hpp>

#define N 512
#define B  16

// Forward declare the kernel name in the global scope. This is an FPGA best
// practice that reduces name mangling in the optimization reports.
class MatMultKernel;


int main() {
  bool passed = true;
  try {
    // Use compile-time macros to select either:
    //  - the FPGA emulator device (CPU emulation of the FPGA)
    //  - the FPGA device (a real FPGA)
    //  - the simulator device
#if FPGA_SIMULATOR
    auto selector = sycl::ext::intel::fpga_simulator_selector_v;
#elif FPGA_HARDWARE
    auto selector = sycl::ext::intel::fpga_selector_v;
#else  // #if FPGA_EMULATOR
    auto selector = sycl::ext::intel::fpga_emulator_selector_v;
#endif

    // create the device queue
    sycl::queue q(selector);

    // make sure the device supports USM host allocations
    auto device = q.get_device();

    std::cout << "Running on device: "
              << device.get_info<sycl::info::device::name>().c_str()
              << std::endl;


    // initialize input and output memory on the host
    std::vector<float,boost::alignment::aligned_allocator<float,64>> mat_a(N * N);
    std::vector<float,boost::alignment::aligned_allocator<float,64>> mat_b(N * N);
    std::vector<float,boost::alignment::aligned_allocator<float,64>> mat_c(N * N); 

    std::random_device rd;
    std::mt19937 mt(rd());
    std::uniform_real_distribution<float> dist(0.0, 1.0);

    // Generate random values
    std::generate(mat_a.begin(), mat_a.end(), [&dist, &mt]() {
      return dist(mt);
    });

    // Generate random values
    std::generate(mat_b.begin(), mat_b.end(), [&dist, &mt]() {
      return dist(mt);
    });

    // fill with zero
    std::fill(mat_c.begin(), mat_c.end(), 0.0); 


    std::cout << "Matrix multiplication A X B = C " <<std::endl;
    {
      // copy the input arrays to buffers to share with kernel
      // We can access the buffer using mat[i][j]
      sycl::buffer<float,2> buffer_a{mat_a.data(), sycl::range<2>(N,N)};
      sycl::buffer<float,2> buffer_b{mat_b.data(), sycl::range<2>(N,N)};
      sycl::buffer<float,2> buffer_c{mat_c.data(), sycl::range<2>(N,N)};


      sycl::range global {N,N};
      sycl::range local  {B,B}; 


      q.submit([&](sycl::handler &h) {
        // use accessors to interact with buffers from device code
        sycl::accessor accessor_a{buffer_a, h, sycl::read_only};
        sycl::accessor accessor_b{buffer_b, h, sycl::read_only};
        sycl::accessor accessor_c{buffer_c, h, sycl::read_write, sycl::no_init};

        sycl::local_accessor<float,2> tileA{{B,B}, h};
        sycl::local_accessor<float,2> tileB{{B,B}, h};

       h.parallel_for<MatMultKernel>(sycl::nd_range{global, local}, [=](sycl::nd_item<2> item)

            [[intel::max_work_group_size(1, B, B)]]    {
            // Indices in the global index space:
            int m = item.get_global_id()[0];
            int n = item.get_global_id()[1];

            // Index in the local index space:
            int i = item.get_local_id()[0];
           int j = item.get_local_id()[1];

            float sum = 0;
            for (int p = 0; p < N/B; p++) {
              // Load the matrix tile from matrix A, and synchronize
              // to ensure all work-items have a consistent view
              // of the matrix tile in local memory.
              tileA[i][j] = accessor_a[m][p*B+j];
              tileB[i][j] = accessor_b[p*B+i][n];
              sycl::group_barrier(item.get_group());

              // Perform computation using the local memory tile, and
              // matrix B in global memory.
              for (int kk = 0; kk < B; kk++) {
               sum += tileA[i][kk] * tileB[kk][j];
              }

              // After computation, synchronize again, to ensure all
              // reads from the local memory tile are complete.
              sycl::group_barrier(item.get_group());
            }

            // Write the final result to global memory.
            accessor_c[m][n] = sum;

        });
      });
    }


  // result is copied back to host automatically when accessors go out of
    // scope.

    // verify that Matrix multiplication is correct
    for (int i = 0; i < N; i++) {
   for (int j = 0; j < N; j++){
      float true_val=0.0;
      for (int k = 0 ; k < N; k++){
       true_val += mat_a[i*N +k] * mat_b[k*N+j];
      }
          if (std::abs(true_val - mat_c[i*N+j])/true_val > 1.0e-4 ) {
               std::cout << "C[" << i << ";" << j << "] = " << mat_c[i*N+j] << " expected = " << true_val << std::endl;
               passed = false;
           }
   }
    }

    std::cout << (passed ? "PASSED" : "FAILED") << std::endl;

  } catch (sycl::exception const &e) {
    // Catches exceptions in the host code.
    std::cerr << "Caught a SYCL host exception:\n" << e.what() << "\n";

    // Most likely the runtime couldn't find FPGA hardware!
    if (e.code().value() == CL_DEVICE_NOT_FOUND) {
      std::cerr << "If you are targeting an FPGA, please ensure that your "
                   "system has a correctly configured FPGA board.\n";
      std::cerr << "Run sys_check in the oneAPI root directory to verify.\n";
      std::cerr << "If you are targeting the FPGA emulator, compile with "
                   "-DFPGA_EMULATOR.\n";
    }
    std::terminate();
  }
  return passed ? EXIT_SUCCESS : EXIT_FAILURE;
}

Warning on work-items group size

If the attribute [[intel::max_work_group_size(Z, Y, X)]] is not specified in your kernel, the workgroup size assumes a default value depending on compilation time and runtime constraints
If your kernel contains a barrier, the Intel® oneAPI DPC++/C++ Compiler sets a default maximum scalarized work-group size of 128 work-items ==> without this attribute, the previous ND-Range kernel would have failed since we have a local work-group size of B x B = 256 work-items

Pipelining with single-work item (loop)¶

When your code can't be decomposed into independent works, you can rely on loop parallelism using FPGA
In such a situation, the pipeline inputs is not work-items but loop iterations
For single-work-item kernels, the developer does not need to do anything special to preserve the data dependency
Communications between kernels is also much easier

FPGA can efficiently handle loop execution, often maintaining a fully occupied pipeline or providing reports on what changes are necessary to enhance occupancy.
It's evident that if loop iterations were substituted with work-items, where the value created by one work-item would have to be transferred to another for incremental computation, the algorithm's description would become far more complex.

Single-work item creation

Replace the parallel_formethod by the single_task method defined in the handler class to create a single-work item kernel
The source file vector_add.cpp from GettingStarted/fpga_compile/part4_dpcpp_lambda_buffers/src uses loop pipelining.

  #include <sycl/ext/intel/fpga_extensions.hpp>
  #include <sycl/sycl.hpp>

  using namespace sycl;

  int main(){


  // queue creation & data initialization


   q.submit([&](handler &h) {
     h.single_task<class MyKernel>([=]() {
       // Code to be executed as a single task
     });
   });
   q.wait();
  }

Task parallelism¶

Pipes function as a first-come, first-served buffer system, linking different parts of a design. The Intel® oneAPI DPC++/C++ Compiler offers various pipe types:

Host Pipes: These establish a connection between a host and a device.
Inter-Kernel Pipes: These facilitate efficient and low-latency data transfer and synchronization between kernels. They enable kernels to interact directly using on-device FIFO buffers, which utilize FPGA memory. The Intel® oneAPI DPC++/C++ Compiler promotes simultaneous kernel operation. By employing inter-kernel pipes for data movement among concurrently running kernels, data can be transferred without waiting for a kernel to finish, enhancing your design's throughput.
I/O Pipes: This is a one-way connection to the hardware, either as a source or sink, which can be linked to an FPGA board's input or output functionalities. Such functionalities could encompass network interfaces, PCIe®, cameras, or other data acquisition or processing tools and protocols.

Inter-Kernel Pipes¶

We will only focus on Inter-Kernel Pipes to leverage task parallelism
As for OpenCL programming, pipes can be blocking or non-blocking
For Intel® oneAPI with FPGA, you need to include FPGA extension:
```
#include <sycl/ext/intel/fpga_extensions.hpp>
```

Pipe creation and usage

Blocking pipesNon-Blocking pipes

// Using alias eases considerably their usage
using my_pipe = ext::intel::pipe<      
                class InterKernelPipe, // An identifier for the pipe.
                int,                   // The type of data in the pipe.
                4>;                    // The capacity of the pipe.

// Single_task kernel 1
q.submit([&](handler& h) {
    auto A = accessor(B_in, h);
    h.single_task([=]() {
        for (int i=0; i < count; i++) {
            my_pipe::write(A[i]); // write a single int into the pipe

        }
    });
}); 

// Single_task kernel 2
q.submit([&](handler& h) {
    auto A = accessor(B_out, h);
    h.single_task([=]() {
        for (int i=0; i < count; i++) {
            A[i] = my_pipe::read(); // read the next int from the pipe
        }
    });
}); 

// Using alias eases considerably their usage
using my_pipe = ext::intel::pipe<      
                class InterKernelPipe, // An identifier for the pipe.
                int,                   // The type of data in the pipe.
                4>;                    // The capacity of the pipe.

// Single_task kernel 1
q.submit([&](handler& h) {
    auto A = accessor(B_in, h);
    h.single_task([=]() {
        valid_write = false;
        for (int i=0; i < count; i++) {
            my_pipe::write(A[i],valid_write); // write a single int into the pipe

        }
    });
}); 

// Single_task kernel 2
q.submit([&](handler& h) {
    auto A = accessor(B_out, h);
    h.single_task([=]() {
        valid_read = false;
        for (int i=0; i < count; i++) {
            A[i] = my_pipe::read(valid_read); // read the next int from the pipe
        }
    });
}); 

Stalling pipes

Care should be taken when implementing pipes, especially when there is a strong imbalance between the consumer kernel reading from the pipe and the producer kernel that feed the pipe.
Stalling pipes can be disastrous when using blocking pipes

Multiple Homogeneous FPGA Devices¶

Each Meluxina's FPGA nodes have two FPGA cards
You can verify their presence using the following commands: aocl list-devices or sycl-ls
Differents kernels or the same kernels can be passed to these devices
Each devices should have his own sycl::queue and share or not a same context
Intel recommends to use a single context for performance reasons as show below:

Running on the two FPGA cards

    ...

    sycl::platform p(selector);
    auto devices = p.get_devices();
    sycl::context C(devices);
    sycl::queue q0 (C, devices[0]);
    sycl::queue q1 (C, devices[1]);


   std::cout << "Running on device: "
             << devices[0].get_info<sycl::info::device::name>().c_str()
             << std::endl;

   std::cout << "Running on device: "
             << devices[1].get_info<sycl::info::device::name>().c_str()
             << std::endl;

   ... 

Multiple nodes¶

Meluxina FPGA's partition contains 20 nodes
Combining MPI with the SYCL language allows developers to scale applications across diverse platforms within a distributed computing environment.
Note that MPI cannot be called inside a kernel
FPGA comminucation path :

graph LR
    A[FPGA 1] -->|PCIe| B[NODE 1];
    B[NODE 1] -->|Infiniband| C[NODE 2];
    C[NODE 1] -->|PCIe| D[FPGA 2];

MPI Programs Using C++ with SYCL running on multiple FPGAs

#include <mpi.h>
// oneAPI headers
#include <sycl/ext/intel/fpga_extensions.hpp>
#include <sycl/sycl.hpp>
#include <iomanip>  // setprecision library
#include <iostream>
#include <numeric> 


using namespace sycl;
constexpr int master = 0;

////////////////////////////////////////////////////////////////////////
//
// Each MPI ranks compute the number Pi partially on target device using SYCL.
// The partial result of number Pi is returned in "results".
//
////////////////////////////////////////////////////////////////////////
void mpi_native(double* results, int rank_num, int num_procs,
                long total_num_steps, queue& q) {

  double dx = 1.0f / (double)total_num_steps;
  long items_per_proc = total_num_steps / size_t(num_procs);
  // The size of amount of memory that will be given to the buffer.
  //range<1> num_items{items_per_proc};

  // Buffers are used to tell SYCL which data will be shared between the host
  // and the devices.
  buffer<double, 1> results_buf(results,
                               range<1>(items_per_proc));

  // Submit takes in a lambda that is passed in a command group handler
  // constructed at runtime.
  q.submit([&](handler& h) {
    // Accessors are used to get access to the memory owned by the buffers.
    accessor results_accessor(results_buf,h,write_only);
    // Each kernel calculates a partial of the number Pi in parallel.
    h.parallel_for(range<1>(items_per_proc), [=](id<1> k) {
      double x = ((double)(rank_num * items_per_proc + k))  * dx ;
      results_accessor[k] = (4.0f * dx) / (1.0f + x * x);
    });
  });
}


int main(int argc, char** argv) {
  long num_steps = 1000000;
  char machine_name[MPI_MAX_PROCESSOR_NAME];
  int name_len=0;
  int id=0;
  int num_procs=0;
  double pi=0.0;
  double t1, t2;
  try {
  // Use compile-time macros to select either:
  //   - the FPGA emulator device (CPU emulation of the FPGA)
  //   - the FPGA device (a real FPGA)
  //   - the simulator device
  #if FPGA_SIMULATOR
  auto selector = ext::intel::fpga_simulator_selector_v;
  #elif FPGA_HARDWARE
  auto selector = ext::intel::fpga_selector_v;
  #elif FPGA_EMULATOR
  auto selector = ext::intel::fpga_emulator_selector_v;
  #else 
  auto selector = sycl::cpu_selector_v;
  #endif

  property_list q_prop{property::queue::in_order()};
  queue myQueue{selector,q_prop};

  // Start MPI.
  if (MPI_Init(&argc, &argv) != MPI_SUCCESS) {
    std::cout << "Failed to initialize MPI\n";
    exit(-1);
  }

  // Create the communicator, and retrieve the number of MPI ranks.
  MPI_Comm_size(MPI_COMM_WORLD, &num_procs);

  // Determine the rank number.
  MPI_Comm_rank(MPI_COMM_WORLD, &id);

  // Get the machine name.
  MPI_Get_processor_name(machine_name, &name_len);

  if(id == master) t1 = MPI_Wtime();

  std::cout << "Rank #" << id << " runs on: " << machine_name
            << ", uses device: "
            << myQueue.get_device().get_info<info::device::name>() << "\n";

  int num_step_per_rank = num_steps / num_procs;
  double* results_per_rank = new double[num_step_per_rank];

  // Initialize an array to store a partial result per rank.
  for (size_t i = 0; i < num_step_per_rank; i++) results_per_rank[i] = 0.0;

  // Calculate the Pi number partially by multiple MPI ranks.
  mpi_native(results_per_rank, id, num_procs, num_steps, myQueue);

  double local_sum = 0.0;
  for(unsigned int i = 0; i < num_step_per_rank; i++){
    local_sum += results_per_rank[i];
  }

  // Master rank performs a reduce operation to get the sum of all partial Pi.
  MPI_Reduce(&local_sum, &pi, 1, MPI_DOUBLE, MPI_SUM, master, MPI_COMM_WORLD);

  if (id == master) {
    t2 = MPI_Wtime(); 
    std::cout << "mpi native:\t\t";
    std::cout << std::setprecision(10) << "PI =" << pi << std::endl;
    std::cout << "Elapsed time is " << t2-t1 << std::endl;
  }

  delete[] results_per_rank;

  MPI_Finalize();

 } catch (sycl::exception const &e) {
    // Catches exceptions in the host code.
    std::cerr << "Caught a SYCL host exception:\n" << e.what() << "\n";

    // Most likely the runtime couldn't find FPGA hardware!
    if (e.code().value() == CL_DEVICE_NOT_FOUND) {
      std::cerr << "If you are targeting an FPGA, please ensure that your "
                   "system has a correctly configured FPGA board.\n";
      std::cerr << "Run sys_check in the oneAPI root directory to verify.\n";
      std::cerr << "If you are targeting the FPGA emulator, compile with "
                   "-DFPGA_EMULATOR.\n";
    }
    std::terminate();
  }

  return 0;
}

Output :

Rank #3 runs on: mel3014, uses device: p520_hpc_m210h_g3x16 : BittWare Stratix 10 MX OpenCL platform (aclbitt_s10mx_pcie0)
Rank #0 runs on: mel3001, uses device: p520_hpc_m210h_g3x16 : BittWare Stratix 10 MX OpenCL platform (aclbitt_s10mx_pcie0)
Rank #4 runs on: mel3017, uses device: p520_hpc_m210h_g3x16 : BittWare Stratix 10 MX OpenCL platform (aclbitt_s10mx_pcie0)
Rank #2 runs on: mel3013, uses device: p520_hpc_m210h_g3x16 : BittWare Stratix 10 MX OpenCL platform (aclbitt_s10mx_pcie0)
Rank #1 runs on: mel3010, uses device: p520_hpc_m210h_g3x16 : BittWare Stratix 10 MX OpenCL platform (aclbitt_s10mx_pcie0)
mpi native:             PI =3.141593654
Elapsed time is 9.703053059