Skip to content
Snippets Groups Projects
Select Git revision
  • 6281caeb284a77bb3b9c010b073007c278345eaa
  • master default
  • brancheSecondaire
3 results

rtgpgpu

  • Clone with SSH
  • Clone with HTTPS
  • user avatar
    zahoussem authored
    6281caeb
    History
    Name Last commit Last update
    examples
    inc
    src
    Makefile
    README.md

    PRUDA : Real-time programing interface on the top of CUDA

    PRUDA is a set of programming tools and mechanisms to control scheduling within the GPU. It also provide implementation the following scheduling policies:

    • Fixed Priority non preemptive and preemptive
    • Earliest Deadline First (EDF) preemptif and non preemptive scheduling techniques
    • Gang scheduling techniques where the GPU is considered as a multiprocessor architecture.

    Details about each scheduling policy will be given in different section. First, we will describe the PRUDA functionalities, structures. We will also show how a scheduling policy can be easily implemnted with PRUDA.

    prerquistis

    • Programming with CUDA
    • Basic Knowledge about real-time systems

    PRUDA is a platform built on the top of CUDA for real-time systems. Therefore,

    ## The GPU in the eye of PRUDA:

    A GPU is compound of one or several streaming multiprocessors (SMs) and one or several copy engines (CEs). Streaming multiprocessors are able to achieve computations (kernels), whereas copy engines execute memory copy operations between different memory spaces. Programming the GPU requires dividing parallel computations into several grids, and each grid to several blocks. A block is a set of multiple threads. A GPU can be programmed using generic platforms such OpenCL or proprietary independent APIs. We use CUDA, a NVIDIA proprietary platform, to have a tight control on SMs and CEs in C/C++ programming language and using the NVIDIA compiler nvcc.

    From PRUDA perspective, the GPU is a set of copy engines and one or more processors. Each SM can be considered as a processor, or both SMs as a single processor. PRUDA manage memory copies between CPU and GPU and kernel pulls to make scheduling decisions.

    When a kernel is invoked by CPU code, it submits commands to the GPU. How and when commands are consumed, is hidden by constructors for intellectual property concerns. PRUDA has been tested on Jetson TX2. It is compound of 6 ARM-based CPU cores, along with an integrated NVIDIA PASCAL-based GPU. The GPU in the TX2 is compound of 256 Cuda cores, divided into two SMs and one copy engine. CPUs and GPU share the same memory module. From a programming perspective, one may either allocate two separate memory spaces for CPU and GPU using {\sf malloc} and {\sf CudaMalloc} primitives respectively. The programmer may use a memory space visible logically by the CPU and the GPU called CUDA unified memory (even for discrete GPUs), therefore no memory copies are needed between CPU and GPU tasks such memory spaces (buffers) allocated using the {\sf CudaMallocManaged} primitive. PRUDA allows handling both memory copy operations by enabling and desabling automatic memory copy operations.

    Typical Cuda programs are organized in the same way. first, memory allocation operations are achieved both on CPU and GPU. Further, memory copies are operated between CPU and GPU. Later, the GPU kernel is launched, and finally results are copied back to the CPU by memory copy operations. Cuda Malloc is a costly operation. Therefore, in PRUDA, this operation must be achieved by the programmer, out of the real-time task processing.

    all thread of any block are executed only by only one SM, however different blocks of the same kernel may be executed on different SMs. In Figure \ref{fig:sched_jetson}, the green kernel is executed on both SM0 and SM1, the red SM is executed only on SM0. The kernel execution order and mechanisms are driven by internal closed-source NVIDIA drivers (in our case of study). A PRUDA user may get the SM where a given block/thread is executing using {\sf pruda_get_sm()} primitive. PRUDA allows also enforcing the allocation of a given kernel to a specific SM by using PRUDA primitive {\sf pruda_allocate_to_sm(int sm_id)}, where the {\sf sm_id} is the id of the target streaming multiprocessor. Implementation details about how these primitives can be found in the PRUDA description section.

    To enforce an execution order between different kernels, we use a specific data structure, called Cuda streams. A Cuda stream has a FIFO behavior. Therefore, kernels submitted to a Cuda stream are executed one after the other in a {\bf sequential} fashion. Therefore, synchronization between two consecutive kernels is implicitly achieved. This property will be used later to implement non preemptive EDF and fixed priority real-time scheduling policies.

    In Cuda, the user may define several streams. A priority might be set between different streams. Therefore, if a Stream {\sf A} have a higher priority than stream {\sf B}, all kernels of {\sf A} are meant to execute before kernels that are submitted to {\sf B}. If a kernel of in {\sf B} is executing, while a kernel is activated on {\sf A}, the GPU might preempt the kernel of {\sf B}, to execute the kernel of {\sf A} according to our benchmarking according to the GPU preemption level. We highlight that fine-grain preemption capabilities are available in NVIDIA GPUs starting from the PASCAL architecture. For example, if a preemption is set a a block level, preemption will be achieved when all already executing blocks finish their execution. Recent VOLTA GPUs allow even finer preemption levels. Even if it is possible to create more than 2 streams, only two levels of priority are available in the Jetson TX2 platform. These properties will be used further to achieve EDF and fixed priority preemptive scheduling policies.

    Other PRUDA primitives will be detailed later.

    CUDA functionalities :

    PRUDA allows a kernel to execute within a single SM ## Singe core strateg