Skip to content
Snippets Groups Projects
Commit 6281caeb authored by zahoussem's avatar zahoussem
Browse files

adding platform description

parent 2f68927f
No related branches found
No related tags found
No related merge requests found
# PRUDA : Real-time programing interface on the top of CUDA
PRUDA is a set of programming tools and mechanisms to control
scheduling within the GPU. It also provide implementation the following scheduling policies:
- Fixed Priority non preemptive and preemptive
- Earliest Deadline First (EDF) preemptif and non preemptive scheduling techniques
- Gang scheduling techniques where the GPU is considered as a multiprocessor architecture.
Details about each scheduling policy will be given in different
section. First, we will describe the PRUDA functionalities,
structures. We will also show how a scheduling policy can be easily
implemnted with PRUDA.
## prerquistis
- Programming with CUDA
- Basic Knowledge about real-time systems
PRUDA is a platform built on the top of CUDA for real-time systems. Therefore,
## The GPU in the eye of PRUDA:
A GPU is compound of one or several streaming multiprocessors (SMs)
and one or several copy engines (CEs). Streaming multiprocessors are
able to achieve computations (kernels), whereas copy engines execute
memory copy operations between different memory spaces. Programming
the GPU requires dividing parallel computations into several grids,
and each grid to several blocks. A block is a set of multiple
threads. A GPU can be programmed using generic platforms such OpenCL
or proprietary independent APIs. We use CUDA, a NVIDIA proprietary
platform, to have a tight control on SMs and CEs in C/C++
programming language and using the NVIDIA compiler *nvcc*.
From PRUDA perspective, the GPU is a set of copy engines and one or
more processors. Each SM can be considered as a processor, or both SMs
as a single processor. PRUDA manage memory copies between CPU and GPU
and kernel pulls to make scheduling decisions.
When a kernel is invoked by CPU code, it submits commands to the
GPU. How and when commands are consumed, is hidden by constructors for
intellectual property concerns. PRUDA has been tested on Jetson
TX2. It is compound of 6 ARM-based CPU cores, along with an integrated
NVIDIA PASCAL-based GPU. The GPU in the TX2 is compound of 256 Cuda
cores, divided into two SMs and one copy engine. CPUs and GPU share
the same memory module. From a programming perspective, one may either
allocate two separate memory spaces for CPU and GPU using {\sf malloc}
and {\sf CudaMalloc} primitives respectively. The programmer may use a
memory space visible logically by the CPU and the GPU called CUDA
unified memory (even for discrete GPUs), therefore no memory copies
are needed between CPU and GPU tasks such memory spaces (buffers)
allocated using the {\sf CudaMallocManaged} primitive. PRUDA allows
handling both memory copy operations by enabling and desabling
automatic memory copy operations.
Typical Cuda programs are organized in the same way. first, memory
allocation operations are achieved both on CPU and GPU. Further,
memory copies are operated between CPU and GPU. Later, the GPU kernel
is launched, and finally results are copied back to the CPU by memory
copy operations. Cuda Malloc is a costly operation. Therefore, in
PRUDA, this operation must be achieved by the programmer, out of the
real-time task processing.
all thread of any block are executed
only by only one SM, however different blocks of the same kernel may
be executed on different SMs. In Figure \ref{fig:sched_jetson}, the
green kernel is executed on both SM0 and SM1, the red SM is executed
only on SM0. The kernel execution order and mechanisms are driven by
internal closed-source NVIDIA drivers (in our case of study). A PRUDA
user may get the SM where a given block/thread is executing using {\sf
pruda\_get\_sm()} primitive. PRUDA allows also enforcing the
allocation of a given kernel to a specific SM by using PRUDA primitive
{\sf pruda\_allocate\_to\_sm(int sm\_id)}, where the {\sf sm\_id} is
the id of the target streaming multiprocessor. Implementation details
about how these primitives can be found in the PRUDA description
section.
To enforce an execution order between different kernels, we use a
specific data structure, called Cuda streams. A Cuda stream has a
FIFO behavior. Therefore, kernels submitted to a Cuda stream are
executed one after the other in a {\bf sequential}
fashion. Therefore, synchronization between two consecutive kernels
is implicitly achieved. This property will be used later to
implement non preemptive EDF and fixed priority real-time scheduling
policies.
In Cuda, the user may define several streams. A priority might be
set between different streams. Therefore, if a Stream {\sf A} have a
higher priority than stream {\sf B}, all kernels of {\sf A} are
meant to execute before kernels that are submitted to {\sf B}. If a
kernel of in {\sf B} is executing, while a kernel is activated on
{\sf A}, the GPU might preempt the kernel of {\sf B}, to execute the
kernel of {\sf A} according to our benchmarking according to the GPU
preemption level. We highlight that fine-grain preemption
capabilities are available in NVIDIA GPUs starting from the PASCAL
architecture. For example, if a preemption is set a a block level,
preemption will be achieved when all already executing blocks finish
their execution. Recent VOLTA GPUs allow even finer preemption
levels. Even if it is possible to create more than 2 streams, only
two levels of priority are available in the Jetson TX2 platform.
These properties will be used further to achieve EDF and fixed
priority preemptive scheduling policies.
Other PRUDA primitives will be detailed later.
## CUDA functionalities :
PRUDA allows a kernel to execute within a single SM
## Singe core strateg
\ No newline at end of file
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment