Skip to content
Snippets Groups Projects
Commit 2632fbb2 authored by zahoussem's avatar zahoussem
Browse files

README.md

parent c18dc1de
No related branches found
No related tags found
No related merge requests found
...@@ -5,9 +5,10 @@ PRUDA is a set of programming tools and mechanisms to control ...@@ -5,9 +5,10 @@ PRUDA is a set of programming tools and mechanisms to control
scheduling within the GPU. It also provides the implementation the scheduling within the GPU. It also provides the implementation the
following real-time scheduling policies: following real-time scheduling policies:
- Fixed Priority (FP): preemptive and non-preemptive - Fixed Priority (FP): preemptive and non-preemptive single core
- Earliest Deadline First (EDF) : preemptif and non preemptive - Earliest Deadline First (EDF) : preemptif and non preemptive
- EDF-Gang scheduling techniques : the GPU is considered as a multiprocessor architecture. - Gang scheduling techniques : preemptive and non preemptive GANG
using the GPU as a multiprocessor architecture.
Additionally PRUDA aims to not modify CUDA-user programming Additionally PRUDA aims to not modify CUDA-user programming
style. Therefore, the PRUDA user can use already developped CUDA style. Therefore, the PRUDA user can use already developped CUDA
...@@ -31,60 +32,17 @@ unified memory. ...@@ -31,60 +32,17 @@ unified memory.
A GPU is compound of one or several streaming multiprocessors (SMs) A GPU is compound of one or several streaming multiprocessors (SMs)
and one or several copy engines (CEs). Streaming multiprocessors are and one or several copy engines (CEs). Streaming multiprocessors are
able to achieve computations (kernels), whereas copy engines execute able to achieve computations (kernels), whereas copy engines execute
memory copy operations between different memory spaces. Programming memory copy operations between different memory spaces. Each SM can
the GPU requires dividing parallel computations into several grids, be considered as a processor, or both SMs as a single processor
and each grid to several blocks. A block is a set of multiple according to the scheduling policy.
threads. A GPU can be programmed using generic platforms such OpenCL
or proprietary independent APIs. We use CUDA, a NVIDIA proprietary To get the execution sm for a given kernel, it can invoke :
platform, to have a tight control on SMs and CEs in C/C++ pruda\_get\_sm(). PRUDA allows also enforcing the allocation of a
programming language and using the NVIDIA compiler *nvcc*. given kernel to a specific SM by using {\sf
pruda\_allocate\_to\_sm(int sm\_id)}, where the {\sf sm\_id} is the
id of the target streaming multiprocessor.
From PRUDA perspective, the GPU is a set of copy engines and one or
more processors. Each SM can be considered as a processor, or both SMs Implementation details
as a single processor. PRUDA manage memory copies between CPU and GPU
and kernel pulls to make scheduling decisions.
When a kernel is invoked by CPU code, it submits commands to the
GPU. How and when commands are consumed, is hidden by constructors for
intellectual property concerns. PRUDA has been tested on Jetson
TX2. It is compound of 6 ARM-based CPU cores, along with an integrated
NVIDIA PASCAL-based GPU. The GPU in the TX2 is compound of 256 Cuda
cores, divided into two SMs and one copy engine. CPUs and GPU share
the same memory module. From a programming perspective, one may either
allocate two separate memory spaces for CPU and GPU using {\sf malloc}
and {\sf CudaMalloc} primitives respectively. The programmer may use a
memory space visible logically by the CPU and the GPU called CUDA
unified memory (even for discrete GPUs), therefore no memory copies
are needed between CPU and GPU tasks such memory spaces (buffers)
allocated using the {\sf CudaMallocManaged} primitive. PRUDA allows
handling both memory copy operations by enabling and desabling
automatic memory copy operations.
Typical Cuda programs are organized in the same way. first, memory
allocation operations are achieved both on CPU and GPU. Further,
memory copies are operated between CPU and GPU. Later, the GPU kernel
is launched, and finally results are copied back to the CPU by memory
copy operations. Cuda Malloc is a costly operation. Therefore, in
PRUDA, this operation must be achieved by the programmer, out of the
real-time task processing.
all thread of any block are executed
only by only one SM, however different blocks of the same kernel may
be executed on different SMs. In Figure \ref{fig:sched_jetson}, the
green kernel is executed on both SM0 and SM1, the red SM is executed
only on SM0. The kernel execution order and mechanisms are driven by
internal closed-source NVIDIA drivers (in our case of study). A PRUDA
user may get the SM where a given block/thread is executing using {\sf
pruda\_get\_sm()} primitive. PRUDA allows also enforcing the
allocation of a given kernel to a specific SM by using PRUDA primitive
{\sf pruda\_allocate\_to\_sm(int sm\_id)}, where the {\sf sm\_id} is
the id of the target streaming multiprocessor. Implementation details
about how these primitives can be found in the PRUDA description about how these primitives can be found in the PRUDA description
section. section.
...@@ -122,13 +80,124 @@ section. ...@@ -122,13 +80,124 @@ section.
## CUDA operations: ## CUDA operations:
PRUDA allows a kernel to execute within a single SM PRUDA allows a kernel to execute within a single SM
# PRUDA usage by example # PRUDA usage by example
```c
#include "../inc/user.h"
#include "../inc/tools.h"
#define N 8
__global__ void add( int *a, int *b, int *c ) {
printf("here 2 \n");
int tid = blockDim.x * blockIdx.x + threadIdx.x;
while (tid < N) {
c[tid] = a[tid] + b[tid];
tid += blockDim.x;
}
}
__global__ void mul( int *a, int *b, int *c, int h ) {
int tid = blockDim.x * blockIdx.x + threadIdx.x;
while (tid < N) {
c[tid] = a[tid] * b[tid];
tid += blockDim.x;
}
}
int main(){
// initializing pointers
int *a, *b, *c;
int *dev_a, *dev_b, *dev_c,ac;
// allocating CPU memory
a = (int*)malloc( N * sizeof(int) );
b = (int*)malloc( N * sizeof(int) );
c = (int*)malloc( N * sizeof(int) );
... init vars
// allocating GPU memory
cudaMalloc( (void**)&dev_a, N * sizeof(int) );
cudaMalloc( (void**)&dev_b, N * sizeof(int) );
cudaMalloc( (void**)&dev_c, N * sizeof(int) );
// initializing the list of kernels
init_kernel_listing();
create_kernel(std::get<1>(get_listing()),add,2,5,dev_a,dev_b,dev_c);
create_kernel(std::get<0>(get_listing()),mul,2,5,dev_a,dev_b,dev_c,ac);
// gpu scheduling parameters for kernel add
struct gpu_sched_param add_p;
gb.period_us = 3000000;
gb.deadline_us= 3000000;
gb.priority = 20;
// gpu scheduliung parameters
struct gpu_sched_param mul_p;
ga.period_us = 6000000;
ga.deadline_us= 6000000;
ga.priority = 15;
int gs_a = 10;
int bs_a = 5;
int gs_b = 10;
int bs_b = 5;
// declaring the gpu tasks (a container of params)
struct pruda_task_t * p_task_b = create_pruda_task(1, add_p, gs_a, bs_a);
struct pruda_task_t * p_task_a = create_pruda_task(0, mul_p, gs_b, bs_b);
// initializing the scheduler with the desired staretgy and scheduling policy
init_scheduler(SINGLE, FP);
// adding pruda tasks to the scheduling unit
add_pruda_task(p_task_a);
add_pruda_task(p_task_b);
// launch the tasks on GPU and the corresponding memory copies
// (locks and synchronization params are under developpement)
create_cpu_threads();
...
}
```
# PRUDA scheduling tools and policies # PRUDA scheduling tools and policies
## Single core strategy for non preemptive schedulers ## Single core strategy for non preemptive schedulers
## Single core strategy for preemptive schedulers ## Single core strategy for preemptive schedulers
## Multicore strategy for GANG preemptive schedulers ## Multicore strategy for GANG preemptive schedulers
# OPTIONS
PRUDA manage memory copies between
CPU and GPU and kernel pulls to make scheduling decisions.
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment