Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
R
rtgpgpu
Manage
Activity
Members
Labels
Plan
Issues
Issue boards
Milestones
Wiki
Code
Merge requests
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Snippets
Build
Pipelines
Jobs
Pipeline schedules
Artifacts
Deploy
Releases
Model registry
Operate
Environments
Monitor
Incidents
Analyze
Value stream analytics
Contributor analytics
CI/CD analytics
Repository analytics
Model experiments
Help
Help
Support
GitLab documentation
Compare GitLab plans
GitLab community forum
Contribute to GitLab
Provide feedback
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
PTASK
rtgpgpu
Commits
2632fbb2
Commit
2632fbb2
authored
5 years ago
by
zahoussem
Browse files
Options
Downloads
Patches
Plain Diff
README.md
parent
c18dc1de
No related branches found
No related tags found
No related merge requests found
Changes
1
Show whitespace changes
Inline
Side-by-side
Showing
1 changed file
README.md
+128
-59
128 additions, 59 deletions
README.md
with
128 additions
and
59 deletions
README.md
+
128
−
59
View file @
2632fbb2
...
@@ -5,9 +5,10 @@ PRUDA is a set of programming tools and mechanisms to control
...
@@ -5,9 +5,10 @@ PRUDA is a set of programming tools and mechanisms to control
scheduling within the GPU. It also provides the implementation the
scheduling within the GPU. It also provides the implementation the
following real-time scheduling policies:
following real-time scheduling policies:
-
Fixed Priority (FP): preemptive and non-preemptive
-
Fixed Priority (FP): preemptive and non-preemptive
single core
-
Earliest Deadline First (EDF) : preemptif and non preemptive
-
Earliest Deadline First (EDF) : preemptif and non preemptive
-
EDF-Gang scheduling techniques : the GPU is considered as a multiprocessor architecture.
-
Gang scheduling techniques : preemptive and non preemptive GANG
using the GPU as a multiprocessor architecture.
Additionally PRUDA aims to not modify CUDA-user programming
Additionally PRUDA aims to not modify CUDA-user programming
style. Therefore, the PRUDA user can use already developped CUDA
style. Therefore, the PRUDA user can use already developped CUDA
...
@@ -31,60 +32,17 @@ unified memory.
...
@@ -31,60 +32,17 @@ unified memory.
A GPU is compound of one or several streaming multiprocessors (SMs)
A GPU is compound of one or several streaming multiprocessors (SMs)
and one or several copy engines (CEs). Streaming multiprocessors are
and one or several copy engines (CEs). Streaming multiprocessors are
able to achieve computations (kernels), whereas copy engines execute
able to achieve computations (kernels), whereas copy engines execute
memory copy operations between different memory spaces. Programming
memory copy operations between different memory spaces. Each SM can
the GPU requires dividing parallel computations into several grids,
be considered as a processor, or both SMs as a single processor
and each grid to several blocks. A block is a set of multiple
according to the scheduling policy.
threads. A GPU can be programmed using generic platforms such OpenCL
or proprietary independent APIs. We use CUDA, a NVIDIA proprietary
To get the execution sm for a given kernel, it can invoke :
platform, to have a tight control on SMs and CEs in C/C++
pruda
\_
get
\_
sm(). PRUDA allows also enforcing the allocation of a
programming language and using the NVIDIA compiler
*nvcc*
.
given kernel to a specific SM by using {
\s
f
pruda
\_
allocate
\_
to
\_
sm(int sm
\_
id)}, where the {
\s
f sm
\_
id} is the
id of the target streaming multiprocessor.
From PRUDA perspective, the GPU is a set of copy engines and one or
more processors. Each SM can be considered as a processor, or both SMs
Implementation details
as a single processor. PRUDA manage memory copies between CPU and GPU
and kernel pulls to make scheduling decisions.
When a kernel is invoked by CPU code, it submits commands to the
GPU. How and when commands are consumed, is hidden by constructors for
intellectual property concerns. PRUDA has been tested on Jetson
TX2. It is compound of 6 ARM-based CPU cores, along with an integrated
NVIDIA PASCAL-based GPU. The GPU in the TX2 is compound of 256 Cuda
cores, divided into two SMs and one copy engine. CPUs and GPU share
the same memory module. From a programming perspective, one may either
allocate two separate memory spaces for CPU and GPU using {
\s
f malloc}
and {
\s
f CudaMalloc} primitives respectively. The programmer may use a
memory space visible logically by the CPU and the GPU called CUDA
unified memory (even for discrete GPUs), therefore no memory copies
are needed between CPU and GPU tasks such memory spaces (buffers)
allocated using the {
\s
f CudaMallocManaged} primitive. PRUDA allows
handling both memory copy operations by enabling and desabling
automatic memory copy operations.
Typical Cuda programs are organized in the same way. first, memory
allocation operations are achieved both on CPU and GPU. Further,
memory copies are operated between CPU and GPU. Later, the GPU kernel
is launched, and finally results are copied back to the CPU by memory
copy operations. Cuda Malloc is a costly operation. Therefore, in
PRUDA, this operation must be achieved by the programmer, out of the
real-time task processing.
all thread of any block are executed
only by only one SM, however different blocks of the same kernel may
be executed on different SMs. In Figure
\r
ef{fig:sched_jetson}, the
green kernel is executed on both SM0 and SM1, the red SM is executed
only on SM0. The kernel execution order and mechanisms are driven by
internal closed-source NVIDIA drivers (in our case of study). A PRUDA
user may get the SM where a given block/thread is executing using {
\s
f
pruda
\_
get
\_
sm()} primitive. PRUDA allows also enforcing the
allocation of a given kernel to a specific SM by using PRUDA primitive
{
\s
f pruda
\_
allocate
\_
to
\_
sm(int sm
\_
id)}, where the {
\s
f sm
\_
id} is
the id of the target streaming multiprocessor. Implementation details
about how these primitives can be found in the PRUDA description
about how these primitives can be found in the PRUDA description
section.
section.
...
@@ -122,13 +80,124 @@ section.
...
@@ -122,13 +80,124 @@ section.
## CUDA operations:
## CUDA operations:
PRUDA allows a kernel to execute within a single SM
PRUDA allows a kernel to execute within a single SM
# PRUDA usage by example
# PRUDA usage by example
```
c
#include
"../inc/user.h"
#include
"../inc/tools.h"
#define N 8
__global__
void
add
(
int
*
a
,
int
*
b
,
int
*
c
)
{
printf
(
"here 2
\n
"
);
int
tid
=
blockDim
.
x
*
blockIdx
.
x
+
threadIdx
.
x
;
while
(
tid
<
N
)
{
c
[
tid
]
=
a
[
tid
]
+
b
[
tid
];
tid
+=
blockDim
.
x
;
}
}
__global__
void
mul
(
int
*
a
,
int
*
b
,
int
*
c
,
int
h
)
{
int
tid
=
blockDim
.
x
*
blockIdx
.
x
+
threadIdx
.
x
;
while
(
tid
<
N
)
{
c
[
tid
]
=
a
[
tid
]
*
b
[
tid
];
tid
+=
blockDim
.
x
;
}
}
int
main
(){
// initializing pointers
int
*
a
,
*
b
,
*
c
;
int
*
dev_a
,
*
dev_b
,
*
dev_c
,
ac
;
// allocating CPU memory
a
=
(
int
*
)
malloc
(
N
*
sizeof
(
int
)
);
b
=
(
int
*
)
malloc
(
N
*
sizeof
(
int
)
);
c
=
(
int
*
)
malloc
(
N
*
sizeof
(
int
)
);
...
init
vars
// allocating GPU memory
cudaMalloc
(
(
void
**
)
&
dev_a
,
N
*
sizeof
(
int
)
);
cudaMalloc
(
(
void
**
)
&
dev_b
,
N
*
sizeof
(
int
)
);
cudaMalloc
(
(
void
**
)
&
dev_c
,
N
*
sizeof
(
int
)
);
// initializing the list of kernels
init_kernel_listing
();
create_kernel
(
std
::
get
<
1
>
(
get_listing
()),
add
,
2
,
5
,
dev_a
,
dev_b
,
dev_c
);
create_kernel
(
std
::
get
<
0
>
(
get_listing
()),
mul
,
2
,
5
,
dev_a
,
dev_b
,
dev_c
,
ac
);
// gpu scheduling parameters for kernel add
struct
gpu_sched_param
add_p
;
gb
.
period_us
=
3000000
;
gb
.
deadline_us
=
3000000
;
gb
.
priority
=
20
;
// gpu scheduliung parameters
struct
gpu_sched_param
mul_p
;
ga
.
period_us
=
6000000
;
ga
.
deadline_us
=
6000000
;
ga
.
priority
=
15
;
int
gs_a
=
10
;
int
bs_a
=
5
;
int
gs_b
=
10
;
int
bs_b
=
5
;
// declaring the gpu tasks (a container of params)
struct
pruda_task_t
*
p_task_b
=
create_pruda_task
(
1
,
add_p
,
gs_a
,
bs_a
);
struct
pruda_task_t
*
p_task_a
=
create_pruda_task
(
0
,
mul_p
,
gs_b
,
bs_b
);
// initializing the scheduler with the desired staretgy and scheduling policy
init_scheduler
(
SINGLE
,
FP
);
// adding pruda tasks to the scheduling unit
add_pruda_task
(
p_task_a
);
add_pruda_task
(
p_task_b
);
// launch the tasks on GPU and the corresponding memory copies
// (locks and synchronization params are under developpement)
create_cpu_threads
();
...
}
```
# PRUDA scheduling tools and policies
# PRUDA scheduling tools and policies
## Single core strategy for non preemptive schedulers
## Single core strategy for non preemptive schedulers
## Single core strategy for preemptive schedulers
## Single core strategy for preemptive schedulers
## Multicore strategy for GANG preemptive schedulers
## Multicore strategy for GANG preemptive schedulers
# OPTIONS
PRUDA manage memory copies between
CPU and GPU and kernel pulls to make scheduling decisions.
This diff is collapsed.
Click to expand it.
Preview
0%
Loading
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment