Skip to content
Snippets Groups Projects
Commit 1c4d0884 authored by rtekin's avatar rtekin
Browse files

jrwtc2019

parent 9ded4b05
Branches
No related tags found
1 merge request!1Presentation
Showing with 2463 additions and 0 deletions
void sporadicTask(void *arg) {
<initialization>;
addkernel(kernelA, gridSize, blockSize, param, priority, CHOSENSM, SMid);
while(cond){
launchOnlyChosenSm();
<wait event>;
}
destroyOnlyChosenSm();
}
\ No newline at end of file
@inproceedings{amert2017gpu,
title={GPU scheduling on the NVIDIA TX2: Hidden details revealed},
author={Amert, Tanya and Otterness, Nathan and Yang, Ming and Anderson, James H and Smith, F Donelson},
booktitle={RTSS'2017},
pages={104--115},
year={2017},
organization={IEEE}
}
@inproceedings{capodieci2018deadline,
title={Deadline-based scheduling for GPU with preemption support},
author={Capodieci, Nicola and Cavicchioli, Roberto and Bertogna, Marko and Paramakuru, Aingara},
booktitle={RTSS'2018},
pages={119--130},
year={2018},
organization={IEEE}
}
@inproceedings{elliott2013gpusync,
title={GPUSync: A framework for real-time GPU management},
author={Elliott, Glenn A and Ward, Bryan C and Anderson, James H},
booktitle={RTSS'2013},
pages={33--44},
year={2013},
organization={IEEE}
}
@inproceedings{kato2012gdev,
title={Gdev: GPU Resource Management in the Operating System},
author={Kato, Shinpei and McThrow, Michael and Maltzahn, Carlos and Brandt, Scott},
booktitle={USENIX'12},
pages={401--412},
year={2012}
}
@inproceedings{kato2011rgem,
title={RGEM: A responsive GPGPU execution model for runtime engines},
author={Kato, Shinpei and Lakshmanan, Karthik and Kumar, Aman and Kelkar, Mihir and Ishikawa, Yutaka and Rajkumar, Ragunathan},
booktitle={2011 IEEE 32nd Real-Time Systems Symposium},
pages={57--66},
year={2011},
organization={IEEE}
}
@inproceedings{kato2011timegraph,
title={TimeGraph: GPU scheduling for real-time multi-tasking environments},
author={Kato, Shinpei and Lakshmanan, Karthik and Rajkumar, Raj and Ishikawa, Yutaka},
booktitle={Proc. USENIX ATC},
pages={17--30},
year={2011}
}
%\documentclass{article}
\documentclass{sig-alternate}
\usepackage{algpseudocode}
\usepackage{algorithm}
\usepackage{mathtools}
\usepackage{listings}
\usepackage{soul}
\usepackage{color}
%\def\isreport{1}
%\def\isfinal{1}
\ifdefined\isreport
\newcommand{\reportonly}[1]{#1}
\newcommand{\paperonly}[1]{}
\else
\newcommand{\reportonly}[1]{}
\newcommand{\paperonly}[1]{#1}
\fi
\ifdefined\isfinal
\newcommand{\added}[1]{#1}
\newcommand{\deleted}[1]{}
\newcommand{\gl}[1]{}
\newcommand{\hz}[1]{}
\newcommand{\rt}[1]{}
\else
\newcommand{\added}[1]{\textcolor{blue}{#1}}
\newcommand{\deleted}[1]{\st{#1}}
\newcommand{\gl}[1]{\textcolor{red}{[\textbf{Giuseppe :} #1]}}
\newcommand{\hz}[1]{\textcolor{yellow}{[\textbf{Houssam :} #1]}}
\newcommand{\rt}[1]{\textcolor{green}{[\textbf{Reyyan :} #1]}}
\fi
\usepackage{minted}
\usepackage{tikz}
%\title{PRUDA: An API for Time and Space Predictible Programming in NVDIA GPUS using CUDA}
%\author{R. Tekin, H-E. Zahaf, G. Lipari (Order to be defined)}
\begin{document}
%
% --- Author Metadata here ---
\conferenceinfo{}{}
%\CopyrightYear{2007} % Allows default copyright year (20XX) to be over-ridden - IF NEED BE.
%\crdata{0-12345-67-8/90/01} % Allows default copyright data (0-89791-88-6/97/05) to be over-ridden - IF NEED BE.
% --- End of Author Metadata ---
\title{PRUDA: An API for Time and Space Predictible Programming in NVDIA GPUs using CUDA}
%
% You need the command \numberofauthors to handle the 'placement
% and alignment' of the authors beneath the title.
%
% For aesthetic reasons, we recommend 'three authors at a time'
% i.e. three 'name/affiliation blocks' be placed beneath the title.
%
% NOTE: You are NOT restricted in how many 'rows' of
% "name/affiliations" may appear. We just ask that you restrict
% the number of 'columns' to three.
%
% Because of the available 'opening page real-estate'
% we ask you to refrain from putting more than six authors
% (two rows with three columns) beneath the article title.
% More than six makes the first-page appear very cluttered indeed.
%
% Use the \alignauthor commands to handle the names
% and affiliations for an 'aesthetic maximum' of six authors.
% Add names, affiliations, addresses for
% the seventh etc. author(s) as the argument for the
% \additionalauthors command.
% These 'additional authors' will be output/set for you
% without further effort on your part as the last section in
% the body of your article BEFORE References or any Appendices.
\numberofauthors{1} % in this sample file, there are a *total*
% of EIGHT authors. SIX appear on the 'first-page' (for formatting
% reasons) and the remaining two appear in the \additionalauthors section.
%extended
\author{
Reyyan Tekin, Houssam-Eddine ZAHAF, Giuseppe Lipari\\
Univ. Lille, CNRS, Centrale Lille, UMR 9189 - CRIStAL, Lille, France\\
\email{\{firstname.familyname\}@univ-lille.fr}
}
% There's nothing stopping you putting the seventh, eighth, etc.
% author on the opening page (as the 'third row') but we ask,
% for aesthetic reasons that you place these 'additional authors'
% in the \additional authors block, viz.
\additionalauthors{Additional authors: John Smith (The Th{\o}rv{\"a}ld Group,
email: {\texttt{jsmith@affiliation.org}}) and Julius P.~Kumquat
(The Kumquat Consortium, email: {\texttt{jpkumquat@consortium.net}}).}
\date{30 July 1999}
% Just remember to make sure that the TOTAL number of authors
% is the number that will appear on the first page PLUS the
% number that will appear in the \additionalauthors section.
\maketitle
\input{texs/abstract.tex}
\input{texs/intro.tex}
%\input{texs/stat.tex}
\input{texs/model.tex}
\input{texs/strategy.tex}
\input{texs/analysis.tex}
\input{texs/api.tex}
\input{texs/conclu.tex}
\input{texs/annex.tex}
\bibliography{biblio}
\bibliographystyle{plain}
\end{document}
This diff is collapsed.
\begin{abstract}
Recent computing platforms combine CPUs with different types of
accelerators such as Graphical Processing Units ({\it GPUs}) to cope
with the increasing computation power needed by complex real-time
applications. NVIDIA
GPUs are compound of hundreds of computing elements called {\it CUDA
cores}, to achieve fast computations for parallel applications.
However, GPUs are not designed to support real-time
execution, as their main goal is to achieve maximum throughput for their resources. Supporting real-time
execution on NVIDIA GPUs involves not only achieving timely
predictable calculations but also to optimize the CUDA cores usage.
In this work, we present the design and the implementation of {\it
PRUDA} (Predictable Real-time CUDA), a programming platform to
manage the GPU resources, therefore decide when and where a
real-time task is executed. PRUDA is written in {\sf C} and provides
different mechanisms to manage the task priorities and allocation on
the GPU. It provides tools to help a designer to properly implement
real-time schedulers on the top of CUDA.
\end{abstract}
\ No newline at end of file
\vspace{-1em}
\section{PRUDA API}
All three strategies are integrated into the Pruda C++ API. We also have
implemented for the single stream strategy both EDF and Fixed priority
algorithms.
First of all, the API (Figure \ref{fig:pruda_show}) requires the user
to implement its kernel using CUDA. Further, the first step is to
initialize the pruda scheduler by invoking function:
{\sf pruda\_init\_sched(method\_t method, policy p);}
\noindent where {\sf method} is either SINGLESTREAM for the first strategy,
MULTIPLESTREAMS for the second and for the MULTIPROC third. The policy
P is the scheduling policy. The current version supports EDF or FP.
Once the scheduler has been initialized, we add kernels to the task
queue {\sf tq} by invoking function:
{\sf pruda\_add\_kernel(p\_kernel\_t kern, int gs, int bs, int p );}
\noindent where {\sf kern} is the a pointer to the kernel function, {\sf gs} is
the grid size, {\sf bs} is the block size and {\sf p} is the task
priority if fixed priority policy is selected.
Once all pruda kernels have been added, the function {\sf
pruda\_start} is invoked to start all periodic threads. Memory
operations are implicitly achieved by the mean of Cuda unified memory,
however explicit memory copies are under development to be soon
supported.
\vspace{-1em}
%\lstinputlisting[caption={Example of API using in a
%sporadic task},language=C++]{api_example.cpp}
%%% Local Variables:
%%% mode: latex
%%% TeX-master: "../main"
%%% End:
\ No newline at end of file
\section{Conclusion}
\label{sec:conclu}
In this paper, we have presented PRUDA, a programming interface to develop real-time scheduler on the top of Cuda. PRUDA provides different strategies to control temporal and space behavior of real-time tasks on the GPU. In future work, we plan to provide tools to analyze the real-time behavior of PRUDA tasks. In fact, scheduling real-time tasks does not allow free preemption and has a very limited number of priorities. These limitations has to be taken into account in the analysis of PRUDA tasks behavior to ensure the respect of timing constraints. We are also planing to develop a tool for tracing pruda tasks along with the NVIDIA nvprof profiling tool.
\section{Introduction}
Many real-time applications such as computer vision, surveillance
systems, etc. demand complex processing on a large amount of
data. Classical multiprocessor platforms combining only CPUs are not
able to satisfy the real-time requirements of such systems as they
require computing capabilities in the order of teraflops. %Many of these applications
%can be accelerated on special hardware computing units such as GPUs.
Recently, NVIDIA have provided computing platforms combining CPUs with
different types of specialized computing unit such as GPUs, Deep
Learning Accelerating (DLA), etc., on the same chip. These platforms
can offer suitable solutions to meet deadlines for emerging complex
real-time applications. However, the complexity of the software,
combined with the complexity of the hardware architecture, makes it difficult to
analyse the temporal behavior of such systems. Moreover, these
accelerators are not fundamentally designed to execute real-time
tasks. Therefore, they do not provide proper hardware and software
mechanisms to schedule real-time tasks.
Several works \cite{amert2017gpu,elliott2013gpusync,kato2011timegraph}
have attacked the problem of providing support to real-time systems onto
GPUs from
different perspectives. Kato et al. have proposed platforms (TimeGraph
and RGEM) for non-preemptive scheduling for graphical tasks in GPU
\cite{kato2011timegraph}, \cite{kato2011rgem}. Authors in \cite{amert2017gpu} tried to study how a GPU
takes scheduling decisions based on benchmarking of the Jetson TX2
platform. Capodieci et al. in \cite{capodieci2018deadline} modified the proprietary NVIDIA driver to
implement an event-driven scheduler allowing to use fine grain
preemption levels provided by recent GPUs under different policies
such as EDF and fixed priority. GPUSync
\cite{elliott2013gpusync} is a platform able to control scheduling
within the GPU using locks. The work in \cite{capodieci2018deadline}
has closed sources whereas GPUSync \cite{elliott2013gpusync} platform
does not provide tools to freely implement real-time schedulers. In
both works, GPU is used as a single core platform.
\paragraph{Contributions.}
In this work, we develop {\it PRUDA}, a platform that implements
different strategies to control real-time execution within a GPU using
CUDA. PRUDA provides a control over priorities and task allocation and
parallel execution within the same GPU at the same time. The different
primitives of PRUDA allows implementing several real-time scheduling
policies using different strategies. The platform is currently
under active development: we are working on implementing special verison of EDF (GRUB) and Fixed priority scheduling policies with PRUDA.
The remainder of this paper is structured as follows. GPU architecture and its
{\it known} scheduling mechanisms are details in Section
\ref{sec:gpu}. Section \ref{sec:models} presents the task and
architecture models. We reserve Section \ref{sec:strategies} to define
how priorities and allocations are controlled within a GPU using our
platform. In Section \ref{sec:policies}, we overview the
implementation of real-time schedulers using PRUDA. We draw our
conclusions in Section \ref{sec:conclu}.
\ No newline at end of file
\section{GPU programming and PRUDA primitives }
\label{sec:gpu}
A GPU is compound of one or more \emph{streaming multiprocessors} (SMs)
and one or more \emph{copy engines} (CEs). Streaming multiprocessors are
able to achieve computations (\emph{kernels}), whereas copy engines execute
memory copy operations between different memory spaces. Programming
the GPU requires dividing parallel computations into several grids,
and each grid to several blocks. A block is a set of multiple
threads. A GPU can be programmed using generic platforms such OpenCL
or proprietary independent APIs. We use CUDA, a NVIDIA proprietary
platform, to have a tight control on SMs and CEs in the {\sf C}
programming language and using the NVIDIA compiler.
\begin{figure}[t]
\centering
\resizebox{200px}{!}{
\newcommand\squaring[4]{
\draw[rounded corners](0+#1,0+#2)rectangle(3+#1,2+#2+#3);
\node at(#1+1.5,#2+1+#3/2){#4};
}
\begin{tikzpicture}
\squaring{0}{0}{0}{SM0 (128 cores)};
\squaring{3.2}{0}{0}{SM1 (128 cores)};
\squaring{1.8}{-1.5}{-1}{Cpy. Engine};
\squaring{-4.5}{0}{0}{Denver CPUs)};
\squaring{-4.5}{-2.2}{0}{A57 CPUs};
\draw[](-0.3,-3) rectangle (6.4,2.2);
\node at(3.5,-2.5){NVIDIA PASCAL GPU };
\draw[](-0.3-5,-3) rectangle (6.4-7,2.2);
\node at(3.5-6.5,-2.5){CPU Islands};
\draw[](-5.8,-4) rectangle (7,-3.2);
\node at (0.6,-3.6){Shared Main Memory };
\end{tikzpicture}
}
\caption{Jetson TX2 Architecture}
\label{fig:jetson}
\end{figure}
When a kernel is invoked by CPU code, it submits commands to the
GPU. How and when commands are executed, is hidden by constructors for
intellectual property concerns. Authors in \cite{amert2017gpu} have
tried to reveal some \emph{GPU scheduling secrets} by benchmarking a Jetson
TX2 (abbreviated TX2 in the rest of this paper). It is compound of 6
ARM-based CPU cores, along with an integrated NVIDIA PASCAL-based GPU
as shown in Figure \ref{fig:jetson}, all running onto Ubuntu. The
GPU in the TX2 is compound of 256 Cuda cores, divided into two SMs and
one copy engine. CPUs and GPU share the same memory module. From a
programming perspective, one may either allocate two separate memory
spaces for CPU and GPU using {\sf malloc} and {\sf CudaMalloc}
primitives respectively. The programmer may use a memory space visible
logically by the CPU and the GPU called CUDA unified memory (even for
discrete GPUs), therefore no memory copies are needed between CPU and
GPU tasks such memory spaces (buffers) allocated using the {\sf
CudaMallocManaged} primitive. The current version of
PRUDA supports CUDA unified memory to avoid dealing with memory copy
operations, as it will be shown in PRUDA architecture. An extension to
separate memory spaces is under development and will be soon
available.
Typical Cuda programs are organized in the same way. First, memory
is allocated both on CPU and GPU. Further,
memory copies are operated between CPU and GPU. Then, the GPU kernel
is launched, and finally results are copied back to the CPU by memory
copy operations.
Regarding kernel execution within the GPU, authors in
\cite{amert2017gpu} affirm that all threads of any block are executed
by only one SM, however different blocks of the same kernel may
be executed on different SMs. In Figure \ref{fig:sched_jetson}, the
green kernel is executed on both SM0 and SM1, the red SM is executed
only on SM0. The kernel execution order and mechanisms are driven by
internal closed-source NVIDIA drivers (in our case of study). A PRUDA
user may obtain the SM where a given block/thread is executing by using the {\sf
pruda\_get\_sm()} function. PRUDA allows also enforcing the
allocation of a given kernel to a specific SM by using PRUDA function
{\sf pruda\_allocate\_to\_sm(int sm\_id)}, where the {\sf sm\_id} is
the id of the target streaming multiprocessor. Implementation details
about how these functions work can be found in the PRUDA description
section.
\begin{figure}[b]
\centering
\resizebox{0.6\columnwidth}{!}{
\begin{tikzpicture}
\draw (0,0)--(5,0);
\draw (0,2)--(5,2);
\draw (0,1)--(5,1);
\draw [black, fill=green](0,0)rectangle (1,0.3);
\node at (-0.5,0.5){\scriptsize SM0};
\node at (-0.5,1.5){\scriptsize SM1};
\filldraw [color=black, fill=green!20](0,0)rectangle (1,0.3);
\filldraw [color=black, fill=green!20](0,1)rectangle (1,1.3);
\filldraw [color=black, fill=green!20](2.8,3)rectangle (3,3.2);
\filldraw [color=black, fill=red!20](2.8,2.7)rectangle (3,2.9);
\node at(3.7,3.1){\tiny Kernel 1};
\node at(3.7,2.8){\tiny Kernel 2};
\filldraw [color=black, fill=red!20](2,0)rectangle (4,0.3);
\end{tikzpicture}
}
\caption{Example of Kernel scheduling in GPU}
\label{fig:sched_jetson}
\end{figure}
To enforce an execution order between different kernels, we use a
specific data structure, called \emph{Cuda Stream}. A cuda stream has a
FIFO behavior. Therefore, kernels submitted to a Cuda stream are
executed one after the other in a {\bf sequential}
fashion. Therefore, synchronization between two consecutive kernels
is implicitly achieved. This property will be used later to
implement non preemptive EDF and fixed priority real-time scheduling
policies.
In Cuda, the user may define several streams. A priority might be
set between different streams. Therefore, if a stream {\sf A} has a
higher priority than stream {\sf B}, all kernels of {\sf A} are
meant to execute before kernels that are submitted to {\sf B}. If a
kernel in {\sf B} is executing, and a kernel is activated on
{\sf A}, the GPU might preempt the kernel of {\sf B}, to execute the
kernel of {\sf A} according to the GPU
preemption level (we will show this behaviour in our benchmarks).
We highlight that fine-grain preemption
capabilities are available in NVIDIA GPUs starting from the PASCAL
architecture. For example, if a preemption is set at a block level,
preemption will be achieved when all already executing blocks finish
their execution. Recent VOLTA GPUs allow even finer preemption
levels.
Even if it is possible to create more than 2 streams, only
two levels of priority are available in the Jetson TX2 platform.
These properties will be used later to approximate EDF and fixed
priority preemptive scheduling policies.
Other PRUDA functions will be detailed later.
\section{System model}
\label{sec:models}
In this paper, we are only interested in GPU programming and
scheduling. While this paper provides real-time support to GPUs,
we do not provide any schedulability analysis yet, the analysis is work in progress.
We assume that all tasks in the system are programmed used PRUDA, therefore only PRUDA tasks
are in concurrence in GPU. Each task $\tau_i$ is characterized by its
deadline $\mathsf{D}_i$ and its period $\mathsf{T}_i$. Tasks are
strictly periodic, therefore the exact time between two successive
activations of task $\tau_{i}$ is equal to $\mathsf{T}_i$. The
$j^{th}$ instance of task $\tau_i$ must finish it execution no later
than $\mathsf{T}_i \times j + \mathsf{D}_i$, otherwise it {\it
misses} its deadline. The task may be scheduled using fixed
priority, therefore it may be characterized by parameter priority
$\mathsf{P}_i$. From the implementation perspective, each PRUDA tasks is a
instance of a periodic CPU thread as shown in the algorithm of Figure
\ref{algo:pruda}.
\begin{figure}
\centering
\begin{minted}[gobble=4,frame=lines]{c}
void *pruda_task(void * arg) {
struct timespec_t next;
p_kernel_t *pk = (p_kernel_t *)(arg);
while (1) {
// memory copy operation
clock\_gettime(CLOCK\_REALTIME, &next);
pruda_subscribe(pk->kernel,p->priority)
timespec_addto(next, pk->T);
clock_nanosleep(CLOCK_REALTIME, 0, &next, 0);
}
}
\end{minted}
\caption{Pseudo-code of PRUDA task}
\label{algo:pruda}
\end{figure}
The PRUDA task starts by parsing the kernel parameters which are the
kernel code, priority, deadline and period. Further it starts the
periodic task behavior. The task get the current time and computes the
next instance activation time {\sf next}. Later, the GPU job is
registered in the correct (according to the desired scheduling policy)
GPU run-queue (see PRUDA architecture in Figure
\ref{fig:pruda_show}). Once the PRUDA CPU thread launched the
kernel, it sleeps until the next activation. Another scheduling entity
checks the run-queue state and schedule the highest priority tasks
first according to (i) one of the strategies details into the next
section and (ii) to the desired scheduling policy.
The memory copy operation line achieves memory copies. This operation
may need to copy several buffers from CPU to GPU and vice-versa. The
current version of the platform use Cuda unified memory, therefore
memory coherency is achieved automatically by the NVIDIA Driver.
The GPU may be scheduled as a single core platform or a parallel
platform where each streaming multiprocessor is an independent core by
the mean of the PRUDA ${\sf \_\_allocate\_to\_sm(\cdots)}$
function. The allocation to a given SM is achieved by testing if the
task is in the correct SM, if yes, the computation is achieved,
otherwise the thread on the {\it wrong} SM is killed.
\begin{figure*}[ht]
\centering
\newcommand\unefile[3]{
\foreach \x in {1,1.5,...,#3}
{
\draw (#1+\x,#2) rectangle (#1+0.5+\x,#2+0.5);
}
}
\newcommand\arrowtext[5]{
\draw [->,line width=0.5mm, #5](#1,#2)--(#1+#3,#2);
\node at (#1+#3/2,#2+0.2){#4};
}
\resizebox{0.8\textwidth}{!}{
\begin{tikzpicture}
\unefile{0}{0}{3};
\edef\mya{7}
\foreach \z in {2,1.25,...,-2}
{
\unefile{5}{\z}{4};
\pgfmathparse{int(\mya-1)}
\xdef\mya{\pgfmathresult}
\node at(9.9,\z+0.2){P$_{\mya}$};
}
\draw (5.8,2.75) rectangle (10.5,-3.5);
\node[rotate=90] at (8,-2.5){\Huge ...};
\arrowtext{-2}{0.25}{3}{\sf pruda\_add\_task}{red}
\node at (2.5,-0.25){task queue {\sf (tq)}};
\node at (8.25,-3.25){Active task queue {\sf (rq)}};
\foreach \xs in {2.5,1.75,...,-1.5}
\draw [->,dashed] (3.5,0.25)--(6,\xs-0.25);
\draw [dashed](5,-5.8) circle (1.5);
\draw [fill=white, color=white](3.5,-4.85) rectangle (7,-4);
\draw [->](3.73,-5)--(3.89,-4.77);
\draw [color=red, line width=0.5mm, ->] (5,-.75)--(5,-4.25);
\node at (5.25,-4.5){{\sf pruda\_subscribe}};
\unefile{14}{0.5}{4};
\unefile{14}{-0.5}{4};
\draw (11.5+3,1.5) rectangle (17+3,-1.5);
\node at (17.25,-1.25){{\sf steam queues }};
\node at (19.25,0.75){{\sf h-sq }};
\node at (19.25,-.25){{\sf l-sq }};
\arrowtext{10.5}{0.25}{4}{\sf pruda\_resched}{red}
\def\xx{-3.5}
\def\yy{-4.5}
\unefile{14+\xx}{0.5+\yy}{3};
\unefile{14+\xx}{-0.5+\yy}{3};
\node at (18.25+\xx,0.7+\yy5){{\sf SM0-q }};
\node at (18.25+\xx,-.25+\yy){{\sf SM1-q }};
\draw [color=red, line width=0.5mm,->](17-6.5,0)--(11,0)--(11,-4.25)--(11.5,-4.25);
\node[rotate=90] at (11.25,-2){pruda\_alloc};
\draw [color=red, line width=0.5mm,->](14,-4.25)--(14.25,-4.25)--(14.25,0)--(14.5,0);
\node[rotate=90] at (14.,-2){pruda\_resched};
\def\xxx{0.75}
\draw (22+\xxx,3.5) rectangle (30+\xxx,-3.5);
\draw (23+\xxx,2.5) rectangle (25.5+\xxx,-1);
\draw (26+\xxx,2.5) rectangle (28.5+\xxx,-1);
\arrowtext{20}{0.25}{2.75}{\sf GPU internals}{red}
\node at (25,-0.75){SM0};
\node at (28,-0.75){SM1};
\node at (26.5,-1.75){\sf pruda\_check};
\node at (26.5,-2.25){\sf pruda\_abort};
\end{tikzpicture}
}
\caption{PRUDA global overview}
\label{fig:pruda_show}
\end{figure*}
\ No newline at end of file
\section{Related work}
The GPU scheduling problem has been addressed in several works from
different perspectives. Kato et al. have proposed platforms (TimeGraph
and RGEM) for non-preemptive scheduling for graphical tasks in GPU
\cite{kato2011timegraph}, \cite{kato2011rgem}. Another platform,
called GPUsync, has been provided by Elliot et al.
\cite{elliott2013gpusync}: it is a set of lock mechanisms for GPU
engines (Compute and Copy), GPU engines are seen as mutually-exclusive
resources that can be accessed only by using real-time locking
protocols. GPU kernels are scheduled with a non-preemptable FIFO
algorithm. All the works described above consider the GPU as
non-preemptable accelerator. Capodieci et al. in
\cite{capodieci2018deadline} modified the proprietary NVIDIA driver to
implement an event-driven scheduler allowing to use fine grain
preemption levels provided by recent GPUs under different policies
such as EDF and fixed priority.
\ No newline at end of file
\section{Temporal and spatial control of PRUDA tasks on GPU}
\label{sec:strategies}
Our platform integrates several strategies to implement scheduling
decisions. These strategies have different performances and overheads.
\subsection{Single-stream strategy}
The first strategy, called {\it single-stream} , uses one Cuda stream
to enforce kernel scheduling decision. The scheduler uses three queues:
a task queue ({\sf tq}) which contains all PRUDA tasks list; an active
kernels queue {\sf rq} which contains the active PRUDA jobs; and the
stream queue {\sf sq}, which contains kernels that will be submitted
to GPU. When a kernel is activated, it is added to the {\it correct}
active kernel queue {\sf rq} via the {\sf pruda\_subscribe}($\cdots$)
function. Further, if Cuda stream queue {\sf sq} is empty, it is
moved from the {\sf rq} to {\sf sq} if it is the highest priority job
according to the given scheduling policy using {\sf pruda\_resched}
function.
As only one Cuda stream is used, once the pruda task is executing, it
can not be preempted by another higher priority task, therefore only
non preemptive scheduling algorithms can be implemented using this
strategy. However, we would like to highlight that we allow pruda user
to abort the current kernel under execution by calling {\sf
pruda\_abort()} function.
This strategy is simple and easy to implement. It provides an implicit
synchronization between active tasks, i.e. if task {\sf B} is in the
stream queue while {\sf A} is running, {\sf B} will wait until {\sf A}
finishes its execution before starting without overlapping. However,
the use of this strategy involves reserving all the GPU resources
(both SMs) for a single pruda task at a time, even if this task is not
using all GPU cores, therefore resource are wasted. In
the next strategies, we will show how to overcome these limitations.
\subsection{Multiple stream: preemption enabling}
In the second strategy, called \emph{multiple streams}, PRUDA creates
multiple streams to take scheduling decisions, allowing concurrent
kernel execution on GPUs and preemption.
First, we recall that the TX2 allows only two priority
levels. Therefore, we create only two streams: one with high priority
and the other with low priority. The queue of the high priority stream
is denoted by {\sf h-sq}, the second stream queue is denoted by {\sf
l-sq}. We recall that using several streams allow asynchronous and
concurrent execution between the two streams, however within the same
stream the execution is always FIFO.
When a task is active, it is added to the correct ready-task queue
{\sf rq}. Further, the scheduler checks one of the following
situations:
\begin{enumerate}
\item {\sf h-sq}~$= \emptyset \wedge $~{\sf l-sq} $= \emptyset $ : the
scheduler will allocate the task to the {\sf l-sq} queue, therefore
the task will be submitted {\it immediately} to the GPU.
\item {\sf h-sq}~$= \emptyset \wedge $~{\sf l-sq} $\neq \emptyset $ :
the scheduler checks that the activated task has a higher priority
than the task in {\sf l-sq}. If yes, the task is inserted into the
high priority queue {\sf h-sq}, therefore it preempts the task in
the {\sf l-sq} if possible. Otherwise, no scheduling decision are
taken.
\end{enumerate}
According to the scheduling decisions mechanism described in the text
above, only one preemption is allowed when a task is already in
execution. For example, if a task {\sf C} arrives after {\sf B} has
preempted {\sf A}, task {\sf C} must wait until {\sf B} finishes even
if it is the highest priority active job. We are currently developing
schedulability analysis for such limited preemption
system. We would like also to highlight that preempted tasks, will
continue to use GPU resources if the high priority task is not using
{\it all} of the GPU resources.
Even if this strategy solves preemption limitations of the previous
one, it is more complex. It uses also a GPU as a single core. In the
next section, we use each SM in the GPU as a single processor allowing
parallel execution within the GPU. We highlight also that the preemption at instruction level can not be guaranteed as the later is decided by the NVIDIA closed internals. However, we ensure that the preemption can be achieved at block boundaries, therefore the worst preemption cost is in the order of the block execution.
\subsection{SMs as cores strategy}
The third strategy uses the GPU in similar way as the previous one;
therefore two streams are created and with the same queue
configuration. However, we allow tasks to call the function {\sf
pruda\_allocate\_to\_sm}($\cdots$), thus using a GPU as a
multiprocessor rather than a single core. We consider two types of
pruda tasks : the ones that are allocated to a given SM and the other
that are not (we consider that the PRUDA tasks, not calling the
allocation function as tasks requiring the GPU exclusively).
In addition to the scheduling structures described for the previous
strategy, this strategy uses one queue per SM : {\sf sm0-q} and {\sf
sm1-q}. When a task is active, if it uses both SMs, no other task
will be scheduled at the same time, therefore it will be added to {\sf
l-sq} or {\sf h-sq} similarly as in the previous
strategy. Otherwise, it uses a single SM and it is assigned to the
correct SM queue. Later, the two job having the highest priority in
{\sf sm0-q} and {\sf sm1-q} are scheduled first by being inserted in
{\sf l-sq} and {\sf h-sq}. This allows parallel execution on both
streaming multiprocessor. This strategy allows using the GPU of TX2 as
a 2-core platform.
In fact, the allocation function tests if a given block/thread is in
the correct SM: if yes, it continues onward execution, otherwise it
exits. Therefore, the user has either to take that into account when
using the block and thread indexes, or he/she must use new functions we
provide to calculate indexes. The thread and block indexing mechanism
we provide is simple but effective. The user is free to use the Cuda
indexes, or our platform indexes, as long as there is no conflict. We highlight here
that both of the previous strategies do not require any modification
in the kernel code nor in the programming fashion (indexing). Although
this method is more complex to implement than the two previous ones,
it provides both temporal and spatial tasks execution control on
GPUs. Analyzing the behavior of this final strategy is a challenging
theoretical question, that is considered for future work.
\section{Real-time policies using PRUDA}
\label{sec:policies}
Implementing real-time schedulers using PRUDA is simple. In fact, it
requires implementing the {\sf pruda\_subscribe} function and the
{\sf pruda\_resched} function. The goal of the first is to put the
active task in the correct queue according to its priority. If the
scheduling algorithm is fixed priority, it has to put it directly in
the corresponding priority queue. If the algorithm is EDF, it requires
calculating the priority and further inserting the task into the
correct queue. The goal of the second function is to select which
active task to select and in which Cuda stream queue it should be
inserted, therefore to be submitted to the GPU. The user is also able
to call {\sf pruda\_abort} to exit the execution of a given kernel to
mix real-time with non real-time tasks if desired. The description of
PRUDA provided in the current and the previous section is described in
Figure \ref{fig:pruda_show}. We highlight that pruda functions
(except subscribe and resched) can be used even for non pruda tasks.
\ No newline at end of file
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment