jrwtc2019

1c4d0884 · rtekin · 9ded4b05 · 1c4d0884 · 1c4d0884 · 1c4d0884
Commit 1c4d0884 authored Oct 28, 2019 by rtekin
--- a/presentation/jrwtc2019/api_example.cpp
+++ b/presentation/jrwtc2019/api_example.cpp
+void sporadicTask(void *arg) {
+  <initialization>;
+  addkernel(kernelA, gridSize, blockSize, param, priority, CHOSENSM, SMid);
+  while(cond){
+    launchOnlyChosenSm();
+    <wait event>;
+  }
+  destroyOnlyChosenSm();
+}
\ No newline at end of file
--- a/presentation/jrwtc2019/biblio.bib
+++ b/presentation/jrwtc2019/biblio.bib
+@inproceedings{amert2017gpu,
+  title={GPU scheduling on the NVIDIA TX2: Hidden details revealed},
+  author={Amert, Tanya and Otterness, Nathan and Yang, Ming and Anderson, James H and Smith, F Donelson},
+  booktitle={RTSS'2017},
+  pages={104--115},
+  year={2017},
+  organization={IEEE}
+}
+@inproceedings{capodieci2018deadline,
+  title={Deadline-based scheduling for GPU with preemption support},
+  author={Capodieci, Nicola and Cavicchioli, Roberto and Bertogna, Marko and Paramakuru, Aingara},
+  booktitle={RTSS'2018},
+  pages={119--130},
+  year={2018},
+  organization={IEEE}
+}
+@inproceedings{elliott2013gpusync,
+  title={GPUSync: A framework for real-time GPU management},
+  author={Elliott, Glenn A and Ward, Bryan C and Anderson, James H},
+  booktitle={RTSS'2013},
+  pages={33--44},
+  year={2013},
+  organization={IEEE}
+}
+@inproceedings{kato2012gdev,
+  title={Gdev: GPU Resource Management in the Operating System},
+  author={Kato, Shinpei and McThrow, Michael and Maltzahn, Carlos and Brandt, Scott},
+  booktitle={USENIX'12},
+  pages={401--412},
+  year={2012}
+}
+@inproceedings{kato2011rgem,
+  title={RGEM: A responsive GPGPU execution model for runtime engines},
+  author={Kato, Shinpei and Lakshmanan, Karthik and Kumar, Aman and Kelkar, Mihir and Ishikawa, Yutaka and Rajkumar, Ragunathan},
+  booktitle={2011 IEEE 32nd Real-Time Systems Symposium},
+  pages={57--66},
+  year={2011},
+  organization={IEEE}
+}
+@inproceedings{kato2011timegraph,
+  title={TimeGraph: GPU scheduling for real-time multi-tasking environments},
+  author={Kato, Shinpei and Lakshmanan, Karthik and Rajkumar, Raj and Ishikawa, Yutaka},
+  booktitle={Proc. USENIX ATC},
+  pages={17--30},
+  year={2011}
+}
--- a/presentation/jrwtc2019/main.tex
+++ b/presentation/jrwtc2019/main.tex
+%\documentclass{article}
+\documentclass{sig-alternate}
+\usepackage{algpseudocode}
+\usepackage{algorithm}
+\usepackage{mathtools}
+\usepackage{listings}
+\usepackage{soul}
+\usepackage{color}
+%\def\isreport{1}
+%\def\isfinal{1}
+\ifdefined\isreport
+\newcommand{\reportonly}[1]{#1} 
+\newcommand{\paperonly}[1]{}
+\else
+\newcommand{\reportonly}[1]{} 
+\newcommand{\paperonly}[1]{#1}
+\fi
+\ifdefined\isfinal
+\newcommand{\added}[1]{#1} 
+\newcommand{\deleted}[1]{}
+\newcommand{\gl}[1]{}
+\newcommand{\hz}[1]{}
+\newcommand{\rt}[1]{}
+\else
+\newcommand{\added}[1]{\textcolor{blue}{#1}} 
+\newcommand{\deleted}[1]{\st{#1}}
+\newcommand{\gl}[1]{\textcolor{red}{[\textbf{Giuseppe :} #1]}}
+\newcommand{\hz}[1]{\textcolor{yellow}{[\textbf{Houssam :} #1]}}
+\newcommand{\rt}[1]{\textcolor{green}{[\textbf{Reyyan :} #1]}}
+\fi
+\usepackage{minted}
+\usepackage{tikz}
+%\title{PRUDA: An API for Time and Space Predictible Programming in NVDIA GPUS using CUDA}
+%\author{R. Tekin, H-E. Zahaf, G. Lipari (Order to be defined)}
+\begin{document}
+%
+% --- Author Metadata here ---
+\conferenceinfo{}{}
+%\CopyrightYear{2007} % Allows default copyright year (20XX) to be over-ridden - IF NEED BE.
+%\crdata{0-12345-67-8/90/01}  % Allows default copyright data (0-89791-88-6/97/05) to be over-ridden - IF NEED BE.
+% --- End of Author Metadata ---
+\title{PRUDA: An API for Time and Space Predictible Programming in NVDIA GPUs using CUDA}
+%
+% You need the command \numberofauthors to handle the 'placement
+% and alignment' of the authors beneath the title.
+%
+% For aesthetic reasons, we recommend 'three authors at a time'
+% i.e. three 'name/affiliation blocks' be placed beneath the title.
+%
+% NOTE: You are NOT restricted in how many 'rows' of
+% "name/affiliations" may appear. We just ask that you restrict
+% the number of 'columns' to three.
+%
+% Because of the available 'opening page real-estate'
+% we ask you to refrain from putting more than six authors
+% (two rows with three columns) beneath the article title.
+% More than six makes the first-page appear very cluttered indeed.
+%
+% Use the \alignauthor commands to handle the names
+% and affiliations for an 'aesthetic maximum' of six authors.
+% Add names, affiliations, addresses for
+% the seventh etc. author(s) as the argument for the
+% \additionalauthors command.
+% These 'additional authors' will be output/set for you
+% without further effort on your part as the last section in
+% the body of your article BEFORE References or any Appendices.
+\numberofauthors{1} %  in this sample file, there are a *total*
+% of EIGHT authors. SIX appear on the 'first-page' (for formatting
+% reasons) and the remaining two appear in the \additionalauthors section.
+%extended
+\author{
+Reyyan Tekin, Houssam-Eddine ZAHAF, Giuseppe Lipari\\ 
+       Univ. Lille, CNRS, Centrale Lille, UMR 9189 - CRIStAL, Lille, France\\
+       \email{\{firstname.familyname\}@univ-lille.fr}
+}
+% There's nothing stopping you putting the seventh, eighth, etc.
+% author on the opening page (as the 'third row') but we ask,
+% for aesthetic reasons that you place these 'additional authors'
+% in the \additional authors block, viz.
+\additionalauthors{Additional authors: John Smith (The Th{\o}rv{\"a}ld Group,
+email: {\texttt{jsmith@affiliation.org}}) and Julius P.~Kumquat
+(The Kumquat Consortium, email: {\texttt{jpkumquat@consortium.net}}).}
+\date{30 July 1999}
+% Just remember to make sure that the TOTAL number of authors
+% is the number that will appear on the first page PLUS the
+% number that will appear in the \additionalauthors section.
+\maketitle
+\input{texs/abstract.tex}
+\input{texs/intro.tex}
+%\input{texs/stat.tex}
+\input{texs/model.tex}
+\input{texs/strategy.tex}
+\input{texs/analysis.tex}
+\input{texs/api.tex}
+\input{texs/conclu.tex}
+\input{texs/annex.tex}
+\bibliography{biblio}
+\bibliographystyle{plain}
+\end{document}
--- a/presentation/jrwtc2019/sig-alternate.cls
+++ b/presentation/jrwtc2019/sig-alternate.cls
--- a/presentation/jrwtc2019/texs/abstract.tex
+++ b/presentation/jrwtc2019/texs/abstract.tex
+\begin{abstract}
+  Recent computing platforms combine CPUs with different types of
+  accelerators such as Graphical Processing Units ({\it GPUs}) to cope
+  with the increasing computation power needed by complex real-time
+  applications. NVIDIA
+  GPUs are compound of hundreds of computing elements called {\it CUDA
+    cores}, to achieve fast computations for parallel applications. 
+  However, GPUs are not designed to support real-time
+  execution, as their main goal is to achieve maximum throughput for their resources.  Supporting real-time
+  execution on NVIDIA GPUs involves not only achieving timely
+  predictable calculations but also to optimize the CUDA cores usage.
+  In this work, we present the design and the implementation of {\it
+    PRUDA} (Predictable Real-time CUDA), a programming platform to
+  manage the GPU resources, therefore decide when and where a
+  real-time task is executed. PRUDA is written in {\sf C} and provides
+  different mechanisms to manage the task priorities and allocation on
+  the GPU. It provides tools to help a designer to properly implement
+  real-time schedulers on the top of CUDA.
+\end{abstract}
\ No newline at end of file
--- a/presentation/jrwtc2019/texs/analysis.tex
+++ b/presentation/jrwtc2019/texs/analysis.tex
--- a/presentation/jrwtc2019/texs/annex.tex
+++ b/presentation/jrwtc2019/texs/annex.tex
--- a/presentation/jrwtc2019/texs/api.tex
+++ b/presentation/jrwtc2019/texs/api.tex
+\vspace{-1em}
+\section{PRUDA API}
+All three strategies are integrated into the Pruda C++ API. We also have
+implemented for the single stream strategy both EDF and Fixed priority
+algorithms. 
+First of all, the API (Figure \ref{fig:pruda_show}) requires the user
+to implement its kernel using CUDA. Further, the first step is to
+initialize the pruda scheduler by invoking function:
+{\sf pruda\_init\_sched(method\_t method, policy p);}
+\noindent where {\sf method} is either SINGLESTREAM for the first strategy,
+MULTIPLESTREAMS for the second and for the MULTIPROC third. The policy
+P is the scheduling policy. The current version supports EDF or FP.
+Once the scheduler has been initialized, we add kernels to the task
+queue {\sf tq} by invoking function:
+{\sf pruda\_add\_kernel(p\_kernel\_t kern, int gs, int bs, int p );}
+\noindent where {\sf kern} is the a pointer to the kernel function, {\sf gs} is
+the grid size, {\sf bs} is the block size and {\sf p} is the task
+priority if fixed priority policy is selected.
+Once all pruda kernels have been added, the function {\sf
+  pruda\_start} is invoked to start all periodic threads. Memory
+operations are implicitly achieved by the mean of Cuda unified memory,
+however explicit memory copies are under development to be soon
+supported.
+\vspace{-1em}
+%\lstinputlisting[caption={Example of API using in a 
+%sporadic task},language=C++]{api_example.cpp}
+%%% Local Variables:
+%%% mode: latex
+%%% TeX-master: "../main"
+%%% End:
\ No newline at end of file
--- a/presentation/jrwtc2019/texs/conclu.tex
+++ b/presentation/jrwtc2019/texs/conclu.tex
+\section{Conclusion}
+\label{sec:conclu}
+In this paper, we have presented PRUDA, a programming interface to develop real-time scheduler on the top of Cuda. PRUDA provides different strategies to control temporal and space behavior of real-time tasks on the GPU.  In future work, we plan to provide tools to analyze the real-time behavior of PRUDA tasks. In fact, scheduling real-time tasks does not allow free preemption and has a very limited number of priorities. These limitations has to be taken into account in the analysis of PRUDA tasks behavior to ensure the respect of timing constraints.  We are also planing to develop a tool for tracing pruda tasks along with the NVIDIA nvprof profiling tool. 
--- a/presentation/jrwtc2019/texs/intro.tex
+++ b/presentation/jrwtc2019/texs/intro.tex
+\section{Introduction}
+Many real-time applications such as computer vision, surveillance
+systems, etc. demand complex processing on a large amount of
+data. Classical multiprocessor platforms combining only CPUs are not
+able to satisfy the real-time requirements of such systems as they
+require computing capabilities in the order of teraflops. %Many of these applications
+%can be accelerated on special hardware computing units such as GPUs.
+Recently, NVIDIA have provided computing platforms combining CPUs with
+different types of specialized computing unit such as GPUs, Deep
+Learning Accelerating (DLA), etc., on the same chip. These platforms
+can offer suitable solutions to meet deadlines for emerging complex
+real-time applications. However, the complexity of the software,
+combined with the complexity of the hardware architecture, makes it difficult to 
+analyse the temporal behavior of such systems. Moreover, these
+accelerators are not fundamentally designed to execute real-time
+tasks. Therefore, they do not provide proper hardware and software
+mechanisms to schedule real-time tasks.
+Several works \cite{amert2017gpu,elliott2013gpusync,kato2011timegraph}
+have attacked the problem of providing support to real-time systems onto
+GPUs from
+different perspectives. Kato et al. have proposed platforms (TimeGraph
+and RGEM) for non-preemptive scheduling for graphical tasks in GPU
+\cite{kato2011timegraph}, \cite{kato2011rgem}. Authors in \cite{amert2017gpu} tried to study how a GPU
+takes scheduling decisions based on benchmarking of the Jetson TX2
+platform. Capodieci et al. in \cite{capodieci2018deadline}  modified the proprietary NVIDIA driver to
+implement an event-driven scheduler allowing to use fine grain
+preemption levels provided by recent GPUs under different policies
+such as EDF and fixed priority. GPUSync
+\cite{elliott2013gpusync} is a platform able to control scheduling
+within the GPU using locks. The work in \cite{capodieci2018deadline}
+has closed sources whereas GPUSync \cite{elliott2013gpusync} platform
+does not provide tools to freely implement real-time schedulers. In
+both works, GPU is used as a single core platform.
+\paragraph{Contributions.}
+In this work, we develop {\it PRUDA}, a platform that implements
+different strategies to control real-time execution within a GPU using
+CUDA. PRUDA provides a control over priorities and task allocation and
+parallel execution within the same GPU at the same time. The different
+primitives of PRUDA allows implementing several real-time scheduling
+policies using different strategies. The platform is currently 
+under active development: we are working on implementing special verison of EDF (GRUB) and Fixed priority scheduling policies with PRUDA.
+The remainder of this paper is structured as follows. GPU architecture and its
+{\it known} scheduling mechanisms are details in Section
+\ref{sec:gpu}. Section \ref{sec:models} presents the task and
+architecture models. We reserve Section \ref{sec:strategies} to define
+how priorities and allocations are controlled within a GPU using our
+platform. In Section \ref{sec:policies}, we overview the
+implementation of real-time schedulers using PRUDA. We draw our
+conclusions in Section \ref{sec:conclu}.
\ No newline at end of file
--- a/presentation/jrwtc2019/texs/model.tex
+++ b/presentation/jrwtc2019/texs/model.tex
+\section{GPU programming and PRUDA primitives  }
+\label{sec:gpu}
+A GPU is compound of one or more \emph{streaming multiprocessors} (SMs)
+and one or more \emph{copy engines} (CEs). Streaming multiprocessors are
+able to achieve computations (\emph{kernels}), whereas copy engines execute
+memory copy operations between different memory spaces. Programming
+the GPU requires dividing parallel computations into several grids,
+and each grid to several blocks. A block is a set of multiple
+threads. A GPU can be programmed using generic platforms such OpenCL
+or proprietary independent APIs. We use CUDA, a NVIDIA proprietary
+platform, to have a tight control on SMs and CEs in the {\sf C}
+programming language and using the NVIDIA compiler.
+\begin{figure}[t]
+    \centering
+\resizebox{200px}{!}{
+\newcommand\squaring[4]{
+\draw[rounded corners](0+#1,0+#2)rectangle(3+#1,2+#2+#3);
+\node at(#1+1.5,#2+1+#3/2){#4};
+}
+\begin{tikzpicture}
+\squaring{0}{0}{0}{SM0 (128 cores)};
+\squaring{3.2}{0}{0}{SM1 (128 cores)};
+\squaring{1.8}{-1.5}{-1}{Cpy. Engine};
+\squaring{-4.5}{0}{0}{Denver CPUs)};
+\squaring{-4.5}{-2.2}{0}{A57 CPUs};
+\draw[](-0.3,-3) rectangle (6.4,2.2);
+\node at(3.5,-2.5){NVIDIA PASCAL GPU };
+\draw[](-0.3-5,-3) rectangle (6.4-7,2.2);
+\node at(3.5-6.5,-2.5){CPU  Islands};
+\draw[](-5.8,-4) rectangle (7,-3.2);
+\node at (0.6,-3.6){Shared Main Memory  };
+\end{tikzpicture}
+}
+    \caption{Jetson TX2 Architecture}
+    \label{fig:jetson}
+\end{figure}
+When a kernel is invoked by CPU code, it submits commands to the
+GPU. How and when commands are executed, is hidden by constructors for
+intellectual property concerns. Authors in \cite{amert2017gpu} have
+tried to reveal some \emph{GPU scheduling secrets} by benchmarking a Jetson
+TX2 (abbreviated TX2 in the rest of this paper). It is compound of 6
+ARM-based CPU cores, along with an integrated NVIDIA PASCAL-based GPU
+as shown in Figure \ref{fig:jetson}, all running onto Ubuntu. The
+GPU in the TX2 is compound of 256 Cuda cores, divided into two SMs and
+one copy engine. CPUs and GPU share the same memory module. From a
+programming perspective, one may either allocate two separate memory
+spaces for CPU and GPU using {\sf malloc} and {\sf CudaMalloc}
+primitives respectively. The programmer may use a memory space visible
+logically by the CPU and the GPU called CUDA unified memory (even for
+discrete GPUs), therefore no memory copies are needed between CPU and
+GPU tasks such memory spaces (buffers) allocated using the {\sf
+  CudaMallocManaged} primitive. The current version of 
+PRUDA supports CUDA unified memory to avoid dealing with memory copy
+operations, as it will be shown in PRUDA architecture. An extension to
+separate memory spaces is under development and will be soon
+available.
+Typical Cuda programs are organized in the same way. First, memory
+is allocated both on CPU and GPU. Further,
+memory copies are operated between CPU and GPU. Then, the GPU kernel
+is launched, and finally results are copied back to the CPU by memory
+copy operations.
+Regarding kernel execution within the GPU, authors in
+\cite{amert2017gpu} affirm that all threads of any block are executed
+by only one SM, however different blocks of the same kernel may
+be executed on different SMs. In Figure \ref{fig:sched_jetson}, the
+green kernel is executed on both SM0 and SM1, the red SM is executed
+only on SM0. The kernel execution order and mechanisms are driven by
+internal closed-source NVIDIA drivers (in our case of study). A PRUDA
+user may obtain the SM where a given block/thread is executing by using the {\sf
+  pruda\_get\_sm()} function. PRUDA allows also enforcing the
+allocation of a given kernel to a specific SM by using PRUDA function
+{\sf pruda\_allocate\_to\_sm(int sm\_id)}, where the {\sf sm\_id} is
+the id of the target streaming multiprocessor. Implementation details
+about how these functions work can be found in the PRUDA description
+section.
+\begin{figure}[b]
+    \centering
+\resizebox{0.6\columnwidth}{!}{
+\begin{tikzpicture}
+\draw (0,0)--(5,0);
+\draw (0,2)--(5,2);
+\draw (0,1)--(5,1);
+\draw [black, fill=green](0,0)rectangle (1,0.3);
+\node at (-0.5,0.5){\scriptsize SM0};
+\node at (-0.5,1.5){\scriptsize SM1};
+\filldraw [color=black, fill=green!20](0,0)rectangle (1,0.3);
+\filldraw [color=black, fill=green!20](0,1)rectangle (1,1.3);
+\filldraw [color=black, fill=green!20](2.8,3)rectangle (3,3.2);
+\filldraw [color=black, fill=red!20](2.8,2.7)rectangle (3,2.9);
+\node at(3.7,3.1){\tiny Kernel 1};
+\node at(3.7,2.8){\tiny Kernel 2};
+\filldraw [color=black, fill=red!20](2,0)rectangle (4,0.3);
+\end{tikzpicture}
+}
+    \caption{Example of Kernel scheduling in GPU}
+    \label{fig:sched_jetson}
+  \end{figure}
+  To enforce an execution order between different kernels, we use a
+  specific data structure, called \emph{Cuda Stream}. A cuda stream has a
+  FIFO behavior. Therefore, kernels submitted to a Cuda stream are
+  executed one after the other in a {\bf sequential}
+  fashion. Therefore, synchronization between two consecutive kernels
+  is implicitly achieved. This property will be used later to
+  implement non preemptive EDF and fixed priority real-time scheduling
+  policies.
+  In Cuda, the user may define several streams. A priority might be
+  set between different streams. Therefore, if a stream {\sf A} has a
+  higher priority than stream {\sf B}, all kernels of {\sf A} are
+  meant to execute before kernels that are submitted to {\sf B}. If a
+  kernel in {\sf B} is executing, and a kernel is activated on
+  {\sf A}, the GPU might preempt the kernel of {\sf B}, to execute the
+  kernel of {\sf A} according to the GPU
+  preemption level (we will show this behaviour in our benchmarks). 
+  We highlight that fine-grain preemption
+  capabilities are available in NVIDIA GPUs starting from the PASCAL
+  architecture. For example, if a preemption is set at a block level,
+  preemption will be achieved when all already executing blocks finish
+  their execution. Recent VOLTA GPUs allow even finer preemption
+  levels. 
+  Even if it is possible to create more than 2 streams, only
+  two levels of priority are available in the Jetson TX2 platform.
+  These properties will be used later to approximate EDF and fixed
+  priority preemptive scheduling policies.
+Other PRUDA functions will be detailed later.
+\section{System model}
+\label{sec:models}
+In this paper, we are only interested in GPU programming and
+scheduling. While this paper provides real-time support to GPUs, 
+we do not provide any schedulability analysis yet, the analysis is work in progress. 
+We assume that all tasks in the system are programmed used PRUDA, therefore only PRUDA tasks
+are in concurrence in GPU. Each task $\tau_i$ is characterized by its
+deadline $\mathsf{D}_i$ and its period $\mathsf{T}_i$. Tasks are
+strictly periodic, therefore the exact time between two successive
+activations of task $\tau_{i}$ is equal to $\mathsf{T}_i$. The
+$j^{th}$ instance of task $\tau_i$ must finish it execution no later
+than $\mathsf{T}_i \times j + \mathsf{D}_i$, otherwise it {\it
+  misses} its deadline. The task may be scheduled using fixed
+priority, therefore it may be characterized by parameter priority
+$\mathsf{P}_i$. From the implementation perspective, each PRUDA tasks is a
+instance of a periodic CPU thread as shown in the algorithm of Figure
+\ref{algo:pruda}.
+ \begin{figure}
+  \centering
+  \begin{minted}[gobble=4,frame=lines]{c}
+    void *pruda_task(void * arg) { 
+      struct timespec_t next; 
+      p_kernel_t *pk = (p_kernel_t *)(arg); 
+      while (1) {
+        // memory copy operation
+        clock\_gettime(CLOCK\_REALTIME, &next);
+        pruda_subscribe(pk->kernel,p->priority)
+        timespec_addto(next, pk->T);
+        clock_nanosleep(CLOCK_REALTIME, 0, &next, 0); 
+      } 
+    }
+  \end{minted}
+  \caption{Pseudo-code of PRUDA task}
+\label{algo:pruda}
+\end{figure}
+The PRUDA task starts by parsing the kernel parameters which are the
+kernel code, priority, deadline and period. Further it starts the
+periodic task behavior. The task get the current time and computes the
+next instance activation time {\sf next}. Later, the GPU job is
+registered in the correct (according to the desired scheduling policy)
+GPU run-queue (see PRUDA architecture in Figure
+\ref{fig:pruda_show}). Once the PRUDA CPU thread launched the
+kernel, it sleeps until the next activation. Another scheduling entity
+checks the run-queue state and schedule the highest priority tasks
+first according to (i) one of the strategies details into the next
+section and (ii) to the desired scheduling policy.
+The memory copy operation line achieves memory copies. This operation
+may need to copy several buffers from CPU to GPU and vice-versa. The
+current version of the platform use Cuda unified memory, therefore
+memory coherency is achieved automatically by the NVIDIA Driver.
+The GPU may be scheduled as a single core platform or a parallel
+platform where each streaming multiprocessor is an independent core by
+the mean of the PRUDA ${\sf \_\_allocate\_to\_sm(\cdots)}$
+function. The allocation to a given SM is achieved by testing if the
+task is in the correct SM, if yes, the computation is achieved,
+otherwise the thread on the {\it wrong} SM is killed.
+\begin{figure*}[ht]
+    \centering
+\newcommand\unefile[3]{
+\foreach \x in {1,1.5,...,#3}
+{
+\draw (#1+\x,#2) rectangle (#1+0.5+\x,#2+0.5);
+}
+}
+\newcommand\arrowtext[5]{
+\draw [->,line width=0.5mm, #5](#1,#2)--(#1+#3,#2);
+\node at (#1+#3/2,#2+0.2){#4};
+}
+\resizebox{0.8\textwidth}{!}{
+\begin{tikzpicture}
+\unefile{0}{0}{3};
+\edef\mya{7}
+\foreach \z in {2,1.25,...,-2}
+   {
+  \unefile{5}{\z}{4};
+ \pgfmathparse{int(\mya-1)}
+ \xdef\mya{\pgfmathresult}
+ \node at(9.9,\z+0.2){P$_{\mya}$};
+}
+\draw (5.8,2.75) rectangle (10.5,-3.5);
+\node[rotate=90] at (8,-2.5){\Huge ...};
+\arrowtext{-2}{0.25}{3}{\sf pruda\_add\_task}{red}
+\node at (2.5,-0.25){task queue {\sf (tq)}};
+\node at (8.25,-3.25){Active task queue {\sf (rq)}};
+\foreach \xs in {2.5,1.75,...,-1.5}
+\draw [->,dashed] (3.5,0.25)--(6,\xs-0.25);
+\draw [dashed](5,-5.8) circle (1.5);
+\draw [fill=white, color=white](3.5,-4.85) rectangle (7,-4);
+\draw [->](3.73,-5)--(3.89,-4.77);
+\draw [color=red, line width=0.5mm, ->] (5,-.75)--(5,-4.25);
+\node at (5.25,-4.5){{\sf pruda\_subscribe}};
+\unefile{14}{0.5}{4};
+\unefile{14}{-0.5}{4};
+\draw (11.5+3,1.5) rectangle (17+3,-1.5);
+\node at (17.25,-1.25){{\sf steam queues }};
+\node at (19.25,0.75){{\sf h-sq }};
+\node at (19.25,-.25){{\sf l-sq }};
+\arrowtext{10.5}{0.25}{4}{\sf pruda\_resched}{red}
+\def\xx{-3.5}
+\def\yy{-4.5}
+\unefile{14+\xx}{0.5+\yy}{3};
+\unefile{14+\xx}{-0.5+\yy}{3};
+\node at (18.25+\xx,0.7+\yy5){{\sf SM0-q }};
+\node at (18.25+\xx,-.25+\yy){{\sf SM1-q }};
+\draw [color=red, line width=0.5mm,->](17-6.5,0)--(11,0)--(11,-4.25)--(11.5,-4.25);
+\node[rotate=90] at (11.25,-2){pruda\_alloc};
+\draw [color=red, line width=0.5mm,->](14,-4.25)--(14.25,-4.25)--(14.25,0)--(14.5,0);
+\node[rotate=90] at (14.,-2){pruda\_resched};
+\def\xxx{0.75}
+\draw (22+\xxx,3.5) rectangle (30+\xxx,-3.5);
+\draw (23+\xxx,2.5) rectangle (25.5+\xxx,-1);
+\draw (26+\xxx,2.5) rectangle (28.5+\xxx,-1);
+\arrowtext{20}{0.25}{2.75}{\sf GPU internals}{red}
+\node at (25,-0.75){SM0};
+\node at (28,-0.75){SM1};
+\node at (26.5,-1.75){\sf pruda\_check};
+\node at (26.5,-2.25){\sf pruda\_abort};
+\end{tikzpicture}
+}
+    \caption{PRUDA global overview}
+    \label{fig:pruda_show}
+\end{figure*}
\ No newline at end of file
--- a/presentation/jrwtc2019/texs/stat.tex
+++ b/presentation/jrwtc2019/texs/stat.tex
+\section{Related work}
+The GPU scheduling problem has been addressed in several works from
+different perspectives. Kato et al. have proposed platforms (TimeGraph
+and RGEM) for non-preemptive scheduling for graphical tasks in GPU
+\cite{kato2011timegraph}, \cite{kato2011rgem}. Another platform,
+called GPUsync, has been provided by Elliot et al.
+\cite{elliott2013gpusync}: it is a set of lock mechanisms for GPU
+engines (Compute and Copy), GPU engines are seen as mutually-exclusive
+resources that can be accessed only by using real-time locking
+protocols. GPU kernels are scheduled with a non-preemptable FIFO
+algorithm. All the works described above consider the GPU as
+non-preemptable accelerator. Capodieci et al. in
+\cite{capodieci2018deadline} modified the proprietary NVIDIA driver to
+implement an event-driven scheduler allowing to use fine grain
+preemption levels provided by recent GPUs under different policies
+such as EDF and fixed priority.
\ No newline at end of file
--- a/presentation/jrwtc2019/texs/strategy.tex
+++ b/presentation/jrwtc2019/texs/strategy.tex
+\section{Temporal and spatial control of PRUDA tasks on GPU}
+\label{sec:strategies}
+Our platform integrates several strategies to implement scheduling
+decisions. These strategies have different performances and overheads.
+\subsection{Single-stream strategy}
+The first strategy, called {\it single-stream} , uses one Cuda stream
+to enforce kernel scheduling decision. The scheduler uses three queues:
+a task queue ({\sf tq}) which contains all PRUDA tasks list; an active
+kernels queue {\sf rq} which contains the active PRUDA jobs; and the
+stream queue {\sf sq}, which contains kernels that will be submitted
+to GPU.  When a kernel is activated, it is added to the {\it correct}
+active kernel queue {\sf rq} via the {\sf pruda\_subscribe}($\cdots$)
+function. Further, if Cuda stream queue {\sf sq} is empty, it is
+moved from the {\sf rq} to {\sf sq} if it is the highest priority job
+according to the given scheduling policy using {\sf pruda\_resched}
+function.
+As only one Cuda stream is used, once the pruda task is executing, it
+can not be preempted by another higher priority task, therefore only
+non preemptive scheduling algorithms can be implemented using this
+strategy. However, we would like to highlight that we allow pruda user
+to abort the current kernel under execution by calling {\sf
+  pruda\_abort()} function.
+This strategy is simple and easy to implement. It provides an implicit
+synchronization between active tasks, i.e. if task {\sf B} is in the
+stream queue while {\sf A} is running, {\sf B} will wait until {\sf A}
+finishes its execution before starting without overlapping. However,
+the use of this strategy involves reserving all the GPU resources
+(both SMs) for a single pruda task at a time, even if this task is not 
+using all GPU cores, therefore resource are wasted. In
+the next strategies, we will show how to overcome these limitations.
+\subsection{Multiple stream: preemption enabling}
+In the second strategy, called \emph{multiple streams}, PRUDA creates
+multiple streams to take scheduling decisions, allowing concurrent
+kernel execution on GPUs and preemption.
+First, we recall that the TX2 allows only two priority
+levels. Therefore, we create only two streams: one with high priority
+and the other with low priority. The queue of the high priority stream
+is denoted by {\sf h-sq}, the second stream queue is denoted by {\sf
+  l-sq}. We recall that using several streams allow asynchronous and
+concurrent execution between the two streams, however within the same
+stream the execution is always FIFO.
+When a task is active, it is added to the correct ready-task queue
+{\sf rq}. Further, the scheduler checks one of the following
+situations:
+\begin{enumerate}
+\item {\sf h-sq}~$= \emptyset \wedge $~{\sf l-sq} $= \emptyset $ : the
+  scheduler will allocate the task to the {\sf l-sq} queue, therefore
+  the task will be submitted {\it immediately} to the GPU.
+\item {\sf h-sq}~$= \emptyset \wedge $~{\sf l-sq} $\neq \emptyset $ :
+  the scheduler checks that the activated task has a higher priority
+  than the task in {\sf l-sq}.  If yes, the task is inserted into the
+  high priority queue {\sf h-sq}, therefore it preempts the task in
+  the {\sf l-sq} if possible. Otherwise, no scheduling decision are
+  taken.
+\end{enumerate}
+According to the scheduling decisions mechanism described in the text
+above, only one preemption is allowed when a task is already in
+execution. For example, if a task {\sf C} arrives after {\sf B} has
+preempted {\sf A}, task {\sf C} must wait until {\sf B} finishes even
+if it is the highest priority active job. We are currently developing
+schedulability analysis for such limited preemption
+system. We would like also to highlight that preempted tasks, will
+continue to use GPU resources if the high priority task is not using
+{\it all} of the GPU resources.
+Even if this strategy solves preemption limitations of the previous
+one, it is more complex. It uses also a GPU as a single core. In the
+next section, we use each SM in the GPU as a single processor allowing
+parallel execution within the GPU. We highlight also that the preemption at instruction level can not be guaranteed as the later is decided by the NVIDIA closed internals. However, we ensure that the preemption can be achieved at block boundaries, therefore the worst preemption cost is in the order of the block execution. 
+\subsection{SMs as cores strategy}
+The third strategy uses the GPU in similar way as the previous one;
+therefore two streams are created and with the same queue
+configuration. However, we allow tasks to call the function {\sf
+  pruda\_allocate\_to\_sm}($\cdots$), thus using a GPU as a
+multiprocessor rather than a single core. We consider two types of
+pruda tasks : the ones that are allocated to a given SM and the other
+that are not (we consider that the PRUDA tasks, not calling the
+allocation function as tasks requiring the GPU exclusively).
+In addition to the scheduling structures described for the previous
+strategy, this strategy uses one queue per SM : {\sf sm0-q} and {\sf
+  sm1-q}. When a task is active, if it uses both SMs, no other task
+will be scheduled at the same time, therefore it will be added to {\sf
+  l-sq} or {\sf h-sq} similarly as in the previous
+strategy. Otherwise, it uses a single SM and it is assigned to the
+correct SM queue. Later, the two job having the highest priority in
+{\sf sm0-q} and {\sf sm1-q} are scheduled first by being inserted in
+{\sf l-sq} and {\sf h-sq}. This allows parallel execution on both
+streaming multiprocessor. This strategy allows using the GPU of TX2 as
+a 2-core platform.
+In fact, the allocation function tests if a given block/thread is in
+the correct SM: if yes, it continues onward execution, otherwise it
+exits. Therefore, the user has either to take that into account when
+using the block and thread indexes, or he/she must use new functions we
+provide to calculate indexes. The thread and block indexing mechanism
+we provide is simple but effective. The user is free to use the Cuda
+indexes, or our platform indexes, as long as there is no conflict. We highlight here
+that both of the previous strategies do not require any modification
+in the kernel code nor in the programming fashion (indexing). Although
+this method is more complex to implement than the two previous ones,
+it provides both temporal and spatial tasks execution control on
+GPUs. Analyzing the behavior of this final strategy is a challenging
+theoretical question, that is considered for future work.
+\section{Real-time policies using PRUDA}
+\label{sec:policies}
+Implementing real-time schedulers using PRUDA is simple. In fact, it
+requires implementing the {\sf pruda\_subscribe} function and the
+{\sf pruda\_resched} function. The goal of the first is to put the
+active task in the correct queue according to its priority. If the
+scheduling algorithm is fixed priority, it has to put it directly in
+the corresponding priority queue. If the algorithm is EDF, it requires
+calculating the priority and further inserting the task into the
+correct queue. The goal of the second function is to select which
+active task to select and in which Cuda stream queue it should be
+inserted, therefore to be submitted to the GPU. The user is also able
+to call {\sf pruda\_abort} to exit the execution of a given kernel to
+mix real-time with non real-time tasks if desired. The description of
+PRUDA provided in the current and the previous section is described in
+Figure \ref{fig:pruda_show}. We highlight that pruda functions
+(except subscribe and resched) can be used even for non pruda tasks.
\ No newline at end of file