Skip to content
Snippets Groups Projects
Select Git revision
  • master default
  • brancheSecondaire
2 results

rtgpgpu

  • user avatar
    rtekin authored
    6568060d
    History
    Name Last commit Last update
    examples
    inc
    presentation
    src
    Makefile
    README.md

    PRUDA : Real-time programing interface on the top of CUDA

    PRUDA is a set of programming tools and mechanisms to control scheduling within the GPU. It also provides the implementation the following real-time scheduling policies:

    • Fixed Priority (FP): preemptive and non-preemptive single core
    • Earliest Deadline First (EDF) : preemptif and non preemptive
    • Gang scheduling techniques : preemptive and non preemptive GANG using the GPU as a multiprocessor architecture.

    Additionally PRUDA aims to not modify CUDA-user programming style. Therefore, the PRUDA user can use already developped CUDA kernels. To keep the user-free of kernel signatures, PRUDA must be compiled at the same time with the user source kernel code. Neverless, PRUDA also provides a dynamic configuration possibilities with fixed kernel signature (int , int *, int *, int *, int *);

    PRUDA can handle implicitly memory copy operations, and also cuda unified memory.

    Prerequisites

    • C++ compiler
    • Nvidia NVCC compiler

    PRUDA and GPU handling:

    A GPU is compound of one or several streaming multiprocessors (SMs) and one or several copy engines (CEs). Streaming multiprocessors are able to achieve computations (kernels), whereas copy engines execute memory copy operations between different memory spaces. Each SM can be considered as a processor, or both SMs as a single processor according to the scheduling policy.

    To get the execution sm for a given kernel, it can invoke : pruda_get_sm(). PRUDA allows also enforcing the allocation of a given kernel to a specific SM by using {\sf pruda_allocate_to_sm(int sm_id)}, where the {\sf sm_id} is the id of the target streaming multiprocessor.

    Implementation details about how these primitives can be found in the PRUDA description section.

    To enforce an execution order between different kernels, we use a specific data structure, called Cuda streams. A Cuda stream has a FIFO behavior. Therefore, kernels submitted to a Cuda stream are executed one after the other in a {\bf sequential} fashion. Therefore, synchronization between two consecutive kernels is implicitly achieved. This property will be used later to implement non preemptive EDF and fixed priority real-time scheduling policies.

    In Cuda, the user may define several streams. A priority might be set between different streams. Therefore, if a Stream {\sf A} have a higher priority than stream {\sf B}, all kernels of {\sf A} are meant to execute before kernels that are submitted to {\sf B}. If a kernel of in {\sf B} is executing, while a kernel is activated on {\sf A}, the GPU might preempt the kernel of {\sf B}, to execute the kernel of {\sf A} according to our benchmarking according to the GPU preemption level. We highlight that fine-grain preemption capabilities are available in NVIDIA GPUs starting from the PASCAL architecture. For example, if a preemption is set a a block level, preemption will be achieved when all already executing blocks finish their execution. Recent VOLTA GPUs allow even finer preemption levels. Even if it is possible to create more than 2 streams, only two levels of priority are available in the Jetson TX2 platform. These properties will be used further to achieve EDF and fixed priority preemptive scheduling policies.

    Other PRUDA primitives will be detailed later.

    CUDA operations:

    PRUDA allows a kernel to execute within a single SM

    PRUDA usage by example

    In this section, we will show how PRUDA can be used by calling two periodic real-time tasks on the GPU. The first does the array sum and the other for array multiplication.

    First to use pruda, you must incluse the following header files:

    #include "../inc/user.h"
    #include "../inc/tools.h"

    The user defines further its kernel in a classical way as follows:

    __global__ void add( int *a, int *b, int *c )  ... ; 
    __global__ void mul( int *a, int *b, int *c, int h ) ... ;

    Further, the user writes it main function by first allocating the memory spaces of both CPU and GPU by the mean of malloc, cudaMalloc or cudaMallocManaged as in follows

    int *a, *b, *c;  
    int *dev_a, *dev_b, *dev_c,ac; 
    
    // allocating CPU memory
    a = (int*)malloc( N * sizeof(int) );
    b = (int*)malloc( N * sizeof(int) );
    c = (int*)malloc( N * sizeof(int) );
    
    // allocating GPU memory
    cudaMalloc( (void**)&dev_a, N * sizeof(int) );
    cudaMalloc( (void**)&dev_b, N * sizeof(int) );
    cudaMalloc( (void**)&dev_c, N * sizeof(int) );
    

    The user must now create kernels list. Therefore the user has a predefined list in the file user.cu. The user calls get_listing() and then initialize each kernel with the kernel code, number of blocks and number of threads per block and finally the kernel parameters as follows:

    // initializing the list of kernels init_kernel_listing();
    create_kernel(std::get<1>(get_listing()),add,2,5,dev_a,dev_b,dev_c);
    create_kernel(std::get<0>(get_listing()),mul,2,5,dev_a,dev_b,dev_c,ac);

    Now all the GPU part has been initialezd. Now we configure the scheduler it self. Therefore, we start by initializing task scheduling parameters: deadline, period and periority and encapsulating them in a pruda task as in the following examples:

    // gpu scheduling parameters for kernel add
    struct gpu_sched_param add_p; 
    gb.period_us = 3000000;
    gb.deadline_us= 3000000;
    gb.priority = 20;
    
    // gpu scheduliung parameters 
    struct gpu_sched_param mul_p;
    ga.period_us =  6000000;
    ga.deadline_us= 6000000;
    ga.priority = 15;
    
    // declaring the gpu tasks  (a container of params)
    struct pruda_task_t * p_task_b = create_pruda_task(1, add_p);
    struct pruda_task_t * p_task_a = create_pruda_task(0, mul_p);

    We highlight that in the pruda task create, the first parameter is the task id and must correspond to the same as in the kernel list of get_listing(). Otherwise, compilation will fail for type-inferences problems.

    Finally, the user must initilize the scheduler and add pruda tasks as follows :

    // initializing the scheduler with the desired staretgy and scheduling policy
    init_scheduler(SINGLE, FP);
    
    // adding pruda tasks to the scheduling unit
    add_pruda_task(p_task_a);
    add_pruda_task(p_task_b); 
    
    
    // launch the tasks on GPU and the corresponding memory copies 
    // (locks and synchronization params are under developpement)
    create_cpu_threads();

    The user now must define the task parameters so they can be initialized at compile time in user.h and user.cu as follows :

    // user.h 
    
    std::tuple<struct kernel_t<int *,int*,int *, int> * , struct kernel_t<int *,int*,int *> * >   get_listing();

    where the return type of get_listing() is a tuple of the list of kernels.

    // user.cu
    // the list of the kernels must be given here 
    struct kernel_t<int *,int*,int *, int> m_1;
    struct kernel_t<int *,int*,int *> m_2;
    
    void init_kernel_listing(){  
      // user must add its kernels here.
      tasks = std::make_tuple(&m_1,&m_2);
    }

    Once, every thing is set in place. The user calls make to compile pruda along with his own code.

    PRUDA scheduling tools and policies

    Single core strategy for non preemptive schedulers

    The first strategy, called {\it single-stream} , uses one Cuda stream to enforce kernel scheduling decision. The scheduler uses three queue: task queue ({\sf tq}) which contains all PRUDA tasks list and active kernels queue {\sf rq} which contains the active PRUDA jobs and the stream queue {\sf sq}, which contains kernels that will be submitted to GPU. When a kernel is activated, it is added to the {\it correct} active kernels queue {\sf rq} via {\sf pruda_subscribe}(\cdots) primitive. Further, if Cuda stream queue {\sf sq} is empty, it is moved from the {\sf rq} to {\sf sq} if it is the most priority job according to the given scheduling policy using pruda_resched primitive.

    As only one Cuda stream is used, once the pruda task is executing, it can not be preempted by another higher priority task, therefore only non preemptive scheduling algorithms can be implemented using this strategy. However, we would like to highlight that we allow pruda user to abort the current kernel under execution by calling pruda_abort() primitive.

    This strategy is simple and easy to implement. It provides an implicit synchronization between active tasks, i.e. if task {\sf B} is in the stream queue while {\sf A} is running, {\sf B} will wait until {\sf A} finishes its execution before starting without overlapping. However, the use of this strategy involves reserving all the GPU resources (both SMs) for a single pruda task at a time even if this task is not heavy and not using all GPU cores, therefore resource are wasted. In the next strategies, we will show how to overcome these limitations.

    Single core strategy for preemptive schedulers

    In the second strategy, called "multiple streams", PRUDA creates multiple streams to take scheduling decisions, allowing concurrent kernel execution on GPUs and preemption.

    First, we recall that the TX2 allows only two priority levels. Therefore, we create only two streams: one with high priority and the other with low priority. The queue of the high priority stream is denoted by {\sf h-sq}, the second stream queue is denoted by {\sf l-sq}. We recall that using several streams allow asynchronous and concurrent execution between the two streams, however within the same stream, the execution is always FIFO.

    When a task is active, it is added to the correct ready-task queue {\sf rq}. Further, the scheduler checks one of the following situations:

    1. {\sf h-sq}$= \emptyset \wedge ${\sf l-sq} $= \emptyset $ : the scheduler will allocate the task to the {\sf l-sq} queue, therefore the task will be submitted {\it immediately} to the GPU.

    2. sf h-sq = emptyset and l-sq != \emptyset : the scheduler checks that the activated task has a higher priority than the task in {\sf l-sq}. If yes, the task is inserted into the high priority queue {\sf h-sq}, therefore it preempts the task in the {\sf l-sq} if possible. Otherwise, no scheduling decision are taken.

    According to the scheduling decisions mechanism described in the text above, only one preemption is allowed when a task is already in execution. For example, if a task {\sf C} arrives after {\sf B} has preempted {\sf A}, task {\sf C} must wait until {\sf B} finished even if it is the highest priority active job. We are currently developing schedulability analysis for such limited preemption and priority system. We would like also to highlight that preempted tasks, will continue to use GPU resources if the high priority task is not using {\it all} of the GPU resources.

    Even if this strategy solves preemption limitations of the previous one, it is more complex. It uses also a GPU as a single core. In the next section, we use each SM in the GPU as a single processor allowing parallel execution within the GPU.

    Multicore strategy for GANG preemptive schedulers

    The third strategy uses the GPU in similar way as the previous one; therefore two streams are created and with the same queue configuration. However, we allow tasks to call the primitive {\sf pruda_allocate_to_sm}(\cdots). Thus, using a GPU as a multiprocessor rather than a single core. We consider two types of pruda tasks : the ones that are allocated to a given SM and the other that are not (we consider that the PRUDA tasks, not calling the allocation primitive as tasks requiring the GPU exclusively).

    In addition to the scheduling structures described for the previous strategy, this strategy uses one queue per SM : {\sf sm0-q} and {\sf sm1-q}. When a task is active, if it uses both SMs, no other task will be scheduled at the same time, therefore it will be added to {\sf l-sq} or {\sf h-sq} similarly as in the previous strategy. Otherwise, it uses a single SM and it is assigned to the correct SM queue. Later, the two job having the highest priority in {\sf sm0-q} and {\sf sm1-q} are scheduled first by being inserted in {\sf l-sq} and {\sf h-sq}. This allows parallel execution on both streaming multiprocessor. This strategy allows using the GPU of TX2 as a 2-core platform.

    The allocation primitive in fact tests if a given block/thread is in the correct SM, if yes, it continues onward execution, otherwise it exits. Therefore, the user have either to take that into account when using the block and thread indexes or must use new primitives we provide to calculate indexes. The thread and block indexing mechanism we provide is simple but effective. The user is free to use the Cuda indexes but {\bf carefully} or our platform indexes. We highlight here that both of the previous strategies does not require any modification in the kernel code nor in the programming fashion (indexing). Although this method is more complex to implement than the two previous ones, it provides both temporal and spatial tasks execution control on GPUs. Analyzing the behavior of this final strategy is a challenging theoretical question, that is considered for future work.

    #Real-time policies using PRUDA

    Implementing real-time schedulers using PRUDA is simple. In fact, it requires implementing the {\sf pruda_subscribe} primitive and the {\sf pruda_resched} primitive. The goal of the first is to put the active task in the correct queue according to its priority. If the scheduling algorithm is fixed priority, it has to put it directly in the corresponding priority queue. If the algorithm is EDF, it requires calculating the priority and further inserting the task into the correct queue. The goal of the second primitive is to select which active task to select and in which Cuda stream queue it should be inserted, therefore to be submitted to the GPU. The user is also able to call {\sf pruda_abort} to exit the execution of a given kernel to mix real-time with non real-time tasks if desired. The description of PRUDA provided in the current and the previous section is described in Figure \ref{fig:pruda_show}. We highlight that pruda primitives (except subscribe and resched) can be used even for non pruda tasks.

    OPTIONS

    PRUDA manage memory copies between CPU and GPU and kernel pulls to make scheduling decisions.