Analysis Methods
Analysis Methods
We consider two methods for analyzing mixed task sets of CPU-only and GPU-using tasks on a multiprocessor system with a single GPU: the Shared Resource Method and the Container Method. Fundamental differences between these methods stem from how GPU execution time is modeled and how potential graphics hardware driver behaviors are managed.
Shared Resource Method
It is natural to view a GPU as a computational resource shared by the CPUs of a multiprocessor system. This is the approach taken by the Shared Resource Method (SRM), which treats the GPU as a globally-shared resource protected by a real-time semaphore. The execution of a GPU-using job goes through several phases. In the first phase, the job executes purely on the CPU. In the next phase, the job sends data to the GPU for use by the GPU kernel. Next, the job suspends from the
CPU when the kernel is invoked on a GPU. The GPU executes the kernel using many parallel threads, but kernel execution does not complete until after the last GPU-thread has completed. Finally, the job resumes execution on the CPU and receives kernel execution results when signaled by the GPU. Optionally, the job may continue executing on the CPU without using the GPU. Thus, a GPU-using job has five execution phases as depicted in Fig. 1.
We can remove the GPU driver from resourcearbitration decisions and create a suitable model for realtime analysis through the use of a real-time semaphore protocol. Contention for a GPU may occur when a job attempts to communicate with it. We resolve this contention with a synchronization point between the first and second phases to provide mutual exclusion through the end of the fourth phase; this interval is called a critical section and denoted for each task Ti by csi. This approach ensures that the driver only services one job at a time, which eliminates the need for knowing how the driver (which, again, is closed-source) manages concurrent GPU requests.
We may consider several real-time multiprocessor locking protocols to protect the GPU critical section. Such a protocol should have several properties. First, it must allow blocked jobs to suspend since critical-section lengths may be very long. A spin lock would consume far too much CPU time. Second, the protocol must support
priority inheritance so blocking times can be bounded. Finally, the protocol need not support critical-section nesting or deadlock prevention since GPU-using tasks only access one GPU. Both the “long” variant of the Flexible Multiprocessor Locking Protocol (FMLP-Long) [13] and the more recent global O(m) Locking Protocol (OMLP) fit these requirements. Neither protocol is strictly better than the other for all task sets since priority-inversion-based blocking (per lock access), denoted by bi, is O(n) under FMLP- Long and O(m) under the OMLP, where n is the number of tasks and m is the number of CPUs. Thus, we allow the SRM to use whichever protocol yields a schedulable task set.
The FMLP-Long uses a single FIFO job queue for each semaphore, and GPU requests are serviced in a first-come first-serve order. The job at the head of the FIFO queue is the lock holder. A job, Ji, of task Ti 2 G(T) may be blocked by one job from the remaining GPU-using tasks. Formally,
The global OMLP uses two job queues for each semaphore: FQ, a FIFO queue of length at most m; and PQ, a priority queue (ordered by job priority). The lock holder is at the head of FQ. Blocked jobs enqueue on FQ if FQ is not full and on PQ, otherwise. Jobs are dequeued from PQ onto FQ as jobs leave FQ. Any job acquiring an OMLP lock may be blocked by at most 2(m-1) lower-
priority jobs. Let A be the set of jobs generated by any Tk ∈ G(T) n fTig that may contend with Ji for the GPU. Let Amax be the 2(m-1) jobs in A with the longest critical sections. The blocking time for task Ti ∈ G(T) is given by the formula
Soft schedulability under the SRM is determined by the following two conditions. First,
is required by the tardiness analysis for G-EDF [16]. Second, the condition
must hold. This is the soft G-EDF schedulability condition required by to ensure bounded tardiness. Like all suspension-oblivious tests, we must analytically treat suspension due to both blocking and GPU execution as execution on the CPU.
Example. Consider a mixed task set with two CPUonly tasks with task parameters (pi = 30;ei = 5; si = 0) and five GPU-using tasks with parameters (pi = 30;ei = 3; si = 2; csi = 4) to be scheduled on a four-CPU system with a single GPU. The CPU-only tasks trivially satisfy Ineqs. (3). The FMLP-Long is best suited for this task set and the blocking term for every GPU-using task is åcsk = 16 as computed by Eq. (1). Tasks in G(T) satisfy Ineq. (3) since 3 2 16 = 21 30. Ineq. (4) also holds since U = 2 (5=30) 5 ((3 2 16)=30) 3:83 4. Therefore, the task set is schedulable under the SRM.
A schedule for this task set is depicted in Fig. 2. T1 and T2 are the CPU-only tasks. Observe that the final job completes at time t = 21, well before its deadline. The computed system utilization of approximately 3:83 is quite close to the upper bound of 4.0 used in Ineq. (4), which suggests a heavily-utilized system. However, the schedule in Fig. 2 shows that the suspension-oblivious analysis is quite pessimistic (mostly due to blocking-term accounting) given that the system is idle for much of the time. In fact, only one CPU is utilized after t = 5. The performance of the GPU must overcome these suspension-oblivious penalties
if it is to be a worthwhile addition to a real-time multiprocessor system.