A Performance Model For Gpu Architectures: Analysis And Design Of Fundamental Algorithms

Karsin, Ben

A Performance Model For Gpu Architectures: Analysis And Design Of Fundamental Algorithms

dc.contributor.author	Karsin, Ben
dc.contributor.department	Computer Science
dc.date.accessioned	2019-05-28T19:45:22Z
dc.date.available	2019-05-28T19:45:22Z
dc.date.issued	2018-05
dc.identifier.uri	http://hdl.handle.net/10125/62290
dc.subject	parallel
dc.subject	algorithms
dc.subject	many-core
dc.subject	GPU
dc.subject	sorting
dc.subject	model
dc.title	A Performance Model For Gpu Architectures: Analysis And Design Of Fundamental Algorithms
dc.type	Thesis
dcterms.abstract	Over the past decade, \many-core" architectures have become a crucial resources for solving com- putationally challenging problems. These systems rely on hundreds or thousands of simple compute cores to achieve high computational throughput. However, their divergence from traditional CPUs makes existing algorithms and models inapplicable. Thus, developers rely on a set of heuristics and hardware-dependent rules of thumb to develop algorithms for many-core systems, such as GPUs. This dissertation attempts to remedy this by presenting the Synchronous Parallel Throughput (SPT) model, a general performance model that aims to capture the factors that most impact algorithm performance on many-core architectures. The model focuses on two factors that often create performance bottlenecks: memory latency and synchronization overhead. We instantiate the SPT model on three separate modern GPU platforms using a series of microbenchmarks that measure hardware parameters such as memory access latency and peak bandwidth. We further show how multiplicity aects performance by hiding latencies and increasing overall throughput. We consider three fundamental problems as case studies in this dissertation: general matrix-matrix multiplication, searching, and sorting. Using our SPT model, we analyze a state-of-the-art library implementation of a matrix multiplication algorithm and show that our model can generate run- time estimates with an average error of 5% across our GPU platforms. We then consider the problem of batched predecessor search in the context of two levels of the GPU memory hierarchy. In slow, global memory, we demonstrate the accuracy of the SPT model, while in fast, shared memory, we determine that the memory access patterns create a performance bottleneck that de- grades performance. We develop a new searching algorithm that improves the access pattern and increases performance by up to 293% on our GPUs. Finally, we look at comparison-based sort- ing on GPUs by analyzing two state-of-the-art algorithms and, using our SPT model, determine that they each suer from bottlenecks that stie performance. With these bottlenecks in mind, we develop GPU-MMS, our GPU-ecient multiway mergesort algorithm, and demonstrate that it outperforms highly optimized library implementations of existing algorithms by an average of 21% when sorting random inputs and up to 67% on worst-case input permutations. These case studies demonstrate both the accuracy and applicability of the SPT model for analyzing and developing GPU-ecient algorithms.
dcterms.description	Ph.D. Thesis. University of Hawaiʻi at Mānoa 2018.
dcterms.language	eng
dcterms.publisher	University of Hawaiʻi at Mānoa
dcterms.rights	All UHM dissertations and theses are protected by copyright. They may be viewed from this source for any purpose, but reproduction or distribution in any format is prohibited without written permission from the copyright owner.
dcterms.type	Text

Files

Original bundle

Now showing 1 - 1 of 1

Name:: 2018-05-phd-karsin.pdf
Size:: 1.96 MB
Format:: Adobe Portable Document Format

Download

Collections

Ph.D. - Computer Science