Optimization techniques for OpenCL-based linear algebra routines

Stephen Kozacik; Paul Fox; John Humphrey; Aryeh Kuller; Eric Kelmelis; Dennis W. Prather

doi:10.1117/12.2050673

13 June 2014 Optimization techniques for OpenCL-based linear algebra routines

Stephen Kozacik, Paul Fox, John Humphrey, Aryeh Kuller, Eric Kelmelis, Dennis W. Prather

Proceedings Volume 9095, Modeling and Simulation for Defense Systems and Applications IX; 90950D (2014) https://doi.org/10.1117/12.2050673
Event: SPIE Defense + Security, 2014, Baltimore, MD, United States

Abstract

The OpenCL standard for general-purpose parallel programming allows a developer to target highly parallel computations towards graphics processing units (GPUs), CPUs, co-processing devices, and field programmable gate arrays (FPGAs). The computationally intense domains of linear algebra and image processing have shown significant speedups when implemented in the OpenCL environment. A major benefit of OpenCL is that a routine written for one device can be run across many different devices and architectures; however, a kernel optimized for one device may not exhibit high performance when executed on a different device. For this reason kernels must typically be hand-optimized for every target device family. Due to the large number of parameters that can affect performance, hand tuning for every possible device is impractical and often produces suboptimal results. For this work, we focused on optimizing the general matrix multiplication routine. General matrix multiplication is used as a building block for many linear algebra routines and often comprises a large portion of the run-time. Prior work has shown this routine to be a good candidate for high-performance implementation in OpenCL. We selected several candidate algorithms from the literature that are suitable for parameterization. We then developed parameterized kernels implementing these algorithms using only portable OpenCL features. Our implementation queries device information supplied by the OpenCL runtime and utilizes this as well as user input to generate a search space that satisfies device and algorithmic constraints. Preliminary results from our work confirm that optimizations are not portable from one device to the next, and show the benefits of automatic tuning. Using a standard set of tuning parameters seen in the literature for the NVIDIA Fermi architecture achieves a performance of 1.6 TFLOPS on an AMD 7970 device, while automatically tuning achieves a peak of 2.7 TFLOPS

Citation Download Citation

Stephen Kozacik, Paul Fox, John Humphrey, Aryeh Kuller, Eric Kelmelis, and Dennis W. Prather "Optimization techniques for OpenCL-based linear algebra routines", Proc. SPIE 9095, Modeling and Simulation for Defense Systems and Applications IX, 90950D (13 June 2014); https://doi.org/10.1117/12.2050673

ACCESS THE FULL ARTICLE

INSTITUTIONAL
Select your institution to access the SPIE Digital Library.

SELECT YOUR INSTITUTION

PERSONAL
Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.

PERSONAL SIGN IN

No SPIE Account? Create one

PURCHASE THIS CONTENT

SUBSCRIBE TO DIGITAL LIBRARY

50 downloads per 1-year subscription

Members: $195

Non-members: $335 ADD TO CART

25 downloads per 1 - year subscription

Members: $145

Non-members: $250 ADD TO CART

PURCHASE SINGLE ARTICLE

Includes PDF, HTML & Video, when available

Members: $17.00

Non-members: $21.00 ADD TO CART

PROCEEDINGS
6 PAGES

DOWNLOAD PAPER SAVE TO MY LIBRARY

GET CITATION

RIGHTS & PERMISSIONS

Get copyright permission Get copyright permission on Copyright Marketplace

KEYWORDS

Linear algebra

Field programmable gate arrays

Matrices

Matrix multiplication

Algorithm development

Standards development

Graphics processing units

Show All Keywords

Keywords/Phrases

Search In:

Publication Years