A GCC Vectorizer based Instruction Translating Method for An Array Accelerator

Hao Wang (1151209)


To acquire high processing performance and to hide difference of hardware, general-purpose graphics processing unit GPGPU which comprises many function units FUs employs programming language that specifying parallel processing explicitly like CUDA\cite{1}. However, in order to pull out the desired performance, considerable hardware tuning cost and understand of the structure of hardware is necessary. On the other hand, we have proposed a linear array pipeline processor LAPP\cite{3}, which characterized by implementing a structure that includes a number of combination of local memory and FUs to achieve the balance between reduction of power consumption and improving processing performance. However, instead of the advantage of high performance on executing loops with no dependecies between iterations by simply inserting the prefetch infomation into the existing VLIW instruction sequence, LAPP\cite{9} has a constraints on the high-speed executable loops. Also, it can not be utilize of a general processor has a different instruction set without redesigning the accelerator portion of LAPP. In this paper, we describe a new structure of accelerator to alleviate the constrains while followed the process of LAPP, and a method of generating instructions oriented to this new structure of accelerator by exploting the GCC vectorizer. currently, We are now implementing control flow, data flow and memory access patterns analysis based on the informtion of Uncprop. Comparing to the LAPP, we get a reduction of 65\% on average in FUs array stages for some simple programs. With the amount of saved array stages by this optimization, it is possible to boost the performance from 2x to 8x via mapping multiple loop instances as compared with LAPP.