The topics I focus on are the hardware aspects, energy efficiency, both the general concept of SPM and known implementations and applications, as well as the issue of worst-case execution time in hard real-time systems. Save to Library Save.
Create Alert Alert. Share This Paper. Figures from this paper. View 1 excerpt, references background. Scratchpad memories vs locked caches in hard real-time systems: a quantitative comparison. Highly Influential. Generally, power account the data memory bandwidth in pipelined execution of an consumption can be reduced by minimizing the number of memory application was given in [7]. The work in [8] presents a memory management of the CGRA.
However, they have not deal with the management of the data memory. The reused In this section, a generic reconfigurable architecture template, values are stored locally in a PE RAM and they are transferred to which is based on characteristics found in the majority of the 2D PEs when it is needed.
Also, by placing the data dependent coarse-grain reconfigurable architectures [1], [2], [3] is presented. Finally, by employing a realistic memory interface, for mapping applications, since it can represent a wide variety of 2D which is not the case in previous works [5], [6] our methodology CGRAs. The proposed architecture template is shown in Fig. The input is the application description in C language.
We are cases, like in [3], where there are also direct connections among have utilized the frontend of the SUIF2 compiler infrastructure [9] all the PEs across a column and a row. An overview of the types of to create the intermediate representation IR , which is the first PE interconnect networks is given in [6].
The ER edges are further annotated with the names of variables that are common to the operations that connect. It is noted that control dependencies e. Finally, the Figure 1. We call, the subset of operations in FU , which it can be configured to perform a specific word-level the DDRG that have E edges sinking into a specific node v, Data- operation, each time.
Furthermore, we ALU, multiplication, and shifts. Also, this context word determines where the output of the FU is routed, thus defining the interconnections among the PEs. The architecture of the memory buses is assumed to be the same for all rows or The description of the CGRA architecture is the second input in columns. In existing CGRA architectures [2], [3], shared buses are the mapping phase.
Thus, PEs on the same row them. The scratch-pad memory is buses to which each PE is connected, the bus bandwidth and located between the array of the PEs and the main memory, and memory access times, are included in the CGRA architecture provides the array with the required data bandwidth.
Finally, the description. This memory is equivalent to the instruction memory in architectures is a combination of scheduling operations in time for a processor. Configuration caches distributed in the CGRA and execution [11], mapping these operation executions to specific PEs, and routing data by mapping and scheduling communications to specific interconnects in the CGRA. Higher interconnection overhead specific PE will be referred to hereafter as a Place Decision PD for causes future scheduled operations to have larger execution start that specific operation.
Each PD has a different impact on the time due to conflicts. The aforementioned costs compose the cost for each Hence, a set of costs is assigned to each PD to incorporate the PD. The aim of the proposed mapping algorithm is to find a cost-effective PD for A greedy approach was adopted for calculating the cost for each all operations of the scheduled application. The pseudocode shown PD. For each choice the shortest paths, which connect the source in Fig. From this set of paths, the one with the minimum delay is selected.
The local RAM in each of the Figure 3. The priority of an operation is After all operations are scheduled, the time, place and the resource calculated as the difference of its As Late As Possible ALAP reservations for each operation at every clock cycle are recorded. This result is called mobility. Also variable p, which indirectly points each time to V. In this way operations residing in the critical path are realizing the presented mapping methodology.
The experimental considered first in the scheduling phase. The and horizontal interconnections. The experimental results are presented in section V.
This type of reconfigurable Finally, conclusions are outlined in section VI. Also, their coarse granularity greatly Although several CGRA architectures have been proposed in reduces the delay, power and configuration time relative to an the past few years [1], few automated methodologies have been FPGA device at the expense of flexibility [1]. In [5], a The aim of mapping algorithms to CGRAs is to exploit the modulo-scheduling algorithm was proposed for mapping loops on a inherent parallelism in the considered applications so as to increase generic coarse grain reconfigurable architecture.
The work in [6] performance. An increase in the operation parallelism results in a proposes an interconnect-aware list based scheduling algorithm that respective increase in the rate by which data are fetched from accounts for various interconnection structures. Thus, there is considers data memory management issues. Furthermore, in [5], [6] an imperative need for a mapping methodology to CGRAs for it was assumed an arbitrarily large memory bandwidth which results reducing the memory bandwidth requirements.
Furthermore, in unpredictable performance overestimates. Performance figures performance is not the only important factor in embedded systems of existing works, like in [5], [6], would have been certainly worst design. Power consumption is equally important in most of the if a realistic memory interface was taken into account, and thus the cases e. As it was shown in [4], memory memory bandwidth offered by it. An approach for taking into contributes the most to the power consumption. Generally, power account the data memory bandwidth in pipelined execution of an consumption can be reduced by minimizing the number of memory application was given in [7].
The work in [8] presents a memory management of the CGRA. However, they have not deal with the management of the data memory. The reused In this section, a generic reconfigurable architecture template, values are stored locally in a PE RAM and they are transferred to which is based on characteristics found in the majority of the 2D PEs when it is needed.
Also, by placing the data dependent coarse-grain reconfigurable architectures [1], [2], [3] is presented. Finally, by employing a realistic memory interface, for mapping applications, since it can represent a wide variety of 2D which is not the case in previous works [5], [6] our methodology CGRAs.
The proposed architecture template is shown in Fig. The input is the application description in C language. We are cases, like in [3], where there are also direct connections among have utilized the frontend of the SUIF2 compiler infrastructure [9] all the PEs across a column and a row. An overview of the types of to create the intermediate representation IR , which is the first PE interconnect networks is given in [6].
The ER edges are further annotated with the names of variables that are common to the operations that connect. It is noted that control dependencies e. Finally, the Figure 1. We call, the subset of operations in FU , which it can be configured to perform a specific word-level the DDRG that have E edges sinking into a specific node v, Data- operation, each time. Furthermore, we ALU, multiplication, and shifts. Also, this context word determines where the output of the FU is routed, thus defining the interconnections among the PEs.
The architecture of the memory buses is assumed to be the same for all rows or The description of the CGRA architecture is the second input in columns.
In existing CGRA architectures [2], [3], shared buses are the mapping phase. Thus, PEs on the same row them. The scratch-pad memory is buses to which each PE is connected, the bus bandwidth and located between the array of the PEs and the main memory, and memory access times, are included in the CGRA architecture provides the array with the required data bandwidth. Finally, the description. This memory is equivalent to the instruction memory in architectures is a combination of scheduling operations in time for a processor.
Configuration caches distributed in the CGRA and execution [11], mapping these operation executions to specific PEs, and routing data by mapping and scheduling communications to specific interconnects in the CGRA.
0コメント