The three dimensional structure tensor algorithm (3D-STA) is often used in image processing applications to compute the optical flow or to detect local 3D structures and their directions. This algorithm is computationally expensive due to many computations that are required to calculate the gradient, the tensor, and to smooth every pixel of the image frames. Therefore, it is important to parallelize the implementation to achieve high performance. In this paper we present two parallel implementations of 3D-STA; namely moderately parallelized and highly parallelized implementation, on a massively parallel reconfigurable array. Finally, we evaluate the performance of the generated code and results are compared with another optical flow implementation. The throughput achieved by the moderately parallelized implementation is approximately half of the throughput of the Optical flow implementation, whereas the highly parallelized implementation results in a 2x gain in throughput as compared to the optical flow implementation. © 2012 IEEE.
Embedded DSP computing is currently shifting towards manycore architectures in order to cope with the ever growing computational demands. Actor based dataflow languages are being considered as a programming model. In this paper we present a code generator for CAL, one such dataflow language. We propose to use a compilation tool with two intermediate representations. We start from a machine model of the actors that provides an ordering for testing of conditions and firing of actions. We then generate an Action Execution Intermediate Representation that is closer to a sequential imperative language like C and Java. We describe our two intermediate representations and show the feasibility and portability of our approach by compiling a CAL implementation of the Two-Dimensional Inverse Discrete Cosine Transform on a general purpose processor, on the Epiphany manycore architecture and on the Ambric massively parallel processor array. © 2014 IEEE.
With the arrival of heterogeneous manycores comprising various features to support task, data and instruction-level parallelism, developing applications that take full advantage of the hardware parallel features has become a major challenge. In this paper, we present an extension to our CAL compilation framework (CAL2Many) that supports data parallelism in the CAL Actor Language. Our compilation framework makes it possible to program architectures with SIMD support using high-level language and provides efficient code generation. We support general SIMD instructions but the code generation backend is currently implemented for two custom architectures, namely ePUMA and EIT. Our experiments were carried out for two custom SIMD processor architectures using two applications.
The experiment shows the possibility of achieving performance comparable to hand-written machine code with much less programming effort.
Manycore architectures are dominating the development of advanced embedded computing due to the computational and power demand of high performance applications. This has introduced an additional complexity with regard to the efficient exploitation of the underlying hardware and the development of efficient parallel implementations. To tackle this we model applications using a dataflow programming language, perform high-level transformations of dataflow actors, and generate native code by using our compilation framework.This paper presents the actor fission transformations of our Cal2Many compilation framework. The transformations have facilitated the mapping of big dataflow actors on memory restricted embedded manycores, increased the utilization of the hardware, and enabled support for task and data-level parallelism. We have applied the actor transformations to two blocks of MPEG-4 decoder and executed it on the Epiphany manycore architecture. The result shows the practicality and feasibility of our approach.
The arrival of manycore platforms has imposed programming challenges for mainstream embedded system developers. In this paper, we discuss the significance of actor-oriented dataflow languages and present our compilation framework for CAL Actor Language that leads to increased portability and retargetability. We demonstrate the applicability of our approach with streaming applications targeting the Epiphany many-core architecture. We have performed an in-depth analysis of MPEG-4 SP implemented on Epiphany using our framework and studied the effects of actor composition. We have identified hardware aspects such as increased off-chip memory bandwidth and larger local memories that could result in further performance improvements.
Efficient utilization of available resources is a key concept in embedded systems. This paper is focused on providing the support for managing dynamic reconfiguration of computing resources in the programming model. We present an approach to map occam-pi programs to a manycore architecture, Platform 2012 (P2012). We describe the techniques used to translate the salient features of the occam-pi language to the native programing model of the P2012 architecture. We present the initial results from a case study of matrix multiplication. Our results show the simplicity of occam-pi program by 6 times reduction in lines-of-code.
Manycore architectures are gaining attention as a means to meet the performance and power demands of high-performance embedded systems. However, their widespread adoption is sometimes constrained by the need formastering proprietary programming languages that are low-level and hinder portability. We propose the use of the concurrent programming language occam-pi as a high-level language for programming an emerging class of manycore architectures. We show how to map occam-pi programs to the manycore architecture Platform 2012 (P2012). We describe the techniques used to translate the salient features of the language to the native programming model of the P2012. We present the results from a case study on a representative algorithm in the domain of real-time image processing: a complex algorithm for corner detectioncalled Features from Accelerated Segment Test (FAST). Our results show that the occam-pi program is much shorter, is easier to adapt and has a competitive performance when compared to versions programmed in the native programming model of P2012 and in OpenCL.
When implementing computation-intensive algorithms on finegrained parallel architectures, adjustment of resource to performance tradeoff is a big challenge. This paper proposes a methodology for dealing with some of these performance tradeoffs by adjusting parallelism at different levels. In a case study, interpolation kernels are implemented on a fine-grained architecture (FPGA) using a high level language (Mitrion-C). For both cubic and bi-cubic interpolation, one single-kernel, one cross-kernel and two multi-kernel parallel implementations are designed and evaluated. Our results demonstrate that no single level of parallelism can be used for trade-off adjustment. Instead, the appropriate degree of parallelism on each level, according to available resources and the performance requirements of the application, needs to be found. Basing the design on high-level programming simplifies the trade-off process. This research is a step towards automation of the choice of parallelization based on a combination of parallelism levels.
In this paper we introduce Epiphany as a high-performance energy-efficient manycore architecture suitable for real-time embedded systems. This scalable architecture supports floating point operations in hardware and achieves 50 GFLOPS/W in 28 nm technology, making it suitable for high performance streaming applications like radio base stations and radar signal processing. Through an efficient 2D mesh Network-on-Chip and a distributed shared memory model, the architecture is scalable to thousands of cores on a single chip. An Epiphany-based open source computer named Parallella was launched in 2012 through Kickstarter crowd funding and has now shipped to thousands of customers around the world. ©2014 IEEE.
Recurrent neural networks (RNNs) are neural networks (NN) designed for time-series applications. There is a growing interest in running RNNs to support these applications on edge devices. However, RNNs have large memory and computational demands that make them challenging to implement on edge devices. Quantization is used to shrink the size and the computational needs of such models by decreasing weights and activation precision. Further, the delta networks method increases the sparsity in activation vectors by relying on the temporal relationship between successive input sequences to eliminate repeated computations and memory accesses. In this paper, we study the effect of quantization on LSTM-, GRU-, LiGRU-, and SRU-based RNN models for speech recognition on the TIMIT dataset. We show how to apply post-training quantization on these models with a minimal increase in the error by skipping quantization of selected paths. In addition, we show that the quantization of activation vectors in RNNs to integer precision leads to considerable sparsity if the delta networks method is applied. Then, we propose a method for increasing the sparsity in the activation vectors while minimizing the error and maximizing the percentage of eliminated computations. The proposed quantization method managed to com-press the four models more than 85%, with an error increase of 0.6, 0, 2.1, and 0.2 percentage points, respectively. By applying the delta networks method to the quantized models, more than 50% of the operations can be eliminated, in most cases with only a minor increase in the error. Comparing the four models to each other under the quantization and delta networks method, we found that compressed LSTM-based models are the most-optimum solutions at low-error-rates constraints. The compressed SRU-based models are the smallest in size, suitable when higher error rates are acceptable, and the compressed LiGRU-based models have the highest number of eliminated operations. © 2022 by the authors. Licensee MDPI, Basel, Switzerland.
The compression of deep learning models is of fundamental importance in deploying such models to edge devices. The selection of compression parameters can be automated to meet changes in the hardware platform and application. This article introduces a Multi-Objective Hardware-Aware Quantization (MOHAQ) method, which considers hardware performance and inference error as objectives for mixed-precision quantization. The proposed method feasibly evaluates candidate solutions in a large search space by relying on two steps. First, post-training quantization is applied for fast solution evaluation (inference-only search). Second, we propose the ”beacon-based search” to retrain selected solutions only and use them as beacons to estimate the effect of retraining on other solutions. We use speech recognition models on TIMIT dataset. Experimental evaluations show that Simple Recurrent Unit (SRU)-based models can be compressed up to 8x by post-training quantization without any significant error increase. On SiLago, we found solutions that achieve 97% and 86% of the maximum possible speedup and energy saving, with a minor increase in error on an SRU-based model. On Bitfusion, the beacon-based search reduced the error gain of the inference-only search on SRU-based models and Light Gated Recurrent Unit (LiGRU)-based model by up to 4.9 and 3.9 percentage points, respectively.
Recurrent Neural Networks (RNNs) are a class of machine learning algorithms used for applications with time-series and sequential data. Recently, there has been a strong interest in executing RNNs on embedded devices. However, difficulties have arisen because RNN requires high computational capability and a large memory space. In this paper, we review existing implementations of RNN models on embedded platforms and discuss the methods adopted to overcome the limitations of embedded systems. We will define the objectives of mapping RNN algorithms on embedded platforms and the challenges facing their realization. Then, we explain the components of RNN models from an implementation perspective. We also discuss the optimizations applied to RNNs to run efficiently on embedded platforms. Finally, we compare the defined objectives with the implementations and highlight some open research questions and aspects currently not addressed for embedded RNNs. Overall, applying algorithmic optimizations to RNN models and decreasing the memory access overhead is vital to obtain high efficiency. To further increase the implementation efficiency, we point up the more promising optimizations that could be applied in future research. Additionally, this article observes that high performance has been targeted by many implementations, while flexibility has, as yet, been attempted less often. Thus, the article provides some guidelines for RNN hardware designers to support flexibility in a better manner. © 2020 IEEE.
The convolution module of convolution neural networks is highly computation demanding. In order to execute a neural network inference on embedded platforms, an ecient implementation of the convolution is required. Low precision parameters can provide an implementation that requires less memory, less computation time, and less power consumption. Nevertheless, streaming the convolution computation over parallelized processing units saves a lot of memory, which is a key concern in memory constrained embedded platforms. In this paper, we show how the convolution module can be implemented on Epiphany manycore architecture. Low precision parameters are used with ternary weights of +1, 0, and -1 values. The computation is done through a pipeline by streaming data through processing units. The proposed approach decreases the memory requirements for CNN implementation and could reach up to 282 GOPS and up to 5.6 GOPs/watt.
Convolution neural networks (CNN) are extensively used for deep learning applications such as image recognition and computer vision. The convolution module of these networks is highly compute-intensive. Having an efficient implementation of the convolution module enables realizing the inference part of the neural network on embedded platforms. Low precision parameters require less memory, less computation time, and less power consumption while achieving high classification accuracy. Furthermore, streaming the data over parallelized processing units saves a considerable amount of memory, which is a key concern in memory constrained embedded platforms. In this paper, we explore the design space for streamed CNN on Epiphany manycore architecture using varying precisions for weights (ranging from binary to 32-bit). Both AlexNet and GoogleNet are explored for two different memory sizes of Epiphany cores. We are able to achieve competitive performance for both Alexnet and GoogleNet with respect to emerging manycores. Furthermore, the effects of different design choices in terms of precision, memory size, and the number of cores are evaluated by applying the proposed method.
Today computer architectures are shifting from single core to manycores due to several reasons such as performance demands, power and heat limitations. However, shifting to manycores results in additional complexities, especially with regard to efficient development of applications. Hence there is a need to raise the abstraction level of development techniques for the manycores while exposing the inherent parallelism in the applications. One promising class of programming languages is dataflow languages and in this paper we evaluate and optimize the code generation for one such language, CAL. We have also developed a communication library to support the inter-core communication.The code generation can target multiple architectures, but the results presented in this paper is focused on Adapteva's many core architecture Epiphany.We use the two-dimensional inverse discrete cosine transform (2D-IDCT) as our benchmark and compare our code generation from CAL with a hand-written implementation developed in C. Several optimizations in the code generation as well as in the communication library are described, and we have observed that the most critical optimization is reducing the number of external memory accesses. Combining all optimizations we have been able to reduce the difference in execution time between auto-generated and hand-written implementations from a factor of 4.3x down to a factor of only 1.3x. ©2014 IEEE.
This paper proposes a novel method for performing division on floating-point numbers represented in IEEE-754 single-precision (binary32) format. The method is based on an inverter, implemented as a combination of Parabolic Synthesis and second-degree interpolation, followed by a multiplier. It is implemented with and without pipeline stages individually and synthesized while targeting a Xilinx Ultrascale FPGA.
The implementations show better resource usage and latency results when compared to other implementations based on different methods. In case of throughput, the proposed method outperforms most of the other works, however, some Altera FPGAs achieve higher clock rate due to the differences in the DSP slice multiplier design.
Due to the small size, low latency and high throughput, the presented floating-point division unit is suitable for high performance embedded systems and can be integrated into accelerators or be used as a stand-alone accelerator.
While parallel computer architectures have become mainstream, application development on them is still challenging. There is a need for new tools, languages and programming models. Additionally, there is a lack of knowledge about the performance of parallel approaches of basic but important operations, such as the QR decomposition of a matrix, on current commercial manycore architectures.
This paper evaluates a high level dataflow language (CAL), a source-to-source compiler (Cal2Many) and three QR decomposition algorithms (Givens Rotations, Householder and Gram-Schmidt). The algorithms are implemented both in CAL and hand-optimized C languages, executed on Adapteva's Epiphany manycore architecture and evaluated with respect to performance, scalability and development effort.
The performance of the CAL (generated C) implementations gets as good as 2\% slower than the hand-written versions. They require an average of 25\% fewer lines of source code without significantly increasing the binary size. Development effort is reduced and debugging is significantly simplified. The implementations executed on Epiphany cores outperform the GNU scientific library on the host ARM processor of the Parallella board by up to 30x. © 2016 Copyright held by the owner/author(s).
On-chip communication plays a significant role in the performance of manycore architectures. Therefore, they require a proper on-chip communication infrastructure that can scale with the number of the cores. As a solution, network-on-chip structures have emerged and are being used.
This paper presents description of a two dimensional mesh network-on-chip router and a network interface, which are implemented in Chisel to be integrated to the rocket chip generator that generates RISC-V (rocket) cores. The router is implemented in VHDL as well and the two implementations are verified and compared.
Hardware resource usage and performance of different sized networks are analyzed. The implementations are synthesized for a Xilinx Ultrascale FPGA via Xilinx tools for the hardware resource usage and clock frequency results. The performance results including latency and throughput measurements with different traffic patterns, are collected with cycle accurate emulations.
The implementations in Chisel and VHDL do not show a significant difference. Chisel requires around 10% fewer lines of code, however, the difference in the synthesis results is negligible. Our latency result are better than the majority of the other studies. The other results such as hardware usage, clock frequency, and throughput are competitive when compared to the related works.
In the last 15 years we have seen, as a response to power and thermal limits for current chip technologies, an explosion in the use of multiple and even many computer cores on a single chip. But now, to further improve performance and energy efficiency, when there are potentially hundreds of computing cores on a chip, we see a need for a specialization of individual cores and the development of heterogeneous manycore computer architectures.
However, developing such heterogeneous architectures is a significant challenge. Therefore, we propose a design method to generate domain specific manycore architectures based on RISC-V instruction set architecture and automate the main steps of this method with software tools. The design method allows generation of manycore architectures with different configurations including core augmentation through instruction extensions and custom accelerators. The method starts from developing applications in a high-level dataflow language and ends by generating synthesizable Verilog code and cycle accurate emulator for the generated architecture.
We evaluate the design method and the software tools by generating several architectures specialized for two different applications and measure their performance and hardware resource usages. Our results show that the design method can be used to generate specialized manycore architectures targeting applications from different domains. The specialized architectures show at least 3 to 4 times better performance than the general purpose counterparts. In certain cases, replacing general purpose components with specialized components saves hardware resources. Automating the method increases the speed of architecture development and facilitates the design space exploration of manycore architectures. © 2019 The Authors. Published by Elsevier B.V.
Performance and power requirements has pushed computer architectures from single core to manycores. These requirements now continue pushing the manycores with identical cores (homogeneous) to manycores with specialized cores (heterogeneous). However designing heterogeneous manycores is a challenging task due to the complexity of the architectures. We propose an approach for designing domain specific heterogeneous manycore architectures based on building blocks. These blocks are defined as the common computations of the applications within a domain. The objective is to generate heterogeneous architectures by integrating many of these blocks to many simple cores and connect the cores with a networkon-chip. The proposed approach aims to ease the design of heterogeneous manycore architectures and facilitate usage of dark silicon concept. As a case study, we develop an accelerator based on several building blocks, integrate it to a RISC core and synthesize on a Xilinx Ultrascale FPGA. The results show that executing a hot-spot of an application on an accelerator based on building blocks increases the performance by 15x, with room for further improvement. The area usage increases as well, however there are potential optimizations to reduce the area usage. © 2018 by the authors
The last ten years have seen performance and power requirements pushing computer architectures using only a single core towards so-called manycore systems with hundreds of cores on a single chip. To further increase performance and energy efficiency, we are now seeing the development of heterogeneous architectures with specialized and accelerated cores. However, designing these heterogeneous systems is a challenging task due to their inherent complexity. We proposed an approach for designing domain-specific heterogeneous architectures based on instruction augmentation through the integration of hardware accelerators into simple cores. These hardware accelerators were determined based on their common use among applications within a certain domain.The objective was to generate heterogeneous architectures by integrating many of these accelerated cores and connecting them with a network-on-chip. The proposed approach aimed to ease the design of heterogeneous manycore architectures—and, consequently, exploration of the design space—by automating the design steps. To evaluate our approach, we enhanced our software tool chain with a tool that can generate accelerated cores from dataflow programs. This new tool chain was evaluated with the aid of two use cases: radar signal processing and mobile baseband processing. We could achieve an approximately 4x improvement in performance, while executing complete applications on the augmented cores with a small impact (2.5–13%) on area usage. The generated accelerators are competitive, achieving more than 90% of the performance of hand-written implementations.
This paper proposes a novel method for performing square root operation on floating-point numbers represented in IEEE-754 single-precision (binary32) format. The method is implemented using Harmonized Parabolic Synthesis. It is implemented with and without pipeline stages individually and synthesized for two different Xilinx FPGA boards.
The implementations show better resource usage and latency results when compared to other similar works including Xilinx intellectual property (IP) that uses the CORDIC method. Any method calculating the square root will make approximation errors. Unless these errors are distributed evenly around zero, they can accumulate and give a biased result. An attractive feature of the proposed method is the fact that it distributes the errors evenly around zero, in contrast to CORDIC for instance.
Due to the small size, low latency, high throughput, and good error properties, the presented floating-point square root unit is suitable for high performance embedded systems. It can be integrated into a processor’s floating point unit or be used as astand-alone accelerator. © 2019 IEEE.
The growing trend towards adoption of flexible and heterogeneous, parallel computing architectures has increased the challenges faced by the programming community. We propose a method to program an emerging class of reconfigurable processor arrays by using the CSP based programming model of occam-pi. The paper describes the extension of an existing compiler platform to target such architectures. To evaluate the performance of the generated code, we present three implementations of the DCT algorithm. It is concluded that CSP appears to be a suitable computation model for programming a wide variety of reconfigurable architectures.
This paper highlights the collaboration between industry and academia in research. It describes more than two decades of intensive development and research of new hardware and software platforms to support innovative, high-performance sensor systems with extremely high demands on embedded signal processing capability. The joint research can be seen as the run before a necessary jump to a new kind of computational platform based on parallelism. The collaboration has had several phases, starting with a focus on hardware, then on efficiency, later on software development, and finally on taking the jump and understanding the expected future. In the first part of the paper, these phases and their respective challenges and results are described. Then, in the second part, we reflect upon the motivation for collaboration between company and university, the roles of the partners, the experiences gained and the long-term effects on both sides. Copyright © 2014 ACM.
Video analyzers are employed to perform an in depth analysis of coding decisions undertaken during the execution of a video codec. Medical diagnostic videos, which are typically dealt with in telemedicine scenarios need careful examination to incorporate the most optimum coding decisions. This paper deals with the enhancement of an open-source video stream analyzer to facilitate codec development tailored for medical diagnostic videos. The proposed extensions include visual representation of quantitative information for the bit count used at CTU level, as well as displaying the different mode decisions adopted in the case of merge mode, prediction mode, and intra mode. We have incorporated the said extensions in HEVC analyzer and validated the approach by using test video sequences for Ultrasound, Eye, and Skin examination. © 2015 IEEE.
We propose that, in order to meet high computational demands, the application development has to be based on suitable models of computations that will lead to scalable and reusable implementations. The models should enhance the understanding of the application and at the same time enable the developer to organize the computations so that they can be efficiently mapped to the target reconfigurable architecture. The goal of the thesis is to propose methods to program future coarse-grained reconfigurable architecttures in a productive manner in such a way as to achieve energy efficient mapping with improved performance.
Coarse-grained reconfigurable architectures, which offer massive parallelism coupled with the capability of undergoing run-time reconfiguration, are gaining attention in order to meet not only the increased computational demands of high-performance embedded systems, but also to fulfill the need of adaptability to functional requirements of the application. This thesis focuses on the programming aspects of such coarse-grained reconfigurable computing devices, including the relevant computation models that are capable of exposing different kinds of parallelism inherent in the application and the ability of these models to capture the adaptability requirements of the application. The thesis suggests the occam-pi language for programming of a broad class of coarse-grained reconfigurable architectures as an intermediate language; we call it intermediate, since we believe that the applicationprogramming is best done in a high-level domain-specific language. The salient properties of the occam-pi language are explicit concurrency with built-in mechanisms for interprocessorcommunication, provision for expressing dynamic parallelism, support for the expression of dynamic reconfigurations, and placement attributes. To evaluate the programming approach, a compiler framework was extended to support the language extensions in the occam-pi language, and backends were developed to target two different coarse-grained reconfigurable architectures. XPP and Ambric. The results on XPP reveal that the occam-pi based implementations produce comparable throughput to those of NML programs, while programming at a much higher level of abstraction than that of NML. Similarly the two occam-pi implementations of autofocus criterion calculation targeted to the Ambric platform outperform the CPU implementation by factors of 11-23. Thus, the results of the implemented case-studies suggest that the occam-pi language based approach simplifies the development of applications employing run-time reconfigurable devices without compromising the performance benefits.
With the advent of manycore architectures comprising hundreds of processing elements, fault management has become a major challenge. We present an approach that uses the occam-pi language to manage the fault recovery mechanism on a new manycore architecture, the Platform 2012 (P2012). The approach is made possible by extending our previously developed compiler framework to compile occam-pi implementations to the P2012 architecture. We describe the techniques used to translate the salient features of the occam-pi language to the native programming model of the P2012 architecture. We demonstrate the applicability of the approach by an experimental case study, in which the DCT algorithm is implemented on a set of four processing elements. During run-time, some of the tasks are then relocated from assumed faulty processing elements to the faultless ones by means of dynamic reconfiguration of the hardware. The working of the demonstrator and the simulation results illustrate not only the feasibility of the approach but also how the use of higher-level abstractions simplifies the fault handling. © 2012 IEEE.
The successful realization of next generation radar systems have high performance demands on the signal processing chain. Among these are advanced Active Electronically Scanned Array (AESA) radars in which complex calculations are to be performed on huge sets of data in real-time. Manycore architectures are designed to provide flexibility and high performance essential for such streaming applications. This paper deals with the implementation of compute-intensive parts of AESA radar signal processing chain in a high-level dataflow language; CAL. We evaluate the approach by targeting a commercial manycore architecture, Epiphany, and present our findings in terms of performance and productivity gains achieved in this case study. The comparison of the performance results with the reference sequential implementations executing on a state-of-the-art embedded processor show that we are able to achieve a speedup of 1.6x to 4.4x by using only 10 cores of Epiphany. © Springer Science+Business Media New York 2015
Telemedicine is drawing greater attention to improve the health care delivery. Video coding being an integral part of any real-time telemedicine system is used to deliver diagnostic video stream to remote physician. Realizing a video coding system customized for telemedicine, using the available technologies poses several challenges. In this paper, we have analyzed state-of-the-art video codecs for adoption in telemedicine customized video conferencing system under low-bandwidth and low-computational scenarios. The experimental results for the selected video codec implementations for medical videos reveal that the HEVC encoder achieves equivalent objective video quality when using approx. 60% bit rate on average. However, the gain in coding efficiency is at the expense of increased computational complexity, which could be dealt with by incorporating adaptive interpolation and selective quality enhancement techniques to achieve real-time performance. © 2015 IEEE.
The future trend in microprocessors for the more advanced embedded systems is focusing on massively parallel reconfigurable architectures, consisting of heterogeneous ensembles of hundreds of processing elements communicating over a reconfigurable interconnection network. However, the mastering of low-level micro-architectural details involved in programming of such massively parallel platforms becomes too cumbersome, which limits their adoption in many applications. Thus there is a dire need of an approach to produce high-performance scalable implementations that harness the computational resources of the emerging reconfigurable platforms.This paper addresses the grand challenge of accessibility of these diverse reconfigurable platforms by suggesting the use of a high-level language, occam-pi, and developing a complete design flow for building, compiling, and generating machine code for heterogeneous coarse-grained hardware. We have evaluated the approach by implementing complex industrial case studies and three common signal processing algorithms. The results of the implemented case-studies suggest that the occam-pi language based approach, because of its well-defined semantics for expressing concurrency and reconfigurability, simplifies the development of applications employing run-time reconfigurable devices. The associated compiler framework ensures portability as well as the performance benefits across heterogeneous platforms.
Over the years reconfigurable computing devices such as FPGAs have evolved from gate-level glue logic to complex reprogrammable processing architectures. However, the tools used for mapping computations to such architectures still require the knowledge about architectural details of the target device to extract efficiency. A study of the Mobius language and tools is presented in this paper, with a focus on generated hardware performance. A number of streaming and memory-intensive applications have been developed and the results have been compared with the corresponding implementations in VHDL and a behavioral hardware description language. Based upon experimental evidences, it is concluded that Mobius, a minimal parallel processing language targeted for reconfigurable architectures, enhances productivity in terms of design time and code maintainability without considerably compromising performance and resources.
Embedded signal processing is facing the challenges of increased performance as well as to achieve energy efficiency. Massively parallel processor arrays (MPPAs) consisting of tens or hundreds of processing cores offer the possibility of meeting the growing performance demand in an energy efficient way by exploiting parallelism instead of scaling the clock frequency of a single powerful processor.
In this paper, we evaluate two variants of MPPAs by implementing a significantly large case study, namely an autofocus criterion calculation, which is a key component in modern synthetic aperture radar systems. The implementation results from the two target architectures are compared on the basis of utilized resources, performance, and energy efficiency. The Ambric implementations demonstrate the usefulness of occam-pi based high-level language approach in utilizing hundreds of processors, whereas the Epiphany implementation reveals that energy-efficiency can be improved even further by a factors of 2-3 with respect to the Ambric implementations and can be achieved at high clock speeds. © 2013 IEEE.
In order to meet the increased computational demands of, e.g., multimedia applications, such as video processing in HDTV, and communication applications, such as baseband processing in telecommunication systems, the architectures of reconfigurable devices have evolved to coarse-grained compositions of functional units or program controlled processors, which are operated in a coordinated manner to improve performance and energy efficiency. In this survey we explore the field of coarse-grained reconfigurable computing on the basis of the hardware aspects of granularity, reconfigurability, and interconnection networks, and discuss the effects of these on energy related properties and scalability. We also consider the computation models that are being adopted for programming of such machines, models that expose the parallelism inherent in the application in order to achieve better performance. We classify the coarse-grained reconfigurable architectures into four categories and present some of the existing examples of these categories. Finally, we identify the emerging trends of introduction of asynchronous techniques at the architectural level and the use of nano-electronics from technological perspective in the reconfigurable computing discipline.
Recently we proposed occam-pi as a high-levellanguage for programming coarse grained reconfigurable architectures. The constructs of occam-pi combine ideas from CSPand pi-calculus to facilitate expressing parallelism, communication, and reconfigurability. The feasability of this approachwas illustrated by developing a compiler framework to compile occam-pi implementations to the Ambric architecture. In this paper, we demonstrate the applicability of occam-pif or programing an array of functional units, eXtreme ProcessingPlatform (XPP). This is made possible by extending the compilerframework to target the XPP architecture, including automatic floating to fixed-point conversion. Different implementations of a FIR filter and a DCT algorithm were developed and evaluated on the basis of performance and resource consumption. The reported results reveal that the approach of using occam-pito program the category of coarse grained reconfigurable architectures appears to be promising. The resulting implementations are generally much superior to those programmed in C and comparable to those hand-coded in the low-level native language NML.
Massively parallel reconfigurable architectures, which offer massive parallelism coupled with the capability of undergoing run-time reconfiguration, are gaining attention in order to meet the increased computational demands of high-performance embedded systems. We propose that the occam-pi language is used for programming of the category of massively parallel reconfigurable architectures. The salient properties of the occam-pi language are explicit concurrency with built-in mechanisms for interprocessor communication, provision for expressing dynamic parallelism, support for the expression of dynamic reconfigurations, and placement attributes. To evaluate the programming approach, a compiler framework was extended to support the language extensions in the occam-pi language and a backend was developed to target the Ambric array of processors. We present two case-studies; DCT implementation exploiting the reconfigurability feature of occam-pi and a significantly large autofocus criterion calculation based on the dynamic parallelism capability of the occam-pi language. The results of the implemented case studies suggest that the occam-pi -language-based approach simplifies the development of applications employing run-time reconfigurable devices without compromising the performance benefits. Copyright © 2012 Zain-ul-Abdin and Bertil Svensson.
Synthetic-Aperture Radar (SAR) systems that are used to create high-resolution radar images from low-resolution aperture data require high computational performance. Manycore architectures are emerging to overcome the computational requirements of the complex radar signal processing.
In this paper, we evaluate a manycore architecture namely Epiphany by implementing two significantly large case studies of fast factorized back-projection and autofocus criterion calculation, which are key components in modern synthetic aperture radar systems. The implementation results from the two case studies are compared on the basis of utilized resources and performance. The Epiphany implementations demonstrate the usefulness of the architecture for the streaming algorithm (autofocus criterion calculation) by achieving speedup of 8.9x with respect to the sequential implementation on Intel Core i7 processor while operating at a lower clock speed. On the other hand, for the memory-intensive algorithm (fast factorized back-projection), the Epiphany architecture shows moderate speedup in the order of 4.25x. The Epiphany implementations of the two algorithms are, respectively, 38x and 78x more energy-efficient.
Embedded electronic systems are finding increased applications in our daily life. In order to meet the application demands in embedded systems, parallel computing is used. This paper emphasizes teaching of the specific issues of parallel computing that are critical to embedded systems. We propose an analytical approach to deliver declarative and functioning knowledge for learning in the field of computer science and engineering with a special focus on Embedded Parallel Computing (EPC). We describe the teaching of a course with a focus on how parallel computing can be used to enhance performance and improve energy efficiency of embedded systems. The teaching methods include interactive lectures with web-based course literature, seminars, and lab exercises and home-assigned practical tasks. Further, the course is intended to give a general insight into current research and development in regard to parallel architectures and computation models. Since the course is an advanced level course, the students are expected to have a basic knowledge about the fundamentals of computer architecture and their common programming methodologies. The course puts emphasis on hands-on experience with embedded parallel computing. Therefore it includes an extensive laboratory and project part, in which a state of the art manycore embedded computing system is used. We believe that undertaking these methods in succession will prepare the students for both research as well as professional career. © 2015 ACM.
Real-time performance is critical for the successful realization of next generation radar systems. Among these are advanced Active Electronically Scanned Array (AESA) radars in which complex calculations are to be performed on huge sets of data in real-time. Manycore architectures are designed to provide flexibility and high performance essential for such streaming applications.
This paper deals with the implementation of compute-intensive parts of AESA radar signal processing chain in a high-level dataflow language; CAL. We evaluate the approach by targeting a commercial manycore architecture, Epiphany, and present our preliminary findings in terms of performance and productivity gains achieved in this case study. © 2014 IEEE.
The next generation radar systems have high performance demands on the signal processing chain. Examples include the advanced image creating sensor systems in which complex calculations are to be performed on huge sets of data in realtime. Manycore architectures are gaining attention as a means to overcome the computational requirements of the complex radar signal processing by exploiting massive parallelism inherent in the algorithms in an energy efficient manner.
In this paper, we evaluate a manycore architecture, namely a 16-core Epiphany processor, by implementing two significantly large case studies, viz. an autofocus criterion calculation and the fast factorized back-projection algorithm, both key componentsin modern synthetic aperture radar systems. The implementation results from the two case studies are compared on the basis of achieved performance and programmability. One of the Epiphany implementations demonstrates the usefulness of the architecture for the streaming based algorithm (the autofocus criterion calculation) by achieving a speedup of 8.9x over a sequential implementation on a state-of-the-art general-purpose processor of a later silicon technology generation and operating at a 2.7x higher clock speed. On the other case study, a highly memory-intensive algorithm (fast factorized backprojection), the Epiphany architecture shows a speedup of 4.25x. For embedded signal processing, low power dissipation is equally important as computational performance. In our case studies, the Epiphany implementations of the two algorithms are, respectively, 78x and 38x more energy efficient. © 2013 IEEE
The next generation radar systems have high performance demands on the signal processing chain. Among these are advanced image creating sensor systems in which complex calculations are to be performed on huge sets of data in realtime. Massively Parallel Processor Arrays (MPPAs) are gaining attention to cope with the computational requirements of complex radar signal processing by exploiting the massive parallelism inherent in the algorithms in an energy efficient manner.
In this paper, we evaluate two such massively parallel architectures, namely, Ambric and Epiphany, by implementing a significantly large case study of autofocus criterion calculation, which is a key component in future synthetic aperture radar systems. The implementation results from the two case studies are compared on the basis of achieved performance, energy efficiency, and programmability. ©2013 IEEE.
Today many of the high performance embedded processors already contain multiple processor cores and we see heterogeneous manycore architectures being proposed. Therefore it is very desirable to have a fast way to explore various heterogeneous architectures through the use of an architectural design space exploration tool, giving the designer the option to explore design alternatives before the physical implementation. In this paper, we have extended Heracles, a design space exploration tool for (homogeneous) manycore architectures, to incorporate different types of processing cores, and thus allowus to model heterogeneity. Our tool, called the Heterogeneous Heracles System (HHS), can besides the already supported MIPS core also include OpenRISC cores. The new tool retains the possibility available in Heracles to perform register transfer level (RTL) simulations of each explored architecture in Verilog as well as synthesizing it to field-programmable gate arrays (FPGAs). To facilitate the exploration of heterogeneous architectures, we have also extended the graphical user interface (GUI) to support heterogeneity. This GUI provides options to configure the types of core, core settings, memory system and network topology. Some initial results on FPGA utilization are presented from synthesizing both homogeneous and heterogeneous manycore architectures, as well as some benchmark results from both simulated and synthesized architectures.
Dataflow programming is a promising paradigm for high performance embedded parallel computing. When mapping a dataflow program onto a manycore architecture a key component is the library to express the communication between the actors. In this paper we present a dataflow communication library supporting the CAL actor language. A first implementation of the communication library is created for Adapteva’s manycore architecture Epiphany that contains an onchip 2-D mesh network. Three different buffering methods, with and without direct memory access (DMA) transfer, have been implemented and evaluated. We have also made a preliminary study on the effect of mapping strategies of the actors onto the cores. The assessment of the library is based on a CAL implementation of a two dimensional inverse discrete cosine transform (2D-IDCT) and our own CAL-to-C compilation framework. As expected the results show that the most efficient actor to-core mapping strategy is to keep the communication to the nearest neighbor communication pattern as much as possible. Thus, the best way to place a pipelined sequence of computations like our 2D-IDCT is to place the actors into cores in a serpentine fashion. For this application we found that the simple receiver side buffer outperforms the more complicated buffering strategies that used DMA transfer.
The adoption of run-time reconfigurable parallel architectures for high-performance embedded systems is constrained by the lackof a unified programming model which can express both parallelism and reconfigurability. We propose to program an emerging class of reconfigurable processor arrays by using the programming model of occam-pi and describe how the extensions of channel direction specifiers, mobile data, dynamic process invocation, and process placement attributes can be used to express run-time reconfiguration in occam-pi. We present implementations of DCT algorithm to demonstrate the applicability of occam-pi to express reconfigurability. We concluded that occam-pi appears to be a suitable programming model for programming run-time reconfigurable processor arrays.
Recently we proposed occam-pi as a high-level language for programming massively parallel reconfigurable architectures. The design of occam-pi incorporates ideas from CSP and pi-calculus to facilitate expressing parallelism and reconfigurability. The feasability of this approach was illustratedby building three occam-pi implementations of DCT executing on an Ambric. However, because DCT is a simple and well studied algorithm it remained uncertain whether occam-pi would also be effective for programming novel, more complex algorithms.
In this paper, we demonstrate the applicability of occam-pi for expressing various degrees of parallelism by implementinga significantly large case-study of focus criterion calculation inan autofocus algorithm on the Ambric architecture. Autofocus is a key component of synthetic aperture radar systems. Two implementations of focus criterion calculation were developedand evaluated on the basis of performance. The comparison of the performance results with a single threaded software implementation of the same algorithm show that the throughput of the two implementations are 11x and 23x higher than the sequential implementation despite a much lower (9x) clock frequency. The two designs are, respectively, 29x and 40x moreenergy efficient.