The three dimensional structure tensor algorithm (3D-STA) is often used in image processing applications to compute the optical flow or to detect local 3D structures and their directions. This algorithm is computationally expensive due to many computations that are required to calculate the gradient, the tensor, and to smooth every pixel of the image frames. Therefore, it is important to parallelize the implementation to achieve high performance. In this paper we present two parallel implementations of 3D-STA; namely moderately parallelized and highly parallelized implementation, on a massively parallel reconfigurable array. Finally, we evaluate the performance of the generated code and results are compared with another optical flow implementation. The throughput achieved by the moderately parallelized implementation is approximately half of the throughput of the Optical flow implementation, whereas the highly parallelized implementation results in a 2x gain in throughput as compared to the optical flow implementation. © 2012 IEEE.
Embedded Intelligent Systems (EIS) is the joint research field of the four collaborating laboratories at the School of Information Science, Computer and Electrical Engineering (IDE) at Halmstad University. The research of the four labs is integrated into a strong concerted research environment within embedded systems (EIS) - with a perspective reaching from the enabling technology via new system solutions and intelligent applications to end user aspects and business models. It is an expanding research area with many applications, not least ones that exist in everyday life.EIS is an important research environment contributing to the regional Triple Helix innovation system Healthcare Technology which the region has pointed out as a prioritised development sector. With its strong connections to both established and new, expanding firms hived off from the university, the research environment is active in the Healthcare Technology Alliance, a network of around sixty companies, counties and health care providers in south-western Sweden with the aim of developing the region into a leading arena for the development of health technology products and services. Several projects together with these participants concern both research and technology transfer.An integrated gender and gender equality perspective in innovations within the health technology area is necessary in order to be able to meet the needs of an ageing population with quality innovations. The relevancy of a gender perspective is clear in relation to the fact that about 70% of all those older than 75 years are women. Older women are on average cared for in hospital twice as long as men, partly due to differing disease panoramas, but also because men are more often cared for in the home by a woman while the women who live longer more often live alone. With the expansion of home-help and home nursing new needs follow and it is likely that a gender perspective will become necessary for the development of products and services that can make daily life easier for the elderly. The gender perspective also has relevance from the point of view of care staff. New technology is developed for application within the health and care sector where the larger professional groups consist mainly of women. The technology, most often designed by men, is used by women. With this in mind it is clear that an important aspect of good innovations is that the end users are involved in the innovation process.Based on an awareness of the need for a more articulated gender perspective within the research environment, in order to meet the needs expressed above, an application for a gender inclusive R&D project was handed in to the VINNOVA programme Applied Gender Research in Strong Research and Innovation Environments. The G-EIS project (Gender Perspective on Embedded Intelligent Systems - Application in Healthcare Technology) was approved and started in 2009. The project involves researchers from the EIS research environment as well as representatives from companies and the public sector.The project participants are on the whole agreed on the need for a gender perspective in the R&I environment, but struggle with the meeting of two epistemologically opposed theories of science. The understanding within gender studies that research and production both create reality and are informed by it is not always accepted within the areas of natural science. Engineering and other technological sciences not only consider aspects of science to be separate from reality, but also seek positivistic proof in research, something not always possible in the more qualitative research of the social sciences. Researching how these two perspectives meet within this specific project is the topic of this paper.
This paper presents a configurable framework to be used for rapid prototyping of stream based languages. The framework is based on a set of design patterns defining the elementary structure of a domain specific language for high-performance signal processing. A stream language prototype for baseband processing has been implemented using the framework. We introduce language constructs to efficiently handle dynamic reconfiguration of distributed processing parameters. It is also demonstrated how new language specific primitive data types and operators can be used to efficiently and machine independently express computations on bitfields and data-parallel vectors. These types and operators yield code that is readable, compact and amenable to a stricter type checking than is common practice. They make it possible for a programmer to explicitly express parallelism to be exploited by a compiler. In short, they provide a programming style that is less error prone and has the potential to lead to more efficient implementations.
The programming complexity of increasingly parallel processors calls for new tools that assist programmers in utilising the parallel hardware resources. In this paper we present a set of models that we have developed as part of a tool for mapping dataflow graphs onto manycores. One of the models captures the essentials of manycores identified as suitable for signal processing, and which we use as tar- get for our algorithms. As an intermediate representation we introduce timed configuration graphs, which describe the mapping of a model of an application onto a machine model. Moreover, we show how a timed configuration graph by very simple means can be evaluated using an abstract interpretation to obtain performance feedback. This infor- mation can be used by our tool and by the programmer in order to discover improved mappings.
The programming complexity of increasingly parallel processors calls for new tools that assist programmers in utilising the parallel hardware resources. In this paper we present a set of models that we have developed as part of a tool for mapping dataflow graphs onto manycores. One of the models captures the essentials of manycores identified as suitable for signal processing, and which we use as target for our algorithms. As an intermediate representation we introduce timed configuration graphs, which describe the mapping of a model of an application onto a machine model. Moreover, we show how a timed configuration graph by very simple means can be evaluated using an abstract interpretation to obtain performance feedback. This information can be used by our tool and by the programmer in order to discover improved mappings.
The programming complexity of increasingly parallel processors calls for new tools to assist programmers in utilising the parallel hardware resources. In this paper we present a set of models that we have developed to form part of a tool which is intended for iteratively tuning the mapping of dataflow graphs onto manycores. One of the models is used for capturing the essentials of manycores that are identified as suitable for signal processing and which we use as target architectures. Another model is the intermediate representation in the form of a timed configuration graph, describing the mapping of a dataflow graph onto a machine model. Moreover, this IR can be used for performance evaluation using abstract interpretation. We demonstrate how the models can be configured and applied in order to map applications on the Raw processor. Furthermore, we report promising results on the accuracy of performance predictions produced by our tool. It is also demonstrated that the tool can be used to rank different mappings with respect to optimisation on throughput and end-to-end latency.
The goal of the REMAP project was to gain new knowledge about the design and use of massively parallel computer architectures in embedded real-time systems. In order to support adaptive and learning behavior in such systems, the efficient execution of Artificial Neural Network (ANN) algorithms on regular processor arrays was in focus. The REMAP-β parallel computer built in the project was designed with ANN computations as the main target application area. This chapter gives an overview of the computational requirements found in ANN algorithms in general and motivates the use of regular processor arrays for the efficient execution of such algorithms. REMAP-β was implemented using the FPGA circuits that were available around 1990. The architecture, following the SIMD principle (Single Instruction stream, Multiple Data streams), is described, as well as the mapping of some important and representative ANN algorithms. Implemented in FPGA, the system served as an architecture laboratory. Variations of the architecture are discussed, as well as scalability of fully synchronous SIMD architectures. The design principles of a VLSI-implemented successor of REMAP-β are described, and the paper is concluded with a discussion of how the more powerful FPGA circuits of today could be used in a similar architecture. © 2006 Springer.
The REMAP project addresses questions related to the use of massively parallel, distributed computing in embedded systems. Of specific interest is the execution of artificial neural network algorithms on multiple, cooperating processor arrays. This paper concentrates on the recently finished, and currently used, processor array prototype, REMAP-β, of SIMD (Single Instruction stream, Multiple Data streams) type. The architecture and implementation of the computer is described, both its overall structure and its constituent parts. Following this comes a discussion of its use as an architecture laboratory, which stems from the fact that it is implemented using FPGA (Field Programmable Gate Array) circuits. As an architecture laboratory the prototype can be used to implement and evaluate, e.g., various Processing Element (PE) designs. A couple of examples of PE architectures, including one with floating-point support, are given. The mapping of important neural network algorithms on processor arrays of this kind is shown, and possible tuning of the architecture to meet specific processing demands is discussed. Performance figures are given as well as implications for future VLSI implementations of the array.
A need to apply the massively parallel computing paradigm in embedded real-time systems is foreseen. Such applications put new demands on massively parallel systems, different from those of general purpose computing. For example, time determinism is more important than maximal throughput, physical distribution is often required, size, power, and I/O are important, and interactive development tools are needed. The paper describes an architecture for high-performance, embedded, massively parallel processing, featuring a large number of nodes physically distributed over a large area. A typical node has thousands of processing elements (PEs) organized in SIMD mode and is the size of the palm of a hand, Intermodule communication over a scalable optical network is described. A combination of wavelength division multiplexing (WDM) and time division multiplexing (TDM) is used. © 1994 IEEE.
With the increased degree of miniaturization resulting from the use of modem VLSI technology and the high communication bandwidth available through optical connections, it is now possible to build massively parallel computers based on distributed modules which can be embedded in advanced industrial products. Examples of such future possibilities are ''action-oriented systems'', in which a network of highly parallel modules perform a multitude of tasks related to perception, cognition, and action. The paper discusses questions of architecture on the level of modules and inter-module communication and gives concrete architectural solutions which meet the demands of typical, advanced industrial real-time applications. The interface between the processors arrays and the all-optical communication network is described in some detail. Implementation issues specifically related to the demand for miniaturization are discussed.
This paper presents a hardware architecture and a software tool needed for future autonomous robots. Specific attention is given to the execution of artificial neural networks and to the need for a good inspection and visualization tool when developing this kind of systems. Achievable performance using state-of-the-art technology is estimated and module miniaturization issues are discussed. © 1994 IEEE.
This paper suggests a cluster collision avoidance mechanism and a dual transceiver architecture to be used in a clustered wireless multihop network. These two contributions make the clustered wireless multihop network the preferred architecture for future industrial wireless networks. The wireless multihop cluster consists of one master and several slaves, where some of the slaves will act as gateways between different clusters. Frequency hopping spread spectrum is used on a cluster level and to avoid frequency collisions between clusters a "neighbor cluster collision avoidance mechanism" is proposed and evaluated through simulations. To break up the dependence between the clusters, introduced by the gateway nodes, each node is equipped with two transceivers. The paper is concluded with a suggestion to use a clustered wireless multihop network with orthogonal hopping sequences for an industrial setting.
In this paper some well established wireless technologies are merged into a new concept solution for a future industrial wireless mesh network. The suggested clustered wireless mesh network can handle probabilistic quality of service guarantees and is based on a dual-radio node architecture using synchronized frequency hopping spread spectrum Bluetooth radios. The proposed architecture gives a heuristic solution to the inter-cluster scheduling problem of gateway nodes in clustered architectures and breaks up the dependence between the local medium access schedules of adjacent clusters. The dual-radio feature also enables higher network connectivity, implying, for example, that a higher link redundancy can be achieved.
An important trend, in personal area networks, is that time critical application becomes more common, e.g., voice over IP, video phone calls, network games. This segment of applications demands for quality of service (QoS) guarantees, to provide the correct functionality. The Bluetooth standard provides an optional interface to support QoS guarantees, but the standard does not suggest any actual implementation. A wireless communication channel is stochastic by nature, providing QoS guarantees with this precondition make traditional deterministic real-time theory obsolete. In this paper a probabilistic fault tolerance test enabling quality of service guarantees in a Bluetooth piconet is given. The basic Bluetooth network architecture is based on a master-slave configuration, i.e., a point to point connection. More advanced network architectures are possible where up to eight Bluetooth equipped units can be active members of one network (piconet). Furthermore, several piconets can interconnect and form a so called scatternet.
It is expected that wireless sensor network will be used in home automation and industrial manufacturing in the future. The main driving forces for wireless sensor networks are fault tolerance, energy gain and spatial capacity gain. Unfortunately, an often forgotten issue is the capacity limits that the network topology of a wireless sensor network represents. In this paper we identify gains, losses and limitations in a wireless sensor network, using a simplified theoretical network model. Especially, we want to point out the stringent capacity limitations that this simplified network model provide. Where a comparison between the locality of the performed information exchange and the average capacity available for each node is the main contribution.
In this paper we investigate the impact of node mobility in a wireless ad hoc network (WAHN). Especially we investigate the possibility to provide guaranteed services in a WAHN, i.e., the network topology predictability. We combine link expiration time (LET) estimation with information propagation speed (IPS) in a time-space diagram and as result an operation area is revealed. The result gives that a WAHN, where the nodes are mobile, has a knowledge horizon (KH), the distance of which is dependent on the mobility of the nodes. Beyond the KH, knowledge about the network state is impossible to achieve. Thus, we can not predict long distance network topology state when the node mobility is high.
Embedded DSP computing is currently shifting towards manycore architectures in order to cope with the ever growing computational demands. Actor based dataflow languages are being considered as a programming model. In this paper we present a code generator for CAL, one such dataflow language. We propose to use a compilation tool with two intermediate representations. We start from a machine model of the actors that provides an ordering for testing of conditions and firing of actions. We then generate an Action Execution Intermediate Representation that is closer to a sequential imperative language like C and Java. We describe our two intermediate representations and show the feasibility and portability of our approach by compiling a CAL implementation of the Two-Dimensional Inverse Discrete Cosine Transform on a general purpose processor, on the Epiphany manycore architecture and on the Ambric massively parallel processor array. © 2014 IEEE.
In this paper, we propose a system suitable for embedded signal processing with extreme performance demands. The system consists of several computational modules that work independently and send data simultaneously in order to achieve high throughput. Each computational module is composed of multiple processors connected in a hypercube topology to meet scalability and high bisection bandwidth requirements. Free-space optical interconnects and planar packaging technology make it possible to transform the hypercubes into planes and to take advantage of many optical properties. For instance, optical fan-out reduces hardware cost. This, altogether, makes the system capable of meeting high performance demands in, e.g., massively parallel signal processing. An example system with eleven computational modules and an overall peak performance greater than 2.8 TFLOPS is presented. The effective inter-module bandwidth in this configuration is 1,024 Gbit/s.
The speed and complexity of integrated circuits are increasing rapidly. For instance, today's mainstream processors have already surpassed gigahertz global clock frequencies on-chip. As a consequence, many algorithms proposed for applications in embedded signal-processing (ESP) systems, e.g. radar and sonar systems, can be implemented with a reasonable number (less than 1000) of processors, at least in terms of computational power. An extreme inter-processor network is required, however, to completely implement those algorithms. The demands are such that completely new interconnection architectures must be considered.
In the search for new architectures, developers of parallel computer systems can actually take advantage of optical interconnects. The main reason for introducing optics from a system point of view is the strength in using benefits that enable new architecture concepts, e.g. free-space propagation and easy fan-out, together with benefits that can actually be exploited by simply replacing the electrical links with optical ones without changing the architecture, e.g. high bandwidth and complete galvanic isolation.
In this paper, we propose a system suitable for embedded signal processing with extreme performance demands. The system consists of several computational modules that work independently and send data simultaneously in order to achieve high throughput. Each computational module is composed of multiple processors connected in a hypercube topology to meet scalability and high bisection bandwidth requirements. Free-space optical interconnects and planar packaging technology make it possible to arrange the hypercubes as planes with an associated three-dimensional communications space and to take advantage of many optical properties. For instance, optical fan-out reduces hardware cost. Altogether, this makes the system capable of meeting high performance demands in, for example, massively parallel signal processing. One 64-channel airborne radar system with nine computational modules and a sustained computational speed of more than 1.6 Tera floating point operations per second (TFLOPS) is presented. The effective inter-module bandwidth in this configuration is 1 024 Gbit/s.
In this paper, we deal with the key issues in implementing an optoelectronic architecture suitable for embedded signal processing. The architecture is based on a system concept where free-space optical interconnects and planar packaging technologies make it possible to merge complicated and new parallel computer architectures into planes and to take advantage of many properties of optics. For instance, optical fan-out reduces the hardware cost as well as the all-to-all broadcast time. It is also possible to meet scalability and high bisection bandwidth requirements. The main results show that it is possible to build a 6D hypercube using planar optical substrates.
To keep up with the explosive growth of world-wide network traffic, large-capacity switches, with switching capacities in excess of several terabits per second, are becoming an essential part of the future. To realize such switches, new architecture concepts must be considered. In this paper, we discuss a technology for terabit switches that combines the advantage of using optical communication in all three spatial dimensions and the benefits of using surface mounted optoelectronic as well as electronic chips. We present three different types of packet-based switch fabrics, all based on the optical planar interconnection technology. We then discuss these in terms of capacity, scalability, and physical size. All three implementations have a single switch plane cross sectional bandwidth exceeding 5 Tbps.
N/A
In this paper, we consider the mapping of two radar algorithms on a new scalable hardware architecture. The architecture consists of several computational modules that work independently and send data simultaneously in order to achieve high throughput. Each computational module is composed of multiple processors connected in a hypercube topology to meet scalability and high bisection bandwidth requirements. Free-space optical interconnects and planar packaging technology make it possible to transform the hypercubes into planes. Optical fan-out reduces the number of optical transmitters and thus the hardware cost. Two example systems are analyzed and mapped onto the architecture. One 64-channel airborne radar system with a sustained computational load of more than 1.6 TFLOPS, and one ground-based 128-channel radar system with extreme inter-processor communication demands.
Efficient utilization of available resources is a key concept in embedded systems. This paper is focused on providing the support for managing dynamic reconfiguration of computing resources in the programming model. We present an approach to map occam-pi programs to a manycore architecture, Platform 2012 (P2012). We describe the techniques used to translate the salient features of the occam-pi language to the native programing model of the P2012 architecture. We present the initial results from a case study of matrix multiplication. Our results show the simplicity of occam-pi program by 6 times reduction in lines-of-code.
Manycore architectures are gaining attention as a means to meet the performance and power demands of high-performance embedded systems. However, their widespread adoption is sometimes constrained by the need formastering proprietary programming languages that are low-level and hinder portability. We propose the use of the concurrent programming language occam-pi as a high-level language for programming an emerging class of manycore architectures. We show how to map occam-pi programs to the manycore architecture Platform 2012 (P2012). We describe the techniques used to translate the salient features of the language to the native programming model of the P2012. We present the results from a case study on a representative algorithm in the domain of real-time image processing: a complex algorithm for corner detectioncalled Features from Accelerated Segment Test (FAST). Our results show that the occam-pi program is much shorter, is easier to adapt and has a competitive performance when compared to versions programmed in the native programming model of P2012 and in OpenCL.
The project Gender Perspective on Embedded Intelligent Systems – Application in Healthcare Technology (G-EIS) financed by Vinnova is integrated into the research environment Embedded Intelligent Systems (EIS) at Halmstad University. EIS is contributing to the regional Triple Helix innovation system Healthcare Technology by developing new technology for application within the health and care sector, and there is an outspoken need for a more articulated gender perspective within the research environment. The project is inspired by the Technoscientific gender research. It has a qualitative and action research approach and is oriented toward development. In the project process the difference between epistemological cultures has been obvious. In the interaction between the researchers we realize that engineering and other technological sciences not only consider aspects of science to be separate from reality, but also seek positivistic proof in research, something not always possible in the more qualitative research of the social sciences. In the paper we discuss how to bridge and create understanding between sciences and different epistemological cultures.
The Harmonized Parabolic Synthesis methodology is a further development of the Parabolic Synthesis methodology for approximation of unary functions such as trigonometric functions, logarithms and the square root, as well as binary functions such as division, in hardware.These functions are extensively used in computer graphics, digital signal processing, communication systems, robotics, astrophysics, fluid physics and many other application areas. For these high-speed applications, software solutions are in many cases not sufficient and a hardware implementation is therefore needed. The Harmonized Parabolic Synthesis methodology has two outstanding advantages: it is parallel, thus reducing the execution time, and it is based on low 2complexity operations, thus is simple to implement in hardware. A notable difference in the Harmonized Parabolic Synthesis methodology compared to many other approximation methodologies is that it is a multiplicative and not an additive methodology. Without harming the favorable distribution of the approximation error presented in earlier described Parabolic Synthesis methodologies it is possible to significantly enhances the performance of the Harmonized Parabolic Synthesis methodology, in terms of reducing chip area, computation delay and power consumption. Furthermore it increases the possibility to tailor the characteristics of the error, which improves the conditions for subsequent calculations. It also extends the set of unary functions that approximations can be performed upon since the possibilities to elaborate with the characteristics and distribution of the error increases. To evaluate the proposed methodology, the fractional part of the logarithm has been implemented and its performance is compared to the Parabolic Synthesis methodology. The comparison is made with 15-bit resolution. The design implemented using the Harmonized Parabolic Synthesis methodology performs 3x better than the Parabolic Synthesis implementation in terms of throughput. In terms of energy consumption, the new methodology consumes 90% less. The chip area is 70% smaller than for the Parabolic Synthesis methodology. In summary, the new technology presented in this paper further increases the advantages of Parabolic Synthesis.
The Harmonized Parabolic Synthesis methodology is a further development of the Parabolic Synthesis methodology for approximation of unary functions such as trigonometric functions, logarithms and the square root with moderate accuracy for ASIC implementation. These functions are extensively used in computer graphics, communication systems and many other application areas. For these high-speed applications, software solutions are in many cases not sufficient and a hardware implementation is therefore needed. The Harmonized Parabolic Synthesis methodology has two outstanding advantages: it is parallel, thus reducing the execution time, and it is based on low complexity operations, thus is simple to implement in hardware. A difference compared to other approximation methodologies is that it is a multiplicative and not additive, methodology. Compared to the Parabolic Synthesis methodologies it is possible to significantly enhance the performance in terms of reducing chip area, computation delay and power consumption. Furthermore it increases the possibility to tailor the characteristics of the error, improving conditions for subsequent calculations and the performance in design terms. To evaluate the proposed methodology, the fractional part of the logarithm has been implemented and its performance is compared to the Parabolic Synthesis methodology. The comparison is made with 15-bit resolution. The design implemented using the proposed methodology performs 3x better than the Parabolic Synthesis implementation in terms of throughput. In terms of energy consumption, the new methodology consumes 90% less. The chip area is 70% smaller than for the Parabolic Synthesis methodology. In summary, the new technology further increases the advantages of Parabolic Synthesis. © 2017 The Author(s)
The Parabolic Synthesis methodology is an approximation methodology for implementing unary functions, such as trigonometric functions, logarithms and square root, as well as binary functions, such as division, in hardware. Unary functions are extensively used in baseband for wireless/wireline communication, computer graphics, digital signal processing, robotics, astrophysics, fluid physics, games and many other areas. For high-speed applications as well as in low-power systems, software solutions are not sufficient and a hardware implementation is therefore needed. The Parabolic Synthesis methodology is a way to implement functions in hardware based on low complexity operations that are simple to implement in hardware. A difference in the Parabolic Synthesis methodology compared to many other approximation methodologies is that it is a multiplicative, in contrast to additive, methodology. To further improve the performance of Parabolic Synthesis based designs, the methodology is combined with Second-Degree Interpolation. The paper shows that the methodology provides a significant reduction in chip area, computation delay and power consumption with preserved characteristics of the error. To evaluate this, the logarithmic function was implemented, as an example, using the Parabolic Synthesis methodology in comparison to the Parabolic Synthesis methodology combined with Second-Degree Interpolation. To further demonstrate the feasibility of both methodologies, they have been compared with the CORDIC methodology. The comparison is made on the implementation of the fractional part of the logarithmic function with a 15-bit resolution. The designs implemented using the Parabolic Synthesis methodology – with and without the Second-Degree Interpolation – perform 4x and 8x better, respectively, than the CORDIC implementation in terms of throughput. In terms of energy consumption, the CORDIC implementation consumes 140% and 800% more energy, respectively. The chip area is also smaller in the case when the Parabolic Synthesis methodology combined with Second-Degree Interpolation is used. © 2016 Elsevier B.V. All rights reserved.
In applications as in future MIMO communication systems a massive computation of complex matrix operations, such as QR decomposition, is performed. In these matrix operations, the functions roots, inverse and inverse roots are computed in large quantities. Therefore, to obtain high enough performance in such applications, efficient algorithms are highly important. Since these algorithms need to be realized in hardware it must also be ensured that they meet high requirements in terms of small chip area, low computation time and low power consumption. Power consumption is particularly important since many applications are battery powered.For most unary functions, directly applying an approximation methodology in a straightforward way will not lead to an efficient implementation. Instead, a dedicated algorithm often has to be developed. The functions roots, inverse and inverse roots are in this category. The developed approaches are founded on working in a floating-point format. For the roots functions also a change of number base is used. These procedures not only enable simpler solutions but also increased accuracy, since the approximation algorithm is performed on a mantissa of limited range.As a summarizing example the inverse square root is chosen. For comparison, the inverse square root is implemented using two methodologies: Harmonized Parabolic Synthesis and Newton-Raphson method. The novel methodology, Harmonized Parabolic Synthesis (HPS), is chosen since it has been demonstrated to provide very efficient approximations. The Newton-Raphson (NR) method is chosen since it is known for providing a very efficient implementation of the inverse square root. It is also commonly used in signal processing applications for computing approximations on fixed-point numbers of a limited range. Four implementations are made; HPS with 32 and 512 interpolation intervals and NR with 1 and 2 iterations. Summarizing the comparisons of the hardware performance, the implementations HPS 32, HPS 512 and NR 1 are comparable when it comes to hardware performance, while NR 2 is much worse. However, HPS 32 stands out in terms of better performance when it comes to the distribution of the error.
When implementing computation-intensive algorithms on finegrained parallel architectures, adjustment of resource to performance tradeoff is a big challenge. This paper proposes a methodology for dealing with some of these performance tradeoffs by adjusting parallelism at different levels. In a case study, interpolation kernels are implemented on a fine-grained architecture (FPGA) using a high level language (Mitrion-C). For both cubic and bi-cubic interpolation, one single-kernel, one cross-kernel and two multi-kernel parallel implementations are designed and evaluated. Our results demonstrate that no single level of parallelism can be used for trade-off adjustment. Instead, the appropriate degree of parallelism on each level, according to available resources and the performance requirements of the application, needs to be found. Basing the design on high-level programming simplifies the trade-off process. This research is a step towards automation of the choice of parallelization based on a combination of parallelism levels.
High speed signal processing is often performed as a pipeline of functions on streams or blocks of data. In order to obtain both flexibility and performance, parallel, reconfigurable array structures are suitable for such processing. The array topology can be used both on the micro and macro-levels, i.e. both when mapping a function on a fine-grained array structure and when mapping a set of functions on different nodes in a coarse-grained array. We outline an architecture on the macro-level as well as explore the use of an existing, commercial, word level reconfigurable architecture on the micro-level. We implement an FFT algorithm in order to determine how much of the available resources are needed for controlling the computations. Having no program memory and instruction sequencing available, a large fraction, 70%, of the used resources is used for controlling the computations, but this is still more efficient than having statically dedicated resources for control. Data can stream through the array at maximum I/O rate, while computing FFTs. The paper also shows how pipelining of the FFT algorithm over a two-level reconfigurable array of arrays can be done in various ways, depending on the application demands.
Configurable architectures have emerged as one of the most powerful programmable signal processing platforms commercially available, obtaining their performance through the use of spatial parallelism. By changing the functionality of these devices during run-time, flexible mapping of signal processing applications can be made. The run-time flexibility puts requirements on the reconfiguration time that depend both on the application and on the mapping strategy. In this paper we analyze one such application, Space Time Adaptive Processing for radar signal processing, and show three different mappings and their requirements. The allowed time for run-time reconfiguration in these three cases varies from 1 ms down to 1 µs. Each has its own advantages, such as data reuse and optimization of computational kernels. Architectures with reconfiguration times in the order of 10 µs provide the flexibility needed for mapping the example in an efficient way, allowing for on-chip data reuse between the different processing stages.
One of the most important features of interconnection networks for massively parallel computer systems is scaleability. The fiber-optic network described in this paper uses both wavelength division multiplexing and a configurable ratio between optics and electronics to gain an architecture with good scaleability. The network connects distributed modules together to a huge parallel system where each node itself typically consists of parallel processing elements. The paper describes two different implementations of the star topology, one uses an electronic star and fiber optic connections, the other is purely optical with a passive optical star in the center. The medium access control of the communication concept is presented and some scaleability properties are discussed involving also a multiple-star topology.
Future real-time applications requiring massively parallel computer systems also put high demands on the interconnection network. By connecting several WDM star clusters by a backbone star, forming a star-of-stars network, we get a modular high-bandwidth network. In this paper we show how to achieve time-deterministic packet switched communication in such networks, even for inter-cluster communication. An analysis of how the deterministic latency and node bandwidth vary with design parameters is presented. We also propose a general clock-synchronization scheme, improving the worst-case latency with up to 33 percentages.
In this paper, we propose a high-bandwidth ring network built up with fiber-ribbon point-to-point links. The network has support for both packet switched and circuit switched traffic. Very high throughputs can be achieved in the network due to pipelining, i.e., several packets can be traveling through the network simultaneously but in different segments of the ring. The network can be built today using fiber-optic off-the-shelf components. The increasingly good price/performance ratio for fiber-ribbon links indicates a great success potential for the proposed kind of networks. We also present a massively parallel radar signal processing system with exceptionally high demands on the communication network. An aggregated throughput of tens of Gb/s is needed in this application, and this is achieved with the proposed network.
In massively parallel computer systems for embedded real-time applications there are normally very high bandwidth demands on the interconnection network. Other important properties are time-deterministic latency and services to guarantee that deadlines are met. In this paper we analyze how these properties vary with the design parameters for a passive optical star network, specifically when used in a massively parallel radar signal processing system. The aggregated bandwidth and computational power of the radar system are approximately 45 Gb/s and 100 GOPS, respectively. The analysis is focused on the medium access control protocol, called TD-TWDMA, for the time and wavelength multiplexed network. It is concluded that the proposed network is very well suited to this kind of signal-processing applications. We also present a new distributed slot-allocation algorithm with real-time properties.
This paper presents a self-organizing approach for mobile robot path planning problems in dynamic environments by using case-based reasoning together with a more conventional method of grid-map based path planning. The map-based path planner is used to suggest new innovative solutions for a particular path planning problem. The case-base is used to store the paths and evaluate their traversability. While planning the route those paths are preferred which, according to former experience, are least risky. As the environment changes, the exploration as well as the evaluation of the paths will allow the system to self-organize by forming a set of low-risk paths that are safest to follow. The experiments in a simulated environment show that the robot is able to adapt in a dynamic environment and learns to use the least risky paths. © Springer-Verlag Berlin Heidelberg 1998.
Transports can be made safer, more secure and efficient by help of telemetry and tracking on-line in real time. T4 is a system architecture aimed to support the development of telematic services for transparent tracking and surveillance monitoring of goods transported by different means on a global scale. The main idea is to focus on the transported pallets or parcels instead of the vehicles moving them. To enable rapid response to new customer requirements and to support remote management of field equipment, software implemented services are designed, packaged, deployed and mediated using XML, Java and the OSGi software technology standards.
The use of radio frequency identification systems (RFID) is growing rapidly. Today, mostly "passive" RFID systems are used because no onboard energy source is needed on the transponders. However, "active " RFID with onboard power source gives a new range of opportunities not possible with passive systems. To obtain energy efficiency in an active RFID system a protocol should be designed that is optimized with energy in mind. This paper describes the on-going work of defining and evaluating such a protocol. The protocol's performance in terms of energy efficiency, aggregated throughput, delay, and number of air collisions is evaluated and compared to that of the medium-access layer in 802.15.4 Zigbee, and also to a commercially available protocol from Free2move.
In this paper we describe and evaluate anenhanced version of an active RFID wake-up and tag IDextraction radio communication protocol. The enhancedprotocol further reduces the transponders’ power consumption(prolonging their battery lifetime). The protocol uses afrequency binary tree method for extracting the identificationnumber of each transponder. This protocol is enhanced byextending it with a framed slotted medium access controlmethod which decreases the number of activations of eachtransponder during tag ID extractions. Using this medium accessmethod, the average number of transponder activations isdecreased with a factor of 2.5 compared to the original protocol.The resulting increase in ID read-out delay is 0.9%, on average.
Active Radio Frequency Identification (A-RFID) is a technology where the tags (transponders) carry an on-board energy source for powering the radio, processor circuits, and sensors. Besides offering longer working distance between RFID reader and tag than passive RFID, this also enables the tags to do sensor measurements, calculations and storage even when no RFID-reader is in the vicinity of the tags. In this paper we introduce a medium access data communication protocol which dynamically adjusts its back-off algorithm to best suit the actual active RFID application at hand. Based on a simulation study of the effect on tag energy cost, readout delay, and message throughput incurred by some typical back-off algorithms in a CSMA/CA (Carrier Sense Multiple Access / Collision Avoidance) A-RFID protocol, we conclude that, by dynamic tuning of the initial contention window size and back-off interval coefficient, tag energy consumption and read-out delay can be significantly lowered. We also present specific guidelines on how parameters should be selected under various application constraints (viz. maximum readout delay; and the number of tags passing).
The communication protocol used is a key issue in order to make the most of the advantages of active RFID technologies. In this paper we introduce a carrier sense medium access data communication protocol that dynamically adjusts its back-off algorithm to best suit the actual application at hand. Based on a simulation study of the effect on tag energy cost, read-out delay, and message throughput incurred by some typical back-off algorithms in a CSMA/CA (Carrier Sense Multiple Access/Collision Avoidance) active RFID protocol, we conclude that by dynamic tuning of the initial contention window size and back-off interval coefficient, tag energy consumption and read-out delay can be significantly lowered. We show that it is possible to decrease the energy consumption per tag payload delivery with more than 10 times, resulting in a 50% increase in tag battery lifetime. We also discuss the advantage of being able to predict the number of tags present at the RFID-reader as well as ways of doing it.
Active radio frequency identification (A-RFID) is a technology where the tags (transponders) carry an on board energy source for powering the radio, processor circuits, and sensors. Besides offering longer working distance between RFID-reader and tag than passive RFID, this also enables the tags to do sensor measurements, calculations and storage even when no RFID-reader is in the vicinity of the tags. In this paper we study the effect on tag energy cost and read out delay incurred by some typical back-off algorithms (constant, linear, and exponential) used in a contention based CSMA/CA (carrier sense multiple access/collision avoidance) protocol for A-RFID communication.
In this paper we present a Radio Frequency Identification (RFID) protocol used to wake up and extract the ID of every tag (or a subset thereof) within reach of a reader in an active backscatter RFID system. We also study the effect on tag energy cost and read-out delay incurred when using the protocol, which is based on a frequency binary tree. Simulations show that, when using the 2.45 GHz ISM band, more than 1500 tags can be read per second.With a population of 1000 tags, the average read-out delay is 319 ms, and the expected lifetime of the RFID tags is estimated to be more than 2.5 years, even in a scenario when they are read out very often.
The use of Radio Frequency Identification systems (RFID) is growing rapidly. Today, mostly “passive” RFID systems are used because no onboard energy source is needed on the transponders. However, “active” RFID technology, with onboard power sources in the transponders, gives a range of opportunities not possible with passive systems. To obtain energy efficiency in an Active RFID system the protocol to be used should be carefully designed with energy optimization in mind. This paper describes how energy consumption can be calculated, to be used in protocol definition, and how evaluation of protocol in this respect can be made. The performance of such a new protocol, in terms of energy efficiency, aggregated throughput, delay, and number of air collisions is evaluated and compared to an existing, commercially available protocol for Active RFID, as well as to the IEEE standard 802.15.4 (used e.g. in the Zigbee mediumaccess layer).