The goal of the REMAP project was to gain new knowledge about the design and use of massively parallel computer architectures in embedded real-time systems. In order to support adaptive and learning behavior in such systems, the efficient execution of Artificial Neural Network (ANN) algorithms on regular processor arrays was in focus. The REMAP-β parallel computer built in the project was designed with ANN computations as the main target application area. This chapter gives an overview of the computational requirements found in ANN algorithms in general and motivates the use of regular processor arrays for the efficient execution of such algorithms. REMAP-β was implemented using the FPGA circuits that were available around 1990. The architecture, following the SIMD principle (Single Instruction stream, Multiple Data streams), is described, as well as the mapping of some important and representative ANN algorithms. Implemented in FPGA, the system served as an architecture laboratory. Variations of the architecture are discussed, as well as scalability of fully synchronous SIMD architectures. The design principles of a VLSI-implemented successor of REMAP-β are described, and the paper is concluded with a discussion of how the more powerful FPGA circuits of today could be used in a similar architecture. © 2006 Springer.
To better utilize the capacity of the twisted-pair access networks, operators deploying very high-speed digital subscriber line (VDSL) systems need accurate parameters for power back-off (PBO). However, VDSL standards give almost no guidance on how these parameters should be established for a particular network. In this paper we present a new technique for optimizing PBO parameters for a cable bundle, which is based on the Nelder-Mead simplex search algorithm. In this way each operator can easily calculate PBO parameters that match its actual access network down to the individual cable bundle. Using the properties of the PBO, as defined in the VDSL standard, we show how a normalized FEXT coupling can replace the knowledge of the individual couplings during the optimization of the PBO parameters. By simulations based on measured cable data we show that our approach using cable bundle unique PBO (CUPBO) achieves a significant improvements compared to the performance achieved with the ordinary PBO. © 2007 EURASIP.
High-performance embedded systems running real-time applications demand communication solutions providing high data rates and low error probabilities, properties inherent to optical solutions. However, providing timing guarantees for deadline bound applications in this context is far from basic due to the parallelism inherent in multiwavelength networks and often bound to include a large amount of pessimism. Assuming deterministic medium access, an admission control algorithm using a schedulability analysis can ensure deadline guarantees for real-time communication. The traffic dependency analysis presented in this paper is specifically targeting a multichannel context, taking into consideration the possibility of concurrent transmissions in these types of networks. Combining our analysis with a feasibility analysis in admission control, the amount of guaranteed hard real-time traffic could be shown to increase by a factor 7 in a network designed for a radar signal processing case. Using this combination of analysis methods will render possible an increased amount of hard real-time traffic over a given multichannel network, leading to a more efficient bandwidth utilization by deadline dependent applications without having to redesign the network or the medium access method.
Analysis, assessment, and design of advanced wireline transmission schemes over multipair copper cables require accurate knowledge of the channel properties. This paper investigates modeling of multiconductor cables based on interpair impedance measurements. A unified approach to the application of the "Cioffi" model is introduced. The direct measurement approach of the underlying interpair impedances yields a good match with the alternative approach suggested in the recent study of Cioffi et al. Crosstalk coupling functions derived from the model exhibit a good match with the corresponding direct measurements in the case in which the modeled length is close to the length of the interpair impedance measurements. However, the prediction power of this model with respect to termination impedance is limited. © 2007 IEEE.
Future wireline communication systems aspire to boost the throughput in two ways: First, they exploit higher frequencies to gain more bandwidth on shorter lines in combination with vectoring. Second, they use non-differential transmission modes (such as phantom modes, common modes, split-pair modes) to exploit more dimensions. Performance predictions for systems exploiting these techniques are of great importance for upgrading copper networks to provide Internet access or deploying copper-based backhaul systems to connect mobile base-stations. Good predictions require accurate channel models. However, channel modeling for higher frequencies and nondifferential modes is still in its infancy. A mixed deterministic/stochastic channel model is proposed to remedy this problem. The outage rate is derived based on an asymptotic (in the number of participating transceivers) analysis. As application examples, performance predictions in access networks using phantom modes and frequencies up to 200 MHz are presented.
A multi-user signal coordination scheme known as Vectoring will play a crucial role in next generation digital subscriber lines (DSL). Previous studies have demonstrated that vectoring increases the bitrate of DSL systems due to its ability to mitigate the interference. In this work we show how Vectoring improves the energy-efficiency of DSL over state-of-the-art spectrum balancing methods. In addition we investigate the impact of channel-state information errors and low-complexity implementations on the performance of Vectoring. We find that Vectoring yields large energy savings in terms of line-driver power consumption even under high channel estimation errors. © 2011 IEEE.
Recently energy saving has become an important issue also for wired communication. In this paper we investigate the potential of using power back-off (PBO) as a means to achieve higher energy efficiency. Based on a global energy optimisation formulation we derive an energy efficient PBO (EEPBO) algorithm. Through simulation we compare EEPBO with continuous bit-loading to the near-optimal energy efficient spectrum balancing (EESB) algorithm and an integer bit-loading version of EEPBO with energy efficient iterative spectrum balancing (EEISB). By restricting the search to practical levels of PBO parameters instead of optimizing the bit-loading on each and every carrier separately we see a significant reduction in computational complexity. It also means that EEPBO is already supported by current VDSL2 systems. Still, even after restricting the spectrum to what the PBO in VDSL2 allows we can show, through simulations, that EEPBO achieves the same level of energy efficiency as the near-optimal methods. This high performance and low-complexity together with standard compliance makes EEPBO a very attractive choice for future energy efficient transmission in VDSL2. © EURASIP, 2009.
In this paper we introduce Epiphany as a high-performance energy-efficient manycore architecture suitable for real-time embedded systems. This scalable architecture supports floating point operations in hardware and achieves 50 GFLOPS/W in 28 nm technology, making it suitable for high performance streaming applications like radio base stations and radar signal processing. Through an efficient 2D mesh Network-on-Chip and a distributed shared memory model, the architecture is scalable to thousands of cores on a single chip. An Epiphany-based open source computer named Parallella was launched in 2012 through Kickstarter crowd funding and has now shipped to thousands of customers around the world. ©2014 IEEE.
The increased availability of modern embedded many-core architectures supporting floating-point operations in hardware makes them interesting targets in traditional high performance computing areas as well. In this paper, the Lattice Boltzmann Method (LBM) from the domain of Computational Fluid Dynamics (CFD) is evaluated on Adapteva’s Epiphany many-core architecture. Although the LBM implementation shows very good scalability and high floating-point efficiency in the lattice computations, current Epiphany hardware does not provide adequate amounts of either local memory or external memory bandwidth to provide a good foundation for simulation of the large problems commonly encountered in real CFD applications.
Recurrent neural networks (RNNs) are neural networks (NN) designed for time-series applications. There is a growing interest in running RNNs to support these applications on edge devices. However, RNNs have large memory and computational demands that make them challenging to implement on edge devices. Quantization is used to shrink the size and the computational needs of such models by decreasing weights and activation precision. Further, the delta networks method increases the sparsity in activation vectors by relying on the temporal relationship between successive input sequences to eliminate repeated computations and memory accesses. In this paper, we study the effect of quantization on LSTM-, GRU-, LiGRU-, and SRU-based RNN models for speech recognition on the TIMIT dataset. We show how to apply post-training quantization on these models with a minimal increase in the error by skipping quantization of selected paths. In addition, we show that the quantization of activation vectors in RNNs to integer precision leads to considerable sparsity if the delta networks method is applied. Then, we propose a method for increasing the sparsity in the activation vectors while minimizing the error and maximizing the percentage of eliminated computations. The proposed quantization method managed to com-press the four models more than 85%, with an error increase of 0.6, 0, 2.1, and 0.2 percentage points, respectively. By applying the delta networks method to the quantized models, more than 50% of the operations can be eliminated, in most cases with only a minor increase in the error. Comparing the four models to each other under the quantization and delta networks method, we found that compressed LSTM-based models are the most-optimum solutions at low-error-rates constraints. The compressed SRU-based models are the smallest in size, suitable when higher error rates are acceptable, and the compressed LiGRU-based models have the highest number of eliminated operations. © 2022 by the authors. Licensee MDPI, Basel, Switzerland.
Recurrent Neural Networks (RNNs) are a class of machine learning algorithms used for applications with time-series and sequential data. Recently, there has been a strong interest in executing RNNs on embedded devices. However, difficulties have arisen because RNN requires high computational capability and a large memory space. In this paper, we review existing implementations of RNN models on embedded platforms and discuss the methods adopted to overcome the limitations of embedded systems. We will define the objectives of mapping RNN algorithms on embedded platforms and the challenges facing their realization. Then, we explain the components of RNN models from an implementation perspective. We also discuss the optimizations applied to RNNs to run efficiently on embedded platforms. Finally, we compare the defined objectives with the implementations and highlight some open research questions and aspects currently not addressed for embedded RNNs. Overall, applying algorithmic optimizations to RNN models and decreasing the memory access overhead is vital to obtain high efficiency. To further increase the implementation efficiency, we point up the more promising optimizations that could be applied in future research. Additionally, this article observes that high performance has been targeted by many implementations, while flexibility has, as yet, been attempted less often. Thus, the article provides some guidelines for RNN hardware designers to support flexibility in a better manner. © 2020 IEEE.
Today computer architectures are shifting from single core to manycores due to several reasons such as performance demands, power and heat limitations. However, shifting to manycores results in additional complexities, especially with regard to efficient development of applications. Hence there is a need to raise the abstraction level of development techniques for the manycores while exposing the inherent parallelism in the applications. One promising class of programming languages is dataflow languages and in this paper we evaluate and optimize the code generation for one such language, CAL. We have also developed a communication library to support the inter-core communication.The code generation can target multiple architectures, but the results presented in this paper is focused on Adapteva's many core architecture Epiphany.We use the two-dimensional inverse discrete cosine transform (2D-IDCT) as our benchmark and compare our code generation from CAL with a hand-written implementation developed in C. Several optimizations in the code generation as well as in the communication library are described, and we have observed that the most critical optimization is reducing the number of external memory accesses. Combining all optimizations we have been able to reduce the difference in execution time between auto-generated and hand-written implementations from a factor of 4.3x down to a factor of only 1.3x. ©2014 IEEE.
This paper proposes a novel method for performing division on floating-point numbers represented in IEEE-754 single-precision (binary32) format. The method is based on an inverter, implemented as a combination of Parabolic Synthesis and second-degree interpolation, followed by a multiplier. It is implemented with and without pipeline stages individually and synthesized while targeting a Xilinx Ultrascale FPGA.
The implementations show better resource usage and latency results when compared to other implementations based on different methods. In case of throughput, the proposed method outperforms most of the other works, however, some Altera FPGAs achieve higher clock rate due to the differences in the DSP slice multiplier design.
Due to the small size, low latency and high throughput, the presented floating-point division unit is suitable for high performance embedded systems and can be integrated into accelerators or be used as a stand-alone accelerator.
While parallel computer architectures have become mainstream, application development on them is still challenging. There is a need for new tools, languages and programming models. Additionally, there is a lack of knowledge about the performance of parallel approaches of basic but important operations, such as the QR decomposition of a matrix, on current commercial manycore architectures.
This paper evaluates a high level dataflow language (CAL), a source-to-source compiler (Cal2Many) and three QR decomposition algorithms (Givens Rotations, Householder and Gram-Schmidt). The algorithms are implemented both in CAL and hand-optimized C languages, executed on Adapteva's Epiphany manycore architecture and evaluated with respect to performance, scalability and development effort.
The performance of the CAL (generated C) implementations gets as good as 2\% slower than the hand-written versions. They require an average of 25\% fewer lines of source code without significantly increasing the binary size. Development effort is reduced and debugging is significantly simplified. The implementations executed on Epiphany cores outperform the GNU scientific library on the host ARM processor of the Parallella board by up to 30x. © 2016 Copyright held by the owner/author(s).
On-chip communication plays a significant role in the performance of manycore architectures. Therefore, they require a proper on-chip communication infrastructure that can scale with the number of the cores. As a solution, network-on-chip structures have emerged and are being used.
This paper presents description of a two dimensional mesh network-on-chip router and a network interface, which are implemented in Chisel to be integrated to the rocket chip generator that generates RISC-V (rocket) cores. The router is implemented in VHDL as well and the two implementations are verified and compared.
Hardware resource usage and performance of different sized networks are analyzed. The implementations are synthesized for a Xilinx Ultrascale FPGA via Xilinx tools for the hardware resource usage and clock frequency results. The performance results including latency and throughput measurements with different traffic patterns, are collected with cycle accurate emulations.
The implementations in Chisel and VHDL do not show a significant difference. Chisel requires around 10% fewer lines of code, however, the difference in the synthesis results is negligible. Our latency result are better than the majority of the other studies. The other results such as hardware usage, clock frequency, and throughput are competitive when compared to the related works.
In the last 15 years we have seen, as a response to power and thermal limits for current chip technologies, an explosion in the use of multiple and even many computer cores on a single chip. But now, to further improve performance and energy efficiency, when there are potentially hundreds of computing cores on a chip, we see a need for a specialization of individual cores and the development of heterogeneous manycore computer architectures.
However, developing such heterogeneous architectures is a significant challenge. Therefore, we propose a design method to generate domain specific manycore architectures based on RISC-V instruction set architecture and automate the main steps of this method with software tools. The design method allows generation of manycore architectures with different configurations including core augmentation through instruction extensions and custom accelerators. The method starts from developing applications in a high-level dataflow language and ends by generating synthesizable Verilog code and cycle accurate emulator for the generated architecture.
We evaluate the design method and the software tools by generating several architectures specialized for two different applications and measure their performance and hardware resource usages. Our results show that the design method can be used to generate specialized manycore architectures targeting applications from different domains. The specialized architectures show at least 3 to 4 times better performance than the general purpose counterparts. In certain cases, replacing general purpose components with specialized components saves hardware resources. Automating the method increases the speed of architecture development and facilitates the design space exploration of manycore architectures. © 2019 The Authors. Published by Elsevier B.V.
Performance and power requirements has pushed computer architectures from single core to manycores. These requirements now continue pushing the manycores with identical cores (homogeneous) to manycores with specialized cores (heterogeneous). However designing heterogeneous manycores is a challenging task due to the complexity of the architectures. We propose an approach for designing domain specific heterogeneous manycore architectures based on building blocks. These blocks are defined as the common computations of the applications within a domain. The objective is to generate heterogeneous architectures by integrating many of these blocks to many simple cores and connect the cores with a networkon-chip. The proposed approach aims to ease the design of heterogeneous manycore architectures and facilitate usage of dark silicon concept. As a case study, we develop an accelerator based on several building blocks, integrate it to a RISC core and synthesize on a Xilinx Ultrascale FPGA. The results show that executing a hot-spot of an application on an accelerator based on building blocks increases the performance by 15x, with room for further improvement. The area usage increases as well, however there are potential optimizations to reduce the area usage. © 2018 by the authors
The last ten years have seen performance and power requirements pushing computer architectures using only a single core towards so-called manycore systems with hundreds of cores on a single chip. To further increase performance and energy efficiency, we are now seeing the development of heterogeneous architectures with specialized and accelerated cores. However, designing these heterogeneous systems is a challenging task due to their inherent complexity. We proposed an approach for designing domain-specific heterogeneous architectures based on instruction augmentation through the integration of hardware accelerators into simple cores. These hardware accelerators were determined based on their common use among applications within a certain domain.The objective was to generate heterogeneous architectures by integrating many of these accelerated cores and connecting them with a network-on-chip. The proposed approach aimed to ease the design of heterogeneous manycore architectures—and, consequently, exploration of the design space—by automating the design steps. To evaluate our approach, we enhanced our software tool chain with a tool that can generate accelerated cores from dataflow programs. This new tool chain was evaluated with the aid of two use cases: radar signal processing and mobile baseband processing. We could achieve an approximately 4x improvement in performance, while executing complete applications on the augmented cores with a small impact (2.5–13%) on area usage. The generated accelerators are competitive, achieving more than 90% of the performance of hand-written implementations.
This paper proposes a novel method for performing square root operation on floating-point numbers represented in IEEE-754 single-precision (binary32) format. The method is implemented using Harmonized Parabolic Synthesis. It is implemented with and without pipeline stages individually and synthesized for two different Xilinx FPGA boards.
The implementations show better resource usage and latency results when compared to other similar works including Xilinx intellectual property (IP) that uses the CORDIC method. Any method calculating the square root will make approximation errors. Unless these errors are distributed evenly around zero, they can accumulate and give a biased result. An attractive feature of the proposed method is the fact that it distributes the errors evenly around zero, in contrast to CORDIC for instance.
Due to the small size, low latency, high throughput, and good error properties, the presented floating-point square root unit is suitable for high performance embedded systems. It can be integrated into a processor’s floating point unit or be used as astand-alone accelerator. © 2019 IEEE.
The latest digital subscriber line (DSL) technology, VDSL2, used for broadband access over twisted-pairs, promises up to 100 Mbit/s for both transmission directions on short loops. Since these systems are designed to operate in a far-end crosstalk (FEXT) limited environment, there is a severe performance degradation when deployed in distributed network scenarios. With power back-off (PBO) the network operators attempt to protect modems deployed on long loops by reducing the transmit power of the short ones. However, currently very little guidance has been given to operators on how to set and optimize the parameter's for PBO. In this paper we explore one promising method, the cable bundle unique PBO (CUPBO), which optimizes these parameters according to the actual situation in the cable with regard to noise and network topology. Using real VDSL systems and cables we show that CUPBO algorithm achieves a significant increase in performance compared to the case when one naively takes the PBO values given in the VDSL standard.
Dynamic spectrum management (DSM) improves the capacity utilization of twisted-pair cables by adapting the transmit power spectral density (PSD) of modems to the actual noise environment and channel conditions. Earlier proposed DSM algorithms do not take into account the standardized very high speed digital subscriber line (VDSL) constraints on the allowable transmit PSDs. However, VDSL modems support only restricted transmit PSD shapes resulting from the standardized power back-off (PBO) method, which is controlled by a small set of parameters. Furthermore, since all modems are currently using the same PBO parameters their bit rate performance is severely limited. In this paper, we show how to effectively exploit the standardized PBO concept for DSM to significantly boost bit rates. We also present a low complex DSM algorithm, the user unique PBO (UUPBO) algorithm, for calculating PBO parameters that are uniquely optimized for each modem. © 2007 IEEE.
A method of optimizing the bit rate capacities of DSL user lines subject to far-end crosstalk by adjusting the transmit power spectral densities at the far ends of the user lines by means of parameterized power back-off functions is characterized by the steps of : a) selecting a desired bit rate share for each user line, b) measuring the noise and losses on each user line, and c) determining an individual set of power back-off parameters for each user line by calculating a global sum of the bit rates of all user lines and iterating the power back-off parameters until a maximum value of the global sum is found under the constraint that the desired bit rate shares are met.
The importance of a power spectral density (PSD) mask restriction is often overlooked when optimizing the spectrum usage for multiuser digital subscriber lines (DSL) systems. However, by developing the optimization strategies based only on the PSD constraints (masks) we can tremendously reduce the computation complexity compared to the methods only based on the total power restriction. In this paper we introduce a mask-based spectrum balancing (MSB) algorithm and demonstrate the near optimum performance of this optimization approach. Furthermore, we show that besides standards compliance, PSD restriction is also needed to ensure the convergence of iterative spectrum balancing methods, which use dual decomposition optimization.
We present a practical solution for dynamic spectrum management(DSM) in digital subscriber line systems: the normalized-rateiterative algorithm (NRIA). Supported by a novel optimizationproblem formulation, the NRIA is the only DSM algorithm thatjointly addresses spectrum balancing for frequency divisionduplexing systems and power allocation for the users sharing acommon cable bundle. With a focus on being implementable ratherthan obtaining the highest possible theoretical performance, theNRIA is designed to efficiently solve the DSM optimization problemwith the operators' business models in mind. This is achieved withthe help of two types of parameters: the desired network asymmetryand the desired user priorities. The NRIA is a centralized DSMalgorithm based on the iterative water-filling algorithm (IWFA)for finding efficient power allocations, but extends the IWFA byfinding the achievable bitrates and by optimizing the bandplan.It is compared with three other DSM proposals: the IWFA, theoptimal spectrum balancing algorithm (OSBA), and the bidirectionalIWFA (bi-IWFA). We show that the NRIA achieves better bitrateperformance than the IWFA and the bi-IWFA. It can even achieveperformance almost as good as the OSBA, but with dramaticallylower requirements on complexity. Additionally, the NRIA canachieve bitrate combinations that cannot be supported by any otherDSM algorithm.
Accurate upstream power back-off (PBO) parameters are needed by operators deploying very high-speed digital subscriber line (VDSL) modems. Although a standardized PBO method for VDSL exist, the standard gives little or no guidance to an operator how to establish these optimized PBO parameters for its particular network and customers. In this paper, we present an efficient algorithm based on the Nelder-Mead simplex search which calculates optimized upstream PBO parameters. To make the PBO parameter calculation in-dependent of the network scenario we present a new method for establishing worst-case far-end crosstalk (FEXT) noise, which is based on virtual modems. © 2006 IEEE.
In recent years an increasing effort was made to reduce the energy consumption in digital subscriber line equipment. Dynamic spectrum management (DSM) has been identified as one promising method to achieve energy-efficiency in discrete multitone based systems. An open research question is how to ensure system robustness when applying highly optimized energy-efficient spectrum management. In this paper, we study the problem of uncertainty in crosstalk noise and parameters, the knowledge of which is indispensable for many DSM algorithms. We introduce robust optimization for spectrum balancing as a technique to achieve feasibility of the optimal power-allocation under a deterministic parameter uncertainty model. This can be seen as an extension of current schemes for spectrum balancing. As a special case we consider the simple strategy of scaling the crosstalk parameters to their worst-case values, which corresponds to a specific uncertainty model and entails no changes to current DSM algorithms. Finally, we quantify the benefit in worst-case performance and the price in terms of energy by simulations. © 2010 IEEE.
This paper highlights the collaboration between industry and academia in research. It describes more than two decades of intensive development and research of new hardware and software platforms to support innovative, high-performance sensor systems with extremely high demands on embedded signal processing capability. The joint research can be seen as the run before a necessary jump to a new kind of computational platform based on parallelism. The collaboration has had several phases, starting with a focus on hardware, then on efficiency, later on software development, and finally on taking the jump and understanding the expected future. In the first part of the paper, these phases and their respective challenges and results are described. Then, in the second part, we reflect upon the motivation for collaboration between company and university, the roles of the partners, the experiences gained and the long-term effects on both sides. Copyright © 2014 ACM.
This paper presents the high-level architecture (HLA) of the research project DEWI (dependable embedded wireless infrastructure). The objective of this HLA is to serve as a reference for the development of industrial wireless sensor and actuator networks (WSANs) based on the concept of the DEWI Bubble. The DEWI Bubble is defined here as a high-level abstraction of an industrial WSAN with enhanced interoperability (via standardized interfaces), technology reusability, and cross-domain development. This paper details the design criteria used to define the HLA and the organization of the infrastructure internal and external to the DEWI Bubble. The description includes the different perspectives, models or views of the architecture: the entity model, the layered model, and the functional view model (including an overview of interfaces). The HLA constitutes an extension of the ISO/IEC SNRA (sensor network reference architecture) towards the support of industrial applications. To improve interoperability with existing approaches the DEWI HLA also reuses some features from other standardized technologies and architectures. The HLA will allow networks with different industrial sensor technologies to exchange information between them or with external clients via standard interfaces, thus providing a consolidated access to sensor information of different domains. This is an important aspect for smart city applications, Big Data and internet-of-things (IoT). © Copyright 2016 IEEE
Spectrum balancing is an established optimization approach in multi-carrier digital subscriber line (DSL) systems. It has previously been applied to very different performance objectives such as sum-rate, min-rate, or fairness maximization and sum-power minimization. In this work we study the maximization of the service coverage, which will be defined as the number of DSL lines which can be granted an operator-specified high-bandwidth service. The proposed algorithm is based on a previously described mathematical decomposition framework. We extend this framework for our new problem and enhance its scalability by various low-complexity heuristics. Simulations demonstrate the applicability of our algorithm for DSL networks of realistic sizes. More precisely, our results obtained in thousand 25 user near-far DSL scenarios show an average gain in service coverage of more than 13% compared to state-of-the-art sum-rate maximizing spectrum balancing algorithms. © 2011 IEEE.
The reduction of energy consumption in digital subscriber line (DSL) networks has obtained considerable attention recently. Today's DSL is designed under an "always on" principle to keep the crosstalk noise as stable as possible. Departuring from this restriction, one approach to achieve energy savings is by "lazy scheduling" which exploits the tradeoff between energy-consumption and transmission delay inherent in many communication systems. This work extends the scope of this idea to multi-user interference limited systems employing multi-carrier modulation. Mathematical decomposition appears to be a natural approach for cross-layer optimization when the physical-layer spectrum management algorithm is already based on dual relaxation. We identify Benders decomposition as the appropriate choice of an optimization scheme for rate and delay constrained energy-minimization. Based on this we propose a cross-layer scheduler for multi-user/multi-carrier systems. By simulations of a single-hop, multi-user DSL scenario this scheduler is shown to closely approximate the optimal solution to this nonconvex problem. Furthermore, by example we demonstrate that scheduling for interference avoidance in DSL yields negligible additional performance gains over sole physical layer spectrum balancing in practice. ©2009 IEEE.
We investigate a novel cross-layer optimization problem for jointly performing dynamic spectrum management (DSM) and periodic rate-scheduling in time. The large number of carriers used in digital subscriber lines (DSL) makes DSM a large-scale optimization problem for which dual optimization is a commonly used method. The duality-gap which potentially accompanies the dual optimization for non-convex problems is typically assumed to be small enough to be neglected. Also, previous theoretical results show a vanishing duality-gap as the number of subcarriers approaches infinity. We will bound the potential performance improvements that can be achieved by the additional rate-scheduling procedure. This bound is found to depend on the duality-gap in the physical layer DSM problem. Furthermore, we will derive bounds on the duality-gap of the two most important optimization problems in DSL, namely the maximization of the weighted sum-rate and the minimization of the weighted sum-power. These bounds are derived for a finite number of subcarriers and are also applicable to the respective problems in orthogonal frequency division multiplex (OFDM) systems. ©2010 IEEE.
Dynamic spectrum management (DSM) is an important technique for mitigating crosstalk noise in multi-user digital subscriber line (DSL) environments. Until now, most of the proposed algorithms for DSM have been designed solely for the purpose of bitrate maximization. These algorithms assume a fixed maximum total power and neglect the energy consumption in DSL modems. However, since recently there is a strong interest in the DSL field to reduce energy consumption as shown, e.g., by the European Commissions' code of conduct on energy consumption of broadband equipment. In contrast to traditional DSM, this paper will show how DSM can be used for minimizing the energy consumption. We will formulate a global optimization problem for energy minimization and discuss several of its peculiarities compared to the current DSM problems. Furthermore, we derive an iterative, dual-based and semi-distributed algorithm for its local solution, which we call energy-efficient spectrum balancing (EESB). The performance of the algorithm is evaluated through simulations, which show similar results to optimal schemes. In addition, EESB achieves substantial energy savings that can be exploited by adapting the transmit powers to users' bitrate demand. © 2008 IEEE.
We consider a constrained multi-carrier power allocation problem in interference-limited multi-user systems with a finite set of transmission rates. The Lagrange relaxation is a common technique for decomposing such problems into independently solvable per-subcarrier problems. Deviating from this approach our main contribution is the proposal of a novel spectrum management framework based on a Nonlinear Dantzig-Wolfe problem decomposition. It allows for suboptimal initialization and suboptimal power allocation methods with low complexity. While we show that the combinatorial per-subcarrier problems have polynomial complexity in the number of users, we find that such suboptimal methods are indispensable in large systems. Thus we give an overview of various basic dual heuristics and provide simulation results on a set of thousand digital subscriber line (DSL) networks which show the superior performance of our framework compared to previous power control algorithms. © 2012 IEEE.
Discrete-rate spectrum balancing in interference-limited multi-user and multi-carrier digital subscriber lines (DSL) is a large-scale, non-convex and combinatorial problem. Previously proposed algorithms for its (dual) optimal solution are only applicable for networks with few users, while the suboptimality of less complex bit-loading algorithms has not been adequately studied so far. We deploy constrained optimization techniques as well as problem-specific branch-and-bound and search-space reduction methods, which for the first time give a low-complexity guarantee of optimality in certain multi-user DSL networks of practical size. Simulation results precisely quantify the suboptimality of multi-user bit-loading schemes in a thousand ADSL2 scenarios under measured channel data.
The data-rate in currently deployed multi-carrier digital subscriber line (DSL) communication systems is limited by the interference among copper lines. This interference can be alleviated by multi-user transmit power allocation. Problem decomposition results in a large number of per-subcarrier problems. Our objective is to solve these nonconvex integer per-subcarrier power control problems at low complexity. For this purpose we develop ten combinatorial heuristics and test them by simulation under a small complexity budget in scenarios with tens of DSL users, where optimal solutions are currently intractable. Simulation results lead us to the conclusion that simple randomized greedy heuristics extended by a specific local search perform well despite the stringent complexity restriction. This has implications on multi-user discrete resource allocation algorithms, as these can be designed to jointly optimize transmit power among users even in large-scale scenarios.
Low-power modes (LPM) are a standardized means in asymmetric digital subscriber lines (ADSL) 2 for reducing the power consumption at the central office. However, the activation of LPMs is hampered by the operators’ concern for instability introduced by frequent transmit power changes. The injection of artificial noise (AN) has been proposed as a standard-compliant stabilization technique. We develop an analytical solution for setting the AN power spectrum. Based on this solution we jointly optimize the AN power spectrum and the signal-to-noise ratio (SNR) margin. Simulation results show the performance gain in terms of rate and energy compared to heuristic rules for setting the AN power spectrum. We propose and demonstrate three approaches for evaluating the performance of AN-enabled DSL systems, including (a) joint spectrum balancing, AN, and margin optimization, (b) single-user worst-case-stable optimization, and (c) worst-case-stable optimization based on sequential initialization. Simulation results confirm a strong dependency of the performance under AN on the selected SNR margins, and highlight the total AN power consumption as well as the residual energy savings under low-power modes stabilized by AN. © 2012 The Author(s).
The large number of broadband users and its forecast growth has recently triggered research on energy-efficiency in digital subscriber lines (DSLs). A promising technique are low-power modes (LPMs) as standardized in asymmetric DSL 2 (ADSL2) which let the DSL connection operate in downstream direction with reduced transmit rate and power. We study the problem of optimizing the LPM rate-level for energy-efficiency. A traffic-independent rate setting is proposed based on an analytical competitive framework. Also, a Markov chain based LPM model is derived which facilitates the fast numerical optimization of the LPM rate-level under realistic traffic models and system constraints. Simulation results under various traffic settings and DSL scenarios demonstrate energy savings by LPMs of around 30–40% of the ADSL2 transceiver’s power consumption. Furthermore, they provide insights on how to set the LPM rate-levels in practice for energy-efficient DSL operation. © 2012 Elsevier B.V. All rights reserved.
Optimization of the power spectrum alleviates the crosstalk noise in Digital Subscriber Lines (DSL) and thereby reduces their power consumption at present. In order to truly assess the DSL system power consumption, thispaper presents realistic line driver (LD) power consumption models. These are applicable to any DSL system andextend previous models by parameterizing various circuit-level non-idealities. Based on the model of a class-ABLD we analyze the multi-user power spectrum optimization problem and propose novel algorithms for its global or approximate solution. The thereby obtained simulation results support our claim that this problem can besimplified with negligible performance loss by neglecting the LD model. This motivates the usage of established spectral optimization algorithms, which are shown to significantly reduce the LD power consumption comparedto static spectrum management.
Today many of the high performance embedded processors already contain multiple processor cores and we see heterogeneous manycore architectures being proposed. Therefore it is very desirable to have a fast way to explore various heterogeneous architectures through the use of an architectural design space exploration tool, giving the designer the option to explore design alternatives before the physical implementation. In this paper, we have extended Heracles, a design space exploration tool for (homogeneous) manycore architectures, to incorporate different types of processing cores, and thus allowus to model heterogeneity. Our tool, called the Heterogeneous Heracles System (HHS), can besides the already supported MIPS core also include OpenRISC cores. The new tool retains the possibility available in Heracles to perform register transfer level (RTL) simulations of each explored architecture in Verilog as well as synthesizing it to field-programmable gate arrays (FPGAs). To facilitate the exploration of heterogeneous architectures, we have also extended the graphical user interface (GUI) to support heterogeneity. This GUI provides options to configure the types of core, core settings, memory system and network topology. Some initial results on FPGA utilization are presented from synthesizing both homogeneous and heterogeneous manycore architectures, as well as some benchmark results from both simulated and synthesized architectures.
Dataflow programming is a promising paradigm for high performance embedded parallel computing. When mapping a dataflow program onto a manycore architecture a key component is the library to express the communication between the actors. In this paper we present a dataflow communication library supporting the CAL actor language. A first implementation of the communication library is created for Adapteva’s manycore architecture Epiphany that contains an onchip 2-D mesh network. Three different buffering methods, with and without direct memory access (DMA) transfer, have been implemented and evaluated. We have also made a preliminary study on the effect of mapping strategies of the actors onto the cores. The assessment of the library is based on a CAL implementation of a two dimensional inverse discrete cosine transform (2D-IDCT) and our own CAL-to-C compilation framework. As expected the results show that the most efficient actor to-core mapping strategy is to keep the communication to the nearest neighbor communication pattern as much as possible. Thus, the best way to place a pipelined sequence of computations like our 2D-IDCT is to place the actors into cores in a serpentine fashion. For this application we found that the simple receiver side buffer outperforms the more complicated buffering strategies that used DMA transfer.