hh.sePublications
Change search
Link to record
Permanent link

Direct link
Alternative names
Publications (10 of 46) Show all publications
Rezk, N., Nordström, T., Stathis, D., Ul-Abdin, Z., Aksoy, E. & Hemani, A. (2022). MOHAQ: Multi-Objective Hardware-Aware Quantization of recurrent neural networks. Journal of systems architecture, 133, Article ID 102778.
Open this publication in new window or tab >>MOHAQ: Multi-Objective Hardware-Aware Quantization of recurrent neural networks
Show others...
2022 (English)In: Journal of systems architecture, ISSN 1383-7621, E-ISSN 1873-6165, Vol. 133, article id 102778Article in journal (Refereed) Published
Abstract [en]

The compression of deep learning models is of fundamental importance in deploying such models to edge devices. The selection of compression parameters can be automated to meet changes in the hardware platform and application. This article introduces a Multi-Objective Hardware-Aware Quantization (MOHAQ) method, which considers hardware performance and inference error as objectives for mixed-precision quantization. The proposed method feasibly evaluates candidate solutions in a large search space by relying on two steps. First, post-training quantization is applied for fast solution evaluation (inference-only search). Second, we propose the ”beacon-based search” to retrain selected solutions only and use them as beacons to estimate the effect of retraining on other solutions. We use speech recognition models on TIMIT dataset. Experimental evaluations show that Simple Recurrent Unit (SRU)-based models can be compressed up to 8x by post-training quantization without any significant error increase. On SiLago, we found solutions that achieve 97% and 86% of the maximum possible speedup and energy saving, with a minor increase in error on an SRU-based model. On Bitfusion, the beacon-based search reduced the error gain of the inference-only search on SRU-based models and Light Gated Recurrent Unit (LiGRU)-based model by up to 4.9 and 3.9 percentage points, respectively.

Place, publisher, year, edition, pages
Amsterdam: Elsevier, 2022
Keywords
Simple recurrent unit, Light gated recurrent unit, Quantization, Multi-objective optimization, Genetic algorithms
National Category
Embedded Systems
Identifiers
urn:nbn:se:hh:diva-48679 (URN)10.1016/j.sysarc.2022.102778 (DOI)000892114100006 ()2-s2.0-85141919627 (Scopus ID)
Available from: 2022-11-23 Created: 2022-11-23 Last updated: 2023-08-21Bibliographically approved
Rezk, N. M., Nordström, T. & Ul-Abdin, Z. (2022). Shrink and Eliminate: A Study of Post-Training Quantization and Repeated Operations Elimination in RNN Models. Information, 13(4), Article ID 176.
Open this publication in new window or tab >>Shrink and Eliminate: A Study of Post-Training Quantization and Repeated Operations Elimination in RNN Models
2022 (English)In: Information, E-ISSN 2078-2489, Vol. 13, no 4, article id 176Article in journal (Refereed) Published
Abstract [en]

Recurrent neural networks (RNNs) are neural networks (NN) designed for time-series applications. There is a growing interest in running RNNs to support these applications on edge devices. However, RNNs have large memory and computational demands that make them challenging to implement on edge devices. Quantization is used to shrink the size and the computational needs of such models by decreasing weights and activation precision. Further, the delta networks method increases the sparsity in activation vectors by relying on the temporal relationship between successive input sequences to eliminate repeated computations and memory accesses. In this paper, we study the effect of quantization on LSTM-, GRU-, LiGRU-, and SRU-based RNN models for speech recognition on the TIMIT dataset. We show how to apply post-training quantization on these models with a minimal increase in the error by skipping quantization of selected paths. In addition, we show that the quantization of activation vectors in RNNs to integer precision leads to considerable sparsity if the delta networks method is applied. Then, we propose a method for increasing the sparsity in the activation vectors while minimizing the error and maximizing the percentage of eliminated computations. The proposed quantization method managed to com-press the four models more than 85%, with an error increase of 0.6, 0, 2.1, and 0.2 percentage points, respectively. By applying the delta networks method to the quantized models, more than 50% of the operations can be eliminated, in most cases with only a minor increase in the error. Comparing the four models to each other under the quantization and delta networks method, we found that compressed LSTM-based models are the most-optimum solutions at low-error-rates constraints. The compressed SRU-based models are the smallest in size, suitable when higher error rates are acceptable, and the compressed LiGRU-based models have the highest number of eliminated operations. © 2022 by the authors. Licensee MDPI, Basel, Switzerland.

Place, publisher, year, edition, pages
Basel: MDPI, 2022
Keywords
delta networks, edge devices, quantization, recurrent neural network
National Category
Computer Sciences
Identifiers
urn:nbn:se:hh:diva-46752 (URN)10.3390/info13040176 (DOI)000786262400001 ()34789458 (PubMedID)2-s2.0-85128393517 (Scopus ID)
Funder
ELLIIT - The Linköping‐Lund Initiative on IT and Mobile Communications
Available from: 2022-05-02 Created: 2022-05-02 Last updated: 2022-11-23Bibliographically approved
Savas, S., Ul-Abdin, Z. & Nordström, T. (2020). A Framework to Generate Domain-Specific Manycore Architectures from Dataflow Programs. Microprocessors and microsystems, 72, Article ID 102908.
Open this publication in new window or tab >>A Framework to Generate Domain-Specific Manycore Architectures from Dataflow Programs
2020 (English)In: Microprocessors and microsystems, ISSN 0141-9331, E-ISSN 1872-9436, Vol. 72, article id 102908Article in journal (Refereed) Published
Abstract [en]

In the last 15 years we have seen, as a response to power and thermal limits for current chip technologies, an explosion in the use of multiple and even many computer cores on a single chip. But now, to further improve performance and energy efficiency, when there are potentially hundreds of computing cores on a chip, we see a need for a specialization of individual cores and the development of heterogeneous manycore computer architectures.

However, developing such heterogeneous architectures is a significant challenge. Therefore, we propose a design method to generate domain specific manycore architectures based on RISC-V instruction set architecture and automate the main steps of this method with software tools. The design method allows generation of manycore architectures with different configurations including core augmentation through instruction extensions and custom accelerators. The method starts from developing applications in a high-level dataflow language and ends by generating synthesizable Verilog code and cycle accurate emulator for the generated architecture.

We evaluate the design method and the software tools by generating several architectures specialized for two different applications and measure their performance and hardware resource usages. Our results show that the design method can be used to generate specialized manycore architectures targeting applications from different domains. The specialized architectures show at least 3 to 4 times better performance than the general purpose counterparts. In certain cases, replacing general purpose components with specialized components saves hardware resources. Automating the method increases the speed of architecture development and facilitates the design space exploration of manycore architectures. © 2019 The Authors. Published by Elsevier B.V.

Place, publisher, year, edition, pages
Amsterdam: Elsevier, 2020
Keywords
Domain-specific, multicore, manycore, accelerator, code generation, hardware/software co-design
National Category
Computer Systems Embedded Systems Signal Processing
Identifiers
urn:nbn:se:hh:diva-39323 (URN)10.1016/j.micpro.2019.102908 (DOI)000513294700002 ()2-s2.0-85073496598 (Scopus ID)
Projects
HiPEC (High Performance Embedded Computing)NGES (Towards Next, Generation Embedded Systems: Utilizing Parallelism and Reconfigurability)
Funder
VinnovaSwedish Foundation for Strategic Research
Available from: 2019-05-07 Created: 2019-05-07 Last updated: 2021-10-19Bibliographically approved
Rezk, N., Purnaprajna, M., Nordström, T. & Ul-Abdin, Z. (2020). Recurrent Neural Networks: An Embedded Computing Perspective. IEEE Access, 8, 57967-57996
Open this publication in new window or tab >>Recurrent Neural Networks: An Embedded Computing Perspective
2020 (English)In: IEEE Access, E-ISSN 2169-3536, Vol. 8, p. 57967-57996Article in journal (Refereed) Published
Abstract [en]

Recurrent Neural Networks (RNNs) are a class of machine learning algorithms used for applications with time-series and sequential data. Recently, there has been a strong interest in executing RNNs on embedded devices. However, difficulties have arisen because RNN requires high computational capability and a large memory space. In this paper, we review existing implementations of RNN models on embedded platforms and discuss the methods adopted to overcome the limitations of embedded systems. We will define the objectives of mapping RNN algorithms on embedded platforms and the challenges facing their realization. Then, we explain the components of RNN models from an implementation perspective. We also discuss the optimizations applied to RNNs to run efficiently on embedded platforms. Finally, we compare the defined objectives with the implementations and highlight some open research questions and aspects currently not addressed for embedded RNNs. Overall, applying algorithmic optimizations to RNN models and decreasing the memory access overhead is vital to obtain high efficiency. To further increase the implementation efficiency, we point up the more promising optimizations that could be applied in future research. Additionally, this article observes that high performance has been targeted by many implementations, while flexibility has, as yet, been attempted less often. Thus, the article provides some guidelines for RNN hardware designers to support flexibility in a better manner. © 2020 IEEE.

Place, publisher, year, edition, pages
Piscataway: IEEE, 2020
Keywords
Compression, flexibility, efficiency, embedded computing, long short term memory (LSTM), quantization, recurrent neural networks (RNNs)
National Category
Computer Systems
Identifiers
urn:nbn:se:hh:diva-41981 (URN)10.1109/ACCESS.2020.2982416 (DOI)000527411700168 ()2-s2.0-85082939909 (Scopus ID)
Projects
NGES (Towards Next Generation Embedded Systems: Utilizing Parallelism and Reconfigurability)
Funder
Vinnova, INT/SWD/VINN/p-10/2015
Note

As manuscript in thesis.

Other funding: Government of India

Available from: 2020-04-30 Created: 2020-04-30 Last updated: 2022-11-23Bibliographically approved
Savas, S., Ul-Abdin, Z. & Nordström, T. (2019). A Configurable Two Dimensional Mesh Network-on-Chip Implementation in Chisel.
Open this publication in new window or tab >>A Configurable Two Dimensional Mesh Network-on-Chip Implementation in Chisel
2019 (English)Other (Other academic)
Abstract [en]

On-chip communication plays a significant role in the performance of manycore architectures. Therefore, they require a proper on-chip communication infrastructure that can scale with the number of the cores. As a solution, network-on-chip structures have emerged and are being used.

This paper presents description of a two dimensional mesh network-on-chip router and a network interface, which are implemented in Chisel to be integrated to the rocket chip generator that generates RISC-V (rocket) cores. The router is implemented in VHDL as well and the two implementations are verified and compared.

Hardware resource usage and performance of different sized networks are analyzed. The implementations are synthesized for a Xilinx Ultrascale FPGA via Xilinx tools for the hardware resource usage and clock frequency results. The performance results including latency and throughput measurements with different traffic patterns, are collected with cycle accurate emulations. 

The implementations in Chisel and VHDL do not show a significant difference. Chisel requires around 10% fewer lines of code, however, the difference in the synthesis results is negligible. Our latency result are better than the majority of the other studies. The other results such as hardware usage, clock frequency, and throughput are competitive when compared to the related works.

Keywords
network-on-chip, Chisel, mesh, scalable
National Category
Computer Systems
Identifiers
urn:nbn:se:hh:diva-39324 (URN)
Funder
Vinnova
Note

As manuscript in thesis

Available from: 2019-05-07 Created: 2019-05-07 Last updated: 2020-10-02Bibliographically approved
Savas, S., Yassin, A., Nordström, T. & Ul-Abdin, Z. (2019). Using Harmonized Parabolic Synthesis to Implement a Single-Precision Floating-Point Square Root Unit. In: 2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI): . Paper presented at International Symposium on VLSI Design (ISVLSI), Miami, Florida, USA, July 15-17, 2019 (pp. 621-626). IEEE conference proceedings
Open this publication in new window or tab >>Using Harmonized Parabolic Synthesis to Implement a Single-Precision Floating-Point Square Root Unit
2019 (English)In: 2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI), IEEE conference proceedings, 2019, p. 621-626Conference paper, Published paper (Refereed)
Abstract [en]

This paper proposes a novel method for performing square root operation on floating-point numbers represented in IEEE-754 single-precision (binary32) format. The method is implemented using Harmonized Parabolic Synthesis. It is implemented with and without pipeline stages individually and synthesized for two different Xilinx FPGA boards.

The implementations show better resource usage and latency results when compared to other similar works including Xilinx intellectual property (IP) that uses the CORDIC method. Any method calculating the square root will make approximation errors. Unless these errors are distributed evenly around zero, they can accumulate and give a biased result. An attractive feature of the proposed method is the fact that it distributes the errors evenly around zero, in contrast to CORDIC for instance.

Due to the small size, low latency, high throughput, and good error properties, the presented floating-point square root unit is suitable for high performance embedded systems. It can be integrated into a processor’s floating point unit or be used as astand-alone accelerator. © 2019 IEEE.

Place, publisher, year, edition, pages
IEEE conference proceedings, 2019
Series
VLSI, IEEE Computer Society Annual Symposium on, ISSN 2159-3469, E-ISSN 2159-3477
Keywords
square root, floating-point, harmonized parabolic synthesis, fpga, hardware
National Category
Embedded Systems
Identifiers
urn:nbn:se:hh:diva-39322 (URN)10.1109/ISVLSI.2019.00116 (DOI)000538332100107 ()2-s2.0-85072977463 (Scopus ID)978-1-7281-3391-1 (ISBN)978-1-7281-3392-8 (ISBN)
Conference
International Symposium on VLSI Design (ISVLSI), Miami, Florida, USA, July 15-17, 2019
Funder
Vinnova
Available from: 2019-05-07 Created: 2019-05-07 Last updated: 2023-08-21Bibliographically approved
Savas, S., Ul-Abdin, Z. & Nordström, T. (2018). Designing Domain Specific Heterogeneous Manycore Architectures Based on Building Blocks.
Open this publication in new window or tab >>Designing Domain Specific Heterogeneous Manycore Architectures Based on Building Blocks
2018 (English)Manuscript (preprint) (Other academic)
Abstract [en]

Performance and power requirements has pushed computer architectures from single core to manycores. These requirements now continue pushing the manycores with identical cores (homogeneous) to manycores with specialized cores (heterogeneous). However designing heterogeneous manycores is a challenging task due to the complexity of the architectures. We propose an approach for designing domain specific heterogeneous manycore architectures based on building blocks. These blocks are defined as the common computations of the applications within a domain. The objective is to generate heterogeneous architectures by integrating many of these blocks to many simple cores and connect the cores with a networkon-chip. The proposed approach aims to ease the design of heterogeneous manycore architectures and facilitate usage of dark silicon concept. As a case study, we develop an accelerator based on several building blocks, integrate it to a RISC core and synthesize on a Xilinx Ultrascale FPGA. The results show that executing a hot-spot of an application on an accelerator based on building blocks increases the performance by 15x, with room for further improvement. The area usage increases as well, however there are potential optimizations to reduce the area usage. © 2018 by the authors

Keywords
heterogeneous architecture design, risc-v, dataflow, QR decomposition, domain-specific processor, accelerator, Autofocus, hardware software co-design
National Category
Embedded Systems
Identifiers
urn:nbn:se:hh:diva-33818 (URN)
Projects
HiPEC (High Performance Embedded Computing)NGES (Towards Next, Generation Embedded Systems: Utilizing Parallelism and Reconfigurability)
Funder
Swedish Foundation for Strategic Research VINNOVA
Available from: 2017-05-09 Created: 2017-05-09 Last updated: 2020-10-02Bibliographically approved
Savas, S., Ul-Abdin, Z. & Nordström, T. (2018). Designing Domain-Specific Heterogeneous Architectures from Dataflow Programs. Computers, 7(2), Article ID 27.
Open this publication in new window or tab >>Designing Domain-Specific Heterogeneous Architectures from Dataflow Programs
2018 (English)In: Computers, ISSN 2073-431X, Vol. 7, no 2, article id 27Article in journal (Refereed) Published
Abstract [en]

The last ten years have seen performance and power requirements pushing computer architectures using only a single core towards so-called manycore systems with hundreds of cores on a single chip. To further increase performance and energy efficiency, we are now seeing the development of heterogeneous architectures with specialized and accelerated cores. However, designing these heterogeneous systems is a challenging task due to their inherent complexity. We proposed an approach for designing domain-specific heterogeneous architectures based on instruction augmentation through the integration of hardware accelerators into simple cores. These hardware accelerators were determined based on their common use among applications within a certain domain.The objective was to generate heterogeneous architectures by integrating many of these accelerated cores and connecting them with a network-on-chip. The proposed approach aimed to ease the design of heterogeneous manycore architectures—and, consequently, exploration of the design space—by automating the design steps. To evaluate our approach, we enhanced our software tool chain with a tool that can generate accelerated cores from dataflow programs. This new tool chain was evaluated with the aid of two use cases: radar signal processing and mobile baseband processing. We could achieve an approximately 4x improvement in performance, while executing complete applications on the augmented cores with a small impact (2.5–13%) on area usage. The generated accelerators are competitive, achieving more than 90% of the performance of hand-written implementations.

Place, publisher, year, edition, pages
Basel: MDPI AG, 2018
Keywords
heterogeneous architecture design, risc-v, dataflow, QR decomposition, domain-specific processor, accelerator, Autofocus, hardware software co-design
National Category
Computer Systems
Identifiers
urn:nbn:se:hh:diva-36669 (URN)10.3390/computers7020027 (DOI)000436492500008 ()2-s2.0-85056771712 (Scopus ID)
Projects
Towards Next Generation Embedded Systems: Utilizing Parallelism and Reconfigurability (NGES)
Funder
Swedish Foundation for Strategic Research VINNOVA
Available from: 2018-04-24 Created: 2018-04-24 Last updated: 2020-10-02Bibliographically approved
Rezk, N., Purnaprajna, M. & Ul-Abdin, Z. (2018). Streaming Tiles: Flexible Implementation of Convolution Neural Networks Inference on Manycore Architectures. In: 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW): . Paper presented at The 7th International Workshop on Parallel and Distributed Computing for Large Scale Machine Learning and Big Data Analytics, Vancouver, British Columbia, Canada, May 21, 2018 (pp. 867-876). Los Alamitos: IEEE Computer Society
Open this publication in new window or tab >>Streaming Tiles: Flexible Implementation of Convolution Neural Networks Inference on Manycore Architectures
2018 (English)In: 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), Los Alamitos: IEEE Computer Society, 2018, p. 867-876Conference paper, Published paper (Refereed)
Abstract [en]

Convolution neural networks (CNN) are extensively used for deep learning applications such as image recognition and computer vision. The convolution module of these networks is highly compute-intensive. Having an efficient implementation of the convolution module enables realizing the inference part of the neural network on embedded platforms. Low precision parameters require less memory, less computation time, and less power consumption while achieving high classification accuracy. Furthermore, streaming the data over parallelized processing units saves a considerable amount of memory, which is a key concern in memory constrained embedded platforms. In this paper, we explore the design space for streamed CNN on Epiphany manycore architecture using varying precisions for weights (ranging from binary to 32-bit). Both AlexNet and GoogleNet are explored for two different memory sizes of Epiphany cores. We are able to achieve competitive performance for both Alexnet and GoogleNet with respect to emerging manycores. Furthermore, the effects of different design choices in terms of precision, memory size, and the number of cores are evaluated by applying the proposed method.

Place, publisher, year, edition, pages
Los Alamitos: IEEE Computer Society, 2018
Keywords
manycores, CNN, stream processing, embedded systems
National Category
Embedded Systems
Identifiers
urn:nbn:se:hh:diva-36887 (URN)10.1109/IPDPSW.2018.00138 (DOI)000541051600099 ()2-s2.0-85052195969 (Scopus ID)978-1-5386-5555-9 (ISBN)978-1-5386-5556-6 (ISBN)
Conference
The 7th International Workshop on Parallel and Distributed Computing for Large Scale Machine Learning and Big Data Analytics, Vancouver, British Columbia, Canada, May 21, 2018
Projects
NGES (Towards Next Generation Embedded Systems: Utilizing Parallelism and Reconfigurability)
Funder
Vinnova
Note

As manuscript in thesis.

Other funding: Department of Science and Technology, Government of India.

Available from: 2018-06-01 Created: 2018-06-01 Last updated: 2023-10-05Bibliographically approved
Ul-Abdin, Z. & Mingkun, Y. (2017). A Radar Signal Processing Case Study for Dataflow Programming of Manycores. Journal of Signal Processing Systems, 87(1), 49-62
Open this publication in new window or tab >>A Radar Signal Processing Case Study for Dataflow Programming of Manycores
2017 (English)In: Journal of Signal Processing Systems, ISSN 1939-8018, E-ISSN 1939-8115, Vol. 87, no 1, p. 49-62Article in journal (Refereed) Published
Abstract [en]

The successful realization of next generation radar systems have high performance demands on the signal processing chain. Among these are advanced Active Electronically Scanned Array (AESA) radars in which complex calculations are to be performed on huge sets of data in real-time. Manycore architectures are designed to provide flexibility and high performance essential for such streaming applications. This paper deals with the implementation of compute-intensive parts of AESA radar signal processing chain in a high-level dataflow language; CAL. We evaluate the approach by targeting a commercial manycore architecture, Epiphany, and present our findings in terms of performance and productivity gains achieved in this case study. The comparison of the performance results with the reference sequential implementations executing on a state-of-the-art embedded processor show that we are able to achieve a speedup of 1.6x to 4.4x by using only 10 cores of Epiphany. © Springer Science+Business Media New York 2015

Place, publisher, year, edition, pages
New York: Springer-Verlag New York, 2017
Keywords
Dataflow language, Manycore architecture, Radar signal processing, Compiler
National Category
Electrical Engineering, Electronic Engineering, Information Engineering
Identifiers
urn:nbn:se:hh:diva-29826 (URN)10.1007/s11265-015-1078-1 (DOI)000396155700004 ()2-s2.0-84948408530 (Scopus ID)
Projects
STAMPHiPEC
Funder
Knowledge FoundationELLIIT - The Linköping‐Lund Initiative on IT and Mobile CommunicationsSwedish Foundation for Strategic Research
Available from: 2015-11-26 Created: 2015-11-26 Last updated: 2017-11-29Bibliographically approved
Projects
Towards Next Generation Embedded Systems: Utilizing Parallelism and Reconfigurability [2015-04178_Vinnova]; Halmstad University; Publications
Savas, S., Ul-Abdin, Z. & Nordström, T. (2020). A Framework to Generate Domain-Specific Manycore Architectures from Dataflow Programs. Microprocessors and microsystems, 72, Article ID 102908. Savas, S., Ul-Abdin, Z. & Nordström, T. (2019). A Configurable Two Dimensional Mesh Network-on-Chip Implementation in Chisel. Savas, S. (2019). Hardware/Software Co-Design of Heterogeneous Manycore Architectures. (Doctoral dissertation). Halmstad: Halmstad University PressSavas, S., Yassin, A., Nordström, T. & Ul-Abdin, Z. (2019). Using Harmonized Parabolic Synthesis to Implement a Single-Precision Floating-Point Square Root Unit. In: 2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI): . Paper presented at International Symposium on VLSI Design (ISVLSI), Miami, Florida, USA, July 15-17, 2019 (pp. 621-626). IEEE conference proceedings
Organisations
Identifiers
ORCID iD: ORCID iD iconorcid.org/0000-0002-4932-4036

Search in DiVA

Show all publications