hh.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Hardware/Software Co-Design of Heterogeneous Manycore Architectures
Halmstad University, School of Information Technology, Halmstad Embedded and Intelligent Systems Research (EIS), Centre for Research on Embedded Systems (CERES).ORCID iD: 0000-0001-8652-0098
2019 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

In the era of big data, advanced sensing, and artificial intelligence, the required computation power is provided mostly by multicore and manycore architectures. However, the performance demand keeps growing. Thus the computer architectures need to continue evolving and provide higher performance. The applications, which are executed on the manycore architectures, are divided into several tasks to be mapped on separate cores and executed in parallel. Usually these tasks are not identical and may be executed more efficiently on different types of cores within a heterogeneous architecture. Therefore, we believe that the heterogeneous manycores are the next step for the computer architectures. However, there is a lack of knowledge on what form of heterogeneity is the best match for a given application or application domain. This knowledge can be acquired through designing these architectures and testing different design configurations. However, designing these architectures is a great challenge. Therefore, there is a need for an automated design method to facilitate the architecture design and design space exploration to gather knowledge on architectures with different configurations. Additionally, it is already difficult to program manycore architectures efficiently and this difficulty will only increase further with the introduction of heterogeneity due to the increase in the complexity of the architectures, unless this complexity is somehow hidden. There is a need for software development tools to facilitate the software development for these architectures and enable portability of the same software across different manycore platforms.

In this thesis, we first address the challenges of the software development for manycore architectures. We evaluate a dataflow language (CAL) and a source-to-source compilation framework (Cal2Many) with several case studies in order to reveal their impact on productivity and performance of the software. The language supports task level parallelism by adopting actor model and the framework takes CAL code and generates implementations in the native language of several different architectures.

In order to address the challenge of custom hardware development, we first evaluate a commercial manycore architecture namely Epiphany and identify its demerits. Then we study manycore architectures in order to reveal possible uses of heterogeneity in manycores and facilitate choice of architecture for software and hardware development. We define a taxonomy for manycore architectures that is based on the levels of heterogeneity they contain and discuss the benefits and drawbacks of these levels. We finally develop and evaluate a design method to design heterogeneous manycore architectures customized based on application requirements. The architectures designed with this method consist of cores with application specific accelerators. The majority of the design method is automated with software tools, which support different design configurations in order to increase the productivity of the hardware developer and enable design space exploration.

Our results show that the dataflow language, together with the software development tool, decreases software development efforts significantly (25-50%), while having a small impact (2-17%) on the performance. The evaluation of the design method reveal that the performance of automatically generated accelerators is between 96-100% of the performance of their manually developed counterparts. Additionally, it is possible to increase the performance of the architectures by increasing the number of cores and using application specific accelerators, usually with a cost on the area usage. However, under certain circumstances, using accelerator may lead to avoiding usage of large general purpose components such as the floating-point unit and therefore improves the area utilization. Eventually, the final impact on the performance and area usage depends on the configurations. When compared to the Epiphany architecture, which is a commercial homogeneous manycore, the generated manycores show competitive results. We can conclude that the automated design method simplifies heterogeneous manycore architecture design and facilitates design space exploration with the use of configurable parameters.

Place, publisher, year, edition, pages
Halmstad: Halmstad University Press, 2019. , p. 205
Series
Halmstad University Dissertations ; 57
Keywords [en]
hardware/software co-design, manycore architectures, heterogeneous manycores, processor design, parallel computing, high performance computing
National Category
Computer Systems Embedded Systems
Identifiers
URN: urn:nbn:se:hh:diva-39325ISBN: 978-91-88749-22-2 (print)ISBN: 978-91-88749-23-9 (electronic)OAI: oai:DiVA.org:hh-39325DiVA, id: diva2:1314192
Public defence
2019-05-28, Wigforssalen, Visionen, Kristian IV:s väg 3, Halmstad, 13:15 (English)
Opponent
Supervisors
Projects
HIPEC - High Performance Embedded ComputingESCHER
Part of project
Towards Next Generation Embedded Systems: Utilizing Parallelism and Reconfigurability, Vinnova
Funder
VinnovaKnowledge FoundationSwedish Foundation for Strategic Research Available from: 2019-05-08 Created: 2019-05-07 Last updated: 2019-05-08Bibliographically approved
List of papers
1. An Evaluation of Code Generation of Dataflow Languages on Manycore Architectures
Open this publication in new window or tab >>An Evaluation of Code Generation of Dataflow Languages on Manycore Architectures
Show others...
2014 (English)In: RTCSA 2014: 2014 IEEE 20th International Conference on Embedded and Real-Time Computing Systems and Applications, Piscataway, NJ: IEEE Press, 2014, article id 6910501Conference paper, Published paper (Refereed)
Abstract [en]

Today computer architectures are shifting from single core to manycores due to several reasons such as performance demands, power and heat limitations. However, shifting to manycores results in additional complexities, especially with regard to efficient development of applications. Hence there is a need to raise the abstraction level of development techniques for the manycores while exposing the inherent parallelism in the applications. One promising class of programming languages is dataflow languages and in this paper we evaluate and optimize the code generation for one such language, CAL. We have also developed a communication library to support the inter-core communication.The code generation can target multiple architectures, but the results presented in this paper is focused on Adapteva's many core architecture Epiphany.We use the two-dimensional inverse discrete cosine transform (2D-IDCT) as our benchmark and compare our code generation from CAL with a hand-written implementation developed in C. Several optimizations in the code generation as well as in the communication library are described, and we have observed that the most critical optimization is reducing the number of external memory accesses. Combining all optimizations we have been able to reduce the difference in execution time between auto-generated and hand-written implementations from a factor of 4.3x down to a factor of only 1.3x. ©2014 IEEE.

Place, publisher, year, edition, pages
Piscataway, NJ: IEEE Press, 2014
Keywords
Manycore, Dataflow Languages, code generation, Actor Machine, 2D-IDCT, Epiphany, evaluation
National Category
Embedded Systems
Identifiers
urn:nbn:se:hh:diva-25649 (URN)10.1109/RTCSA.2014.6910501 (DOI)000352610400005 ()2-s2.0-84908637354 (Scopus ID)
Conference
RTCSA 2014, 20th IEEE International Conference on Embedded and Real-Time Computing Systems and Applications, Chongqing, China, August 20-22, 2014
Projects
HiPEC project
Funder
Knowledge FoundationSwedish Foundation for Strategic Research
Note

The authors would like to thank Adapteva Inc. for giving access to their software development suite and hardware board. This research is part of the CERES research program funded by the Knowledge Foundation and HiPEC project funded by Swedish Foundation for Strategic Research (SSF).

Available from: 2014-06-16 Created: 2014-06-16 Last updated: 2019-05-07Bibliographically approved
2. Dataflow Implementation of QR Decomposition on a Manycore
Open this publication in new window or tab >>Dataflow Implementation of QR Decomposition on a Manycore
Show others...
2016 (English)In: MES '16: Proceedings of the Third ACM International Workshop on Many-core Embedded Systems, New York, NY: ACM Press, 2016, p. 26-30Conference paper, Published paper (Refereed)
Abstract [en]

While parallel computer architectures have become mainstream, application development on them is still challenging. There is a need for new tools, languages and programming models. Additionally, there is a lack of knowledge about the performance of parallel approaches of basic but important operations, such as the QR decomposition of a matrix, on current commercial manycore architectures.

This paper evaluates a high level dataflow language (CAL), a source-to-source compiler (Cal2Many) and three QR decomposition algorithms (Givens Rotations, Householder and Gram-Schmidt). The algorithms are implemented both in CAL and hand-optimized C languages, executed on Adapteva's Epiphany manycore architecture and evaluated with respect to performance, scalability and development effort.

The performance of the CAL (generated C) implementations gets as good as 2\% slower than the hand-written versions. They require an average of 25\% fewer lines of source code without significantly increasing the binary size. Development effort is reduced and debugging is significantly simplified. The implementations executed on Epiphany cores outperform the GNU scientific library on the host ARM processor of the Parallella board by up to 30x. © 2016 Copyright held by the owner/author(s).

Place, publisher, year, edition, pages
New York, NY: ACM Press, 2016
National Category
Embedded Systems
Identifiers
urn:nbn:se:hh:diva-32371 (URN)10.1145/2934495.2934499 (DOI)2-s2.0-84991106778 (Scopus ID)978-1-4503-4262-9 (ISBN)
Conference
MES '16, International Workshop on Many-core Embedded Systems, Seoul, Republic of Korea, June 19, 2016
Projects
ESCHERHiPEC
Funder
Knowledge FoundationSwedish Foundation for Strategic Research ELLIIT - The Linköping‐Lund Initiative on IT and Mobile Communications
Available from: 2016-11-04 Created: 2016-11-04 Last updated: 2019-05-07Bibliographically approved
3. Efficient Single-Precision Floating-Point Division Using Harmonized Parabolic Synthesis
Open this publication in new window or tab >>Efficient Single-Precision Floating-Point Division Using Harmonized Parabolic Synthesis
2017 (English)In: 2017 IEEE Computer Society Annual Symposium on VLSI: ISVLSI 2017 / [ed] Michael Hübner, Ricardo Reis, Mircea Stan & Nikolaos Voros, Los Alamitos: IEEE, 2017Conference paper, Published paper (Refereed)
Abstract [en]

This paper proposes a novel method for performing division on floating-point numbers represented in IEEE-754 single-precision (binary32) format. The method is based on an inverter, implemented as a combination of Parabolic Synthesis and second-degree interpolation, followed by a multiplier. It is implemented with and without pipeline stages individually and synthesized while targeting a Xilinx Ultrascale FPGA.

The implementations show better resource usage and latency results when compared to other implementations based on different methods. In case of throughput, the proposed method outperforms most of the other works, however, some Altera FPGAs achieve higher clock rate due to the differences in the DSP slice multiplier design.

Due to the small size, low latency and high throughput, the presented floating-point division unit is suitable for high performance embedded systems and can be integrated into accelerators or be used as a stand-alone accelerator.

Place, publisher, year, edition, pages
Los Alamitos: IEEE, 2017
Series
IEEE Computer Society Annual Symposium on VLSI, ISSN 2159-3477
Keywords
Floating-point, single precision, division, FPGA, Harmonized Parabolic Synthesis
National Category
Computer Systems
Identifiers
urn:nbn:se:hh:diva-33793 (URN)10.1109/ISVLSI.2017.28 (DOI)2-s2.0-85027258772 (Scopus ID)978-1-5090-6762-6 (ISBN)978-1-5090-6763-3 (ISBN)
Conference
IEEE Computer Society Annual Symposium on VLSI, July 3-5, 2017, Bochum, Germany
Projects
NGES
Funder
VINNOVA
Available from: 2017-05-05 Created: 2017-05-05 Last updated: 2019-05-07Bibliographically approved
4. Designing Domain-Specific Heterogeneous Architectures from Dataflow Programs
Open this publication in new window or tab >>Designing Domain-Specific Heterogeneous Architectures from Dataflow Programs
2018 (English)In: Computers, ISSN 2073-431X, Vol. 7, no 2, article id 27Article in journal (Refereed) Published
Abstract [en]

The last ten years have seen performance and power requirements pushing computer architectures using only a single core towards so-called manycore systems with hundreds of cores on a single chip. To further increase performance and energy efficiency, we are now seeing the development of heterogeneous architectures with specialized and accelerated cores. However, designing these heterogeneous systems is a challenging task due to their inherent complexity. We proposed an approach for designing domain-specific heterogeneous architectures based on instruction augmentation through the integration of hardware accelerators into simple cores. These hardware accelerators were determined based on their common use among applications within a certain domain.The objective was to generate heterogeneous architectures by integrating many of these accelerated cores and connecting them with a network-on-chip. The proposed approach aimed to ease the design of heterogeneous manycore architectures—and, consequently, exploration of the design space—by automating the design steps. To evaluate our approach, we enhanced our software tool chain with a tool that can generate accelerated cores from dataflow programs. This new tool chain was evaluated with the aid of two use cases: radar signal processing and mobile baseband processing. We could achieve an approximately 4x improvement in performance, while executing complete applications on the augmented cores with a small impact (2.5–13%) on area usage. The generated accelerators are competitive, achieving more than 90% of the performance of hand-written implementations.

Place, publisher, year, edition, pages
Basel: MDPI AG, 2018
Keywords
heterogeneous architecture design, risc-v, dataflow, QR decomposition, domain-specific processor, accelerator, Autofocus, hardware software co-design
National Category
Computer Systems
Identifiers
urn:nbn:se:hh:diva-36669 (URN)10.3390/computers7020027 (DOI)
Projects
Towards Next Generation Embedded Systems: Utilizing Parallelism and Reconfigurability (NGES)
Funder
Swedish Foundation for Strategic Research VINNOVA
Available from: 2018-04-24 Created: 2018-04-24 Last updated: 2019-05-07Bibliographically approved
5. Using Harmonized Parabolic Synthesis to Implement a Single-Precision Floating-Point Square Root Unit
Open this publication in new window or tab >>Using Harmonized Parabolic Synthesis to Implement a Single-Precision Floating-Point Square Root Unit
2019 (English)Conference paper, Published paper (Refereed)
Abstract [en]

This paper proposes a novel method for performing square root operation on floating-point numbers represented in IEEE-754 single-precision (binary32) format. The method is implemented using Harmonized Parabolic Synthesis. It is implemented with and without pipeline stages individually and synthesized for two different Xilinx FPGA boards.

The implementations show better resource usage and latency results when compared to other similar works including Xilinx intellectual property (IP) that uses the CORDIC method. Any method calculating the square root will make approximation errors. Unless these errors are distributed evenly around zero, they can accumulate and give a biased result. An attractive feature of the proposed method is the fact that it distributes the errors evenly around zero, in contrast to CORDIC for instance.

Due to the small size, low latency, high throughput, and good error properties, the presented floating-point square root unit is suitable for high performance embedded systems. It can be integrated into a processor’s floating point unit or be used as astand-alone accelerator.

Place, publisher, year, edition, pages
IEEE conference proceedings, 2019
Keywords
square root, floating-point, harmonized parabolic synthesis, fpga, hardware
National Category
Embedded Systems
Identifiers
urn:nbn:se:hh:diva-39322 (URN)
Conference
International Symposium on VLSI Design (ISVLSI), Miami, Florida, USA, July 15-17, 2019
Funder
Vinnova
Available from: 2019-05-07 Created: 2019-05-07 Last updated: 2019-10-14
6. A Configurable Two Dimensional Mesh Network-on-Chip Implementation in Chisel
Open this publication in new window or tab >>A Configurable Two Dimensional Mesh Network-on-Chip Implementation in Chisel
2019 (English)Conference paper, Published paper (Refereed)
Abstract [en]

On-chip communication plays a significant role in the performance of manycore architectures. Therefore, they require a proper on-chip communication infrastructure that can scale with the number of the cores. As a solution, network-on-chip structures have emerged and are being used.

This paper presents description of a two dimensional mesh network-on-chip router and a network interface, which are implemented in Chisel to be integrated to the rocket chip generator that generates RISC-V (rocket) cores. The router is implemented in VHDL as well and the two implementations are verified and compared.

Hardware resource usage and performance of different sized networks are analyzed. The implementations are synthesized for a Xilinx Ultrascale FPGA via Xilinx tools for the hardware resource usage and clock frequency results. The performance results including latency and throughput measurements with different traffic patterns, are collected with cycle accurate emulations. 

The implementations in Chisel and VHDL do not show a significant difference. Chisel requires around 10% fewer lines of code, however, the difference in the synthesis results is negligible. Our latency result are better than the majority of the other studies. The other results such as hardware usage, clock frequency, and throughput are competitive when compared to the related works.

Place, publisher, year, edition, pages
IEEE conference proceedings, 2019
Keywords
network-on-chip, Chisel, mesh, scalable
National Category
Computer Systems
Identifiers
urn:nbn:se:hh:diva-39324 (URN)
Conference
32nd IEEE International System-on-Chip Conference (SOCC), Singapore, September 3-6, 2019
Funder
Vinnova
Note

©2019 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.

Available from: 2019-05-07 Created: 2019-05-07 Last updated: 2019-10-10
7. A Framework to Generate Domain-Specific Manycore Architectures from Dataflow Programs
Open this publication in new window or tab >>A Framework to Generate Domain-Specific Manycore Architectures from Dataflow Programs
2019 (English)In: Microprocessors and microsystems, ISSN 0141-9331, E-ISSN 1872-9436Article in journal (Refereed) Submitted
Abstract [en]

In the last 15 years we have seen, as a response to power and thermal limits for current chip technologies, an explosion in the use of multiple and even many computer cores on a single chip. But now, to further improve performance and energy efficiency, when there are potentially hundreds of computing cores on a chip, we see a need for a specialization of individual cores and the development of heterogeneous manycore computer architectures.

However, developing such heterogeneous architectures is a significant challenge. Therefore, we propose a design method to generate domain specific manycore architectures based on RISC-V instruction set architecture and automate the main steps of this method with software tools. The design method allows generation of manycore architectures with different configurations including core augmentation through instruction extensions and custom accelerators. The method starts from developing applications in a high-level dataflow language and ends by generating synthesizable Verilog code and cycle accurate emulator for the generated architecture.

We evaluate the design method and the software tools by generating several architectures specialized for two different applications and measure their performance and hardware resource usages. Our results show that the design method can be used to generate specialized manycore architectures targeting applications from different domains. The specialized architectures show at least 3 to 4 times better performance than the general purpose counterparts. In certain cases, replacing general purpose components with specialized components saves hardware resources. Automating the method increases the speed of architecture development and facilitates the design space exploration of manycore architectures.

Place, publisher, year, edition, pages
Amsterdam: Elsevier, 2019
Keywords
Domain-specific, multicore, manycore, accelerator, code generation, hardware/software co-design
National Category
Computer Systems Embedded Systems Signal Processing
Identifiers
urn:nbn:se:hh:diva-39323 (URN)
Funder
Vinnova
Available from: 2019-05-07 Created: 2019-05-07 Last updated: 2019-05-08

Open Access in DiVA

SavasPhDThesis(8198 kB)89 downloads
File information
File name FULLTEXT01.pdfFile size 8198 kBChecksum SHA-512
1e3a3b08fb9910690682ad58a081c86116ada6b5a452aebceb3dc05ca37f2f5c53018713bb3e11c3dc565829ff5656011cdb58cff5c731bbf2e07dbabb4b0ce7
Type fulltextMimetype application/pdf

Authority records BETA

Savas, Süleyman

Search in DiVA

By author/editor
Savas, Süleyman
By organisation
Centre for Research on Embedded Systems (CERES)
Computer SystemsEmbedded Systems

Search outside of DiVA

GoogleGoogle Scholar
Total: 89 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 551 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf