hh.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Utilizing Heterogeneity in Manycore Architectures for Streaming Applications
Halmstad University, School of Information Technology, Halmstad Embedded and Intelligent Systems Research (EIS), Centre for Research on Embedded Systems (CERES). (Computer Architectures and Languages)ORCID iD: 0000-0001-8652-0098
2017 (English)Licentiate thesis, comprehensive summary (Other academic)
Abstract [en]

In the last decade, we have seen a transition from single-core to manycore in computer architectures due to performance requirements and limitations in power consumption and heat dissipation. The first manycores had homogeneous architectures consisting of a few identical cores. However, the applications, which are executed on these architectures, usually consist of several tasks requiring different hardware resources to be executed efficiently. Therefore, we believe that utilizing heterogeneity in manycores will increase the efficiency of the architectures in terms of performance and power consumption. However, development of heterogeneous architectures is more challenging and the transition from homogeneous to heterogeneous architectures will increase the difficulty of efficient software development due to the increased complexity of the architecture. In order to increase the efficiency of hardware and software development, new hardware design methods and software development tools are required. Additionally, there is a lack of knowledge on the performance of applications when executed on manycore architectures.

The transition began with a shift from single-core architectures to homogeneous multicore architectures consisting of a few identical cores. It now continues with a shift from homogeneous architectures with identical cores to heterogeneous architectures with different types of cores specialized for different purposes. However, this transition has increased the complexity of architectures and hence the complexity of software development and execution. In order to decrease the complexity of software development, new software tools are required. Additionally, there is a lack of knowledge on what kind of heterogeneous manycore design is most efficient for different applications and what are the performances of these applications when executed on current commercial manycores.

This thesis studies manycore architectures in order to reveal possible uses of heterogeneity in manycores and facilitate choice of architecture for software and hardware developers. It defines a taxonomy for manycore architectures that is based on the levels of heterogeneity they contain and discusses benefits and drawbacks of these levels. Additionally, it evaluates several applications, a dataflow language (CAL), a source-to-source compilation framework (Cal2Many), and a commercial manycore architecture (Epiphany). The compilation framework takes implementations written in the dataflow language as input and generates code targetting different manycore platforms. Based on these evaluations, the thesis identifies the bottlenecks of the architecture. It finally presents a methodology for developing heterogeneoeus manycore architectures which target specific application domains.

Our studies show that using different types of cores in manycore architectures has the potential to increase the performance of streaming applications. If we add specialized hardware blocks to a core, the performance easily increases by 15x for the target application while the core size increases by 40-50% which can be optimized further. Other results prove that dataflow languages, together with software development tools, decrease software development efforts significantly (25-50%) while having a small impact (2-17%) on the performance.

Place, publisher, year, edition, pages
Halmstad: Halmstad University Press, 2017. , p. 78
Series
Halmstad University Dissertations ; 29
Keywords [en]
Manycores, parallel architectures, parallelism, streaming applications, dataflow, manycore design, heterogeneous manycores
National Category
Computer Systems
Identifiers
URN: urn:nbn:se:hh:diva-33792ISBN: 978-91-87045-60-8 (print)ISBN: 978-91-87045-61-5 (electronic)OAI: oai:DiVA.org:hh-33792DiVA, id: diva2:1093334
Presentation
2017-06-02, Wigforss, Kristian IV:s väg 3, Halmstad, 13:15 (English)
Opponent
Supervisors
Projects
HiPEC (High Performance Embedded Computing)NGES (Towards Next Generation Embedded Systems: Utilizing Parallelism and Reconfigurability)
Funder
VINNOVASwedish Foundation for Strategic Research Available from: 2017-05-09 Created: 2017-05-05 Last updated: 2020-10-02Bibliographically approved
List of papers
1. An Evaluation of Code Generation of Dataflow Languages on Manycore Architectures
Open this publication in new window or tab >>An Evaluation of Code Generation of Dataflow Languages on Manycore Architectures
Show others...
2014 (English)In: RTCSA 2014: 2014 IEEE 20th International Conference on Embedded and Real-Time Computing Systems and Applications, Piscataway, NJ: IEEE Press, 2014, article id 6910501Conference paper, Published paper (Refereed)
Abstract [en]

Today computer architectures are shifting from single core to manycores due to several reasons such as performance demands, power and heat limitations. However, shifting to manycores results in additional complexities, especially with regard to efficient development of applications. Hence there is a need to raise the abstraction level of development techniques for the manycores while exposing the inherent parallelism in the applications. One promising class of programming languages is dataflow languages and in this paper we evaluate and optimize the code generation for one such language, CAL. We have also developed a communication library to support the inter-core communication.The code generation can target multiple architectures, but the results presented in this paper is focused on Adapteva's many core architecture Epiphany.We use the two-dimensional inverse discrete cosine transform (2D-IDCT) as our benchmark and compare our code generation from CAL with a hand-written implementation developed in C. Several optimizations in the code generation as well as in the communication library are described, and we have observed that the most critical optimization is reducing the number of external memory accesses. Combining all optimizations we have been able to reduce the difference in execution time between auto-generated and hand-written implementations from a factor of 4.3x down to a factor of only 1.3x. ©2014 IEEE.

Place, publisher, year, edition, pages
Piscataway, NJ: IEEE Press, 2014
Keywords
Manycore, Dataflow Languages, code generation, Actor Machine, 2D-IDCT, Epiphany, evaluation
National Category
Embedded Systems
Identifiers
urn:nbn:se:hh:diva-25649 (URN)10.1109/RTCSA.2014.6910501 (DOI)000352610400005 ()2-s2.0-84908637354 (Scopus ID)
Conference
RTCSA 2014, 20th IEEE International Conference on Embedded and Real-Time Computing Systems and Applications, Chongqing, China, August 20-22, 2014
Projects
HiPEC project
Funder
Knowledge FoundationSwedish Foundation for Strategic Research
Note

The authors would like to thank Adapteva Inc. for giving access to their software development suite and hardware board. This research is part of the CERES research program funded by the Knowledge Foundation and HiPEC project funded by Swedish Foundation for Strategic Research (SSF).

Available from: 2014-06-16 Created: 2014-06-16 Last updated: 2020-10-02Bibliographically approved
2. Dataflow Implementation of QR Decomposition on a Manycore
Open this publication in new window or tab >>Dataflow Implementation of QR Decomposition on a Manycore
Show others...
2016 (English)In: MES '16: Proceedings of the Third ACM International Workshop on Many-core Embedded Systems, New York, NY: ACM Press, 2016, p. 26-30Conference paper, Published paper (Refereed)
Abstract [en]

While parallel computer architectures have become mainstream, application development on them is still challenging. There is a need for new tools, languages and programming models. Additionally, there is a lack of knowledge about the performance of parallel approaches of basic but important operations, such as the QR decomposition of a matrix, on current commercial manycore architectures.

This paper evaluates a high level dataflow language (CAL), a source-to-source compiler (Cal2Many) and three QR decomposition algorithms (Givens Rotations, Householder and Gram-Schmidt). The algorithms are implemented both in CAL and hand-optimized C languages, executed on Adapteva's Epiphany manycore architecture and evaluated with respect to performance, scalability and development effort.

The performance of the CAL (generated C) implementations gets as good as 2\% slower than the hand-written versions. They require an average of 25\% fewer lines of source code without significantly increasing the binary size. Development effort is reduced and debugging is significantly simplified. The implementations executed on Epiphany cores outperform the GNU scientific library on the host ARM processor of the Parallella board by up to 30x. © 2016 Copyright held by the owner/author(s).

Place, publisher, year, edition, pages
New York, NY: ACM Press, 2016
National Category
Embedded Systems
Identifiers
urn:nbn:se:hh:diva-32371 (URN)10.1145/2934495.2934499 (DOI)000469271000004 ()2-s2.0-84991106778 (Scopus ID)978-1-4503-4262-9 (ISBN)
Conference
MES '16, International Workshop on Many-core Embedded Systems, Seoul, Republic of Korea, June 19, 2016
Projects
ESCHERHiPEC
Funder
Knowledge FoundationSwedish Foundation for Strategic Research ELLIIT - The Linköping‐Lund Initiative on IT and Mobile Communications
Available from: 2016-11-04 Created: 2016-11-04 Last updated: 2020-10-02Bibliographically approved
3. Efficient Single-Precision Floating-Point Division Using Harmonized Parabolic Synthesis
Open this publication in new window or tab >>Efficient Single-Precision Floating-Point Division Using Harmonized Parabolic Synthesis
2017 (English)In: 2017 IEEE Computer Society Annual Symposium on VLSI: ISVLSI 2017 / [ed] Michael Hübner, Ricardo Reis, Mircea Stan & Nikolaos Voros, Los Alamitos: IEEE, 2017Conference paper, Published paper (Refereed)
Abstract [en]

This paper proposes a novel method for performing division on floating-point numbers represented in IEEE-754 single-precision (binary32) format. The method is based on an inverter, implemented as a combination of Parabolic Synthesis and second-degree interpolation, followed by a multiplier. It is implemented with and without pipeline stages individually and synthesized while targeting a Xilinx Ultrascale FPGA.

The implementations show better resource usage and latency results when compared to other implementations based on different methods. In case of throughput, the proposed method outperforms most of the other works, however, some Altera FPGAs achieve higher clock rate due to the differences in the DSP slice multiplier design.

Due to the small size, low latency and high throughput, the presented floating-point division unit is suitable for high performance embedded systems and can be integrated into accelerators or be used as a stand-alone accelerator.

Place, publisher, year, edition, pages
Los Alamitos: IEEE, 2017
Series
IEEE Computer Society Annual Symposium on VLSI, ISSN 2159-3477
Keywords
Floating-point, single precision, division, FPGA, Harmonized Parabolic Synthesis
National Category
Computer Systems
Identifiers
urn:nbn:se:hh:diva-33793 (URN)10.1109/ISVLSI.2017.28 (DOI)2-s2.0-85027258772 (Scopus ID)978-1-5090-6762-6 (ISBN)978-1-5090-6763-3 (ISBN)
Conference
IEEE Computer Society Annual Symposium on VLSI, July 3-5, 2017, Bochum, Germany
Projects
NGES
Funder
VINNOVA
Available from: 2017-05-05 Created: 2017-05-05 Last updated: 2020-10-02Bibliographically approved
4. Designing Domain Specific Heterogeneous Manycore Architectures Based on Building Blocks
Open this publication in new window or tab >>Designing Domain Specific Heterogeneous Manycore Architectures Based on Building Blocks
2018 (English)Manuscript (preprint) (Other academic)
Abstract [en]

Performance and power requirements has pushed computer architectures from single core to manycores. These requirements now continue pushing the manycores with identical cores (homogeneous) to manycores with specialized cores (heterogeneous). However designing heterogeneous manycores is a challenging task due to the complexity of the architectures. We propose an approach for designing domain specific heterogeneous manycore architectures based on building blocks. These blocks are defined as the common computations of the applications within a domain. The objective is to generate heterogeneous architectures by integrating many of these blocks to many simple cores and connect the cores with a networkon-chip. The proposed approach aims to ease the design of heterogeneous manycore architectures and facilitate usage of dark silicon concept. As a case study, we develop an accelerator based on several building blocks, integrate it to a RISC core and synthesize on a Xilinx Ultrascale FPGA. The results show that executing a hot-spot of an application on an accelerator based on building blocks increases the performance by 15x, with room for further improvement. The area usage increases as well, however there are potential optimizations to reduce the area usage. © 2018 by the authors

Keywords
heterogeneous architecture design, risc-v, dataflow, QR decomposition, domain-specific processor, accelerator, Autofocus, hardware software co-design
National Category
Embedded Systems
Identifiers
urn:nbn:se:hh:diva-33818 (URN)
Projects
HiPEC (High Performance Embedded Computing)NGES (Towards Next, Generation Embedded Systems: Utilizing Parallelism and Reconfigurability)
Funder
Swedish Foundation for Strategic Research VINNOVA
Available from: 2017-05-09 Created: 2017-05-09 Last updated: 2020-10-02Bibliographically approved

Open Access in DiVA

fulltext(2047 kB)2074 downloads
File information
File name FULLTEXT02.pdfFile size 2047 kBChecksum SHA-512
fe0d054339b387b4e7421981a10f1d7b411818ce42d18c3f6fbe58f558015f22dc6e6e22e50db7bc641852a25744f720de1498be21517b617f96a1f775bf62af
Type fulltextMimetype application/pdf

Authority records

Savas, Süleyman

Search in DiVA

By author/editor
Savas, Süleyman
By organisation
Centre for Research on Embedded Systems (CERES)
Computer Systems

Search outside of DiVA

GoogleGoogle Scholar
Total: 2075 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 1488 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf