hh.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Applying machine learning algorithms to multi-label text classification on GitHub issues
Halmstad University.
2020 (English)Independent thesis Basic level (degree of Bachelor), 10 credits / 15 HE creditsStudent thesis
Abstract [en]

This report compares five machine learning algorithms in their ability to categorize code repositories. The focus of expanding software projects tend to shift from developing new software to the maintenance of the projects. Maintainers can label code repositories to organize the project, but this requires manual labor and time. This report will evaluate how machine learning algorithms perform in automatically classifying code repositories. Automatic classification can aid the management process by reducing both manual labor and human errors.

GitHub provides online hosting for both private and public code repositories. In these repositories, users can open issues and assign labels to them, to keep track of bugs, enhancement, or requests. GitHub was used as a source for all data as it contains millions of open-source repositories. The focus was on the most popular labels from GitHub - both default labels and those defined by users.

This report investigated the algorithms linear regression (LR), convolutional neural network (CNN), recurrent neural network (RNN), random forest (RF), and k-nearest-neighbor (KNN) - in multi-label text classification. The mentioned algorithms were implemented, trained, and tested with the Keras and Scikit-learn libraries. The training sets contained around 38 thousand rows and the test set around 12 thousand rows. Cross-validation was used to measure the performance of each algorithm. The metrics used to obtain the results were precision, recall, and F1-score. The algorithms were empirically tested on a different number of output labels. In order to maximize the F1-score, different designs of the neural networks and different natural language processing (NLP) methods were evaluated. This was done to see if the algorithms could be used to efficiently organize code repositories.

CNN displayed the best scores in all experiments, but LR, RNN, and RF also showed some good results. LR, CNN, and RNN the had the highest F1-scores while RF could achieve a particularly high precision. KNN performed much worse than all other algorithms. The highest F1-score of 46.48% was achieved when using a non-sequential CNN model that used text input with stem words. The highest precision of 89.17% was achieved by RF.

It was concluded that LR, CNN, RNN, and RF were all viable in classifying labels in software-related texts, among those found in GitHub issues. KNN wasn't found to be a viable candidate for this purpose.

Place, publisher, year, edition, pages
2020. , p. 44
National Category
Computer Sciences
Identifiers
URN: urn:nbn:se:hh:diva-43097OAI: oai:DiVA.org:hh-43097DiVA, id: diva2:1467163
Subject / course
Computer science and engineering
Educational program
Computer Science and Engineering, 300 credits
Presentation
2020-09-13, Halmstad, 20:17 (English)
Supervisors
Examiners
Available from: 2020-09-16 Created: 2020-09-14 Last updated: 2020-09-16Bibliographically approved

Open Access in DiVA

fulltext(1276 kB)1958 downloads
File information
File name FULLTEXT02.pdfFile size 1276 kBChecksum SHA-512
032259324bd9956d3c1d7d327072c7ccf7eb1131c29fcc38cb6f55bbfef0a589b759df4ea3ba43c7b536d09ea1864f60263c463cbaf84e7d626a2bdb99ec6bf1
Type fulltextMimetype application/pdf

Search in DiVA

By author/editor
Artmann, Daniel
By organisation
Halmstad University
Computer Sciences

Search outside of DiVA

GoogleGoogle Scholar
Total: 1958 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 2681 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf