hh.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Sustainability Data Extraction System
Halmstad University, School of Information Technology.
Halmstad University, School of Information Technology.
2024 (English)Independent thesis Basic level (degree of Bachelor), 10 credits / 15 HE creditsStudent thesis
Abstract [en]

Sustainability reporting is essential for organizations to report their environmental, social, and governance (ESG) performance. However, extracting, and structuring data from sustainability reports can be challenging, leading to inefficiencies and inconsistencies. This project aims to develop an integrated system for sustainability reporting by leveraging artificial intelligence (AI) techniques, particularly natural language processing (NLP), to extract and structure data from sustainability reports. Utilizing the GPT-3 model by OpenAI, the system processes unstructured text from PDF reports into a structured format compliant with European Sustainability Reporting Standards (ESRS). The system efficiently and accurately extracted crucial sustainability information through a meticulously designed pipeline, including PDF parsing, text sanitizing, batching, parallel API requests, and postprocessing. The system’s effectiveness is evaluated using cosine similarity metrics, comparing model outputs with manually extracted data. The results demonstrate high alignment between the model outputs and manual extractions, validating the system’s performance. This project contributes to advancing sustainability reporting practices, providing organizations with a robust tool for transparent and standardized disclosure of ESG impacts.

Place, publisher, year, edition, pages
2024. , p. 48
Keywords [en]
Data Extraction, Data processing, NLP, LLM, NER, BERT, GPT-3 model, Prompt engineering, Sustainability, Pipeline, Text sanitizing, ETL, Python
National Category
Computer Engineering Computer and Information Sciences
Identifiers
URN: urn:nbn:se:hh:diva-54288OAI: oai:DiVA.org:hh-54288DiVA, id: diva2:1883540
External cooperation
ICONSOF; HighFive
Subject / course
Computer science and engineering
Educational program
Computer Science and Engineering, 300 credits
Supervisors
Examiners
Available from: 2024-07-11 Created: 2024-07-10 Last updated: 2024-07-11Bibliographically approved

Open Access in DiVA

fulltext(520 kB)285 downloads
File information
File name FULLTEXT02.pdfFile size 520 kBChecksum SHA-512
f5a0fa96689f311a9caf4b9ad309c88e78056082a86fc7b381b494ff8f107d5f9700eafbce454d78e6b0ba124760f811d833ad84e71005a3c8aab12e43ccfe96
Type fulltextMimetype application/pdf

By organisation
School of Information Technology
Computer EngineeringComputer and Information Sciences

Search outside of DiVA

GoogleGoogle Scholar
Total: 286 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 1034 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf