Sustainability Data Extraction System
2024 (English)Independent thesis Basic level (degree of Bachelor), 10 credits / 15 HE credits
Student thesis
Abstract [en]
Sustainability reporting is essential for organizations to report their environmental, social, and governance (ESG) performance. However, extracting, and structuring data from sustainability reports can be challenging, leading to inefficiencies and inconsistencies. This project aims to develop an integrated system for sustainability reporting by leveraging artificial intelligence (AI) techniques, particularly natural language processing (NLP), to extract and structure data from sustainability reports. Utilizing the GPT-3 model by OpenAI, the system processes unstructured text from PDF reports into a structured format compliant with European Sustainability Reporting Standards (ESRS). The system efficiently and accurately extracted crucial sustainability information through a meticulously designed pipeline, including PDF parsing, text sanitizing, batching, parallel API requests, and postprocessing. The system’s effectiveness is evaluated using cosine similarity metrics, comparing model outputs with manually extracted data. The results demonstrate high alignment between the model outputs and manual extractions, validating the system’s performance. This project contributes to advancing sustainability reporting practices, providing organizations with a robust tool for transparent and standardized disclosure of ESG impacts.
Place, publisher, year, edition, pages
2024. , p. 48
Keywords [en]
Data Extraction, Data processing, NLP, LLM, NER, BERT, GPT-3 model, Prompt engineering, Sustainability, Pipeline, Text sanitizing, ETL, Python
National Category
Computer Engineering Computer and Information Sciences
Identifiers
URN: urn:nbn:se:hh:diva-54288OAI: oai:DiVA.org:hh-54288DiVA, id: diva2:1883540
External cooperation
ICONSOF; HighFive
Subject / course
Computer science and engineering
Educational program
Computer Science and Engineering, 300 credits
Supervisors
Examiners
2024-07-112024-07-102024-07-11Bibliographically approved