Evaluating Bias and Toxicity in LLMs
Director:
Herrera García, Vicente OctavioDate:
2025-07Abstract:
This master´s thesis investigates bias and toxicity in Large Language Models (LLMs) as a central concern for AI Safety and AI Alignment. Guided by a series of different benchmarks and the 3H framework, it systematically shows how publicly available checkpoints behave when faced with reasoning, demographic and open-ended safety challenges. Three Jupyter notebooks integrate the harness evaluation, Hugging Face bias and customized safety prompts to deliver a reliable and standardized benchmarking framework. Beyond establishing that raw accuracy is no guarantee of ethical soundness, the thesis details how those gaps were uncovered. Each notebook covers different layers of the problem: one benchmarks factual and reasoning skills, another measures Toxicity and Bias , and a third runs multi‑turn dialogues that surface context‑dependent harms. This setup means new models or datasets can be swapped in with minimal code changes, giving future AI Safety a solid base for its tests. The study argues that current AI systems reflect the same offline social power dynamics. Addressing those issues calls for more than clever code modifications; it demands continuous processes including broader data curation or tighter model‑governance rules and humans firmly educated and in the loop. Together, these suggestions provide a clearer guide to using models effectively and responsibly. Overall, this work applies the 3H framework to a practical benchmarking process, highlighting where current models still have weaknesses and offering clear steps to develop AI that is safer and fairer. In the future, more people should be involved, and the tests used to check AI should be kept up to date so they stay useful as the technology keeps changing.
This master´s thesis investigates bias and toxicity in Large Language Models (LLMs) as a central concern for AI Safety and AI Alignment. Guided by a series of different benchmarks and the 3H framework, it systematically shows how publicly available checkpoints behave when faced with reasoning, demographic and open-ended safety challenges. Three Jupyter notebooks integrate the harness evaluation, Hugging Face bias and customized safety prompts to deliver a reliable and standardized benchmarking framework. Beyond establishing that raw accuracy is no guarantee of ethical soundness, the thesis details how those gaps were uncovered. Each notebook covers different layers of the problem: one benchmarks factual and reasoning skills, another measures Toxicity and Bias , and a third runs multi‑turn dialogues that surface context‑dependent harms. This setup means new models or datasets can be swapped in with minimal code changes, giving future AI Safety a solid base for its tests. The study argues that current AI systems reflect the same offline social power dynamics. Addressing those issues calls for more than clever code modifications; it demands continuous processes including broader data curation or tighter model‑governance rules and humans firmly educated and in the loop. Together, these suggestions provide a clearer guide to using models effectively and responsibly. Overall, this work applies the 3H framework to a practical benchmarking process, highlighting where current models still have weaknesses and offering clear steps to develop AI that is safer and fairer. In the future, more people should be involved, and the tests used to check AI should be kept up to date so they stay useful as the technology keeps changing.
Collections
Files in this item



