Guardrail Vulnerabilities in Open-Source Language Models: Implications for Democratic Discourse and Marginalized Communities

Loading...
Thumbnail Image

Contributor

Advisor

Editor

Performer

Department

Instructor

Depositor

Speaker

Researcher

Consultant

Interviewer

Interviewee

Narrator

Transcriber

Annotator

Journal Title

Journal ISSN

Volume Title

Publisher

Journal Name

Volume

Number/Issue

Starting Page

6792

Ending Page

Alternative Title

Abstract

The proliferation of open-source Large Language Models (LLMs) presents a complex technological phenomenon with significant societal implications. While these models democratize access to advanced Natural Language Processing (NLP) capabilities, they simultaneously amplify risks for marginalized communities who often bear the disproportionate burden of technological misuse. Our research examines systematic vulnerabilities in guardrail mechanisms across seven prominent open-source LLMs, revealing patterns of harmful content generation that threaten democratic discourse and social cohesion. Through empirical analysis using advanced NLP classification methods, we demonstrate that popular open-source models consistently generate content classified as hateful or offensive when subjected to adversarial prompting techniques. These findings directly contradict the safety assurances provided by model developers, particularly Meta AI's stated commitment that their systems should present balanced perspectives on debated policy issues rather than singular viewpoints.

Description

Citation

DOI

Extent

10 pages

Format

Type

Conference Paper

Geographic Location

Time Period

Related To

Proceedings of the 59th Hawaii International Conference on System Sciences

Related To (URI)

Table of Contents

Rights

Attribution-NonCommercial-NoDerivatives 4.0 International

Rights Holder

Catalog Record

Local Contexts

Email libraryada-l@lists.hawaii.edu if you need this content in ADA-compliant format.