Guardrails for LLM Security: Best Practices and Implementation

June 19, 2024 Dustin Gallegos

The ability of LLMs (Large Language Models) to learn from the data they were trained on and generate human-like text has proven to be extremely valuable. However, LLMs also possess several characteristics that can potentially lead to safety concerns, making the use of guardrails a convenient approach.

Firstly, the vast knowledge and capabilities of LLMs can be leveraged to generate highly convincing and human-like text, which could be exploited for malicious purposes, such as the creation of disinformation or the impersonation of real individuals. Another significant concern is the potential for "jailbreaks," where users or malicious actors find ways to circumvent the guardrails put in place, enabling the generation of prohibited or harmful content.

Additionally, LLMs can experience "hallucinations," or the generation of plausible but factually incorrect information, which can further complicate the challenge of ensuring information accuracy and integrity. The inherent biases and inconsistencies present in the training data of LLMs can also be reflected in the model's outputs, potentially leading to the propagation of harmful stereotypes or the generation of content that is biased or discriminatory.

Moreover, the lack of a deep understanding of the reasoning processes within these models can make it challenging to reliably predict their behavior, especially in novel or ambiguous situations, heightening the risk of unexpected or unintended outcomes. Implementing robust guardrails can help mitigate these risks and ensure the safe and responsible deployment of LLMs in various applications.

Securing LLM applications with guardrails is not only essential for maintaining public trust and regulatory compliance, but it also protects businesses from reputational damage and potential financial losses associated with security breaches.

What are guardrails?

Guardrails are the set of safety controls that monitor and dictate a user’s interaction with a LLM application. They are a set of programmable, rule-based systems that sit in between users and foundational models in order to make sure the AI model is operating between defined principles in an organization.

What about prompt engineering? You could ask.

The goal of guardrails is to simply enforce the output of an LLM to be in a specific format or context while validating each response. By implementing guardrails, users can define structure, type, and quality of LLM responses.

Categories of LLM Guardrails

The applications on LLM Guardrails are largely binned into these three categories:

·Staying on topic (Topical): The key concerns largely revolve around providing accurate, informed, and on-topic responses. For example, a sales agent shouldn’t provide responses for questions about engineering topics.

·Safe responses (Safety): The responses given by the bot should be ethical and safe. For instance, potential racist or violent responses should be blocked in most agents.

·Ensuring security (Security): Building structures to enhance the bot's resilience against malicious actors trying to jailbreak, highjack functionalities, or otherwise break try to attack the bot.

How to implement guardrails on LLM applications

There are many frameworks that allow us to implement guardrails in LLM applications. NeMo Guardrail, developed by NVIDIA, is one of the most important ones. It is an open-source toolkit designed to introduce programmatic guardrails within LLM systems, primarily focusing on conversational AI. It aims to prevent LLM-powered applications from navigating towards undesired topics or responses, thereby ensuring that interactions adhere to predefined policies or guidelines.

Before delving deeper, it's worth noting that NeMo Guardrails is currently in a beta phase.

This phase is marked by continuous development, where feedback and contributions are greatly valued to navigate through occasional instabilities and enhancements. As such, while exploring and experimenting with NeMo Guardrails, it's advised to tread with caution, especially in production settings, bearing in mind its evolving nature.

At the heart of NeMo Guardrails is Colang, a modeling language crafted for creating flexible and controllable conversational workflows. Its syntax, designed to be intuitive for those familiar with Python, plays a crucial role in defining how conversations progress, including the management of user and bot messages, and the flow of dialogue.

The Colang Syntax

The Core Syntax elements of Colang, which is integral to configuring NVIDIA's NeMo Guardrails, are designed to offer a structured yet flexible way to create conversational AI guardrails. Colang, standing for "Conversational Language," employs a syntax that is reminiscent of Python, meaning it is both intuitive and powerful for defining the behavior of conversational systems.

The core syntax elements are: blocks, statements, expressions, keywords and variables. There are three main types of blocks: user message blocks, flow blocks and bot message blocks.

User Messages

User message definition blocks define the canonical form message that should be associated with various user utterances e.g.:

define user express greeting

  "hello"

  "hi"

define user request help

  "I need help with something."

  "I need your help."

Bot Messages

Bot message definition blocks define the utterances that should be associated with various bot message canonical forms:

define bot express greeting

  "Hello there!"

  "Hi!"

define bot ask welfare

  "How are you feeling today?"

If more than one utterance is specified per bot message, the meaning is that one of them should be chosen randomly.

Bot Messages with Variables

The utterance definition can also include reference to variables (see the Variables section below).

define bot express greeting

"Hello there, $name!"

Alternatively, you can also use the Jinja syntax:

define bot express greeting

"Hello there, {{ name }}!"

Flows

Flows represent how you want the conversation to unfold. It includes sequences of user messages, bot messages and potentially other events.

define flow hello

user express greeting

bot express greeting

bot ask welfare

Additionally, flows can contain additional logic which can be modeled using if and when.

For example, to alter the greeting message based on whether the user is talking to the bot for the first time or not, we can do the following (we can model this using if):

define flow hello

user express greeting

if $first_time_user

bot express greeting

bot ask welfare

else

bot expess welcome back

The $first_time_user context variable would have to be set by the host application.

Getting Started with NeMo Guardrails

Now, we will use NeMo Guardrails to implement topical safety features. We'll go through installing the necessary libraries, setting up the environment, defining guardrails using CoLang and YAML configurations, and finally, initiating conversations with an LLM while enforcing these guardrails.

Step 1: Installing Libraries

First, we need to install the nemoguardrails and python-dotenv packages. Open your terminal or command prompt and run the following command:

pip install nemoguardrails python-dotenv

These libraries help us interact with NeMo Guardrails and manage environment variables securely.

Step 2: Setting Up Your Environment

Create a .env file in your project directory. Add your OpenAI API key to this file as follows:

OPEN_API_KEY='your_api_key_here'

Replace 'your_api_key_here' with your actual API key. This step ensures your API key is kept secure and not hardcoded into your scripts.

Step 3: Importing Libraries

In your Python script or Jupyter notebook, import the necessary classes and functions:

This code imports LLMRails and RailsConfig from nemoguardrails for guardrails configuration and interaction. It also loads the .env file to use your OpenAI API key.

Step 4: Defining Guardrails

Define your guardrails using CoLang for content rules and YAML for model configurations:

This configuration sets up guardrails to redirect conversations away from political topics.

Step 5: Initializing Guardrails

Initialize the guardrails configuration and create a rails instance:

This step compiles your guardrails and prepares the model for interaction.

Step 6: Chatting with Guardrails

Now, let's test our setup by sending prompts to the LLM:

The first prompt is political, so it will trigger the guardrails and return a predefined response. The second prompt is neutral and will receive a direct response from the model.

Conclusion

The use of tools like NVIDIA's NeMo Guardrails, which leverage the flexible Colang syntax, enables developers to define clear topical, safety, and security policies for their LLM-powered applications. By implementing these guardrails, organizations can feel more confident in deploying LLMs, knowing that the interactions will adhere to predefined guidelines and best practices.

As the adoption of LLMs expands across various industries, the imperative for implementing effective guardrails continues to grow. Continuous research and development efforts in this realm are vital for ensuring the responsible and trustworthy use of these powerful AI technologies. Kmeleon, with its expertise in Enterprise AI Solutions and AI Strategy Consulting, is at the forefront of addressing these challenges. By proactively tackling potential risks and hurdles, we enable organizations to fully harness the potential of LLMs while prioritizing user safety and security.