Language Model Chat
OPTIONS
Common Parameters
The following parameters are applicable to all language model selections.
Option | Description |
---|---|
Transform Name | Name of the transform. Must be unique within a single pipeline. |
Input JSON Chat | When enabled, uses the chat completion JSON format (system, assistant, user messages). When disabled, treats the input as plain text. See "Chat Completion" section for more details. |
Output JSON Chat | When enabled, returns the response in chat completion JSON format. When disabled, returns the response as plain text. See "Chat Completion" section for more details. |
Input field name | The name of the field holding the input text/prompt |
Output field name prefix | A prefix applied to all output field names. |
Identifier value (optional) | An optional identifier to label the output records; if left blank, the output will default to the model’s name. |
Parallelism | The number of parallel threads used for API requests. WARNING: Be mindful of endpoint capacity and associated costs. It is advisable to enable |
Mock (simulate) | Enables a test mode that simulates responses without performing real API calls. |
Mock Output Value | A predefined response returned in simulation mode, used when real API calls are skipped. |
Model API | Defines the API used for communication with the endpoint (e.g., via OpenAI, Anthropic, Ollama, Hugging Face, Mistral). Note: The Open AI API option can be used with any Open AI compatible endpoint/server. For example, vLLM (Kwon et al., 2023). |
Model-Specific Options
Below is a table detailing parameters specific to individual language model APIs that are currently supported with the Language Model Chat transform plugin.
Option | OpenAI | Anthropic | Hugging Face | Mistral | Ollama |
---|---|---|---|---|---|
API/Endpoint URL | ✅ | ✅ | ✅ | ✅ | ✅ |
Temperature | ✅ | ✅ | ✅ | ✅ | ✅ |
Max New Tokens | ✅ | ✅ | ✅ | ✅ | ✅ |
Timeout | ✅ | ✅ | ✅ | ✅ | ✅ |
API Key/AccessToken | ✅ | ✅ | ✅ | ✅ | ❌ |
Model Name | ✅ | ✅ | ❌ | ✅ | ✅ |
Top P | ✅ | ✅ | ❌ | ✅ | ✅ |
Max Retries | ✅ | ✅ | ❌ | ✅ | ✅ |
Response Format | ✅ | ❌ | ❌ | ✅ | ✅ |
Seed | ✅ | ❌ | ❌ | ✅ | ✅ |
Log Requests | ✅ | ✅ | ❌ | ✅ | ❌ |
Log Responses | ✅ | ✅ | ❌ | ✅ | ❌ |
Top K | ❌ | ✅ | ❌ | ❌ | ✅ |
Presence/Repeat Penalty | ✅ | ❌ | ❌ | ❌ | ✅ |
Organisation | ✅ | ❌ | ❌ | ❌ | ❌ |
Frequency Penalty | ✅ | ❌ | ❌ | ❌ | ❌ |
Safe Prompt | ❌ | ❌ | ❌ | ✅ | ❌ |
User | ✅ | ❌ | ❌ | ❌ | ❌ |
Return Full Text | ❌ | ❌ | ✅ | ❌ | ❌ |
Wait for Model | ❌ | ❌ | ✅ | ❌ | ❌ |
Use Proxy | ✅ | ❌ | ❌ | ❌ | ❌ |
Context Window Size | ❌ | ❌ | ❌ | ❌ | ✅ |
Option | Description |
---|---|
API/Endpoint URL | Specifies the identifier (e.g., URL, model ID, or image ID) used to access and interact with models provided by services like OpenAI, Anthropic, Mistral, Ollama, or Hugging Face. |
Temperature | Controls the randomness of the output: high values lead to more diverse text (creative/inventive), while low values produce more focused and deterministic responses. The value typically ranges between 0 and 1 but can exceed this range depending on the model. Note: Unless intentional, it is generally best to adjust only one parameter; either temperature or Top-P/K, rather than multiple simultaneously. |
Max New Tokens | Sets a limit on how many tokens the model may generate in one response. |
Context Window Size | Specifies how much input tokens the model can consider at once. |
Timeout | Determines how long to wait for a response before giving up. |
API Key/AccessToken | Unique access value used to authenticate and authorise access to the endpoint service. |
Model Name | Identifies which specific model to use. |
Top P | Also known as nucleus sampling (Holtzman et al., 2019), controls the diversity of output by limiting the model’s choices to a dynamic subset of the vocabulary. For example, 0.2 means only the tokens comprising the top 20% probability mass are considered. Top-p and temperature both influence the randomness and diversity of the model’s output but operate differently. Temperature scales the probabilities of all tokens while top-p limits token selection to a subset of options whose cumulative probability meets a threshold. Note: Unless intentional, it is generally best to adjust only one parameter; either temperature or Top-P/K, rather than multiple simultaneously. |
Max Retries | The maximum number of retry attempts for failed API requests. |
Response Format | Defines the structure used for the model’s responses. Note: While some APIs support this option, not all models are compatible with it. Additionally, the API may require explicitly instructing the model to produce JSON to ensure proper functionality. |
Seed | Attempts to sample deterministically, ensuring consistent results for repeated requests with the same seed, prompt, and parameters. |
Top K | Controls the breadth of possible answers by limiting the number of tokens considered. Unlike Top P, which selects tokens until their cumulative probability meets or exceeds p, this parameter restricts the selection to a fixed number of tokens defined by k. Note: Unless intentional, it is generally best to adjust only one parameter; either temperature or Top-P/K, rather than multiple simultaneously. |
Repeat/Frequency Penalty | This setting adjusts how the model evaluates token frequency in the generated text, encouraging the model to use less/more repetitive language. The penalty can range from -2.0 to 2.0, depending on the endpoint model. Positive values discourage repetition by penalising frequently used tokens, while negative values can increase repetition by favouring them. Reasonable values for the frequency penalty are typically between 0.1 and 1.0. |
Presence Penalty | This setting determines how the model evaluates new tokens, aiming to discourage reuse of previously generated tokens. The penalty ranges from -2.0 to 2.0, depending on the endpoint model. Positive values penalise tokens that have already appeared in the text, encouraging the model to explore new directions. Unlike the frequency penalty, which scales proportionally based on how often a token is used, the presence penalty applies a one-time additive penalty to tokens that have appeared at least once. This means all repeated tokens are penalised equally, regardless of whether they appear twice, five times, or ten times. Reasonable values for the presence penalty typically range between 0.1 and 1.0. |
Organisation | The organisation ID associated with the API key, for account and billing purposes. |
Safe Prompt | Provides a baseline prompt to guide the model towards generating safe outputs. For example, by prepending your messages with a system prompt such as: |
User | A unique identifier for the end-user, used by the endpoint provider to monitor activity and detect potential abuse. |
Return Full Text | Ensures the entire generated text is returned without truncation. |
Wait for Model | Delays response until the model is ready. If the model is cold and needs loading, this avoids repeated requests by waiting for it to become available. |
Use Proxy | Routes requests through a proxy server. |
Log Requests | Used for debugging requests. |
Log Responses | Used for debugging responses. |
ADDITIONAL INFORMATION
Token
A token represents a unit of text, such as a word, subword, or character, depending on the tokenisation method used during model development. The size of a token varies across models, but a common estimate for English is approximately 4 tokens per 3 words. For more precise calculations, it is important to consider the specific tokenisation method employed by the model, such as Byte Pair Encoding (BPE) or WordPiece. There are online tools that provide convenient calculators to determine token counts based on specific tokenisers and input text. Additionally, frameworks like Hugging Face allow users to download tokenisers that can be used locally to calculate token counts.
Chat Completion / Prompting
Chat completion refers to the JSON structure used to design and control the flow of interactions in a conversational AI system (the model) via an API. This structure defines role-based communication (e.g., "system," "user," "assistant") and shapes the assistant’s behaviour, enabling it to capture user intent, extend text in response to prompts, and generate relevant content even for tasks the model hasn’t been specifically fine-tuned for.
Other terms related to this concept include prompting, prompt engineering, in-context learning and meta-learning. While these terms might sometimes be used interchangeably, they have distinct technical meanings. Meta-learning refers to the broader framework where a model leverages an inner-loop/outer-loop structure to adapt to new tasks. Within this framework, in-context learning represents the inner loop, where the model performs tasks based on examples provided during inference without any gradient updates. This process can be further specialised into "zero-shot," "one-shot," or "few-shot" learning, depending on the number of examples given at inference (Brown et al., 2020). Prompt engineering is the process of designing the inputs (prompts) used to guide the model’s behaviour effectively, optimising its responses to meet specific objectives.
Example:
[
{
"role" : "system",
"content" : "Provides instructions, context, or guidelines to shape the assistant's behaviour and set the overall tone or purpose of the interaction."
}, {
"role" : "user",
"content" : "Represents the individual interacting with the system, providing queries, requests, or instructions to guide the assistant’s responses."
}, {
"role" : "assistant",
"content" : "Responds to the user’s input by generating outputs based on the system's instructions and the user's messages."
}
]
Basic Prompt Engineering Techniques
-
Zero-Shot: The model generates output based solely on the given prompt and its pre-existing knowledge, with no examples provided to guide its response (Brown et al., 2020).
-
One/Few-Shot: A small set of examples (one or more) are provided within the prompt to help the model generalise to new, unseen inputs (Brown et al., 2020).
-
Chain-of-Thought: The model is guided through a logical, step-by-step reasoning process, enabling it to break down complex problems and improve its ability to reach more accurate conclusions (Wei et al., 2022). This method is particularly effective for multistep reasoning tasks. Variants include Zero-Shot and Few-Shot Chain-of-Thought (Kojima et al., 2023).
-
Self-Consistency: The model generates multiple diverse reasoning paths for the same prompt by introducing variability through techniques like adjusting the temperature or applying nucleus sampling (top-p). The final answer is determined by selecting the response that occurs most frequently among these outputs (Wang et al., 2023). This method is particularly effective for improving accuracy in reasoning tasks by leveraging diversity in outputs. For example, in a classification task, the model could generate three chain-of-thought outputs for a sentiment analysis problem (e.g., positive, negative, neutral) and determine the final label through a majority vote among the outputs.
-
Prompt-Chaining (Sequential Chaining): A method where multiple related prompts are provided in sequence, with each prompt building on the output of the previous one. This step-by-step approach helps guide the model’s reasoning process, enabling it to tackle complex tasks incrementally. This technique requires the JSON chat feature to be enabled, as it relies on this functionality to chain the conversation.