Job Summary
The role involves developing and refining complex prompts to evaluate and challenge the behavior of Large Language Models (LLMs) across dimensions such as bias, accuracy, coherence, and ethics. Responsibilities include designing multi-turn interactions, testing safety mechanisms, and probing for vulnerabilities by manipulating language, context, and phrasing to assess the effectiveness of built-in protections. The position requires crafting scenarios that evaluate the model’s compliance with ethical and legal standards, including testing guardrails across multiple languages. Detailed documentation and reporting of findings, along with collaboration with AI safety teams, are essential to ensure responsible AI development and enhancement.
Job Description
- Develop and draft diverse and complex prompts aimed at testing various aspects of “Large Language Model” behavior, including but not limited to bias, accuracy, coherence, and ethical considerations (refine and iterate using advanced prompts based on models' initial responses to identify how slight changes affect the output, with the goal by bypassing the model's filters)
- Identify and analyze potential vulnerabilities and flaws in “Large Language Model” responses by executing a variety of prompt scenarios (document findings and provide detailed reports on discovered issues)
- Manipulating language, phrasing, and context to bypass built-in protections that could include rephrasing sensitive requests in ways that avoid triggering content filters (e.g. asking for sensitive information in a polite or indirect manner) and prompt injection attacks, where a carefully crafted prompt is designed to override the model's instructions or system prompts (e.g. attempting to break internal instructions set by developers)
- Design multi-turn interactions where previous responses are used to gradually lead the model to unsafe outputs involving testing for context-aware vulnerabilities and confusing the model
- Crafting prompts that probe the effectiveness of safety mechanisms like content moderation, censorship, or responses are used to gradually lead the model to unsafe outputs involving testing for context-aware vulnerabilities and confusing the model
- Crafting prompts that probe the effectiveness of safety mechanisms like content moderation, censorship, or response limitations - skill in crafting prompts in multiple languages to test whether guardrails are equally effective across different linguistic contexts - needs to design prompts that test the “Large Language Model" compliance with ethical and legal standards (e.g. testing for responses related to illegal activities, child protection, or privacy breaches)
- Crafting prompts and create testing templates that probe the effectiveness of safety mechanisms like content moderation, censorship, or response limitations
- Evaluate model and report vulnerabilities, testing outcomes using standardized benchmarks like responsible AI and collaborate with protection/guardrails/ML Team
Keyskills