Hi! Custom Criteria is live in Open BETA, via the Add criteria selector. We're still looking to make some improvements to the prompt evaluators before moving to GA, to give signal when users try to evaluate criteria or actions which it does not have access to beyond message content (such as conversation/message metadata, tags, status changes etc.)Â
Our BETA has taught us a lot about what teams are looking to measure. Sharing with you all a few clear themes that emerged. Leave a comment to share what you’re looking to measure using Custom Criteria!Â
1. Quality & accuracy criterion
The most common type of custom criterion focuses on whether agents gave the *right* answer — not just a polite or well-structured one. Criteria like Solution Accuracy, Friction Reduction, and Process Adherence all reflect the same underlying concern: did the agent follow correct process and efficiently provide the right solution? Response accuracy failures can be evaluated when there is a correction or escalation as evidenced in the message content, however true evaluation against a knowledge base, or checking actions performed in external systems is not currently possible.
2. Teams are using Smart QA for compliance, not just coaching
Particularly those in regulated industries have built criteria that function as compliance checklists: verifying PII/data handling protocols, escalation protocols, attachments, and security procedures. Some teams are also interested in scoring whether agents followed the right operational steps, like timely response times, email routing accuracy, case merging, and status updates. System action compliance is a use case we didn't fully anticipate - Smart QA cannot evaluate these today, however it can be added and then manually assessed as part of additional QA sampling activity in the scorecard.
---
Tips for writing better criteria
Based on what's working well across the beta, here are our top recommendations:
Be explicit about what the criterion does / does NOT cover
This prevents the AI from flagging conversations as failures when they were fine if a criterion only makes sense in certain scenarios — like an escalation criterion that shouldn't apply to simple transactional requests. Use the Applicability field to narrow the scope, for example: "does not apply to simple transactional conversations."
Define the criteria as the positive behaviour being assessedÂ
For criteria, it should describe the positive behaviour that is being assessed - this is so that when a 'YES' or '5' on a scale it represents it being done well - so that the analytics aggregation of scoring is in the correct direction for fulfilling this criteria. For example, instead of "Red flag behaviours: referencing discounts on XYZ product lines", phrase as "No references to discounts on.." which would be marked positive when there is the absence of this behaviour.
Anchor criteria with observable behaviour
Generic descriptors like "good" or "poor" are harder for the AI to apply consistently than specific, observable examples of good and bad and what to look for. Instead of "the agent communicates well", try additional examples as references: "the agent presents the key answer referencing links to external knowledge articles, and uses the customer's name at least once." The more your rubric reads like an evaluator would think, the better.
Custom Criteria is live - sharing learnings from the BETA 🎯
Login to the community
No account yet? Create an account
Use your Front credentials
Log in with Frontor
Enter your E-mail address. We'll send you an e-mail with instructions to reset your password.
