We may earn compensation from some providers below. Learn More.
Our videos have over 100 thousand views on
Youtube logo

See our channel

How We Test & Score AI Characters

uPDATED ON: December 27, 2024

Question Mark Tooltip Icon

Update Schedule Overview

Every 3 months: We check our information for accuracy, update pricing, and make small adjustments as needed.

Every 6 months: We conduct a comprehensive review of all content to ensure it remains relevant, making major updates where necessary.

This schedule ensures you receive the most current and accurate information. For more details, visit our update process.

Question Mark Tooltip Icon

Content Review Process

This piece of content is reviewed by one of our experts, ensuring the information you receive is accurate and trustworthy.

For more details, learn about our fact-checking process.

The Best Characters AI team has dedicated countless hours interviewing AI roleplay users to uncover the most common issues they face with AI characters.

Our scoring guidelines are designed based on this feedback, helping you easily identify which AI characters excel and which ones might fall short. Use our scores to quickly find the AI character that’s the best fit for you.

Below we’re going factor-by-factor so you can see exactly how we test and score AI characters.

Scoring Factors

Our overall score is based on a weighted average of 6 scoring factors.

  • 15% – Backstory
  • 30% – Conversation
  • 25% Engagement
  • 15% – Special Mechanics
  • 15% – Platform

Understanding Our Scores

We rate each feature based on how important it is, then add those ratings to get the overall score. The more points an AI character scores in any given criteria, the better its overall score is going to be.

Here’s a breakdown of what each overall score stands for:

4.6

5.0

This is a perfect score, showing the AI character is top-notch with exceptional features.

4.0

4.5

Good score, the AI character meets basic requirements but isn’t flawless.

3.0

3.9

Bad score, the AI character significantly underperforms compared to similar characters in the niche.

We rate individual aspects using a point system from 3.0 to 5.0, in 0.5 steps. The total score may include decimals.

Scoring Methodology

Over time, we may find ways to improve our testing process or even add or remove criteria. These changes aim to make sure our ratings give you the best advice on AI characters.

Right now, our evaluation system is at version 1.0, designed from insights gathered through interviews with AI character creators and roleplay enthusiasts. In our interview, we asked both users and AI character creators to identify the features they value most in an AI character and the biggest issues they encounter.

If we make any updates to how we rate AI characters, we’ll also go back and update our existing AI character reviews to match. That way, you’ll always know which version of our methods was used to rate an AI character, as we’ll note it at the start of every review.

Backstory – 15%

Our backstory score consists of two sub-ratings.

Character Depth – 7.50%

For character depth, we look at how much information is provided about the character’s history, motivations, and goals.

We also check if the description clearly outlines traits like shyness, boldness, or humor that define the character.

Casey runaway girl character description
Casey (Runaway girl) character description

Profile Match – 7.50%

Many low-quality AI characters have profiles that don’t match their roleplay or description. We check:

  • Does the profile picture match the roleplay?
  • Does the profile picture match the description?

Here’s the scale we use to determine the backstory scores:

4.6

5.0

Highly detailed and engaging backstory with clear personality and history.

4.0

4.5

Good backstory with enough detail but missing some depth.

3.0

3.9

Basic backstory with minimal details or lacks clarity.

Conversation – 30%

We evaluate conversation quality by looking at how well the AI character responds to prompts, which affects how realistic and engaging the roleplay feels. Our assessment includes several key sub-factors.

Opening Message – 10%

The first message can make or break an AI character. This is where the AI character needs to hook the user in with a detailed yet easy-to-understand introduction that aligns with its backstory and sets the tone for the roleplay.

  • Engagement: Does it grab the user’s attention and make them want to continue?
  • Clarity: Is it easy to understand and free from unnecessary complexity?
  • Roleplay Setup: Does it establish a clear setting or context for the roleplay?
  • Originality: Does it feel unique or does it come off as generic?
  • Message Formatting: Is the text well-structured with proper breaks, making it easy to read and follow?
Judith Thoreau opening message GirlfriendGPT
Judith Thoreau’s opening message GirlfriendGPT

Sex Neutrality – 10%

Gender neutrality is important for AI characters. A good AI should respond in a way that works for both male and female users. For example, women don’t want to be called “dudes,” and men don’t want to be referred to as women. Without gender-neutral responses, the audience becomes much smaller.

An AI character can use pronouns, but only if the user explicitly requests it or provides a profile card with pronouns.

Chat Stamina- 10%

A common problem with AI characters is that they often break down after a few messages. While the platform hosting the AI plays a role, the bot itself is also a key factor.

Madville Creations interview: "a better word for longevity - can you engage with a bot for 100 messages without it breaking down."
Madville Creations, character creator of GirlfriendGPT

To test this, we send 100 messages to each AI character to evaluate how well it maintains quality over time.

  • 40+ messages: Excellent stamina
  • 30 – 40 messages: Good stamina
  • 15 – 30 messages: Decent stamina
  • 0 – 15 messages: Poor stamina

Here’s the scale we use to determine conversation scores:

4.6

5.0

Excellent conversation quality. The opening message is engaging and well-structured, setting the tone perfectly. Responses maintain clarity, gender neutrality, and consistency for 40+ messages without breaking down.

4.0

4.5

Good conversation quality. The opening message grabs attention and is mostly clear and aligned with the roleplay. Gender neutrality is maintained, and responses stay engaging for 30–40 messages with only minor issues.

3.0

3.9

Poor conversation quality. The opening message is average, with some lack of clarity or originality. Responses are consistent for 15–30 messages but may show occasional repetition or lose context.

Engagement – 25%

Engagement is a simple way we measure how good an AI character is. We calculate it by dividing the total number of messages by the total number of chats. This shows how much users interact with the character on average.

Engagement metrics for Judith Thoreau

The table below explains how we determine engagement scores and what each score represents.

Engagement LevelEngagement ScoreDescription
Amazing40+Exceptional engagement, very interactive.
Above Average30-40Strong engagement, users are highly engaged.
Bare Minimum15-30Acceptable, but indicates some issues.
Below Minimum< 15Poor engagement, likely significant problems.

For the final performance score, we use a rating scale from 3.0 to 5.0 in 0.5 increments.

Here’s the scale we use to determine engagement scores:

4.6

5.0

Excellent engagement score (40+). Users stay deeply engaged with the character.

4.0

4.5

Good engagement score (30–40). Shows strong user interest.

3.0

3.9

Decent engagement score (15–30). Needs improvement to keep users engaged.

Special Mechanics – 10%

Some AI characters include special mechanics, like displaying stats during roleplay, which makes the experience more immersive.

We reward AI characters for special mechanics if they add value to the user, enhance the roleplay, and fit the character.

Gwen Stacy, special Mechanics on GirlfriendGPT
Example of special mechanics

Here’s the scale we use to determine special mechanics scores:

Yes

The AI has unique, well-designed mechanics that clearly enhance roleplay.

No

The AI has no special mechanics.

Platform – 10%

Finally, we evaluate the AI roleplay platform where the AI character is hosted. Even the best and most well-designed AI character depends on the platform’s capabilities, such as;

  • Memory – 2%
  • Speed – 2%
  • Customization – 2%
  • Privacy – 2%
  • Extra Features – 2%
Favicon Best Characters AI

We use a simple pass or fail system for each subrating. For example, if a platform can remember 20 messages, it passes and earns the full points for that category (2% of the total platform score).

Memory – 2%

Memory is super important because it makes the AI feel more personal. To test this, we tell the AI things like our name, hobbies, or preferences and see if it remembers them in future conversations. For example, does it bring up your favorite food or remember a roleplay detail from before? Platforms with strong memory make conversations feel more real and connected, while poor memory makes the AI feel forgetful and less engaging.

Passing Grade of

20 Messages

Speed – 2%

Speed significantly impacts the user experience, as delays can break immersion. We measure response times by sending multiple prompts and recording how quickly the AI replies. Ideally, responses should come within 1–3 seconds for smooth conversations. We also test during different times of day to account for potential server load issues and ensure consistency across various devices and internet connections.

Passing Grade of

3 Seconds Max

Customization – 2%

Customization is about making the AI fit what you want. We check if you can adjust their personality, backstory, and even looks. For example, can you make the AI more romantic or shy, or change its appearance to match your preferences? We also see how easy it is to use these settings and whether the changes work smoothly in chats. The more flexible the platform is, the better your experience will be.

Passing Grade of

Customizable responses

Privacy – 2%

Privacy matters, especially when chatting about personal stuff. We look at how the platform protects your data—do they encrypt chats? Do they store your messages or share them with others? Good platforms let you delete your data and chat anonymously. We dig into the privacy policy to make sure they aren’t doing anything sketchy with your info.

Passing Grade of

No Chat Monitoring

Extra Features – 2%

Some platforms go beyond just chatting and offer things like images, voice messages, or even videos. We test how well these features work. For example, does the image match the conversation? Does the voice sound natural? These extras can make the AI experience way more immersive, so we score platforms higher if they have fun and useful add-ons that actually work well.

Passing Grade of

2 Bonus Features Minimum

To provide a complete picture, we review each platform used in our roundups with a dedicated review and youtube video on @ai-girlfriend-expert.

Here’s the scale we use to determine platform scores:

4.6

5.0

This is a perfect score, showing the AI platform is top-notch with exceptional features.

4.0

4.5

Good score, the platform meets basic requirements but isn’t flawless.

3.0

3.9

Bad score, the app significantly underperforms compared to similar competitors.

Mistakes and How to Fix Them

Our way of testing and giving scores is really detailed and not easy to do. Because of this, we might sometimes make mistakes, like getting a test result wrong, messing up a score, forgetting to update something or other slip-ups.

We try really hard to avoid these errors by checking our work carefully. But mistakes can still happen.

If you notice any mistake or something that doesn’t seem right, please tell us here. We’re always ready to fix errors and clear up any confusion.

Got Ideas?

If you have any ideas on how we can improve our scoring, we’d love to hear from you. Please share your thoughts with us on our contact form.

Herman J Carter Signature in white
Herman Carter