Frequently Asked Questions

Everything you need to know about PeerLM.

What is blind comparative ranking?

Instead of scoring each model's response independently (which can produce unreliable absolute scores), PeerLM anonymizes all responses, shuffles their order, and asks evaluator models to rank them from best to worst. The evaluator never knows which model produced which response, eliminating name bias and positional bias. Rankings are then mapped back to source models and converted to normalized scores.

What models can I evaluate?

PeerLM supports 200+ models across major providers including OpenAI, Anthropic, Google, Meta, Mistral, and more — all through OpenRouter and Groq integrations. You can evaluate any combination of models against each other. We sync model capabilities automatically so you always have access to the latest releases.

How do eval credits work?

1 eval credit = 1 generated model response. The formula is simple: models x personas x topics = credits consumed. Evaluator model calls (the ranking step) are included free — you only pay for generation. Cache hits consume 0 credits, so re-running an evaluation with the same prompts costs nothing.

What are personas?

Personas are system prompt wrappers that define the context for evaluation. For example, you might test models as a 'customer support agent,' a 'code reviewer,' or a 'creative writer.' Each persona has different expectations and evaluation criteria. PeerLM breaks down results by persona so you can see which model excels for which use case.

What's the difference between JSON output and text output?

In JSON output mode, models return structured arrays (e.g., '3 one-liner jokes about airline food'). Each item is scored independently, giving granular performance data. In text mode, models return free-form prose which is evaluated as a whole. JSON mode is recommended for comparative evaluations as it produces more reliable rankings.

Can I share results with my team?

Yes. Pro and Enterprise plans include shareable reports — a public link that anyone can view without an account. Reports are branded with PeerLM and include overall rankings, per-persona breakdowns, and response samples. You can also export results as CSV or JSON for further analysis.

How does response caching work?

PeerLM generates a SHA-256 hash from the combination of model ID, model version, persona, and topic content. If a matching cached response exists, it's reused at zero credit cost. Editing any part of the prompt automatically invalidates the cache. This means re-running evaluations is free, and you only pay for genuinely new generations.

What is deterministic mode?

When enabled, PeerLM sets temperature to 0 and uses fixed random seeds where the model supports it. Before sending these parameters, it checks each model's capability registry to ensure only supported parameters are sent — preventing silent failures. The actual parameters used are logged for audit purposes.

Do you store my API keys?

Provider API keys used for model calls are managed by PeerLM — you don't need to bring your own keys. For BYOK (Bring Your Own Key) setups on Enterprise plans, keys are encrypted with AES-256 at rest and never logged or exposed in responses.

Can I cancel anytime?

Yes. All plans are month-to-month with no long-term commitment. You can downgrade or cancel at any time from your billing settings. When you cancel, you retain access until the end of your current billing period.

Still have questions?

Contact support