Changelog

New features, integrations, and research from Neurometric AI.

Mar 10, 2026

Product

Clawbake: Open-Source Multi-User Instance Management for OpenClaw

Released Clawbake, an open-source Kubernetes system enabling team members to spin up isolated OpenClaw AI agent instances managed via web dashboard or Slack.

Isolated environments with independent namespaces, API keys, and storage
Slack integration with /clawbake commands for instance management
Kubernetes operator pattern with automatic drift correction
Network, credential, and workload isolation across users

Learn More —>

Mar 9, 2026

Product

Cosine Similarity Evaluation

Evaluate model outputs using cosine similarity for organizations without established eval datasets. Compare model performance to optimize for speed and cost without needing ground-truth labels.

Cosine similarity scoring for text-based responses
Relative model comparison without eval datasets
Speed and cost optimization decisions from real outputs

Learn More —>

Mar 6, 2026

Integration

Neurometric for Claude Skills

Turn Claude Code sessions into a testing lab. Automatically evaluate prompts across different models and reasoning configurations with zero code changes.

Automatic prompt capture from Claude Code sessions
Gateway routing through api.neurometric.ai
Built-in /neurometric-status and /neurometric-replay commands

Learn More —>

Mar 5, 2026

Product

Agent Latency Optimization

Identify the fastest models for your agent workloads. Radar graph comparisons across cost, speed, and token efficiency with average 4-5x latency improvements.

Radar graph model comparisons
Latency-focused optimization (avg 4-5x speed gains)
Free access at studio.neurometric.ai

Learn More —>

Feb 27, 2026

Research

Agentic Inference Checklist

Published a comprehensive checklist for evaluating AI infrastructure and deploying Small Language Models for speed and cost reduction.

Step-by-step infrastructure evaluation framework
SLM deployment guide for production workloads
Cost-performance optimization strategies

Learn More —>

Feb 26, 2026

Research

Stage 6: Autonomous AI Infrastructure

Explores the future of self-optimizing AI infrastructure where systems autonomously manage model selection, routing, and configuration without human intervention.

Intelligent gateway for real-time request routing
Automated model evaluation pipelines
Cost-quality optimizers with dynamic budget constraints

Learn More —>

Feb 23, 2026

Integration

Langfuse Data Ingestion

Import your existing Langfuse traces directly into Neurometric for instant analysis. One-click connection with automatic trace sync.

One-click Langfuse connection
Automatic trace import and sync
Supports EU and US Langfuse regions

Learn More —>

Feb 17, 2026

Product

Model Explorer: Chain of Thought

Explore 4-stage Chain of Thought reasoning across different models. Compare how models break down complex problems step by step.

4-stage CoT visualization
Side-by-side model comparison
Step-by-step reasoning analysis

Learn More —>

Feb 13, 2026

Open Source GPT-4o Alternative

Announced a post-trained GPT-4o-style model based on Arcee 400B as a free and open-source alternative for production inference.

Learn More —>

Feb 12, 2026

Research

SLM Fine-Tuning: 95% Accuracy on CRM-Arena

Fine-tuned a 4B parameter Small Language Model to achieve 95% accuracy on CRM-Arena, demonstrating that smaller models can outperform frontier models on domain-specific tasks.

4B parameter model fine-tuned for CRM tasks
95% accuracy — outperforming larger frontier models
Fraction of the inference cost

Learn More —>

Feb 11, 2026

Research

AI Maturity: Stage 4 to Stage 5

Guide to transitioning from tactical model selection to systematic portfolio management across 10+ models with dedicated AI Model Ops infrastructure.

Portfolio management across frontier, mid-tier, SLM, and custom models
Multi-dimensional measurement at task-model-algorithm level
Continuous automated testing infrastructure

Learn More —>

Feb 1, 2026

Product

Experiment Workflow

Full experiment pipeline for model evaluation: configure evaluations across multiple models, run automated comparisons, and review results with LLM-as-Judge scoring.

Multi-model experiment configuration
Automated prompt evaluation across 6+ models
LLM-as-Judge pass/fail scoring
Cost, latency, and token analysis per model

Jan 30, 2026

Product

CRM-Arena: NVIDIA Nemotron Support

Added NVIDIA Nemotron models to the CRM-Arena leaderboard for testing and benchmarking against other production models.

Learn More —>

Jan 15, 2026

Integration

Langfuse Integration

Connect your Langfuse account to Neurometric and import existing LLM traces. Auto-sync prompts, costs, and latency data for immediate analysis.

API key-based connection
Automatic prompt and cost sync
Real-time monitoring support

Dec 15, 2025

Product

CRM-Arena Leaderboard

Launched the CRM-Arena leaderboard for benchmarking language models on real-world CRM tasks. Test models across customer service, sales, and support scenarios.

Real-world CRM task benchmarks
Multi-model comparison
Production-relevant evaluation metrics

Oct 23, 2025

Research

Inference Time Compute: CRM-Arena Results

Published research on how inference-time compute optimizations and task-specific algorithms improve model performance on production CRM workloads.

Learn More —>

Changelog

Changelog

Changelog

Ready to Deploy Intelligence Without Compromise?