LLM evaluation platform

GitHub - Arize-ai/phoenix: AI Observability & Evaluation

Arize Phoenix stands out as the premier open-source LLM observability platform in late 2025. It delivers powerful tracing, interactive embeddings visualization, built-in evaluations, and drift detection—completely free and self-hostable.

Go

DeepChecks

DeepChecks excels as a specialized LLM evaluation platform in late 2025, providing robust auto-scoring, customizable judges, version comparison, and seamless CI/CD/production monitoring. It handles complex agentic workflows and reduces hallucinations effectively—perfect for AI teams releasing high-quality generative apps.

Go

DeepEval

Confident AI's DeepEval remains the premier open-source LLM evaluation framework in late 2025, powering reliable testing for RAG, agents, and production apps with 50+ metrics, advanced tracing, red teaming, and synthetic data generation. Trusted by thousands and enterprises alike, it excels at regression testing and production monitoring—free open-source core with scalable cloud plans.