In addition to GPT-4, OpenAI has released a software platform for assessing the efficacy of its AI models. The technology, named Evals, will let anyone report flaws in its models to assist direct progress, according to OpenAI. The company explains in a blog post, it’s a form of a crowdsourcing approach to model testing.
To design and run benchmarks for assessing models like GPT-4 while examining their efficiency, OpenAI produced Evals. Developers can utilize the new technology to assess performance between various datasets and models, analyze the quality of completions offered by an OpenAI model, and produce prompts using datasets.
“We use Evals to guide the development of our models (both identifying shortcomings and preventing regressions), and our users can apply it for tracking performance across model versions and evolving product integrations,” OpenAI writes. “We are hoping Evals becomes a vehicle to share and crowdsource benchmarks, representing a maximally wide set of failure modes and difficult tasks.”
Evals, which works with several well-known AI benchmarks, also allows for the creation of new classes to incorporate unique evaluation logic. OpenAI developed a logic puzzle assessment that includes 10 queries where GPT-4 fails as a model to be imitated. Unfortunately, all of this labor is unpaid. However, the company intends to give GPT-4 access to individuals who submit “high-quality” benchmarks to encourage the use of new platform.
“We believe that Evals will be an integral part of the process for using and building on top of our models, and we welcome direct contributions, questions, and feedback,” the company wrote.
With Evals, the AI research company would automatically discontinue utilizing user input to develop its models, continuing in the path of other companies that have used crowdsourcing to strengthen AI models.