Publications Details

Publications / LDRD Report

Predictive Indicators of the Performance of Large Language Models

Link, Hamilton E.; Berman, Brandon; Cooper, Ryan; Bays, Nathan R.; Bays, Nathan R.; Vancleave, Robert

In several mission contexts, it is desirable to estimate the performance of large language models (LLMs) on tasks that we cannot run directly. In light of published “scaling laws” our hypothesis is that some tasks should be consistently more challenging than others based on characteristics of the task. The goal of this project was to begin quantifying how much information about LLM performance can be gained from the features of a model and a task. Two of our statistical models struggled to converge. Pass/fail test results may provide limited information for inference beyond model quality and task difficulty, but we see no evidence at this time for significant feature interaction effect sizes, arguing for simple models. Future work extending the models to capitalize on perplexity of ground truth answers is suggested. This project also introduces “Depth of Knowledge Variant Testing” as a strategy for more finely assessing language models on open domain question and answer tasks. We developed sets of questions that ask a language model to produce similar information while demonstrating increasing depth of knowledge, and also relabeled existing Q&A test questions with their depth of knowledge. Our results suggest further consideration of Bloom’s taxonomy and further refinement of prompts to properly elicit information at varying depths. In the course of this work, we set up a basic infrastructure for standardizing tasks and testing many language models on these tasks. In addition to testing the predictive quality of model features and performance across test suites, with this project we have introduced two new task features to contextualize each test question: the Dewey Classification main category of information covered, and the Bloom’s taxonomy level that corresponds to the depth of knowledge probed by the question. Splits across these and other features produced over five hundred task subtypes with distinct feature vectors, which we tested on half a dozen models.