You Can't Debug a Judgment: Behavior(alism) vs Function(alism) in AI Evaluation

In traditional software, we debug behavior; in AI, we evaluate function. This post explores the tension between behavioral transparency and functional performance in AI systems, drawing on both philosophy and software engineering. When the internal workings are opaque—like in neural networks—we shift from analyzing how a system works to judging what it achieves.

Behavior(alism) vs Function(alism)

One way to contrast traditional software engineering unit test with AI eval is through the lens of behavior versus function.

Traditional systems are engineered with behavioral determinism: developers understand and control the computational steps, and unit tests verify whether the system behaves as specified.

In contrast, AI systems - especially those involving neural networks - are often treated as black boxes. Their internal behavior is opaque, but we evaluate them functionally, judging their usefulness based on how well the outputs align with our expectations. We may not know how the system behaves internally, but we know the function it serves.

Two examples to illustrate the distinction: one from a philosophical perspective, the other from a software engineering viewpoint.

Consider the difference between a mechanical calculator and a human making moral judgments - a difference Leibniz might have disputed. With the calculator, we can trace each operation—each step in the behavior of the machine is transparent, predictable, and analyzable. This is akin to traditional software, where we understand the internal behavior that leads from input to output.

The image depicts an abstract calculator as imagined by the author of the post. — A sketch of a calculator.

In contrast, when evaluating a person's moral judgment, we often can't fully explain the cognitive or neurological behavior that produced it - but we can assess whether the judgment aligns with certain ethical principles or expectations. Similarly in neural networks and deep learning we often lack insight into the "behavior" of the model, but we judge its function based on how well the outputs fit our criteria or values. This mirrors a functionalist view of the mind: what matters is what it does, not how it's done.

Think of a simple bubble sort algorithm: every line of code can be stepped through in a debugger. You know exactly what happens, and unit tests verify that it behaves as expected in every condition. This is traditional software engineering: you test and understand the behavior of the system in precise terms.

Now take a neural network for image classification. You feed it a picture, it says "cat." You don’t know what computations happened inside - not in a way you can debug line-by-line. But you can evaluate the function it performs by running test sets, scoring accuracy, and checking for alignment with user expectations. The internal behavior is obscure, but the functional outcome is what matters. Testing becomes an exercise in evaluation, not in step-by-step verification.

The image depicts an abstract cat as imagined by the author of the post. — Is this a cat?

Reflections

In the generative AI paradigm, user feedback loops, evaluation processes, and pipelines for extracting and preparing contextual data from structured sources play a central role. These components are not only technical necessities and opportunities for innovation and value creation—particularly for the European market in systems built on Large Language Models. They also raise deeper philosophical and epistemological questions about the evolving relationship between biological/intentional and artificial/computational intelligence (?), and about how this relationship might be framed within a theory of knowledge creation for hybrid forms of intelligence.