The first thing is to sit down with the person who is interested in obtaining information about a speaker and have a discussion about what, exactly, we are interested in measuring.
The very first step would be to actually define what we want to measure. Sometimes that seems very simple, but it’s not. Oftentimes what we try to measure is something that we might not understand. It might be something that we might be interested in finding out more about. That means that the definition of that thing might change.
So oftentimes that very first step is very, very challenging.
Construct Irrelevant Variance
One of the main challenges is that language depends on cognitive processes that oftentimes we cannot observe. Unlike measuring something like a desk, for example, where we can take a measuring tape and figure out how long that desk is — language is not like that.
So, if I’m interested in measuring something like the range of vocabulary that this person has, it’s unclear how we can do that. It’s not easy to go in and measure the distinct representations that a person has in their lexicon. So what we have to do is figure out different ways to get that information.
So what we ask them to do, is we ask them to perform specific tasks, and based on those tasks, we obtain scores. And based on those scores, then, we make inferences about the cognitive skills that we care about.
One of the challenges there is that with the tasks we design, it’s not clear if they elicit information just about the thing we care about, or whether they tap into other cognitive skills that we might not be interested in. What we refer to as “construct irrelevant variance.”
That’s one of the challenges, obtaining a score that will give us information about the cognitive skills that we care about.
We also have noise. Individuals are not machines, their performance will fluctuate as a function of a number of different random variables: whether they had coffee in the morning before or not.
When we design a task and we try to sample somebody’s behavior, we will capture what we care about, we will capture things we don’t care about that are systematic, and we will also capture some of that noise. It’s a matter, then, of the person who is using a test to figure out: The score I’m looking at, how much does it represent what I want to know about versus all those other things.
If you’re thinking about making judgments about participants or clients, then we want to get information that will not change if we measure them on Wednesday, Thursday, or Friday.
I realized that we often underestimate the amount of noise that the data contain. That’s the case even for measures that people think provide very valid and reliable information. That’s one of the things that’s really surprised me. In some cases it was unexpected to find that a specific test would contain that much noise. In other cases it would be surprising how much faith people would put in measurements sometimes without having the necessary psychometric evidence to do so.
An additional challenge is that language is very complex, it’s very dynamic. It’s often considered an adaptive system, which means the way I produce language will vary as a function of who I talk to, what are the specific conditions, what is the setting, what is my goal, what is the type of discourse.
When we design a task to sample language, oftentimes we have a specific constellation of all those different factors. We perform this task. A speaker produces language. And we measure some aspect of it. But what we’ve done is we’ve sampled that specific constellation of conditions.
So it’s uncertain whether that specific score, and the information it gives us about that specific combination of conditions, will generalize to other kinds of behaviors that we might be interested in.
For example, if I ask an individual to perform a task and tell me how they make a peanut butter and jelly sandwich. Then I measure some specific aspect of their language production. Does this give me information about things like, how will this individual try to resolve an argument with their spouse? How will that individual try to resolve an issue when they are pulled over by a police officer?
It’s impossible to validate a test for “any given purpose” — i.e., for all the purposes at the same time. The idea is that you have to target specific purposes of the test and identify the pieces of information that would allow you to say, yes, I can interpret a given score in a specific way.
In our field we typically think of validity or reliability as being characteristics of specific assessment tools. In other fields validity is not thought of as something a test has or doesn’t have. Validity is often thought of as the extent to which certain interpretations are justified.
For example, if I ask for an individual to go through a task, and I obtain a score based on that task. One of the critical questions is: what am I going to do with that score? If I’m interested, for example, in figuring out the severity of an individual, then when I look in the test manual to look for specific psychometric evidence, I should find information that says that test is good at figuring out the severity of the individual.
However, if I’m using a test in order to obtain a score before some treatment, and then after a treatment, and I’m interested in figuring out whether that treatment works or not, that means I have to go in the manual, first, to figure out whether or not that test is sensitive to change. So depending on the purpose I want to use a specific test for, I have to find specific evidence that supports the use of scores for a specific purpose.
I don’t think sometimes in our field that’s something that is clear. We talk about whether a test is valid or not, but perhaps we should be talking about whether a test is valid for specific purposes. A test might be valid for one purpose, but it might not be valid for a different purpose.
Individuals need to understand that the field is changing. So, the way we were designing tests 10, 20, 30 years ago is not the same as the way we are designing tests now.
The manuals will have very different information now, compared to the information they had 5 or 10 years ago. Also now we have a lot of new quantitative statistical tools that allow us to design, develop, and test and evaluate our psychometric tools. They were not available before for a number of reasons. Either they were not available in our field, because we had not transferred these skills from other fields. Or in general they were not available because we did not have the computational power. Computers were not fast enough to do all the calculations needed to use those approaches and methodologies.
There are new frameworks like Item Response Theory, for example, that use very different quantitative tools to validate specific aspects of tests.
If somebody, for example, doesn’t have the background for using that specific tool, or approach, or framework, then it might be really hard to evaluate whether a specific test would be the appropriate test to use.
I think one of the things people should take into account now is: Do they have to skills to evaluate new psychometric tools as they are coming out? Or do they need to educate themselves about new approaches?
Markus K. A. & Borsboom D. (2013). Frontiers of test validity theory: Measurement, causation, and meaning. Routledge.