Who’s in Agreement?

So, let’s say you have invited everyone over for the big game on Sunday (Superbowl 49) but you don’t have a big screen TV. Whoops! That sucks. Time to go shopping. Here’s the rub: which one to get? There are so many to chose from and only a little time to make the decision. Here is what you do:

2- make a list of all neighboring electronics stores
3- Go shopping!

OK, that sounds like a good plan but it will take an enormous amount of time to perform this task all together and more importantly your Lada only seats 4 comfortably and you are 8 buddies.

As you are a new research scientist (see here for your story) and you have already studied the challenges of assessing agreement (see here for a refresher) you know that it is best for all raters to assess the same items of interest. This is called a fully crossed design. So in this case you and all of your friends will assess all the TVs of interest. You will then make a decision based on the ratings. Often, it is of interest to know and to quantify the degree of agreement between the raters – your friends in this case. This assessment is the inter-rater reliability (IRR).

As a quick recap,

Observed Scores = True Score + Measurement Error

And

Reliability = Var(True Score)/ Var(True Score) + Var(Measurement Error)

Fully crossed designs allow you to assess and control for any systematic bias between raters at the cost of an increase in the number of assessments made.

The problem today is that you want to minimize the number of assessments made in order to save time and keep your buddies happy. What to do? Well, you will simply perform a study where different items will be rated by different subsets of raters. This is a “not fully crossed” design!

However, you must be aware that with this type of design you are at risk of underestimating the true reliability and therefore must, therefore, perform alternative statistics.

I will not go into statistical detail (today anyway!) but if you are interested have a peek here. The purpose of today’s post was simply to bring to your attention that you need to be very careful when assessing agreement between raters when NOT performing a fully crossed design. The good news is that there is a way to estimate reliability when you are not able to have all raters assess all the same subjects.

Now you can have small groups of friends who can share the task of assessing TVs. This will result in less assessments, less time to complete the study, and – most importantly – less use of your precious Lada!

Your main concern, as you are the one to make the purchase of the TV, is still: can you trust your friends assessment score of TVs you did not see? But now you have a way to determine if you and your friends are on the same page!

Maybe this will avoid you and your friends having to Agree to Disagree as did Will Ferrell in Anchorman…

Listen to an unreleased early song by Katy Perry Agree to Disagree, enjoy the Superbowl (and Katy Perry) on Sunday and…

…I’ll see you in the blogosphere!

Pascal Tyrrell

Face Validity: Who’s Face Is It Anyway?

Yes, I was a big fan of the A-Team. Who wasn’t? Mr. T (I guess that makes me Prof. T…) was always entertaining to watch. Lieutenant Templeton Arthur Peck was suave, smooth-talking, and hugely successful with women. Peck served as the team’s con man and scrounger, able to get his hands on just about anything they needed. Need a refresher? Have a peek here.

Well in a past post 2 Legit 2 Quit we talked about why we assess validity – because we want to know the nature of what is being measured and the relationship of that measure to its scientific aim or purpose. So what if we are uncertain that our measure (a scale for example) looks reasonable? We would consider face validity and content validity. Essentially, face validity assess whether or not the instrument we are using to measure appears to be assessing the desired qualities or attributes based on “the face of it”. Content validity – that was touched on in the previous post – is closely related and considers whether the instrument samples all of the relevant or important content or interest.

So, why the importance of face validity? Whenever you need to interact successfully with study participants there is often a need to:

– increase motivation and cooperation from participants for better responses.
– attract as many potential candidates.
– reduce dissatisfaction among users.
– make your results more generalizable and appealing to stake holders.

These are especially important points to consider when planning a study that involves human subjects as respondents or there exists any level of subjectivity in how data is collected for the variables of interest in your study.

However, you want to avoid a “Con Man” situation in your study where respondents’ answers are not what they appear to be. As a researcher you need to be aware that there may be situations where Face Validity may not be achievable. Let’s say for instance you are interested in discovering all factors related to bullying in high school. If you were to ask the question ‘ have you ever bullied a classmate into given you his/her lunch money?’ you may have Face Validity but you may not get an honest response! In this case, you may consider a question that does not have face validity but will elicit the wanted answer. Ultimately, the decision on whether or not to have face validity – where the meaning and relevance are self-evident – depends on the nature and purpose of the instrument. Prepare to be flexible in your methodology!

Remember that face validity pertains to how your study participants perceive your test. They should be the yard stick by which you assess whether you have face validity or not.

Listen to Ed Sheeran – The A Team to decompress and…

… I’ll see you in the blogosphere.

Pascal Tyrrell