Interrater Reliability definition
So how do we determine whether two observers are being consistent in their observations? We have chosen to implement a test for inter-rater reliability. Interrater reliability is the extent to which two or more individuals agree.
Data Collection procedures
The data will be used to evaluate the quality of library reference services. The Reference Survey Card consists of two forms: Form A, which focuses on student experience during a reference desk interview, and Form B, which focuses on the librarian's experience during a reference desk interview. The librarian and the student will complete each form immediately after a reference desk interview.
For the rating of the students and the librarians to be consider “credible” and thus usable to draw conclusions, we must first test the degree to which the students and librarians agree about the student’s experience with reference librarian. When we test the reliability of ratings we often compute the inter-rater reliability coefficient. It is generally accepted that a inter-rater reliability coefficient of .75 or higher suggests that the ratings are reliable. The closer to 1.0 (a perfect match), the more reliable the ratings are. For example, if we have the following data:
|
Question |
Student Rating |
Librarian Rating |
|
|
|
I am now better able to construct a successful search statement in order to find information. |
5 = Strongly Agree 1= Strongly Disagree |
|
|
|
|
Interaction Event 1 |
5 |
5 |
|
|
|
Interaction Event 2 |
4 |
4 |
|
|
|
Interaction Event 3 |
4 |
4 |
|
|
|
Interaction Event 4 |
3 |
5 |
|
|
|
Interaction Event 5 |
5 |
3 |
|
|
|
Interaction Event 6 |
3 |
4 |
|
|
|
Interaction Event 7 |
5 |
5 |
|
|
|
Interaction Event 8 |
5 |
5 |
|
|
|
Interaction Event 9 |
3 |
2 |
|
|
|
Interaction Event 10 |
2 |
2 |
|
|
|
Total |
39 |
39 |
|
|
|
Mean |
3.9 |
3.9 |
Significance Level |
Interpretation |
|
Inter-Rater Reliability |
.61 |
.031 |
Marginal agreement about statement (significant result) |
|
In this first case, the reliability of the ratings are considered low (.61) even though the student and librarian mean
rating (3.9 on a 5-point scale) are the same (3.9).
|
Question |
Student Rating |
Librarian Rating |
|
|
|
I am now better able to construct a successful search statement in order to find information. |
5 = Strongly Agree 1= Strongly Disagree |
|
|
|
|
Interaction Event 1 |
5 |
5 |
|
|
|
Interaction Event 2 |
4 |
4 |
|
|
|
Interaction Event 3 |
4 |
4 |
|
|
|
Interaction Event 4 |
5 |
5 |
|
|
|
Interaction Event 5 |
3 |
3 |
|
|
|
Interaction Event 6 |
3 |
4 |
|
|
|
Interaction Event 7 |
5 |
5 |
|
|
|
Interaction Event 8 |
5 |
5 |
|
|
|
Interaction Event 9 |
3 |
2 |
|
|
|
Interaction Event 10 |
2 |
2 |
|
|
|
Total |
39 |
39 |
|
|
|
Mean |
3.9 |
3.9 |
Significance Level |
Interpretation |
|
Inter-Rater Reliability |
.92 |
Less than .000 |
Excellent agreement about statement (highly significant result) |
|
In this second case, the reliability of the ratings is considered high (.92) even though the student and librarian’s mean rating (3.9) is the same as in the first case. The difference is in the level of agreement. You can visually see that in the 2nd set of data the student and librarian ratings are the same in almost each row and that is what we are looking for in with your data. It doesn’t matter if the student “Strongly Agrees or Strongly Disagrees” only that the librarian and the student agree on the result of the interaction and thus provide similar ratings.
