Cohen's κ / Krippendorff's α — Measuring Whether Raters Agree

til/social-sciences/psychology/inter-rater-agreement-kappa-alpha

inter-rater-agreement-kappa-alpha.mdupdated 2026-07-162597 words

ダブルクリックで英日反転

Social Sciences · Psychology

Cohen's κ / Krippendorff's α — Measuring Whether Raters Agree

EN

When multiple raters label the same data, raw agreement inflates results because some matches happen by chance. κ and α correct for that, giving a trustworthy reliability figure.

Why raw agreement is not enough

Manual annotation results shift when the rater changes.
Raw agreement rate counts chance matches as real agreement.
A chance-corrected metric is required before trusting any accuracy score.

Cohen's κ — two raters, nominal scale

Formula: κ = (P_o − P_e) ÷ (1 − P_e), where P_o = observed rate, P_e = chance-expected rate.
Conventional benchmarks (Landis & Koch 1977): 0.61–0.80 substantial; 0.81+ almost perfect.
Limitation: two raters only; does not handle ordinal or interval data natively.

Krippendorff's α — three+ raters, any scale

Handles missing data and nominal, ordinal, interval, and ratio scales in one formula.
Rule of thumb: α ≥ 0.80 = trustworthy conclusions; ≥ 0.667 = preliminary discussion only.
Preferred over κ when rater count or scale type varies across tasks.

Application in AI-search evaluation

DoD category E-1 (Reliability) sets minimum κ ≥ 0.70 or α ≥ 0.667.
2–3 raters score factual, brand-mention, and freshness judgements in parallel each week.
Example: observed 0.90, chance 0.50 → κ = 0.80, confirming stable scoring logic.

→ Use κ for two-rater nominal tasks; switch to α when raters, scales, or missing data complicate things — and set a minimum threshold before the annotation campaign starts.

社会科学 · 心理学

Cohen's κ / Krippendorff's α — 評価者間一致度の測り方

JP

複数の評価者が同じデータにラベルを付けると、偶然の一致が生まれ、生の一致率は信頼性を過大評価する。κ（カッパ）とα（アルファ）は偶然一致を補正し、正確な信頼性指標を提供する。

なぜ生の一致率では不十分か

評価者が変わると手動アノテーション（＝データへのラベル付け）の結果が変動する。
生の一致率は偶然の一致も「真の合意」として計上してしまう。
精度スコアを信頼するには、偶然一致を補正した指標が不可欠。

Cohen's κ — 2名・名義尺度向け

算式: κ = (P_o − P_e) ÷ (1 − P_e)。P_o=実測一致率、P_e=偶然期待一致率。
判断基準（Landis & Koch 1977）: 0.61–0.80=実質的一致、0.81以上=ほぼ完全一致。
制約: 評価者は2名のみ、順序尺度や間隔尺度には対応していない。

Krippendorff's α — 3名以上・あらゆる尺度向け

欠損データ、名義・順序・間隔・比率（レシオ）尺度を単一の算式で扱える汎用指標。
目安: α ≥ 0.80 で「結論を信頼可」、≥ 0.667 で「暫定議論のみ」（Krippendorff 2004）。
評価者数や尺度の種類が混在するタスクではκよりαを使うほうが安全。

AI検索評価への適用例

DoD（完了の定義）カテゴリ E-1（信頼性）は κ ≥ 0.70 または α ≥ 0.667 を最低基準に設定。
事実確認・ブランド言及・鮮度の判定を2〜3名が並行採点し、毎週κ/αを報告。
実例: 実測一致率0.90、偶然期待値0.50 → κ = 0.80 で採点ロジックの安定を確認。

→ 2名・名義尺度ならκ、評価者数・尺度・欠損が複雑ならαを選び、アノテーション開始前に最低閾値を決めておく。

Social Sciences · Psychology

Cohen's κ / Krippendorff's α — Measuring Whether Raters Agree

EN

When multiple raters label the same data, raw agreement inflates results because some matches happen by chance. κ and α correct for that, giving a trustworthy reliability figure.

Why raw agreement is not enough

Manual annotation results shift when the rater changes.
Raw agreement rate counts chance matches as real agreement.
A chance-corrected metric is required before trusting any accuracy score.

Cohen's κ — two raters, nominal scale

Formula: κ = (P_o − P_e) ÷ (1 − P_e), where P_o = observed rate, P_e = chance-expected rate.
Conventional benchmarks (Landis & Koch 1977): 0.61–0.80 substantial; 0.81+ almost perfect.
Limitation: two raters only; does not handle ordinal or interval data natively.

Krippendorff's α — three+ raters, any scale

Handles missing data and nominal, ordinal, interval, and ratio scales in one formula.
Rule of thumb: α ≥ 0.80 = trustworthy conclusions; ≥ 0.667 = preliminary discussion only.
Preferred over κ when rater count or scale type varies across tasks.

Application in AI-search evaluation

DoD category E-1 (Reliability) sets minimum κ ≥ 0.70 or α ≥ 0.667.
2–3 raters score factual, brand-mention, and freshness judgements in parallel each week.
Example: observed 0.90, chance 0.50 → κ = 0.80, confirming stable scoring logic.

→ Use κ for two-rater nominal tasks; switch to α when raters, scales, or missing data complicate things — and set a minimum threshold before the annotation campaign starts.

社会科学 · 心理学

Cohen's κ / Krippendorff's α — 評価者間一致度の測り方

JP

複数の評価者が同じデータにラベルを付けると、偶然の一致が生まれ、生の一致率は信頼性を過大評価する。κ（カッパ）とα（アルファ）は偶然一致を補正し、正確な信頼性指標を提供する。

なぜ生の一致率では不十分か

評価者が変わると手動アノテーション（＝データへのラベル付け）の結果が変動する。
生の一致率は偶然の一致も「真の合意」として計上してしまう。
精度スコアを信頼するには、偶然一致を補正した指標が不可欠。

Cohen's κ — 2名・名義尺度向け

算式: κ = (P_o − P_e) ÷ (1 − P_e)。P_o=実測一致率、P_e=偶然期待一致率。
判断基準（Landis & Koch 1977）: 0.61–0.80=実質的一致、0.81以上=ほぼ完全一致。
制約: 評価者は2名のみ、順序尺度や間隔尺度には対応していない。

Krippendorff's α — 3名以上・あらゆる尺度向け

欠損データ、名義・順序・間隔・比率（レシオ）尺度を単一の算式で扱える汎用指標。
目安: α ≥ 0.80 で「結論を信頼可」、≥ 0.667 で「暫定議論のみ」（Krippendorff 2004）。
評価者数や尺度の種類が混在するタスクではκよりαを使うほうが安全。

AI検索評価への適用例

DoD（完了の定義）カテゴリ E-1（信頼性）は κ ≥ 0.70 または α ≥ 0.667 を最低基準に設定。
事実確認・ブランド言及・鮮度の判定を2〜3名が並行採点し、毎週κ/αを報告。
実例: 実測一致率0.90、偶然期待値0.50 → κ = 0.80 で採点ロジックの安定を確認。

→ 2名・名義尺度ならκ、評価者数・尺度・欠損が複雑ならαを選び、アノテーション開始前に最低閾値を決めておく。

Related notes

148 notestil

Cohen's κ / Krippendorff's α — Measuring Whether Raters Agree

Why raw agreement is not enough

Cohen's κ — two raters, nominal scale

Krippendorff's α — three+ raters, any scale

Application in AI-search evaluation

Cohen's κ / Krippendorff's α — 評価者間一致度の​測り方

な​ぜ生の​一致率では​不十分か

Cohen's κ — 2名・名義尺度向け

Krippendorff's α — 3名以上​・​あらゆる​尺度向け

AI検索評価への​適用例

Cohen's κ / Krippendorff's α — Measuring Whether Raters Agree

Why raw agreement is not enough

Cohen's κ — two raters, nominal scale

Krippendorff's α — three+ raters, any scale

Application in AI-search evaluation

Cohen's κ / Krippendorff's α — 評価者間一致度の​測り方

な​ぜ生の​一致率では​不十分か

Cohen's κ — 2名・名義尺度向け

Krippendorff's α — 3名以上​・​あらゆる​尺度向け

AI検索評価への​適用例

Related notes

Cohen's κ / Krippendorff's α — 評価者間一致度の測り方

なぜ生の一致率では不十分か

Krippendorff's α — 3名以上・あらゆる尺度向け

AI検索評価への適用例

Cohen's κ / Krippendorff's α — 評価者間一致度の測り方

なぜ生の一致率では不十分か

Krippendorff's α — 3名以上・あらゆる尺度向け

AI検索評価への適用例