Charting Student Math Misunderstandings (MAP)

🥈 銀メダル（上位5%）

Public LB: 0.945 Private LB: 0.946

[コンペページ]

一、タスク概要

本コンペの目的は、学生が数学の選択式問題に回答する際に示す 誤答タイプ（Category:Misconception）を予測することである。

問題文（QuestionText）
学生が選択した解答（MC_Answer）
学生による説明文（StudentExplanation）

出力は65種類の「Category:Misconception」ラベルのいずれか。

テストセットに含まれないルール（unseen rules）の存在
学生説明文の表現が多様で意味の差異が大きい
ラベル分布が不均衡

二、全体的なアプローチ

複数LLMのLoRA微調整＋加重アンサンブル戦略を採用。 4つの指示微調整モデルを組み合わせ、言語・論理・数学的補完性を活用した。

モデル	パラメータ規模	学習方式	特徴
Gemma2-9B-IT	9B	LoRA-CV945	英語理解と意味的ロバスト性
Qwen3-8B	8B	LoRA-MAP	多言語適応性
DeepSeek-Math-7B	7B	LoRA-MAP	数学的推論力
Hunyuan-7B-Instruct	7B	LoRA-MAP	指示一般化と安定性

三、データ前処理

ラベル構築

train["target"] = train["Category"] + ":" + train["Misconception"]
train["label"] = LabelEncoder().fit_transform(train["target"])

正答特徴抽出

idx = train.apply(lambda row: row.Category.split("_")[0], axis=1) == "True"
correct = train.loc[idx].groupby(["QuestionId","MC_Answer"]).head(1)

入力テンプレート

Question: {QuestionText}
Answer: {MC_Answer}
Correct? {Yes/No}
Student Explanation: {StudentExplanation}

四、ファインチューニング（LoRA）

モデル	Rank	LR	Batch	CV	説明
Gemma2-9B-IT	16	2e-4	8	0.945	メイン
Qwen3-8B	16	2e-4	8	0.944	意味的広さ
DeepSeek-Math-7B	16	2e-4	8	0.944	数学的論理性
Hunyuan-7B	16	2e-4	8	0.943	安定性

五、推論プロセス

Gemma2-9B：単GPU推論、Top-25確率出力
Qwen3 + DeepSeek：マルチGPU並列推論（約100秒）
Hunyuan：単GPUで同様手順

row_id, top_classes, prob_0, prob_1, ..., prob_24

六、アンサンブル戦略

final_score = 0.6 * weighted_mean_prob \
             + 0.3 * agreement_bonus \
             + 0.1 * confidence_bonus

4モデル出力を統合し、Top-3を最終予測。

七、結果概要

モデル	CV	Public	Private
DeepSeek-7B	0.944	0.942	0.942
Qwen3-8B	0.945	0.944	0.945
Gemma2-9B	0.942	0.943	0.944
Hunyuan-7B	0.943	0.943	0.943
アンサンブル	0.948	0.945	0.946 🥈

八、環境

Transformers 4.56.1 / PEFT 0.11.1 / Torch 2.3
Kaggle T4×2 GPU
FP16 / BF16
総実行時間：約3分

九、プロジェクト構成

├── gemma2_inference.py
├── qwen3_deepseek_inference.py
├── hunyuan_inference.py
├── ensemble.py
└── submission.csv

Charting Student Math Misunderstandings (MAP)

🥈 Silver Medal (~Top 5%) Rank 53 / 1858

Public LB 0.945 Private LB 0.946

[Competition Page]

1. Problem Overview

Multi-class classification: predict the most likely Category:Misconception label (65 classes) from a math question, the student’s chosen answer, and written explanation.

Unseen rules → domain generalization
Subtle semantic variations in explanations
Severe label imbalance

2. Overall Approach

Adopted a multi-LLM LoRA fine-tuning + weighted ensemble strategy.

Model	Size	Fine-tune	Role
Gemma2-9B-IT	9B	LoRA-CV945	Main baseline
Qwen3-8B	8B	LoRA-MAP	Semantic coverage
DeepSeek-Math-7B	7B	LoRA-MAP	Math logic understanding
Hunyuan-7B	7B	LoRA-MAP	Stable reasoning generalization

3. Data Processing

train["target"] = train["Category"] + ":" + train["Misconception"]
train["label"] = LabelEncoder().fit_transform(train["target"])

idx = train.apply(lambda r: r.Category.split("_")[0], axis=1) == "True"
correct = train.loc[idx].groupby(["QuestionId","MC_Answer"]).head(1)

Question: {QuestionText}
Answer: {MC_Answer}
Correct? {Yes/No}
Student Explanation: {StudentExplanation}

4. Model Fine-Tuning

Model	Rank	LR	Batch	CV
Gemma2-9B-IT	16	2e-4	8	0.945
Qwen3-8B	16	2e-4	8	0.944
DeepSeek-Math-7B	16	2e-4	8	0.944
Hunyuan-7B	16	2e-4	8	0.943

5. Inference Pipeline

Gemma2 single-GPU inference → Top-25 probabilities
Qwen3 + DeepSeek parallel on two GPUs (~100 s)
Hunyuan same as Gemma2

row_id, top_classes, prob_0, prob_1, ..., prob_24

6. Ensemble Strategy

final_score = 0.6 * weighted_mean_prob \
             + 0.3 * agreement_bonus \
             + 0.1 * confidence_bonus

Combine four models and rank Top-3 predictions.

7. Results Summary

Model	CV	Public	Private
DeepSeek-Math-7B	0.944	0.942	0.942
Qwen3-8B	0.945	0.944	0.945
Gemma2-9B	0.942	0.943	0.944
Hunyuan-7B	0.943	0.943	0.943
Ensemble (final)	0.948	0.945	0.946 🥈

8. Environment & Reproducibility

transformers 4.56.1 / peft 0.11.1 / torch 2.3
Kaggle T4×2 GPUs
FP16 / bfloat16 precision
Total runtime ≈ 3 min

9. Repository Structure

├── gemma2_inference.py
├── qwen3_deepseek_inference.py
├── hunyuan_inference.py
├── ensemble.py
└── submission.csv