Updated evaluation results formatting
Browse files
README.md
CHANGED
|
@@ -54,16 +54,30 @@ We compare Munin-7B-Open-pt at various training stages with its base model [Comm
|
|
| 54 |
|
| 55 |
The following tables show, for Danish and English respectively, the performance on each dataset. For each, we report the respective main metric from EuroEval and the confidence interval.
|
| 56 |
|
| 57 |
-
|
|
| 58 |
-
|
|
| 59 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 60 |
|
| 61 |
|
| 62 |
The following plots show, for Danish and English respectively, model size on the x-axis and an aggregate performance score on the y-axis. Each metric is normalized across all evaluated models using min-max normalization to the range [0, 1], and the final score represents the average of all normalized metrics.
|
| 63 |
|
| 64 |
| Danish | English |
|
| 65 |
|:--------------------------:|:--------------------------:|
|
| 66 |
-
|
|
| 67 |
|
| 68 |
|
| 69 |
## Limitations
|
|
|
|
| 54 |
|
| 55 |
The following tables show, for Danish and English respectively, the performance on each dataset. For each, we report the respective main metric from EuroEval and the confidence interval.
|
| 56 |
|
| 57 |
+
| Model | scala-da (MCC) | dala (MCC) | angry-tweets (MCC) | no_misc_dansk (Micro F1) | danske-talemaader (MCC) | danish-citizen-tests (MCC) | multi-wiki-qa-da (F1) | hellaswag-da (MCC) | nordjylland-news (BERTScore) |
|
| 58 |
+
| ------------------------ | -------------- | ------------ | ------------------ | ------------------------ | ----------------------- | -------------------------- | --------------------- | ------------------ | ---------------------------- |
|
| 59 |
+
| **comma-v0.1-2t** | 0.94 ± 0.76 | 0.15 ± 0.55 | 39.77 ± 1.36 | 31.97 ± 2.77 | 3.63 ± 2.31 | 10.72 ± 4.05 | 66.37 ± 0.81 | 3.84 ± 0.96 | 60.20 ± 1.69 |
|
| 60 |
+
| **munin-7b-open-stage1** | 13.27 ± 2.92 | 12.70 ± 2.16 | 47.65 ± 1.70 | 40.01 ± 2.39 | 18.06 ± 0.92 | 32.84 ± 1.43 | 76.57 ± 0.55 | 12.85 ± 1.02 | 65.91 ± 0.85 |
|
| 61 |
+
| **munin-7b-open-stage2** | 15.78 ± 3.05 | 14.43 ± 2.92 | 47.35 ± 2.30 | 40.42 ± 2.38 | 24.12 ± 1.79 | 36.07 ± 1.80 | 75.18 ± 0.71 | 13.09 ± 1.13 | 66.50 ± 0.69 |
|
| 62 |
+
| **munin-7b-open-stage3** | 16.45 ± 1.36 | 15.68 ± 1.74 | 46.33 ± 2.09 | 41.08 ± 2.81 | 24.61 ± 1.98 | 36.22 ± 1.69 | 76.02 ± 0.68 | 13.15 ± 1.21 | 66.55 ± 0.63 |
|
| 63 |
+
| **Pleias-350m-Preview** | -0.95 ± 1.46 | -1.84 ± 1.75 | 10.61 ± 2.87 | 12.86 ± 1.78 | 0.66 ± 2.63 | 4.59 ± 2.31 | 11.63 ± 0.88 | -0.26 ± 0.73 | 56.28 ± 1.47 |
|
| 64 |
+
| **Pleias-1.2b-Preview** | 0.17 ± 1.13 | 0.66 ± 1.01 | 27.70 ± 2.89 | 27.30 ± 2.18 | -0.61 ± 1.89 | 8.60 ± 3.24 | 35.20 ± 1.25 | -0.04 ± 1.48 | 60.34 ± 0.86 |
|
| 65 |
+
|
| 66 |
+
| Model | scala-en (MCC) | sst5 (MCC) | conll-en (Micro F1 no misc) | life-in-the-uk (MCC) | squad (F1) | hellaswag (MCC) | cnn-dailymail (BERTScore) |
|
| 67 |
+
| ------------------------ | -------------- | ------------ | --------------------------- | -------------------- | ------------ | --------------- | ------------------------- |
|
| 68 |
+
| **comma-v0.1-2t** | 29.74 ± 1.94 | 61.75 ± 2.08 | 57.54 ± 2.76 | 41.60 ± 2.41 | 90.38 ± 0.35 | 16.83 ± 0.63 | 63.33 ± 0.94 |
|
| 69 |
+
| **munin-7b-open-stage1** | 27.46 ± 2.13 | 60.01 ± 1.69 | 56.63 ± 2.14 | 40.45 ± 1.74 | 22.10 ± 0.67 | 13.66 ± 0.70 | 59.16 ± 1.40 |
|
| 70 |
+
| **munin-7b-open-stage2** | 27.65 ± 2.04 | 59.49 ± 1.59 | 56.61 ± 2.31 | 41.16 ± 1.73 | 22.29 ± 1.49 | 15.95 ± 0.90 | 60.22 ± 1.58 |
|
| 71 |
+
| **munin-7b-open-stage3** | 29.00 ± 2.41 | 60.30 ± 1.40 | 56.96 ± 2.49 | 41.71 ± 1.78 | 24.62 ± 2.29 | 13.76 ± 0.87 | 58.98 ± 1.70 |
|
| 72 |
+
| **Pleias-350m-Preview** | 0.71 ± 1.75 | 15.41 ± 7.34 | 31.76 ± 3.48 | -0.70 ± 2.11 | 31.07 ± 2.31 | 0.22 ± 1.35 | 53.80 ± 1.04 |
|
| 73 |
+
| **Pleias-1.2b-Preview** | 0.99 ± 2.37 | 48.23 ± 2.58 | 40.86 ± 3.28 | 2.55 ± 2.75 | 52.90 ± 2.48 | -0.06 ± 1.50 | 60.15 ± 1.59 |
|
| 74 |
|
| 75 |
|
| 76 |
The following plots show, for Danish and English respectively, model size on the x-axis and an aggregate performance score on the y-axis. Each metric is normalized across all evaluated models using min-max normalization to the range [0, 1], and the final score represents the average of all normalized metrics.
|
| 77 |
|
| 78 |
| Danish | English |
|
| 79 |
|:--------------------------:|:--------------------------:|
|
| 80 |
+
| <img src="./images/performance_plot_da.png" width="600"/> | <img src="./images/performance_plot_en.png" width="600"/> |
|
| 81 |
|
| 82 |
|
| 83 |
## Limitations
|