danish-foundation-models
/

dfm-decoder-open-v0-7b-pt

@@ -54,16 +54,30 @@ We compare Munin-7B-Open-pt at various training stages with its base model [Comm
 The following tables show, for Danish and English respectively, the performance on each dataset. For each, we report the respective main metric from EuroEval and the confidence interval.
-| Danish | English |
-|:-----------------------:|:-----------------------:|
-| ![Heatmap EuroEval DA→TR](https://i.postimg.cc/65JD1Bdr/heatmap-euroeval-da-tr.png) | ![Heatmap EuroEval EN→TR](https://i.postimg.cc/9Fds6sp6/heatmap-euroeval-en-tr.png) |
 The following plots show, for Danish and English respectively, model size on the x-axis and an aggregate performance score on the y-axis. Each metric is normalized across all evaluated models using min-max normalization to the range [0, 1], and the final score represents the average of all normalized metrics.
 | Danish | English |
 |:--------------------------:|:--------------------------:|
-| ![Model Performance Plot DA](https://i.postimg.cc/Ls8kVQv3/model-performance-plot-da.png) | ![Model Performance Plot EN](https://i.postimg.cc/TPNVNj46/model-performance-plot-en.png) |
 ## Limitations

 The following tables show, for Danish and English respectively, the performance on each dataset. For each, we report the respective main metric from EuroEval and the confidence interval.
+| Model                    | scala-da (MCC) | dala (MCC)   | angry-tweets (MCC) | no_misc_dansk (Micro F1) | danske-talemaader (MCC) | danish-citizen-tests (MCC) | multi-wiki-qa-da (F1) | hellaswag-da (MCC) | nordjylland-news (BERTScore) |
+| ------------------------ | -------------- | ------------ | ------------------ | ------------------------ | ----------------------- | -------------------------- | --------------------- | ------------------ | ---------------------------- |
+| **comma-v0.1-2t**        | 0.94 ± 0.76    | 0.15 ± 0.55  | 39.77 ± 1.36       | 31.97 ± 2.77             | 3.63 ± 2.31             | 10.72 ± 4.05               | 66.37 ± 0.81          | 3.84 ± 0.96        | 60.20 ± 1.69                 |
+| **munin-7b-open-stage1** | 13.27 ± 2.92   | 12.70 ± 2.16 | 47.65 ± 1.70       | 40.01 ± 2.39             | 18.06 ± 0.92            | 32.84 ± 1.43               | 76.57 ± 0.55          | 12.85 ± 1.02       | 65.91 ± 0.85                 |
+| **munin-7b-open-stage2** | 15.78 ± 3.05   | 14.43 ± 2.92 | 47.35 ± 2.30       | 40.42 ± 2.38             | 24.12 ± 1.79            | 36.07 ± 1.80               | 75.18 ± 0.71          | 13.09 ± 1.13       | 66.50 ± 0.69                 |
+| **munin-7b-open-stage3** | 16.45 ± 1.36   | 15.68 ± 1.74 | 46.33 ± 2.09       | 41.08 ± 2.81             | 24.61 ± 1.98            | 36.22 ± 1.69               | 76.02 ± 0.68          | 13.15 ± 1.21       | 66.55 ± 0.63                 |
+| **Pleias-350m-Preview**  | -0.95 ± 1.46   | -1.84 ± 1.75 | 10.61 ± 2.87       | 12.86 ± 1.78             | 0.66 ± 2.63             | 4.59 ± 2.31                | 11.63 ± 0.88          | -0.26 ± 0.73       | 56.28 ± 1.47                 |
+| **Pleias-1.2b-Preview**  | 0.17 ± 1.13    | 0.66 ± 1.01  | 27.70 ± 2.89       | 27.30 ± 2.18             | -0.61 ± 1.89            | 8.60 ± 3.24                | 35.20 ± 1.25          | -0.04 ± 1.48       | 60.34 ± 0.86                 |
+| Model                    | scala-en (MCC) | sst5 (MCC)   | conll-en (Micro F1 no misc) | life-in-the-uk (MCC) | squad (F1)   | hellaswag (MCC) | cnn-dailymail (BERTScore) |
+| ------------------------ | -------------- | ------------ | --------------------------- | -------------------- | ------------ | --------------- | ------------------------- |
+| **comma-v0.1-2t**        | 29.74 ± 1.94   | 61.75 ± 2.08 | 57.54 ± 2.76                | 41.60 ± 2.41         | 90.38 ± 0.35 | 16.83 ± 0.63    | 63.33 ± 0.94              |
+| **munin-7b-open-stage1** | 27.46 ± 2.13   | 60.01 ± 1.69 | 56.63 ± 2.14                | 40.45 ± 1.74         | 22.10 ± 0.67 | 13.66 ± 0.70    | 59.16 ± 1.40              |
+| **munin-7b-open-stage2** | 27.65 ± 2.04   | 59.49 ± 1.59 | 56.61 ± 2.31                | 41.16 ± 1.73         | 22.29 ± 1.49 | 15.95 ± 0.90    | 60.22 ± 1.58              |
+| **munin-7b-open-stage3** | 29.00 ± 2.41   | 60.30 ± 1.40 | 56.96 ± 2.49                | 41.71 ± 1.78         | 24.62 ± 2.29 | 13.76 ± 0.87    | 58.98 ± 1.70              |
+| **Pleias-350m-Preview**  | 0.71 ± 1.75    | 15.41 ± 7.34 | 31.76 ± 3.48                | -0.70 ± 2.11         | 31.07 ± 2.31 | 0.22 ± 1.35     | 53.80 ± 1.04              |
+| **Pleias-1.2b-Preview**  | 0.99 ± 2.37    | 48.23 ± 2.58 | 40.86 ± 3.28                | 2.55 ± 2.75          | 52.90 ± 2.48 | -0.06 ± 1.50    | 60.15 ± 1.59              |
 The following plots show, for Danish and English respectively, model size on the x-axis and an aggregate performance score on the y-axis. Each metric is normalized across all evaluated models using min-max normalization to the range [0, 1], and the final score represents the average of all normalized metrics.
 | Danish | English |
 |:--------------------------:|:--------------------------:|
+| <img src="./images/performance_plot_da.png" width="600"/> | <img src="./images/performance_plot_en.png" width="600"/> |
 ## Limitations