Text Generation
Safetensors
Danish
English
llama
giannor commited on
Commit
261cefc
·
1 Parent(s): 01ad791

Updated evaluation results formatting

Browse files
Files changed (1) hide show
  1. README.md +18 -4
README.md CHANGED
@@ -54,16 +54,30 @@ We compare Munin-7B-Open-pt at various training stages with its base model [Comm
54
 
55
  The following tables show, for Danish and English respectively, the performance on each dataset. For each, we report the respective main metric from EuroEval and the confidence interval.
56
 
57
- | Danish | English |
58
- |:-----------------------:|:-----------------------:|
59
- | ![Heatmap EuroEval DA→TR](https://i.postimg.cc/65JD1Bdr/heatmap-euroeval-da-tr.png) | ![Heatmap EuroEval EN→TR](https://i.postimg.cc/9Fds6sp6/heatmap-euroeval-en-tr.png) |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
60
 
61
 
62
  The following plots show, for Danish and English respectively, model size on the x-axis and an aggregate performance score on the y-axis. Each metric is normalized across all evaluated models using min-max normalization to the range [0, 1], and the final score represents the average of all normalized metrics.
63
 
64
  | Danish | English |
65
  |:--------------------------:|:--------------------------:|
66
- | ![Model Performance Plot DA](https://i.postimg.cc/Ls8kVQv3/model-performance-plot-da.png) | ![Model Performance Plot EN](https://i.postimg.cc/TPNVNj46/model-performance-plot-en.png) |
67
 
68
 
69
  ## Limitations
 
54
 
55
  The following tables show, for Danish and English respectively, the performance on each dataset. For each, we report the respective main metric from EuroEval and the confidence interval.
56
 
57
+ | Model | scala-da (MCC) | dala (MCC) | angry-tweets (MCC) | no_misc_dansk (Micro F1) | danske-talemaader (MCC) | danish-citizen-tests (MCC) | multi-wiki-qa-da (F1) | hellaswag-da (MCC) | nordjylland-news (BERTScore) |
58
+ | ------------------------ | -------------- | ------------ | ------------------ | ------------------------ | ----------------------- | -------------------------- | --------------------- | ------------------ | ---------------------------- |
59
+ | **comma-v0.1-2t** | 0.94 ± 0.76 | 0.15 ± 0.55 | 39.77 ± 1.36 | 31.97 ± 2.77 | 3.63 ± 2.31 | 10.72 ± 4.05 | 66.37 ± 0.81 | 3.84 ± 0.96 | 60.20 ± 1.69 |
60
+ | **munin-7b-open-stage1** | 13.27 ± 2.92 | 12.70 ± 2.16 | 47.65 ± 1.70 | 40.01 ± 2.39 | 18.06 ± 0.92 | 32.84 ± 1.43 | 76.57 ± 0.55 | 12.85 ± 1.02 | 65.91 ± 0.85 |
61
+ | **munin-7b-open-stage2** | 15.78 ± 3.05 | 14.43 ± 2.92 | 47.35 ± 2.30 | 40.42 ± 2.38 | 24.12 ± 1.79 | 36.07 ± 1.80 | 75.18 ± 0.71 | 13.09 ± 1.13 | 66.50 ± 0.69 |
62
+ | **munin-7b-open-stage3** | 16.45 ± 1.36 | 15.68 ± 1.74 | 46.33 ± 2.09 | 41.08 ± 2.81 | 24.61 ± 1.98 | 36.22 ± 1.69 | 76.02 ± 0.68 | 13.15 ± 1.21 | 66.55 ± 0.63 |
63
+ | **Pleias-350m-Preview** | -0.95 ± 1.46 | -1.84 ± 1.75 | 10.61 ± 2.87 | 12.86 ± 1.78 | 0.66 ± 2.63 | 4.59 ± 2.31 | 11.63 ± 0.88 | -0.26 ± 0.73 | 56.28 ± 1.47 |
64
+ | **Pleias-1.2b-Preview** | 0.17 ± 1.13 | 0.66 ± 1.01 | 27.70 ± 2.89 | 27.30 ± 2.18 | -0.61 ± 1.89 | 8.60 ± 3.24 | 35.20 ± 1.25 | -0.04 ± 1.48 | 60.34 ± 0.86 |
65
+
66
+ | Model | scala-en (MCC) | sst5 (MCC) | conll-en (Micro F1 no misc) | life-in-the-uk (MCC) | squad (F1) | hellaswag (MCC) | cnn-dailymail (BERTScore) |
67
+ | ------------------------ | -------------- | ------------ | --------------------------- | -------------------- | ------------ | --------------- | ------------------------- |
68
+ | **comma-v0.1-2t** | 29.74 ± 1.94 | 61.75 ± 2.08 | 57.54 ± 2.76 | 41.60 ± 2.41 | 90.38 ± 0.35 | 16.83 ± 0.63 | 63.33 ± 0.94 |
69
+ | **munin-7b-open-stage1** | 27.46 ± 2.13 | 60.01 ± 1.69 | 56.63 ± 2.14 | 40.45 ± 1.74 | 22.10 ± 0.67 | 13.66 ± 0.70 | 59.16 ± 1.40 |
70
+ | **munin-7b-open-stage2** | 27.65 ± 2.04 | 59.49 ± 1.59 | 56.61 ± 2.31 | 41.16 ± 1.73 | 22.29 ± 1.49 | 15.95 ± 0.90 | 60.22 ± 1.58 |
71
+ | **munin-7b-open-stage3** | 29.00 ± 2.41 | 60.30 ± 1.40 | 56.96 ± 2.49 | 41.71 ± 1.78 | 24.62 ± 2.29 | 13.76 ± 0.87 | 58.98 ± 1.70 |
72
+ | **Pleias-350m-Preview** | 0.71 ± 1.75 | 15.41 ± 7.34 | 31.76 ± 3.48 | -0.70 ± 2.11 | 31.07 ± 2.31 | 0.22 ± 1.35 | 53.80 ± 1.04 |
73
+ | **Pleias-1.2b-Preview** | 0.99 ± 2.37 | 48.23 ± 2.58 | 40.86 ± 3.28 | 2.55 ± 2.75 | 52.90 ± 2.48 | -0.06 ± 1.50 | 60.15 ± 1.59 |
74
 
75
 
76
  The following plots show, for Danish and English respectively, model size on the x-axis and an aggregate performance score on the y-axis. Each metric is normalized across all evaluated models using min-max normalization to the range [0, 1], and the final score represents the average of all normalized metrics.
77
 
78
  | Danish | English |
79
  |:--------------------------:|:--------------------------:|
80
+ | <img src="./images/performance_plot_da.png" width="600"/> | <img src="./images/performance_plot_en.png" width="600"/> |
81
 
82
 
83
  ## Limitations