The benchmark we created needs to be executed and results need to be compiled. For this, the following needs to be done:
- Select a list of LLMs to evaluate
- Execute the benchmark
- Check the results, making sure that the automated evaluation is good
- Document the results
|