We need an evaluation framework to test how well the system works for different types of tasks. We need to develop or use a framework that supports:
- List(s) of tasks/prompts for different types of tasks
- Example content to index as context for these tasks
- Storing answers of the LLM
- Automated evaluation of the answers with another LLM
- Manual evaluation by a human
- Storing evaluation results
- Generating visualizations how well the LLM performed on different tasks
|