Branch: refs/heads/master
Home: https://github.com/xwiki-contrib/ai-llm-benchmark
Commit: 3fe4cc6a5eed04aa29c7b6f31c5855a95e2e7df0
https://github.com/xwiki-contrib/ai-llm-benchmark/commit/3fe4cc6a5eed04aa29…
Author: Paul Pantiru <paul.pantiru(a)xwiki.com>
Date: 2024-05-17 (Fri, 17 May 2024)
Changed paths:
M .gitignore
M Snakefile
A config.json
M environment.yml
M input/input.json
R request.json
R scripts/calculate_scores.py
A scripts/calculate_scores_broken.py
R scripts/collect_data.py
A scripts/collect_model_responses.py
A scripts/deepeval_model.py
M scripts/split_input_to_files.py
A scripts/test_eval_summary.py
A scripts/waise_model.py
Log Message:
-----------
LLMAI-61: Implement an evaluation framework
* Re-organize input structure
* Add snakemake pipeline
* Update config file
* Add conda environmet dependecy file
* Modify input.json structure
* Add waise and ollama model connection for DeepEval
* Initial summary test script
Commit: 72fb0f3016a94bc53ceae0763507f886ca07e3e7
https://github.com/xwiki-contrib/ai-llm-benchmark/commit/72fb0f3016a94bc53c…
Author: Paul Pantiru <paul.pantiru(a)xwiki.com>
Date: 2024-05-17 (Fri, 17 May 2024)
Changed paths:
M context_data/documents/dev_Community.SupportStrategy.DatabaseSupportStrategy.json
M context_data/documents/extensions_Extension.Administration Application.json
A context_data/documents/extensions_Extension.Attachment.Validation.UI.WebHome.json
M context_data/documents/extensions_Extension.LDAP.Authenticator.WebHome.json
M context_data/documents/extensions_Extension.LLM Application.Authenticator.WebHome.json
M context_data/documents/extensions_Extension.LLM Application.Index for the LLM Application.WebHome.json
M context_data/documents/extensions_Extension.Model.Validation.UI.WebHome.json
M context_data/documents/extensions_Extension.Notifications Application.WebHome.json
M context_data/documents/extensions_Extension.Office Importer Application.json
M context_data/documents/extensions_Extension.OpenID Connect.OpenID Connect Authenticator.WebHome.json
M context_data/documents/urls.txt
M context_data/documents/xwiki_Documentation.AdminGuide.Access Rights.Permission types.WebHome.json
M context_data/documents/xwiki_Documentation.AdminGuide.Access Rights.WebHome.json
A context_data/documents/xwiki_Documentation.AdminGuide.Attachments.json
M context_data/documents/xwiki_Documentation.AdminGuide.Authentication.WebHome.json
M context_data/documents/xwiki_Documentation.AdminGuide.Configuration.WebHome.json
M context_data/documents/xwiki_Documentation.AdminGuide.Upgrade.WebHome.json
M context_data/documents/xwiki_Documentation.UserGuide.Features.Applications.json
M context_data/documents/xwiki_Documentation.UserGuide.Features.Attachments.json
M context_data/documents/xwiki_Documentation.UserGuide.Features.Authentication.json
M context_data/documents/xwiki_Documentation.UserGuide.Features.ContentOrganization.NestedPagesMigration.WebHome.json
M context_data/documents/xwiki_Documentation.UserGuide.Features.ContentOrganization.WebHome.json
M context_data/documents/xwiki_Documentation.UserGuide.Features.DatabaseSupport.json
M context_data/documents/xwiki_Documentation.UserGuide.Features.DistributionWizard.json
M context_data/documents/xwiki_Documentation.UserGuide.Features.DocumentLifecycle.WebHome.json
M context_data/documents/xwiki_Documentation.UserGuide.Features.Exports.json
M context_data/documents/xwiki_Documentation.UserGuide.Features.Forms.json
M context_data/documents/xwiki_Documentation.UserGuide.Features.I18N.json
M context_data/documents/xwiki_Documentation.UserGuide.Features.Imports.json
M context_data/documents/xwiki_Documentation.UserGuide.Features.KeyboardShortcuts.json
M context_data/documents/xwiki_Documentation.UserGuide.Features.Navigate.WebHome.json
M context_data/documents/xwiki_Documentation.UserGuide.Features.Notifications.json
M context_data/documents/xwiki_Documentation.UserGuide.Features.PageEditing.json
M context_data/documents/xwiki_Documentation.UserGuide.Features.Programming.json
M context_data/documents/xwiki_Documentation.UserGuide.Features.RSS.json
M context_data/documents/xwiki_Documentation.UserGuide.Features.RightsManagement.json
M context_data/documents/xwiki_Documentation.UserGuide.Features.ScalabilityPerformance.json
M context_data/documents/xwiki_Documentation.UserGuide.Features.SecondGenerationWiki.WebHome.json
M context_data/documents/xwiki_Documentation.UserGuide.Features.Skins.json
M context_data/documents/xwiki_Documentation.UserGuide.Features.UsersAndGroupsManagement.json
M context_data/documents/xwiki_Documentation.UserGuide.Features.VersionControl.json
M context_data/documents/xwiki_Documentation.UserGuide.Features.WebHome.json
M context_data/documents/xwiki_Documentation.UserGuide.Features.XWikiRESTfulAPI.json
M context_data/documents/xwiki_Documentation.UserGuide.Features.XWikiSyntax.WebHome.json
M context_data/documents/xwiki_Documentation.UserGuide.GettingStarted.ChangingTheLogoAndThePanels.json
M context_data/documents/xwiki_Documentation.UserGuide.GettingStarted.CreatingABasicApp.json
M context_data/documents/xwiki_Documentation.UserGuide.GettingStarted.CreatingAPage.json
M context_data/documents/xwiki_Documentation.UserGuide.GettingStarted.CreatingNewUsers.json
M context_data/documents/xwiki_Documentation.UserGuide.GettingStarted.EditingAPage.json
M context_data/documents/xwiki_Documentation.UserGuide.GettingStarted.FirstStepsWithXWiki.json
M context_data/documents/xwiki_Documentation.UserGuide.GettingStarted.GoingFurther.json
M context_data/documents/xwiki_Documentation.UserGuide.GettingStarted.PageHistory.json
M context_data/documents/xwiki_Documentation.UserGuide.GettingStarted.SettingUserRights.json
M context_data/documents/xwiki_Documentation.UserGuide.GettingStarted.WebHome.json
M context_data/documents/xwiki_Documentation.UserGuide.GettingStarted.WhatIsAWiki.json
M context_data/documents/xwiki_Documentation.UserGuide.GettingStarted.WhatsSpecialAboutXWiki.WebHome.json
M context_data/documents/xwiki_Documentation.UserGuide.GettingStarted.XWikiBasicConcepts.json
M input/input.json
M scripts/download_wiki_page.py
Log Message:
-----------
[misc] merge
Commit: 92bff7b0bcbcb03a75befb85ac8bc48914707f02
https://github.com/xwiki-contrib/ai-llm-benchmark/commit/92bff7b0bcbcb03a75…
Author: Paul Pantiru <paul.pantiru(a)xwiki.com>
Date: 2024-05-17 (Fri, 17 May 2024)
Changed paths:
M Snakefile
M config.json
M scripts/collect_model_responses.py
Log Message:
-----------
LLMAI-61: Implement an evaluation framework
* Capture output of different models separately
Commit: 108e3618e9b5d05c3e971b3632b8d78aa1072109
https://github.com/xwiki-contrib/ai-llm-benchmark/commit/108e3618e9b5d05c3e…
Author: Paul Pantiru <paul.pantiru(a)xwiki.com>
Date: 2024-05-18 (Sat, 18 May 2024)
Changed paths:
M .gitignore
M Snakefile
M config.json
M environment.yml
R scripts/calculate_scores_broken.py
R scripts/collect_model_responses.py
A scripts/context_gathering/download_wiki_page.py
A scripts/context_indexing/index_data.py
R scripts/create_plots.py
R scripts/deepeval_model.py
R scripts/download_wiki_page.py
A scripts/evaluation_scripts/calculate_scores_broken.py
A scripts/evaluation_scripts/test_eval_summary.py
R scripts/index_data.py
A scripts/input_data_preparation/split_input_to_files.py
A scripts/models_connections/deepeval_model.py
A scripts/models_connections/waise_model.py
A scripts/output_generation/collect_model_responses.py
A scripts/results_visualization/create_plots.py
R scripts/split_input_to_files.py
R scripts/test_eval_summary.py
R scripts/waise_model.py
Log Message:
-----------
LLMAI-61: Implement an evaluation framework
* organiz scripts folder
* add summarization evaluation metric
* make evaluator model configurable
* update .gitignore
* update environment.yml with the required versions
Commit: d01706c95773e46ea59a1e2e3bc248e2c9d26a8e
https://github.com/xwiki-contrib/ai-llm-benchmark/commit/d01706c95773e46ea5…
Author: Paul Pantiru <paul.pantiru(a)xwiki.com>
Date: 2024-05-19 (Sun, 19 May 2024)
Changed paths:
M Snakefile
M environment.yml
A scripts/evaluation_scripts/eval_rag_qa.py
A scripts/evaluation_scripts/eval_summary.py
A scripts/evaluation_scripts/eval_text_generation.py
R scripts/evaluation_scripts/test_eval_summary.py
M scripts/output_generation/collect_model_responses.py
Log Message:
-----------
LLMAI-61: Implement an evaluation framework
* Add text generation evaluation script
* Add rag qa evaluation script
* Updated Snakefile
Commit: da507c607a96d2d48f739507c23c3b90cafc86a4
https://github.com/xwiki-contrib/ai-llm-benchmark/commit/da507c607a96d2d48f…
Author: Paul Pantiru <paul.pantiru(a)xwiki.com>
Date: 2024-05-20 (Mon, 20 May 2024)
Changed paths:
M Snakefile
M config.json
A dag.png
R drafts/eval_draft.py
M environment.yml
A evaluation_results/RAG-qa/.snakemake_timestamp
A evaluation_results/RAG-qa/AI.Models.waise-gpt-4o/qa_001_result.json
A evaluation_results/RAG-qa/AI.Models.waise-gpt-4o/qa_002_result.json
A evaluation_results/RAG-qa/AI.Models.waise-gpt-4o/qa_003_result.json
A evaluation_results/RAG-qa/AI.Models.waise-gpt-4o/qa_004_result.json
A evaluation_results/RAG-qa/AI.Models.waise-gpt-4o/qa_005_result.json
A evaluation_results/RAG-qa/AI.Models.waise-gpt-4o/qa_006_result.json
A evaluation_results/RAG-qa/AI.Models.waise-gpt-4o/qa_007_result.json
A evaluation_results/RAG-qa/AI.Models.waise-gpt-4o/qa_008_result.json
A evaluation_results/RAG-qa/AI.Models.waise-gpt-4o/qa_009_result.json
A evaluation_results/RAG-qa/AI.Models.waise-gpt-4o/qa_010_result.json
A evaluation_results/RAG-qa/AI.Models.waise-gpt-4o/qa_011_result.json
A evaluation_results/RAG-qa/AI.Models.waise-gpt-4o/qa_012_result.json
A evaluation_results/RAG-qa/AI.Models.waise-gpt-4o/qa_013_result.json
A evaluation_results/RAG-qa/AI.Models.waise-gpt-4o/qa_014_result.json
A evaluation_results/RAG-qa/AI.Models.waise-gpt-4o/qa_015_result.json
A evaluation_results/RAG-qa/AI.Models.waise-gpt-4o/qa_016_result.json
A evaluation_results/RAG-qa/AI.Models.waise-gpt-4o/qa_017_result.json
A evaluation_results/RAG-qa/AI.Models.waise-gpt-4o/qa_018_result.json
A evaluation_results/RAG-qa/AI.Models.waise-gpt-4o/qa_019_result.json
A evaluation_results/RAG-qa/AI.Models.waise-gpt-4o/qa_020_result.json
A evaluation_results/RAG-qa/AI.Models.waise-gpt-4o/qa_021_result.json
A evaluation_results/RAG-qa/AI.Models.waise-gpt-4o/qa_022_result.json
A evaluation_results/RAG-qa/AI.Models.waise-gpt-4o/qa_023_result.json
A evaluation_results/RAG-qa/AI.Models.waise-gpt-4o/qa_024_result.json
A evaluation_results/RAG-qa/AI.Models.waise-gpt-4o/qa_025_result.json
A evaluation_results/RAG-qa/AI.Models.waise-gpt-4o/qa_026_result.json
A evaluation_results/RAG-qa/AI.Models.waise-gpt-4o/qa_027_result.json
A evaluation_results/RAG-qa/AI.Models.waise-gpt-4o/qa_028_result.json
A evaluation_results/RAG-qa/AI.Models.waise-gpt-4o/qa_029_result.json
A evaluation_results/RAG-qa/AI.Models.waise-gpt-4o/qa_030_result.json
A evaluation_results/RAG-qa/AI.Models.waise-mixtral/qa_001_result.json
A evaluation_results/RAG-qa/AI.Models.waise-mixtral/qa_002_result.json
A evaluation_results/RAG-qa/AI.Models.waise-mixtral/qa_003_result.json
A evaluation_results/RAG-qa/AI.Models.waise-mixtral/qa_004_result.json
A evaluation_results/RAG-qa/AI.Models.waise-mixtral/qa_005_result.json
A evaluation_results/RAG-qa/AI.Models.waise-mixtral/qa_006_result.json
A evaluation_results/RAG-qa/AI.Models.waise-mixtral/qa_007_result.json
A evaluation_results/RAG-qa/AI.Models.waise-mixtral/qa_008_result.json
A evaluation_results/RAG-qa/AI.Models.waise-mixtral/qa_009_result.json
A evaluation_results/RAG-qa/AI.Models.waise-mixtral/qa_010_result.json
A evaluation_results/RAG-qa/AI.Models.waise-mixtral/qa_011_result.json
A evaluation_results/RAG-qa/AI.Models.waise-mixtral/qa_012_result.json
A evaluation_results/RAG-qa/AI.Models.waise-mixtral/qa_013_result.json
A evaluation_results/RAG-qa/AI.Models.waise-mixtral/qa_014_result.json
A evaluation_results/RAG-qa/AI.Models.waise-mixtral/qa_015_result.json
A evaluation_results/RAG-qa/AI.Models.waise-mixtral/qa_016_result.json
A evaluation_results/RAG-qa/AI.Models.waise-mixtral/qa_017_result.json
A evaluation_results/RAG-qa/AI.Models.waise-mixtral/qa_018_result.json
A evaluation_results/RAG-qa/AI.Models.waise-mixtral/qa_019_result.json
A evaluation_results/RAG-qa/AI.Models.waise-mixtral/qa_020_result.json
A evaluation_results/RAG-qa/AI.Models.waise-mixtral/qa_021_result.json
A evaluation_results/RAG-qa/AI.Models.waise-mixtral/qa_022_result.json
A evaluation_results/RAG-qa/AI.Models.waise-mixtral/qa_023_result.json
A evaluation_results/RAG-qa/AI.Models.waise-mixtral/qa_024_result.json
A evaluation_results/RAG-qa/AI.Models.waise-mixtral/qa_025_result.json
A evaluation_results/RAG-qa/AI.Models.waise-mixtral/qa_026_result.json
A evaluation_results/RAG-qa/AI.Models.waise-mixtral/qa_027_result.json
A evaluation_results/RAG-qa/AI.Models.waise-mixtral/qa_028_result.json
A evaluation_results/RAG-qa/AI.Models.waise-mixtral/qa_029_result.json
A evaluation_results/RAG-qa/AI.Models.waise-mixtral/qa_030_result.json
A evaluation_results/RAG-qa/AI.Models.waise_gpt3_5_turbo/qa_001_result.json
A evaluation_results/RAG-qa/AI.Models.waise_gpt3_5_turbo/qa_002_result.json
A evaluation_results/RAG-qa/AI.Models.waise_gpt3_5_turbo/qa_003_result.json
A evaluation_results/RAG-qa/AI.Models.waise_gpt3_5_turbo/qa_004_result.json
A evaluation_results/RAG-qa/AI.Models.waise_gpt3_5_turbo/qa_005_result.json
A evaluation_results/RAG-qa/AI.Models.waise_gpt3_5_turbo/qa_006_result.json
A evaluation_results/RAG-qa/AI.Models.waise_gpt3_5_turbo/qa_007_result.json
A evaluation_results/RAG-qa/AI.Models.waise_gpt3_5_turbo/qa_008_result.json
A evaluation_results/RAG-qa/AI.Models.waise_gpt3_5_turbo/qa_009_result.json
A evaluation_results/RAG-qa/AI.Models.waise_gpt3_5_turbo/qa_010_result.json
A evaluation_results/RAG-qa/AI.Models.waise_gpt3_5_turbo/qa_011_result.json
A evaluation_results/RAG-qa/AI.Models.waise_gpt3_5_turbo/qa_012_result.json
A evaluation_results/RAG-qa/AI.Models.waise_gpt3_5_turbo/qa_013_result.json
A evaluation_results/RAG-qa/AI.Models.waise_gpt3_5_turbo/qa_014_result.json
A evaluation_results/RAG-qa/AI.Models.waise_gpt3_5_turbo/qa_015_result.json
A evaluation_results/RAG-qa/AI.Models.waise_gpt3_5_turbo/qa_016_result.json
A evaluation_results/RAG-qa/AI.Models.waise_gpt3_5_turbo/qa_017_result.json
A evaluation_results/RAG-qa/AI.Models.waise_gpt3_5_turbo/qa_018_result.json
A evaluation_results/RAG-qa/AI.Models.waise_gpt3_5_turbo/qa_019_result.json
A evaluation_results/RAG-qa/AI.Models.waise_gpt3_5_turbo/qa_020_result.json
A evaluation_results/RAG-qa/AI.Models.waise_gpt3_5_turbo/qa_021_result.json
A evaluation_results/RAG-qa/AI.Models.waise_gpt3_5_turbo/qa_022_result.json
A evaluation_results/RAG-qa/AI.Models.waise_gpt3_5_turbo/qa_023_result.json
A evaluation_results/RAG-qa/AI.Models.waise_gpt3_5_turbo/qa_024_result.json
A evaluation_results/RAG-qa/AI.Models.waise_gpt3_5_turbo/qa_025_result.json
A evaluation_results/RAG-qa/AI.Models.waise_gpt3_5_turbo/qa_026_result.json
A evaluation_results/RAG-qa/AI.Models.waise_gpt3_5_turbo/qa_027_result.json
A evaluation_results/RAG-qa/AI.Models.waise_gpt3_5_turbo/qa_028_result.json
A evaluation_results/RAG-qa/AI.Models.waise_gpt3_5_turbo/qa_029_result.json
A evaluation_results/RAG-qa/AI.Models.waise_gpt3_5_turbo/qa_030_result.json
A evaluation_results/summarization/.snakemake_timestamp
A evaluation_results/summarization/AI.Models.GPT-4o/summ_001_result.json
A evaluation_results/summarization/AI.Models.GPT-4o/summ_002_result.json
A evaluation_results/summarization/AI.Models.GPT-4o/summ_003_result.json
A evaluation_results/summarization/AI.Models.GPT-4o/summ_004_result.json
A evaluation_results/summarization/AI.Models.GPT-4o/summ_005_result.json
A evaluation_results/summarization/AI.Models.GPT-4o/summ_006_result.json
A evaluation_results/summarization/AI.Models.GPT-4o/summ_007_result.json
A evaluation_results/summarization/AI.Models.GPT-4o/summ_008_result.json
A evaluation_results/summarization/AI.Models.GPT-4o/summ_009_result.json
A evaluation_results/summarization/AI.Models.GPT-4o/summ_010_result.json
A evaluation_results/summarization/AI.Models.gpt3_5_turbo/summ_001_result.json
A evaluation_results/summarization/AI.Models.gpt3_5_turbo/summ_002_result.json
A evaluation_results/summarization/AI.Models.gpt3_5_turbo/summ_003_result.json
A evaluation_results/summarization/AI.Models.gpt3_5_turbo/summ_004_result.json
A evaluation_results/summarization/AI.Models.gpt3_5_turbo/summ_005_result.json
A evaluation_results/summarization/AI.Models.gpt3_5_turbo/summ_006_result.json
A evaluation_results/summarization/AI.Models.gpt3_5_turbo/summ_007_result.json
A evaluation_results/summarization/AI.Models.gpt3_5_turbo/summ_008_result.json
A evaluation_results/summarization/AI.Models.gpt3_5_turbo/summ_009_result.json
A evaluation_results/summarization/AI.Models.gpt3_5_turbo/summ_010_result.json
A evaluation_results/summarization/AI.Models.mixtral/summ_001_result.json
A evaluation_results/summarization/AI.Models.mixtral/summ_002_result.json
A evaluation_results/summarization/AI.Models.mixtral/summ_003_result.json
A evaluation_results/summarization/AI.Models.mixtral/summ_004_result.json
A evaluation_results/summarization/AI.Models.mixtral/summ_005_result.json
A evaluation_results/summarization/AI.Models.mixtral/summ_006_result.json
A evaluation_results/summarization/AI.Models.mixtral/summ_007_result.json
A evaluation_results/summarization/AI.Models.mixtral/summ_008_result.json
A evaluation_results/summarization/AI.Models.mixtral/summ_009_result.json
A evaluation_results/summarization/AI.Models.mixtral/summ_010_result.json
A evaluation_results/text_generation/.snakemake_timestamp
A evaluation_results/text_generation/AI.Models.GPT-4o/text_gen_001_result.json
A evaluation_results/text_generation/AI.Models.GPT-4o/text_gen_002_result.json
A evaluation_results/text_generation/AI.Models.GPT-4o/text_gen_003_result.json
A evaluation_results/text_generation/AI.Models.GPT-4o/text_gen_004_result.json
A evaluation_results/text_generation/AI.Models.GPT-4o/text_gen_005_result.json
A evaluation_results/text_generation/AI.Models.GPT-4o/text_gen_006_result.json
A evaluation_results/text_generation/AI.Models.GPT-4o/text_gen_007_result.json
A evaluation_results/text_generation/AI.Models.GPT-4o/text_gen_008_result.json
A evaluation_results/text_generation/AI.Models.GPT-4o/text_gen_009_result.json
A evaluation_results/text_generation/AI.Models.GPT-4o/text_gen_010_result.json
A evaluation_results/text_generation/AI.Models.gpt3_5_turbo/text_gen_001_result.json
A evaluation_results/text_generation/AI.Models.gpt3_5_turbo/text_gen_002_result.json
A evaluation_results/text_generation/AI.Models.gpt3_5_turbo/text_gen_003_result.json
A evaluation_results/text_generation/AI.Models.gpt3_5_turbo/text_gen_004_result.json
A evaluation_results/text_generation/AI.Models.gpt3_5_turbo/text_gen_005_result.json
A evaluation_results/text_generation/AI.Models.gpt3_5_turbo/text_gen_006_result.json
A evaluation_results/text_generation/AI.Models.gpt3_5_turbo/text_gen_007_result.json
A evaluation_results/text_generation/AI.Models.gpt3_5_turbo/text_gen_008_result.json
A evaluation_results/text_generation/AI.Models.gpt3_5_turbo/text_gen_009_result.json
A evaluation_results/text_generation/AI.Models.gpt3_5_turbo/text_gen_010_result.json
A evaluation_results/text_generation/AI.Models.mixtral/text_gen_001_result.json
A evaluation_results/text_generation/AI.Models.mixtral/text_gen_002_result.json
A evaluation_results/text_generation/AI.Models.mixtral/text_gen_003_result.json
A evaluation_results/text_generation/AI.Models.mixtral/text_gen_004_result.json
A evaluation_results/text_generation/AI.Models.mixtral/text_gen_005_result.json
A evaluation_results/text_generation/AI.Models.mixtral/text_gen_006_result.json
A evaluation_results/text_generation/AI.Models.mixtral/text_gen_007_result.json
A evaluation_results/text_generation/AI.Models.mixtral/text_gen_008_result.json
A evaluation_results/text_generation/AI.Models.mixtral/text_gen_009_result.json
A evaluation_results/text_generation/AI.Models.mixtral/text_gen_010_result.json
A evaluation_results_graphcs/.snakemake_timestamp
A evaluation_results_graphcs/RAG-qa_AnswerRelevancy_bar_chart.png
A evaluation_results_graphcs/RAG-qa_AnswerRelevancy_box_plot.png
A evaluation_results_graphcs/RAG-qa_ContextualPrecision_bar_chart.png
A evaluation_results_graphcs/RAG-qa_ContextualPrecision_box_plot.png
A evaluation_results_graphcs/RAG-qa_ContextualRecall_bar_chart.png
A evaluation_results_graphcs/RAG-qa_ContextualRecall_box_plot.png
A evaluation_results_graphcs/RAG-qa_Faithfulness_bar_chart.png
A evaluation_results_graphcs/RAG-qa_Faithfulness_box_plot.png
A evaluation_results_graphcs/RAG-qa_grouped_bar_chart.png
A evaluation_results_graphcs/summarization_grouped_bar_chart.png
A evaluation_results_graphcs/summarization_score_bar_chart.png
A evaluation_results_graphcs/summarization_score_box_plot.png
A evaluation_results_graphcs/text_generation_grouped_bar_chart.png
A evaluation_results_graphcs/text_generation_score_bar_chart.png
A evaluation_results_graphcs/text_generation_score_box_plot.png
M example.env
M input/input.json
A input/tasks/.snakemake_timestamp
A input/tasks/RAG-qa/qa_001.json
A input/tasks/RAG-qa/qa_002.json
A input/tasks/RAG-qa/qa_003.json
A input/tasks/RAG-qa/qa_004.json
A input/tasks/RAG-qa/qa_005.json
A input/tasks/RAG-qa/qa_006.json
A input/tasks/RAG-qa/qa_007.json
A input/tasks/RAG-qa/qa_008.json
A input/tasks/RAG-qa/qa_009.json
A input/tasks/RAG-qa/qa_010.json
A input/tasks/RAG-qa/qa_011.json
A input/tasks/RAG-qa/qa_012.json
A input/tasks/RAG-qa/qa_013.json
A input/tasks/RAG-qa/qa_014.json
A input/tasks/RAG-qa/qa_015.json
A input/tasks/RAG-qa/qa_016.json
A input/tasks/RAG-qa/qa_017.json
A input/tasks/RAG-qa/qa_018.json
A input/tasks/RAG-qa/qa_019.json
A input/tasks/RAG-qa/qa_020.json
A input/tasks/RAG-qa/qa_021.json
A input/tasks/RAG-qa/qa_022.json
A input/tasks/RAG-qa/qa_023.json
A input/tasks/RAG-qa/qa_024.json
A input/tasks/RAG-qa/qa_025.json
A input/tasks/RAG-qa/qa_026.json
A input/tasks/RAG-qa/qa_027.json
A input/tasks/RAG-qa/qa_028.json
A input/tasks/RAG-qa/qa_029.json
A input/tasks/RAG-qa/qa_030.json
A input/tasks/summarization/summ_001.json
A input/tasks/summarization/summ_002.json
A input/tasks/summarization/summ_003.json
A input/tasks/summarization/summ_004.json
A input/tasks/summarization/summ_005.json
A input/tasks/summarization/summ_006.json
A input/tasks/summarization/summ_007.json
A input/tasks/summarization/summ_008.json
A input/tasks/summarization/summ_009.json
A input/tasks/summarization/summ_010.json
A input/tasks/text_generation/text_gen_001.json
A input/tasks/text_generation/text_gen_002.json
A input/tasks/text_generation/text_gen_003.json
A input/tasks/text_generation/text_gen_004.json
A input/tasks/text_generation/text_gen_005.json
A input/tasks/text_generation/text_gen_006.json
A input/tasks/text_generation/text_gen_007.json
A input/tasks/text_generation/text_gen_008.json
A input/tasks/text_generation/text_gen_009.json
A input/tasks/text_generation/text_gen_010.json
A output/.snakemake_timestamp
A output/AI.Models.GPT-4o/tasks/summarization/summ_001.json
A output/AI.Models.GPT-4o/tasks/summarization/summ_002.json
A output/AI.Models.GPT-4o/tasks/summarization/summ_003.json
A output/AI.Models.GPT-4o/tasks/summarization/summ_004.json
A output/AI.Models.GPT-4o/tasks/summarization/summ_005.json
A output/AI.Models.GPT-4o/tasks/summarization/summ_006.json
A output/AI.Models.GPT-4o/tasks/summarization/summ_007.json
A output/AI.Models.GPT-4o/tasks/summarization/summ_008.json
A output/AI.Models.GPT-4o/tasks/summarization/summ_009.json
A output/AI.Models.GPT-4o/tasks/summarization/summ_010.json
A output/AI.Models.GPT-4o/tasks/text_generation/text_gen_001.json
A output/AI.Models.GPT-4o/tasks/text_generation/text_gen_002.json
A output/AI.Models.GPT-4o/tasks/text_generation/text_gen_003.json
A output/AI.Models.GPT-4o/tasks/text_generation/text_gen_004.json
A output/AI.Models.GPT-4o/tasks/text_generation/text_gen_005.json
A output/AI.Models.GPT-4o/tasks/text_generation/text_gen_006.json
A output/AI.Models.GPT-4o/tasks/text_generation/text_gen_007.json
A output/AI.Models.GPT-4o/tasks/text_generation/text_gen_008.json
A output/AI.Models.GPT-4o/tasks/text_generation/text_gen_009.json
A output/AI.Models.GPT-4o/tasks/text_generation/text_gen_010.json
A output/AI.Models.gpt3_5_turbo/tasks/summarization/summ_001.json
A output/AI.Models.gpt3_5_turbo/tasks/summarization/summ_002.json
A output/AI.Models.gpt3_5_turbo/tasks/summarization/summ_003.json
A output/AI.Models.gpt3_5_turbo/tasks/summarization/summ_004.json
A output/AI.Models.gpt3_5_turbo/tasks/summarization/summ_005.json
A output/AI.Models.gpt3_5_turbo/tasks/summarization/summ_006.json
A output/AI.Models.gpt3_5_turbo/tasks/summarization/summ_007.json
A output/AI.Models.gpt3_5_turbo/tasks/summarization/summ_008.json
A output/AI.Models.gpt3_5_turbo/tasks/summarization/summ_009.json
A output/AI.Models.gpt3_5_turbo/tasks/summarization/summ_010.json
A output/AI.Models.gpt3_5_turbo/tasks/text_generation/text_gen_001.json
A output/AI.Models.gpt3_5_turbo/tasks/text_generation/text_gen_002.json
A output/AI.Models.gpt3_5_turbo/tasks/text_generation/text_gen_003.json
A output/AI.Models.gpt3_5_turbo/tasks/text_generation/text_gen_004.json
A output/AI.Models.gpt3_5_turbo/tasks/text_generation/text_gen_005.json
A output/AI.Models.gpt3_5_turbo/tasks/text_generation/text_gen_006.json
A output/AI.Models.gpt3_5_turbo/tasks/text_generation/text_gen_007.json
A output/AI.Models.gpt3_5_turbo/tasks/text_generation/text_gen_008.json
A output/AI.Models.gpt3_5_turbo/tasks/text_generation/text_gen_009.json
A output/AI.Models.gpt3_5_turbo/tasks/text_generation/text_gen_010.json
A output/AI.Models.mixtral/tasks/summarization/summ_001.json
A output/AI.Models.mixtral/tasks/summarization/summ_002.json
A output/AI.Models.mixtral/tasks/summarization/summ_003.json
A output/AI.Models.mixtral/tasks/summarization/summ_004.json
A output/AI.Models.mixtral/tasks/summarization/summ_005.json
A output/AI.Models.mixtral/tasks/summarization/summ_006.json
A output/AI.Models.mixtral/tasks/summarization/summ_007.json
A output/AI.Models.mixtral/tasks/summarization/summ_008.json
A output/AI.Models.mixtral/tasks/summarization/summ_009.json
A output/AI.Models.mixtral/tasks/summarization/summ_010.json
A output/AI.Models.mixtral/tasks/text_generation/text_gen_001.json
A output/AI.Models.mixtral/tasks/text_generation/text_gen_002.json
A output/AI.Models.mixtral/tasks/text_generation/text_gen_003.json
A output/AI.Models.mixtral/tasks/text_generation/text_gen_004.json
A output/AI.Models.mixtral/tasks/text_generation/text_gen_005.json
A output/AI.Models.mixtral/tasks/text_generation/text_gen_006.json
A output/AI.Models.mixtral/tasks/text_generation/text_gen_007.json
A output/AI.Models.mixtral/tasks/text_generation/text_gen_008.json
A output/AI.Models.mixtral/tasks/text_generation/text_gen_009.json
A output/AI.Models.mixtral/tasks/text_generation/text_gen_010.json
A output/AI.Models.waise-gpt-4o/tasks/RAG-qa/qa_001.json
A output/AI.Models.waise-gpt-4o/tasks/RAG-qa/qa_002.json
A output/AI.Models.waise-gpt-4o/tasks/RAG-qa/qa_003.json
A output/AI.Models.waise-gpt-4o/tasks/RAG-qa/qa_004.json
A output/AI.Models.waise-gpt-4o/tasks/RAG-qa/qa_005.json
A output/AI.Models.waise-gpt-4o/tasks/RAG-qa/qa_006.json
A output/AI.Models.waise-gpt-4o/tasks/RAG-qa/qa_007.json
A output/AI.Models.waise-gpt-4o/tasks/RAG-qa/qa_008.json
A output/AI.Models.waise-gpt-4o/tasks/RAG-qa/qa_009.json
A output/AI.Models.waise-gpt-4o/tasks/RAG-qa/qa_010.json
A output/AI.Models.waise-gpt-4o/tasks/RAG-qa/qa_011.json
A output/AI.Models.waise-gpt-4o/tasks/RAG-qa/qa_012.json
A output/AI.Models.waise-gpt-4o/tasks/RAG-qa/qa_013.json
A output/AI.Models.waise-gpt-4o/tasks/RAG-qa/qa_014.json
A output/AI.Models.waise-gpt-4o/tasks/RAG-qa/qa_015.json
A output/AI.Models.waise-gpt-4o/tasks/RAG-qa/qa_016.json
A output/AI.Models.waise-gpt-4o/tasks/RAG-qa/qa_017.json
A output/AI.Models.waise-gpt-4o/tasks/RAG-qa/qa_018.json
A output/AI.Models.waise-gpt-4o/tasks/RAG-qa/qa_019.json
A output/AI.Models.waise-gpt-4o/tasks/RAG-qa/qa_020.json
A output/AI.Models.waise-gpt-4o/tasks/RAG-qa/qa_021.json
A output/AI.Models.waise-gpt-4o/tasks/RAG-qa/qa_022.json
A output/AI.Models.waise-gpt-4o/tasks/RAG-qa/qa_023.json
A output/AI.Models.waise-gpt-4o/tasks/RAG-qa/qa_024.json
A output/AI.Models.waise-gpt-4o/tasks/RAG-qa/qa_025.json
A output/AI.Models.waise-gpt-4o/tasks/RAG-qa/qa_026.json
A output/AI.Models.waise-gpt-4o/tasks/RAG-qa/qa_027.json
A output/AI.Models.waise-gpt-4o/tasks/RAG-qa/qa_028.json
A output/AI.Models.waise-gpt-4o/tasks/RAG-qa/qa_029.json
A output/AI.Models.waise-gpt-4o/tasks/RAG-qa/qa_030.json
A output/AI.Models.waise-mixtral/tasks/RAG-qa/qa_001.json
A output/AI.Models.waise-mixtral/tasks/RAG-qa/qa_002.json
A output/AI.Models.waise-mixtral/tasks/RAG-qa/qa_003.json
A output/AI.Models.waise-mixtral/tasks/RAG-qa/qa_004.json
A output/AI.Models.waise-mixtral/tasks/RAG-qa/qa_005.json
A output/AI.Models.waise-mixtral/tasks/RAG-qa/qa_006.json
A output/AI.Models.waise-mixtral/tasks/RAG-qa/qa_007.json
A output/AI.Models.waise-mixtral/tasks/RAG-qa/qa_008.json
A output/AI.Models.waise-mixtral/tasks/RAG-qa/qa_009.json
A output/AI.Models.waise-mixtral/tasks/RAG-qa/qa_010.json
A output/AI.Models.waise-mixtral/tasks/RAG-qa/qa_011.json
A output/AI.Models.waise-mixtral/tasks/RAG-qa/qa_012.json
A output/AI.Models.waise-mixtral/tasks/RAG-qa/qa_013.json
A output/AI.Models.waise-mixtral/tasks/RAG-qa/qa_014.json
A output/AI.Models.waise-mixtral/tasks/RAG-qa/qa_015.json
A output/AI.Models.waise-mixtral/tasks/RAG-qa/qa_016.json
A output/AI.Models.waise-mixtral/tasks/RAG-qa/qa_017.json
A output/AI.Models.waise-mixtral/tasks/RAG-qa/qa_018.json
A output/AI.Models.waise-mixtral/tasks/RAG-qa/qa_019.json
A output/AI.Models.waise-mixtral/tasks/RAG-qa/qa_020.json
A output/AI.Models.waise-mixtral/tasks/RAG-qa/qa_021.json
A output/AI.Models.waise-mixtral/tasks/RAG-qa/qa_022.json
A output/AI.Models.waise-mixtral/tasks/RAG-qa/qa_023.json
A output/AI.Models.waise-mixtral/tasks/RAG-qa/qa_024.json
A output/AI.Models.waise-mixtral/tasks/RAG-qa/qa_025.json
A output/AI.Models.waise-mixtral/tasks/RAG-qa/qa_026.json
A output/AI.Models.waise-mixtral/tasks/RAG-qa/qa_027.json
A output/AI.Models.waise-mixtral/tasks/RAG-qa/qa_028.json
A output/AI.Models.waise-mixtral/tasks/RAG-qa/qa_029.json
A output/AI.Models.waise-mixtral/tasks/RAG-qa/qa_030.json
A output/AI.Models.waise_gpt3_5_turbo/tasks/RAG-qa/qa_001.json
A output/AI.Models.waise_gpt3_5_turbo/tasks/RAG-qa/qa_002.json
A output/AI.Models.waise_gpt3_5_turbo/tasks/RAG-qa/qa_003.json
A output/AI.Models.waise_gpt3_5_turbo/tasks/RAG-qa/qa_004.json
A output/AI.Models.waise_gpt3_5_turbo/tasks/RAG-qa/qa_005.json
A output/AI.Models.waise_gpt3_5_turbo/tasks/RAG-qa/qa_006.json
A output/AI.Models.waise_gpt3_5_turbo/tasks/RAG-qa/qa_007.json
A output/AI.Models.waise_gpt3_5_turbo/tasks/RAG-qa/qa_008.json
A output/AI.Models.waise_gpt3_5_turbo/tasks/RAG-qa/qa_009.json
A output/AI.Models.waise_gpt3_5_turbo/tasks/RAG-qa/qa_010.json
A output/AI.Models.waise_gpt3_5_turbo/tasks/RAG-qa/qa_011.json
A output/AI.Models.waise_gpt3_5_turbo/tasks/RAG-qa/qa_012.json
A output/AI.Models.waise_gpt3_5_turbo/tasks/RAG-qa/qa_013.json
A output/AI.Models.waise_gpt3_5_turbo/tasks/RAG-qa/qa_014.json
A output/AI.Models.waise_gpt3_5_turbo/tasks/RAG-qa/qa_015.json
A output/AI.Models.waise_gpt3_5_turbo/tasks/RAG-qa/qa_016.json
A output/AI.Models.waise_gpt3_5_turbo/tasks/RAG-qa/qa_017.json
A output/AI.Models.waise_gpt3_5_turbo/tasks/RAG-qa/qa_018.json
A output/AI.Models.waise_gpt3_5_turbo/tasks/RAG-qa/qa_019.json
A output/AI.Models.waise_gpt3_5_turbo/tasks/RAG-qa/qa_020.json
A output/AI.Models.waise_gpt3_5_turbo/tasks/RAG-qa/qa_021.json
A output/AI.Models.waise_gpt3_5_turbo/tasks/RAG-qa/qa_022.json
A output/AI.Models.waise_gpt3_5_turbo/tasks/RAG-qa/qa_023.json
A output/AI.Models.waise_gpt3_5_turbo/tasks/RAG-qa/qa_024.json
A output/AI.Models.waise_gpt3_5_turbo/tasks/RAG-qa/qa_025.json
A output/AI.Models.waise_gpt3_5_turbo/tasks/RAG-qa/qa_026.json
A output/AI.Models.waise_gpt3_5_turbo/tasks/RAG-qa/qa_027.json
A output/AI.Models.waise_gpt3_5_turbo/tasks/RAG-qa/qa_028.json
A output/AI.Models.waise_gpt3_5_turbo/tasks/RAG-qa/qa_029.json
A output/AI.Models.waise_gpt3_5_turbo/tasks/RAG-qa/qa_030.json
M scripts/evaluation_scripts/eval_rag_qa.py
M scripts/evaluation_scripts/eval_summary.py
M scripts/evaluation_scripts/eval_text_generation.py
A scripts/evaluation_scripts/evaluation_utils.py
M scripts/output_generation/collect_model_responses.py
M scripts/results_visualization/create_plots.py
A snakeout/indexed/.snakemake_timestamp
Log Message:
-----------
LLMAI-61: Implement an evaluation framework
* Add script for creating plots from the evaluation results
* Remove draft scripts
* Updated input.json and config.json for evaluating 3 models gpt-4o, gpt-3.5-turbo and mixtral-7b
* Ran the full pipeline for gpt-4o, gpt-3.5-turbo and mixtral-7b and added the results
* Generated DAG (directed acyclic graph) - dag.png
Compare: https://github.com/xwiki-contrib/ai-llm-benchmark/compare/b0005e4d217c...da…
To unsubscribe from these emails, change your notification settings at https://github.com/xwiki-contrib/ai-llm-benchmark/settings/notifications