This issue has been created

/

Open

Cache attachment context extracted by Tika

View issue · Add comment

Issue created

Michael Hamann created this issue on 04/Jun/25 17:30

Summary:	Cache attachment context extracted by Tika
Issue Type:	Improvement
Affects Versions:	16.10.0
Assignee:	Unassigned
Components:	Search - Solr
Created:	04/Jun/25 17:30
Labels:	performance
Priority:	Major
Reporter:	Michael Hamann
Description:	The Solr indexer currently extracts the textual content of attachments using Tika whenever an attachment is indexed. The same attachment is usually indexed at least twice: once on the document entry, once for a separate attachment entry, and then if the document has translations, also once for every translation. Whenever a document is changed, the attachment will be re-parsed even if it didn't change. As parsing attachments with Tika can be slow and resource-consuming, we should introduce a cache for the extracted text. This cache should ideally be stored in the same store as the attachment itself. There should be a way to manually or automatically clear the cache, e.g., on Tika upgrades.

This message was sent by Atlassian Jira (v9.3.0#930000-sha1:287aeb6)

If image attachments aren't displayed, see this article.