This issue has been created
There are 2 updates.
 
 
LLM AI Integration / cid:jira-generated-image-avatar-8d36b3fe-6482-4118-9f4f-74f1c7f18af3 LLMAI-79 Open

Improve chunking by taking newlines and headings into account

 
View issue   ยท   Add comment
 

Issue created

 
cid:jira-generated-image-avatar-697f7522-8d8c-4cc3-bdc9-769e1c2f0052 Michael Hamann created this issue on 27/May/24 15:17
 
Summary: Improve chunking by taking newlines and headings into account
Issue Type: cid:jira-generated-image-avatar-8d36b3fe-6482-4118-9f4f-74f1c7f18af3 Improvement
Affects Versions: 0.4
Assignee: Unassigned
Created: 27/May/24 15:17
Priority: cid:jira-generated-image-static-major-51dd4857-8c15-4373-a799-b99b2bb0c8b6 Major
Reporter: Michael Hamann
Description:

Chunking should try putting whole sections and paragraphs of the document into a chunk instead of splitting the content in the middle of words. Also, this isn't possible, chunking should at least to try split on a space character or word boundary.

A possible chunking algorithm could be to read the input line by line and doing the

  • If adding the next line makes the chunk too big, don't add the line. If the chunk is smaller than, e.g., half of the maximum, add a part of the line (ideally splitting at least at a word boundary or at end of sentence).
  • While the last line of a chunk is a heading, remove that heading (so the next chunk starts with it) unless it makes the chunk too small.

We should also check what other projects do.

 
 

2 updates

 
cid:jira-generated-image-avatar-697f7522-8d8c-4cc3-bdc9-769e1c2f0052 Changes by Michael Hamann on 27/May/24 15:17
 
Fix Version: 0.4
Assignee: Paul Pantiru