Some Sections are Too Long for LLMs #1558

npmccallum · 2023-11-28T21:01:42Z

Some sections of the text are too long to be held in the context window of LLMs. Here's a list of the sections (the encodings of Plutarch are a particularly bad offender):

The worst, by far, is this one:

http://www.perseus.tufts.edu/hopper/text?doc=Perseus%3Atext%3A2008.01.0301%3Asection%3D22

Would it be possible to break up these sections into smaller sizes?

lcerrato · 2023-11-29T18:00:50Z

@npmccallum
The Isocrates appears incorrect to me (sections were not encoded) but this was very early conversion work and will be revisited as part of the workflow.

For the Plutarch, smaller sections (excepting if there are errors lurking in here) would require another level of CTS structure to be imposed on the texts. That would be an arbitrary imposition on the standard structure. (I also do not believe another layer is possible in the case of works with 3 levels already.)

The plain text versions of the texts might be a better option depending on the type of work being done? I know others have done post-processing text chunking as needed using those versions.

As CTS referencing permits designating any span of text for referencing smaller portions or subsets of the works, we haven't been adding more containers top-down as a general practice excepting some more obscure works here and there.

This is really beyond my role and would be something others should decide.

helmadik · 2024-06-27T21:27:16Z

Just a quick comment that the offender especially singled out 0007 -107 - 22 does seem to be missing section numbers after that point in Vernardakis, but the Loeb edition does have them. e.g. 23, https://www.loebclassics.com/view/plutarch-delays_divine_vengeance/1959/pb_LCL405.273.xml?result=1&rskey=5RUAuX, 24, https://www.loebclassics.com/view/plutarch-delays_divine_vengeance/1959/pb_LCL405.277.xml?result=1&rskey=5RUAuX

npmccallum changed the title ~~Section 22 of Plutarch's De sera numinis vindicta is too long~~ Some Sections are Too Long for LLMs Nov 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some Sections are Too Long for LLMs #1558

Some Sections are Too Long for LLMs #1558

npmccallum commented Nov 28, 2023 •

edited by lcerrato

Loading

lcerrato commented Nov 29, 2023

helmadik commented Jun 27, 2024

Some Sections are Too Long for LLMs #1558

Some Sections are Too Long for LLMs #1558

Comments

npmccallum commented Nov 28, 2023 • edited by lcerrato Loading

lcerrato commented Nov 29, 2023

helmadik commented Jun 27, 2024

npmccallum commented Nov 28, 2023 •

edited by lcerrato

Loading