Skip to content

Navigation Menu

Explore
By company size
By use case
By industry
View all solutions
Topics
- AI
- DevOps
- Security
- Software Development
- View all
Explore
- GitHub Sponsors
  Fund open source developers
- The ReadME Project
  GitHub community articles
Repositories
- Enterprise platform
  AI-powered developer platform
Available add-ons
Pricing

Search code, repositories, users, issues, pull requests...

Search

Clear

Search syntax tips

Provide feedback

We read every piece of feedback, and take your input very seriously.

Include my email address so I can be contacted

Saved searches

Use saved searches to filter your results more quickly

Name

Query

To see all available qualifiers, see our documentation.

You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.

Dismiss alert

kimoji919 / Docx2KG Public

Notifications You must be signed in to change notification settings
Fork 0
Star 0

Code
Issues
Pull requests
Actions
Projects
Security
Insights

Additional navigation options

Code
Issues
Pull requests
Actions
Projects
Security
Insights

Breadcrumbs

Docx2KG

/

知识图谱2期.md

Copy path

Latest commit

History

43 lines (27 loc) · 1.63 KB

Breadcrumbs

Docx2KG

/

知识图谱2期.md

File metadata and controls

43 lines (27 loc) · 1.63 KB

基本思路

1、按照token数切分文档形成单元段落（token数待定）
2、将单元段落解析成文本后，整合上文实体关系给到大模型
3、获取反馈，将实体、关系整合进上文实体关系数组中
4、组转xmind

待办

上下文实体整合进提示词
能够提取其他关系的提示词撰写
组转xmind

结论

实现效果不太行，存在遗忘风险，以及长文本提取不全、不细

基本思路2

设计思路

目录读取
单元任务切分
切分完给到大模型（提示词优化）
返回返回值（json，并加入json列表中）
知识图谱导出
选择生成并不完美的部分、重复第二步

遇到难点

1.docx目录是特殊格式文字并不会被现有段落读取读取到，无法用现有流程传给LLM

2.docx分页没有现成的api，对单页进行访问也没有这样的api，目录划分并不能很轻松的用在docx文件上

3.pdf是另一套体系读取方式统一且分页轻松，但是在不使用ocr的情况下，现有的pdf数据均为扫描版，没法用简单的pdfloader

4.目前的提示词下效果一般，提取不全不细

目前做法

我先是手动切，去对不不同颗粒度文本下，对实体提取的优劣

可以看到如果以章节为依据进行提取，模型会停留在提取小标题，以及一些其他的地方；如果以小标题为切分依据，会导致颗粒度太小，知识点提取过多；因此我们可以通过分析章节，提取章节的方式对文章进行处理，然后在提示词上改进。

Footer

© 2025 GitHub, Inc.

Footer navigation

Terms
Privacy
Security
Status
Docs
Contact

You can’t perform that action at this time.