Some Telegram chats are way too active for me to keep up with. Let's pull the last N days from these chats, summarize and send a digest, instead
Set your keys for anything that does not have a default value in AppConfig
(config.py
):
- TELEGRAM_BOT_TOKEN: str
- TELEGRAM_API_HASH: str
- TELEGRAM_API_ID: str
- TELEGRAM_SESSION_STRING: str
- POE_PB_TOKEN: str
- POE_CHAT_CODE: str
These can be placed in any of the following places:
- as environment variables (eg
export
), including as secrets that then get exposed as environment variables - in a
conf.env
file
then do:
$ pip install -r requirements.txt
$ python telegram_digest/main.py
V1 can take arbitrary-length input and uses a refine-summary strategy to summarize.
- Telegram setup: use individual credentials (not a bot), so we can get the full history
- llm: leverage Poe (so we can try different llms quickly)
- summarization: implemented a
refine
strategy- splits input into batches, each having at most
max_token
tokens - iteratively generates a summary (refine-style)
- splits input into batches, each having at most
- config loading: use pydantic_settings.BaseSettings to import either from environment variables (eg github secrets) or from file
- summary quality. 2 issues
- no metric to measure quality of one summary vs another
- no stability: even the same input against the same poe bot will give different summaries
- experiment with different bots
- experiment with different thread representations
- add "reply to.." to identify replies
- represent the reply-chains in a more structured form (eg all replies in the same chain are collected and represented together, instead of interleaved in the main thread)
- interactive: host the bot on heroku / fly.io, so I can interact with it via Telegram
main.py
is the entry point.telegram_bot.py
handles creating of a Telegram client (TelegramBotBuilder
), pulling history and sending messages (TelegramBot
) and message-data munging (TelegramMessagesParsing
)llm.py
handles interfacing with Poe (sending messages, defining prompts) and has helpers for splitting the text into batches that fit into the context (TextBatcher
)
- Telegram interface
telethon
is what you want to use- You can interface as your own user or as a bot.
- my account --> bot: I thought I wanted to do as myself, then I discovered the bots, which have a simpler api
- bot --> myself: then I discovered bots can only see the conversation once they are added to a thread, and even then they can see only the messages sent after they were added
- [?] myself --> bot: having a bot is nice because you can interact with it (eg passing different ocnfig arguments) and is more clear who is doing what, see
- Summarization
- strategies: langchain details 2 summarization strategies (stuff-it-all in the prompt, map-reduce or refine).
- metrics: it's unclear how to measure quality: if you have a reference summary you can measure similarity to the reference, but if you don't have a reference metrics might not be very reliable: https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00417/107833/A-Statistical-Analysis-of-Summarization-Evaluation
pydantic_settings.BaseSettings
is very useful for loading config from environment variables and files.