Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concept statistics #679

Draft
wants to merge 6 commits into
base: master
Choose a base branch
from
Draft

Concept statistics #679

wants to merge 6 commits into from

Conversation

efhosci
Copy link
Contributor

@efhosci efhosci commented Feb 6, 2025

Add new tab in concept window which will display information about the file contents of the concept. Will show the total size of all files and the number of images, masks, and captions. Will also check how many images are paired with a caption/mask of the same name, and conversely if there are any captions/masks which lack a corresponding image.

To do:

  • currently only runs when "refresh" button is clicked, as large datasets can take a long time to process and freeze the program. Will try to have it start running on opening the window with a timeout of ~1/2 second, so smaller concepts can be processed automatically but larger ones won't hold the program
  • save/load stats to the concept config file so they can be retrieved more quickly, assuming no files have been added/deleted since it was last run
  • better formatting on the tab to make labels/data more clear
  • more statistics on image resolution, tag frequency, or other useful information. Some of this may take a lot longer to process so parts may be enabled/disabled depending on user input

Add new tab in concept window which will display information about the file contents of the concept. Will show the total size of all files and the number of images, masks, and captions. Will also check how many images are paired with a caption/mask of the same name, and conversely if there are any captions/masks which lack a corresponding image.

To do:
- currently only runs when "refresh" button is clicked, as large datasets can take a long time to process and freeze the program. Will try to have it start running on opening the window with a timeout of ~1/2 second, so smaller concepts can be processed automatically but larger ones won't hold the program
- save/load stats to the concept config file so they can be retrieved more quickly, assuming no files have been added/deleted since it was last run
- better formatting on the tab to make labels/data more clear
- more statistics on image resolution, tag frequency, or other useful information. Some of this may take a lot longer to process so parts may be enabled/disabled depending on user input
@Arcitec
Copy link
Contributor

Arcitec commented Feb 6, 2025

I have only read your description but it sounds like you are doing heavy work in the GUI thread (which is why the app freezes).

And the proposed to-dos of making the statistics modular or auto-canceling scans after 2 seconds etc sounds like digging deeper into the wrong design.

You can convert the code to use a separate non-GUI worker thread to fix that:

https://docs.python.org/3.10/library/asyncio-task.html

https://docs.python.org/3.10/library/asyncio-task.html#running-in-threads

The worker can also communicate periodic results to the GUI to update the scan results and state in real time.

@efhosci
Copy link
Contributor Author

efhosci commented Feb 6, 2025

I have only read your description but it sounds like you are doing heavy work in the GUI thread (which is why the app freezes).

And the proposed to-dos of making the statistics modular or auto-canceling scans after 2 seconds etc sounds like digging deeper into the wrong design.

You can convert the code to use a separate non-GUI worker thread to fix that:

https://docs.python.org/3.10/library/asyncio-task.html

https://docs.python.org/3.10/library/asyncio-task.html#running-in-threads

The worker can also communicate periodic results to the GUI to update the scan results and state in real time.

Thanks! I did briefly look into ways to run this in a separate thread but it's something I have zero experience with currently. I definitely will try to implement that once I have a better understanding of that and have optimized the scanning loop and gui as much as possible.

I still will probably do something to stop the thread after a short time if it's started automatically on opening the window and not by the user, and probably add a manual "abort" button, just to make sure that it doesn't suck up all system resources without the user knowing why.

I think there still is a good reason to have separate "basic/advanced" scans for statistics. I was doing some initial testing of getting the resolution for each image, and adding that slows down the scanning process by 10-20x. It's useful info to have but it needs to be optional, and only happen when it's either going to take <1 second it when the user explicitly asks for it.

Moved the "scanning" code into a new file scripts/concept_stats.py that's called from ConceptWindow

Now writes concept_stats to concept config file, will later add a check to load any saved info to the UI if it already exists on startup

Added section to get the image width and height, and find the min/max/average pixel count of all images. This MASSIVELY slows down the process (around 10x slower) so the scanning has been separated into a "basic" and "advanced" section: "basic" only counts the number of files and their file size, "advanced" checks for image/mask and image/caption pairs and the new resolution info. Will use width/height info to check image bucketing in the future
Now loads from data saved in concept config file if it exists, otherwise will run a "basic" scan the first time. Scanning manually refreshes data, "advanced" scan will add additional info. May need to force it to refresh if some other concept parameters change, or at least indicate whe n loaded values may be "out of date".

Reorganized the layout on the GUI

Changed the way some values are initialized or reloaded
@mx
Copy link
Collaborator

mx commented Feb 8, 2025

threading.Thread is pretty easy to use especially if you want to abort/cancel things, just make sure you stash the handle somewhere you can easily access from the abort button.

@Arcitec
Copy link
Contributor

Arcitec commented Feb 8, 2025

@efhosci Hmm, for the issue with slow scans when getting image size, the problem is that most ways to get the resolution involves decoding the whole image into a bitmap and then checking its size. That decoding is very slow.

But if all we want is the resolution, it is technically possible to read that data immediately from the binary image file data on disk. Which would be super fast. Usually it just involves reading two integers stored somewhere in the image file header.

The real question is if any Python libraries implement such efficient image resolution reading, and how many image formats they support. Worth investigating...

Edit: I have a super vague memory that PIL may have a mode that doesn't decode any image data until you try to read the data. If so, it may be capable of "opening" the image file and getting the resolution without decoding anything. It's a long shot and I may misremember.

@efhosci
Copy link
Contributor Author

efhosci commented Feb 9, 2025

@Arcitec From what I could find on Google there are a few libraries that can get the image resolution a bit more quickly than PIL, but may have some issues with certain filetypes. I did test imagesize and it's a bit faster than PIL, but only like 20% faster. I think in both cases it's not fully opening the file, just getting info from the metadata, and I'd rather stick with the already installed library than install something separate. Plus if I eventually want to get some other info about the image PIL is going to be more useful.

Also part of the issue might be that I'm testing this with files stored on an HDD ZFS array, which is probably slowing down I/O speeds. I'll copy some pics to the system drive and see if it runs any faster.

Edit: Ok it runs a LOT faster from the SSD, though reading the resolution info is still much slower than reading just the filename and filesize. imgsize also seems to beat PIL a lot more there but will have to test it more extensively.

Matches images to nearest aspect ratio buckets and counts the number of each. GUI shows the smallest + second smallest nonzero buckets, and a bar chart/histogram of all buckets. Currently the aspect buckets are copied from AspectBucketing and hard-coded in the concept_stats function, will try to have it pull those values from AspectBucketing automatically to avoid issues if they change in the future

Some additional tweaks to the GUI spacing to make values more readable and reduce blank space

Displays the exact resolution and file name for the largest/smallest images, average shows approx size of equivalent square image

Add some checks to catch nonexistent/missing directory, other invalid inputs
Gets length of captions (each newline-separated section in any text file matched to an image) and records min/max/avg length based on characters, as well as word count and filenames for min/max. Displayed on GUI similar to resolution info.
Underlined labels on stats window, tweaked graph to waste less space and show bar values, reformatted "smallest aspect bucket" into more readable string
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants