Concept statistics #679

efhosci · 2025-02-06T04:47:28Z

Add new tab in concept window which will display information about the file contents of the concept. Will show the total size of all files and the number of images, masks, and captions. Will also check how many images are paired with a caption/mask of the same name, and conversely if there are any captions/masks which lack a corresponding image.

To do:

currently only runs when "refresh" button is clicked, as large datasets can take a long time to process and freeze the program. Will try to have it start running on opening the window with a timeout of ~1/2 second, so smaller concepts can be processed automatically but larger ones won't hold the program
save/load stats to the concept config file so they can be retrieved more quickly, assuming no files have been added/deleted since it was last run
better formatting on the tab to make labels/data more clear
more statistics on image resolution, tag frequency, or other useful information. Some of this may take a lot longer to process so parts may be enabled/disabled depending on user input

Add new tab in concept window which will display information about the file contents of the concept. Will show the total size of all files and the number of images, masks, and captions. Will also check how many images are paired with a caption/mask of the same name, and conversely if there are any captions/masks which lack a corresponding image. To do: - currently only runs when "refresh" button is clicked, as large datasets can take a long time to process and freeze the program. Will try to have it start running on opening the window with a timeout of ~1/2 second, so smaller concepts can be processed automatically but larger ones won't hold the program - save/load stats to the concept config file so they can be retrieved more quickly, assuming no files have been added/deleted since it was last run - better formatting on the tab to make labels/data more clear - more statistics on image resolution, tag frequency, or other useful information. Some of this may take a lot longer to process so parts may be enabled/disabled depending on user input

Arcitec · 2025-02-06T16:05:00Z

I have only read your description but it sounds like you are doing heavy work in the GUI thread (which is why the app freezes).

And the proposed to-dos of making the statistics modular or auto-canceling scans after 2 seconds etc sounds like digging deeper into the wrong design.

You can convert the code to use a separate non-GUI worker thread to fix that:

https://docs.python.org/3.10/library/asyncio-task.html

https://docs.python.org/3.10/library/asyncio-task.html#running-in-threads

The worker can also communicate periodic results to the GUI to update the scan results and state in real time.

efhosci · 2025-02-06T17:51:14Z

I have only read your description but it sounds like you are doing heavy work in the GUI thread (which is why the app freezes).

And the proposed to-dos of making the statistics modular or auto-canceling scans after 2 seconds etc sounds like digging deeper into the wrong design.

You can convert the code to use a separate non-GUI worker thread to fix that:

https://docs.python.org/3.10/library/asyncio-task.html

https://docs.python.org/3.10/library/asyncio-task.html#running-in-threads

The worker can also communicate periodic results to the GUI to update the scan results and state in real time.

Thanks! I did briefly look into ways to run this in a separate thread but it's something I have zero experience with currently. I definitely will try to implement that once I have a better understanding of that and have optimized the scanning loop and gui as much as possible.

I still will probably do something to stop the thread after a short time if it's started automatically on opening the window and not by the user, and probably add a manual "abort" button, just to make sure that it doesn't suck up all system resources without the user knowing why.

I think there still is a good reason to have separate "basic/advanced" scans for statistics. I was doing some initial testing of getting the resolution for each image, and adding that slows down the scanning process by 10-20x. It's useful info to have but it needs to be optional, and only happen when it's either going to take <1 second it when the user explicitly asks for it.

Moved the "scanning" code into a new file scripts/concept_stats.py that's called from ConceptWindow Now writes concept_stats to concept config file, will later add a check to load any saved info to the UI if it already exists on startup Added section to get the image width and height, and find the min/max/average pixel count of all images. This MASSIVELY slows down the process (around 10x slower) so the scanning has been separated into a "basic" and "advanced" section: "basic" only counts the number of files and their file size, "advanced" checks for image/mask and image/caption pairs and the new resolution info. Will use width/height info to check image bucketing in the future

Now loads from data saved in concept config file if it exists, otherwise will run a "basic" scan the first time. Scanning manually refreshes data, "advanced" scan will add additional info. May need to force it to refresh if some other concept parameters change, or at least indicate whe n loaded values may be "out of date". Reorganized the layout on the GUI Changed the way some values are initialized or reloaded

mx · 2025-02-08T15:33:48Z

threading.Thread is pretty easy to use especially if you want to abort/cancel things, just make sure you stash the handle somewhere you can easily access from the abort button.

Arcitec · 2025-02-08T20:58:43Z

@efhosci Hmm, for the issue with slow scans when getting image size, the problem is that most ways to get the resolution involves decoding the whole image into a bitmap and then checking its size. That decoding is very slow.

But if all we want is the resolution, it is technically possible to read that data immediately from the binary image file data on disk. Which would be super fast. Usually it just involves reading two integers stored somewhere in the image file header.

The real question is if any Python libraries implement such efficient image resolution reading, and how many image formats they support. Worth investigating...

Edit: I have a super vague memory that PIL may have a mode that doesn't decode any image data until you try to read the data. If so, it may be capable of "opening" the image file and getting the resolution without decoding anything. It's a long shot and I may misremember.

efhosci · 2025-02-09T00:40:43Z

@Arcitec From what I could find on Google there are a few libraries that can get the image resolution a bit more quickly than PIL, but may have some issues with certain filetypes. I did test imagesize and it's a bit faster than PIL, but only like 20% faster. I think in both cases it's not fully opening the file, just getting info from the metadata, and I'd rather stick with the already installed library than install something separate. Plus if I eventually want to get some other info about the image PIL is going to be more useful.

Also part of the issue might be that I'm testing this with files stored on an HDD ZFS array, which is probably slowing down I/O speeds. I'll copy some pics to the system drive and see if it runs any faster.

Edit: Ok it runs a LOT faster from the SSD, though reading the resolution info is still much slower than reading just the filename and filesize. imgsize also seems to beat PIL a lot more there but will have to test it more extensively.

Matches images to nearest aspect ratio buckets and counts the number of each. GUI shows the smallest + second smallest nonzero buckets, and a bar chart/histogram of all buckets. Currently the aspect buckets are copied from AspectBucketing and hard-coded in the concept_stats function, will try to have it pull those values from AspectBucketing automatically to avoid issues if they change in the future Some additional tweaks to the GUI spacing to make values more readable and reduce blank space Displays the exact resolution and file name for the largest/smallest images, average shows approx size of equivalent square image Add some checks to catch nonexistent/missing directory, other invalid inputs

Gets length of captions (each newline-separated section in any text file matched to an image) and records min/max/avg length based on characters, as well as word count and filenames for min/max. Displayed on GUI similar to resolution info.

Underlined labels on stats window, tweaked graph to waste less space and show bar values, reformatted "smallest aspect bucket" into more readable string

efhosci added 2 commits February 7, 2025 13:11

efhosci added 3 commits February 9, 2025 13:33

Caption length info

230bc79

Gets length of captions (each newline-separated section in any text file matched to an image) and records min/max/avg length based on characters, as well as word count and filenames for min/max. Displayed on GUI similar to resolution info.

Formatting

64a1205

Underlined labels on stats window, tweaked graph to waste less space and show bar values, reformatted "smallest aspect bucket" into more readable string

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Concept statistics #679

Concept statistics #679

efhosci commented Feb 6, 2025

Arcitec commented Feb 6, 2025 •

edited

Loading

efhosci commented Feb 6, 2025

mx commented Feb 8, 2025

Arcitec commented Feb 8, 2025 •

edited

Loading

efhosci commented Feb 9, 2025 •

edited

Loading

Concept statistics #679

Are you sure you want to change the base?

Concept statistics #679

Conversation

efhosci commented Feb 6, 2025

Arcitec commented Feb 6, 2025 • edited Loading

efhosci commented Feb 6, 2025

mx commented Feb 8, 2025

Arcitec commented Feb 8, 2025 • edited Loading

efhosci commented Feb 9, 2025 • edited Loading

Arcitec commented Feb 6, 2025 •

edited

Loading

Arcitec commented Feb 8, 2025 •

edited

Loading

efhosci commented Feb 9, 2025 •

edited

Loading