Semantic search added #47

ndrean · 2024-01-23T00:08:15Z

First trial. Still amazed to see what you can do.

It works, probably needs a serious read. I close the other PR.

The embedding model is started via a GenServer. Maybe we can or should include this in your Models.ex.

Reset your the DB, and run the server. Logs should display "New Index".
Trials:

load 2 different images, say a house and then a dog. Then record a short audio, say a house. Check the result.
stop & start the server. Logs should output "Existing Index". Record an audio for a dog. Check the result.

Note: if you erase the Index file without resetting the DB, there will be obviously a problem.

I experience that the semantic search is not perfect, especially if the audio transcription fails. It also depends a lot on the words you use. If you ask for a picture of a cat, you may have a dog or a house..... This is where a threshold is needed. TBC!

codecov · 2024-01-23T00:12:57Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 100.00%. Comparing base (9641719) to head (a9c2d83).

❗ Current head a9c2d83 differs from pull request most recent head 7b5287d. Consider uploading reports for the commit 7b5287d to get more accurate results

Additional details and impacted files

@@            Coverage Diff             @@
##              main       #47    +/-   ##
==========================================
  Coverage   100.00%   100.00%            
==========================================
  Files            3         5     +2     
  Lines           94       198   +104     
==========================================
+ Hits            94       198   +104

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

LuchoTurtle · 2024-01-23T00:24:41Z

I've merged #43 to main :)
Thanks a lot for this PR, can't wait until it's finished :D

ndrean · 2024-01-23T00:47:46Z

@LuchoTurtle

More remarks:

the Index file is saved locally.!!!! This means...problems if - as it seems - Fly prunes the volume when the machine is killed ‼️. So maybe Postgres Blob trick/hack?
Dialyzer warning corrected ✌️. I forgot I raised an issue last year on this (the specs are wrong) so the answer is:

    {:hnswlib, git: "https://github.com/elixir-nx/hnswlib", override: true},

about unique uploads. Nothing prevents a user from uploading several times the same image. One way could be to calculate the SHA, save it in the DB, ahd check if the new image has an existing SHA in which case we do nothing. I am not sure if duplicated stored images could be a problem bit this means that the same image may have several HNSWLib indexes. I think this refinment shouldn't be difficult to implement : add a "sha" field in the %ImageInfo{} struct and to the App.Image table (index this column), use a :crypto.hash(:sha, file_binary) , and run a DB check before even displaying and uploading it. his is implemented in the last commit. Let me know if you want this.
the UX: I should remove the previous found picture & transcription once the semantic search runs. We should probably do the same for the caption.
second UX point: the caption is not editable, nor is the audio transcript. This could be a refinement that you may want so the user can modify them in case of an error. Indeed, the first upload must be correct simply because you can't erase an uploaded file (which comes with the caption). But this is a demo, isn't it?
Let me know if the last part is "pedantic". I felt that the documentation of HNSWLib could be improved so I adjoined something here, but it can be censored.
I tried several times to find a threshold by looking at the "distance" or angle. I output the distance on each search, so you can observe it as well. My best guess is not use the result when the distance is above 0.12-0.15, but we need more fine-grained examples.

ndrean · 2024-01-23T23:41:41Z

@LuchoTurtle
I added "save Index file to Postgres'" to overcome the Fly volume problem as it is crucial to keep the "images" table and the HNSWLIb Index file in sync.

To be doubled checked but the test is the following: I populate "normally" the "indexes.bin" Index file when I upload some images, then stop the app, erase the "indexes.bin" file and restart the app, so I expect to load the Index from the DB and be able to find a given image with an audio.
The logic is in 2 files: App.HnswlibIndex.ex and App.KnnIndex.ex". In the GenServer init (KnnIndex.ex), I firstly check if the file "indexes.bin" is present (typically a "warm" start). If yes, then load from it. If not (volume pruned), check if it exists in the DB. If not, this means it is the first start of the app and you create an Index, and if found in the DB, then read it and copy it on disk and load it.
I save any Index file update into the DB, so every time I upload an image (where the caption, url and the found indice from the Index file is saved to the DB). This done in the Liveview "handle_info({ref, result}" line 358.. Perhaps we should lock the table during transaction: my concern is multi-users access and modifying the Index file. It is not entire clear for me when an Ecto.Multi acquires a lock. I updated the code to use an Ecto.Multi transaction but I am not convinced. See this thread on elixirforum.
Let me know what you think

ndrean · 2024-01-24T18:45:39Z

At this point, I believe the code works, once again except the tests and Readme. I didn't write much since I don't know if you accept the code as such. Except for the models, it relies on Postgres for the images and Index so it should be deployable. I am very very curious to see this online!!

LuchoTurtle · 2024-01-24T22:35:28Z

Hey @ndrean !

Before anything, thank you for the changes! Let me go over some of the points you've raised.

about unique uploads. Nothing prevents a user from uploading several times the same image.

The images are being uploaded to https://github.com/dwyl. The name of each file stems from the contents of the image, which is then hashed. So if we upload two images to https://github.com/dwyl that are the same, we're just replacing one with the other :)

the UX: I should remove the previous found picture & transcription once the semantic search runs. We should probably do the same for the caption.

The UX can be taken care of on another PR. It's easier to review smaller PRs then big ones :D Plus, I have some ideas on how to tackle this later.

second UX point: the caption is not editable, nor is the audio transcript. This could be a refinement that you may want so the user can modify them in case of an error.

This is a neat idea. But this is just a small demo/project to learn how to get some ML running on Elixir apps, it's not meant to be a production-ready app. Those details are cool but they're not the focus of this small project :D

Related to saving the index file in Postgres, I believe you made the right call (since we're deploying this to fly.io. I'm aware that there's an extension to Postgres called https://github.com/pgvector/pgvector, which is basically a vector database (which can inclusively perform knn search). But I haven't found any links of people able to run this extension in fly.io machines.

Your strategy to load the index file seems to make sense. I honestly have no idea how much time it takes to save the file to the DB but it seems that Ecto,Multi does not lock the table (from the link you sent me). It seems that, if we want to have a pessimistic row-level locking procedure going on, we have to use the lock function instead.

However, do we actually really want pessimistic locking? I think if there's a race condition, we can either return the info to the user with feedback or let it resolve itself (https://stackoverflow.com/questions/54896663/what-is-a-real-world-example-where-you-would-choose-to-use-pessimistic-locking). We can try both options and see how it fares though, for sure (at least Logger it, so we know in case it happens).

ndrean · 2024-01-24T23:52:06Z

Thanks Lucho!

About the SHA:

we're just replacing one with the other :)

ok, but:

you still can create 2 indices for the same picture, and this is not good since there is some kind of clustering, so I don't know the consequences
and is it worth going through all the process, making another call to S3 ect whilst you can simply add the SHA (its a one line computation in Elixir) as an indexed field into the Image table and look-up of it and make an early stop? This is what I implement, very little changes.

The usage is pretty reasonable and short and stops the process quite early. I just add an indexed "sha1" field to the Image table, and add the SHA1 string as a field to the %ImageInfo{} struct, hat's pretty much it.

#page_live.ex
def handle_progress(:image, ...) when entry.done? do
  with {:magic, {:ok, %{mime_type: mime}}} <-
                    {:magic, magic_check(path)},
                  file_binary <- File.read!(path),
                  sha1 <- App.Image.calc_sha1(file_binary),
                  {:sha_check, :ok} <- {:sha_check, App.Image.check_sha1(sha1)},
                  {:image_info, {mimetype, width, height, _variant}} <-..... do
         image_info = %ImageInfo{
                            ...,
                            sha1: sha1
                         }
         ...

  else
    {:sha_check, nil} -> push_event(socket, "toast", %{message: "Already present"})

and the functions are:

def calc_sha1(file_binary) do
    :crypto.hash(:sha, file_binary)
    |> Base.encode16()
  end

  def check_sha1(sha1) do
    App.Repo.get_by(App.Image, %{sha1: sha1})
    |> case do
      nil ->
        :ok

      _ ->
        nil
    end
  end

The Index file is not a vector, and seems small ? (10kB for 5 entries). I erased the "indexes.bin" file and takes 4 ms as per my logs to return it as a binary buffer.

[debug] QUERY OK source="hnswlib_index" db=3.6ms decode=1.5ms queue=53.1ms idle=0.0ms
SELECT h0."id", h0."file" FROM "hnswlib_index" AS h0 WHERE (h0."id" = $1) [1]
↳ App.HnswlibIndex.maybe_load_index_from_db/3, at: lib/app/hnswlib_index.ex:29
[info] Loading Index from DB

It seems quite reasonable.

pg_Vector: But I haven't found any links of people able to run this extension in fly.io machines.

I guess you can use extensions, such as PostGIS. See this thread. But it is clear as mud, sorry.

As for the locks, I really don't know. I was more thinking that the tests are more drastic on race conditions. But I would leave it like this, it's already quite a gymnastic.

lib/app/image.ex

ndrean · 2024-01-25T23:22:39Z

@LuchoTurtle

Sorry for pushing even further all this but I discovered a pb.

In fact, I did not take into account the fact that 2 different users may upload the same image at almost the same time. In this case, the SHA won't work because the image save to the DB occurs after the caption is computed, which means a few seconds, which gives space for another user to upload the same image (thus with the same SHA) but the DB is not yet uplated with the SHA. I hope this makes sense.

So I looked at optimistic locks. It seemed pretty easy to use, so I added it, just a few changes, exactly as per the doc.
The idea is that the Hnswlib Index is to avoid several saving the Index file at the same, to avoid different versions, in which case I ask for a retry as you suggested.
Furthermore, I had to modify the Ecto.Multi failure treatment: within a transaction, you firstly update the image, then save the Idnex into the DB and handle the errors.

My main fear is how to test all this, but let's firstly focus on understanding the code!

The optimistic lock changes (just following the mentionned documentation above):

Hnwslib_index migration

defmodule App.Repo.Migrations.CreateTableHnswlibIndex do
  use Ecto.Migration

  def change do
    create_if_not_exists table("hnswlib_index") do
      add :lock_version, :integer, default: 1
     ^^^
      add :file, :binary
    end
  end
end

Hnswlib_schema

schema "hnswlib_index" do
    field(:file, :binary)
    field(:lock_version, :integer, default: 1)
   ^^^
end

def changeset(struct \\ %__MODULE__{}, params \\ %{}) do
    struct
    |> Ecto.Changeset.cast(params, [:id, :file])
    |> Ecto.Changeset.optimistic_lock(:lock_version)
    ^^^
    |> Ecto.Changeset.validate_required([:id])
end

The Ecot.Multi where a potential race condition when saving the Index into the DB is captured by a toast "Please retry"

Ecto.Multi.new()
  # save updated Image to DB
  |> Ecto.Multi.run(:update_image, fn _, _ ->
    Map.put(image, :idx, idx)
    |> App.Image.insert()
  end)
  # save Index file to DB
  |> Ecto.Multi.run(:save_index, fn _, _ ->
    App.HnswlibIndex.save()
  end)
  |> App.Repo.transaction()
  |> case do
    {:error, :update_image, _changeset, _} ->
      {:noreply,
       socket
       |> push_event("toast", %{message: "Invalid entry"})
       |> assign(running?: false, index: index, task_ref: nil, label: nil)}

    {:error, :save_index, _, _} ->
      {:noreply,
       socket
       |> push_event("toast", %{message: "Please retry"})
       |> assign(running?: false, index: index, task_ref: nil, label: nil)}

    {:ok, _} ->
      {:noreply,
        socket
        |> assign(running?: false, index: index, task_ref: nil, label: label)}
    end
else....

# Conflicts: # mix.exs # mix.lock

ndrean · 2024-02-24T16:27:15Z

@LuchoTurtle
Regarding the formatting of the Readme. I find it difficult to read the text because of the numerous newlines.
is there a reason to produce newlines the way you do? yMaybe ou don't want to have the automatic slider?

LuchoTurtle · 2024-02-24T19:45:59Z

There's a reason for this ☺️

Here's a link for a small explanation on this:
https://emmanuelbernard.com/blog/2013/08/08/one-line-per-idea/

Other than that, I think this PR is ready to be merged, don't you agree? Don't worry about the HTML, that can be worked on a different issue :)

# Conflicts: # mix.exs # mix.lock

ndrean · 2024-03-03T15:08:21Z

Some remarks:

Bandit. I saw that Phoenix uses it by default, so I used it to see if I could give any improvement. However, it maybe not be a good idea due to memory usage. We should revert back to Cowboy.
Just comment the line below in "config.exs" and in "mixs.exs" and uncomment :plug_comwboy.

#config/congi.exs
config :app, AppWeb.Endpoint,
  # adapter: Bandit.PhoenixAdapter,

#mix.exs
 [...,
  # {:bandit, "~> 1.0"},
  {:plug_cowboy, "~> 2.7.0"},
 ...]

ELXA:
I updated EXLA and hnswlib and nx

{:bumblebee, "~> 0.5.0"},
{:exla, "~> 0.7.0"},
{:nx, "~> 0.7.0 "},
{:hnswlib, "~> 0.1.5"},

and I could make it work due to the error below:

[warning] The on_load function for module Elixir.EXLA.NIF returned:
{:error, {:load_failed,
  ~c"Failed to load NIF library: 'dlopen(/Users/nevendrean/code/elixir/image-classifier/_build/dev/lib/exla/priv/libexla.so

I erased "_build" and the local the cache, and tried to mix compile or mix deps.compile exla --force and it still did not work.
I see that the Dockerfile does not declare any special env var.
However, I could not get it working until I set the env var XLA_CACHE_DIR on my system as below (pointing to my cache).

 export XLA_CACHE_DIR=/Users/nevendrean/Library/Caches/xla/exla

I don't understand what happens and lost quite some time on this.

Failling async model loading
I transformed the models loading into a generic Genserver that takes the model as an argument. It was supposed to off-load the processes loading in a handle_continue. These Genservers are set as childs to the Supervisor. You need to use a map %{id: :audio, start: {....} and the :via, Registry trick because you are using several times the same GenServer but with different arguments. Note that you reach the module attribute @audio_prod_model with a function instead, App.Models.audio_prod_model() returning this struct.

#Applcition.ex

defp via_tuple(name) do
  {:via, Registry, {MyRegistry, name}}
end


children = [
  ...,
  {Registry, keys: :unique, name: MyRegistry},
  %{
    id: "audio",
    start:
       {App.ModelLoader, :start_link,
        [
           [
             name: via_tuple("audio"),
             space: :cosine,
             model: App.Models.get_audio_prod_model()
          ]
      ]}
  },
  %{
    id: "caption",
    start: 
      {App/ModLoader, :start_link,
        [....]
      }
  },
  .....

However, this fails as It still see a sequential loading so it may not be the way to go.

LuchoTurtle · 2024-03-04T02:06:53Z

@ndrean Thank you again for the awesome insight (as always!).

1 - I've made this change, thank you for the warning. Reverted.
2 - It's weird that you got this happening to you 🤔 . Regardless, although upgrading the dependencies is always appreciated, it seems that with the new version of Bumblebee, some breaking changes have been introduced that may conflict with the work we've done here. They've been updated on #70.
3 - It's a nice idea to try and load the models async. But I feel like this PR is already massive and adds a ton of value. I don't think it's productive to delve into optimization in this PR, it's already bloated with many changes (great ones!), it's best to merge it and deal with each optimization separately. Again, we want to make this PR beginner-friendly, it's a tutorial meant to guide people :)

Thank you for the awesome work and the great value delivered on this PR!

We'll make some changes on the HTML/CSS on a different view to make it more pleasing to the eye :)

Thanks a lot!

ndrean · 2024-03-04T12:38:00Z

@LuchoTurtle
Note that I did test the async loading to try to (partly) improve the loading time of the models every time the Fly machine is pruned.

ndrean · 2024-03-05T23:23:16Z

@LuchoTurtle
If you want to use tabs, such as one with a file input, and one with audio input, I saved some notes on how one can do this by creating 2 components and some links to navigate between tabs. It is a simply JS.hide / JS.show.
If you think this could be interesting, I can propose something.

I am in the process of reading the Readme.

first trial

140c49c

LuchoTurtle added documentation Improvements or additions to documentation enhancement New feature or enhancement of existing functionality in-progress An issue or pull request that is being worked on by the assigned person labels Jan 23, 2024

ndrean added 6 commits January 22, 2024 20:25

corrected dialyzer error case

c78f111

corrected dialyzer error case

4bf1276

corrected dialyzer error case in Readme

76ac055

align transcription output with caption

41ea744

align transcription output & image with caption

d61913d

add index in db

bfca6d2

ndrean added 6 commits January 23, 2024 22:32

loading from DB if index file erased

85ef304

loading from DB if index file erased

67476f2

loading from DB if index file erased: cleaning

c88ef83

with Ecto.Multi & transaction

9c65846

added SHA to get unique files

ffc4739

created unique index on :sha1 for images & schema capture

b82df7d

changed SHA1 test to early failure

db26ee7

refactor SHA & doc with functions

7993b5d

LuchoTurtle reviewed Jan 25, 2024

View reviewed changes

lib/app/image.ex Show resolved Hide resolved

ndrean changed the title ~~Semantic sezarch added~~ Semantic search added Jan 25, 2024

ndrean added 2 commits January 25, 2024 09:11

added pg index on image_idx

6f99a58

added optimistic lock & refactored Multi

aebaa10

LuchoTurtle and others added 14 commits February 22, 2024 04:21

chore: Renaming socket assigns.

6f1fe4f

readme: Adding section for page_live

9993b98

readme: Adding view section.

90f0368

Merge branch 'main' into semantic

6fe7202

# Conflicts: # mix.exs # mix.lock

fix: Fixing mix.lock

9d42ca9

chore: Normalizing all loggers.

579c5fb

fix: Fixing models loading while testing and on prod.

e1675ad

readme: Updating README.

8b0dca5

minor changes on redundant & shorter code: :if and remove dir creationH

12204a5

Update README.md

2528176

Update README.md

a34f132

Update README.md

b2c7882

Update README.md

96f6ccd

padding to record button

72693bf

readme: Adding example gif.

9a11eab

LuchoTurtle approved these changes Feb 25, 2024

View reviewed changes

LuchoTurtle added in-review Issue or pull request that is currently being reviewed by the assigned person and removed in-progress An issue or pull request that is being worked on by the assigned person labels Feb 25, 2024

LuchoTurtle added 2 commits March 3, 2024 01:49

Merge branch 'main' into semantic

e679569

# Conflicts: # mix.exs # mix.lock

merge: Fixing conflicts.

a9c2d83

chore: Not using cowboy.

7b5287d

LuchoTurtle merged commit b8b6241 into dwyl:main Mar 4, 2024
1 check passed

This was referenced Mar 4, 2024

feat: Audio-to-text image caption search using embedding models [showcase] #18

Closed

chore: Making the HTML/CSS a bit prettier :) #68

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Semantic search added #47

Semantic search added #47

ndrean commented Jan 23, 2024 •

edited

Loading

codecov bot commented Jan 23, 2024 •

edited

Loading

LuchoTurtle commented Jan 23, 2024 •

edited

Loading

ndrean commented Jan 23, 2024 •

edited

Loading

ndrean commented Jan 23, 2024 •

edited

Loading

ndrean commented Jan 24, 2024

LuchoTurtle commented Jan 24, 2024

ndrean commented Jan 24, 2024 •

edited

Loading

ndrean commented Jan 25, 2024 •

edited

Loading

ndrean commented Feb 24, 2024

LuchoTurtle commented Feb 24, 2024 •

edited

Loading

ndrean commented Mar 3, 2024 •

edited

Loading

LuchoTurtle commented Mar 4, 2024 •

edited

Loading

ndrean commented Mar 4, 2024

ndrean commented Mar 5, 2024 •

edited

Loading

Semantic search added #47

Semantic search added #47

Conversation

ndrean commented Jan 23, 2024 • edited Loading

codecov bot commented Jan 23, 2024 • edited Loading

Codecov Report

LuchoTurtle commented Jan 23, 2024 • edited Loading

ndrean commented Jan 23, 2024 • edited Loading

ndrean commented Jan 23, 2024 • edited Loading

ndrean commented Jan 24, 2024

LuchoTurtle commented Jan 24, 2024

ndrean commented Jan 24, 2024 • edited Loading

ndrean commented Jan 25, 2024 • edited Loading

ndrean commented Feb 24, 2024

LuchoTurtle commented Feb 24, 2024 • edited Loading

ndrean commented Mar 3, 2024 • edited Loading

LuchoTurtle commented Mar 4, 2024 • edited Loading

ndrean commented Mar 4, 2024

ndrean commented Mar 5, 2024 • edited Loading

ndrean commented Jan 23, 2024 •

edited

Loading

codecov bot commented Jan 23, 2024 •

edited

Loading

LuchoTurtle commented Jan 23, 2024 •

edited

Loading

ndrean commented Jan 23, 2024 •

edited

Loading

ndrean commented Jan 23, 2024 •

edited

Loading

ndrean commented Jan 24, 2024 •

edited

Loading

ndrean commented Jan 25, 2024 •

edited

Loading

LuchoTurtle commented Feb 24, 2024 •

edited

Loading

ndrean commented Mar 3, 2024 •

edited

Loading

LuchoTurtle commented Mar 4, 2024 •

edited

Loading

ndrean commented Mar 5, 2024 •

edited

Loading