Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Semantic search added #47

Merged
merged 105 commits into from
Mar 4, 2024
Merged

Semantic search added #47

merged 105 commits into from
Mar 4, 2024

Conversation

ndrean
Copy link
Collaborator

@ndrean ndrean commented Jan 23, 2024

@LuchoTurtle

First trial. Still amazed to see what you can do.

It works, probably needs a serious read. I close the other PR.

The embedding model is started via a GenServer. Maybe we can or should include this in your Models.ex.

  1. Reset your the DB, and run the server. Logs should display "New Index".

  2. Trials:

  • load 2 different images, say a house and then a dog. Then record a short audio, say a house. Check the result.
  • stop & start the server. Logs should output "Existing Index". Record an audio for a dog. Check the result.

Note: if you erase the Index file without resetting the DB, there will be obviously a problem.

I experience that the semantic search is not perfect, especially if the audio transcription fails. It also depends a lot on the words you use. If you ask for a picture of a cat, you may have a dog or a house..... This is where a threshold is needed. TBC!

Copy link

codecov bot commented Jan 23, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 100.00%. Comparing base (9641719) to head (a9c2d83).

❗ Current head a9c2d83 differs from pull request most recent head 7b5287d. Consider uploading reports for the commit 7b5287d to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##              main       #47    +/-   ##
==========================================
  Coverage   100.00%   100.00%            
==========================================
  Files            3         5     +2     
  Lines           94       198   +104     
==========================================
+ Hits            94       198   +104     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@LuchoTurtle
Copy link
Member

LuchoTurtle commented Jan 23, 2024

I've merged #43 to main :)
Thanks a lot for this PR, can't wait until it's finished :D

@LuchoTurtle LuchoTurtle added documentation Improvements or additions to documentation enhancement New feature or enhancement of existing functionality in-progress An issue or pull request that is being worked on by the assigned person labels Jan 23, 2024
@ndrean
Copy link
Collaborator Author

ndrean commented Jan 23, 2024

@LuchoTurtle

More remarks:

  • the Index file is saved locally.!!!! This means...problems if - as it seems - Fly prunes the volume when the machine is killed ‼️. So maybe Postgres Blob trick/hack?

  • Dialyzer warning corrected ✌️. I forgot I raised an issue last year on this (the specs are wrong) so the answer is:

    {:hnswlib, git: "https://github.com/elixir-nx/hnswlib", override: true},
  • about unique uploads. Nothing prevents a user from uploading several times the same image. One way could be to calculate the SHA, save it in the DB, ahd check if the new image has an existing SHA in which case we do nothing. I am not sure if duplicated stored images could be a problem bit this means that the same image may have several HNSWLib indexes. I think this refinment shouldn't be difficult to implement : add a "sha" field in the %ImageInfo{} struct and to the App.Image table (index this column), use a :crypto.hash(:sha, file_binary) , and run a DB check before even displaying and uploading it. his is implemented in the last commit. Let me know if you want this.

  • the UX: I should remove the previous found picture & transcription once the semantic search runs. We should probably do the same for the caption.

  • second UX point: the caption is not editable, nor is the audio transcript. This could be a refinement that you may want so the user can modify them in case of an error. Indeed, the first upload must be correct simply because you can't erase an uploaded file (which comes with the caption). But this is a demo, isn't it?

  • Let me know if the last part is "pedantic". I felt that the documentation of HNSWLib could be improved so I adjoined something here, but it can be censored.

  • I tried several times to find a threshold by looking at the "distance" or angle. I output the distance on each search, so you can observe it as well. My best guess is not use the result when the distance is above 0.12-0.15, but we need more fine-grained examples.

@ndrean
Copy link
Collaborator Author

ndrean commented Jan 23, 2024

@LuchoTurtle
I added "save Index file to Postgres'" to overcome the Fly volume problem as it is crucial to keep the "images" table and the HNSWLIb Index file in sync.

To be doubled checked but the test is the following: I populate "normally" the "indexes.bin" Index file when I upload some images, then stop the app, erase the "indexes.bin" file and restart the app, so I expect to load the Index from the DB and be able to find a given image with an audio.
The logic is in 2 files: App.HnswlibIndex.ex and App.KnnIndex.ex". In the GenServer init (KnnIndex.ex), I firstly check if the file "indexes.bin" is present (typically a "warm" start). If yes, then load from it. If not (volume pruned), check if it exists in the DB. If not, this means it is the first start of the app and you create an Index, and if found in the DB, then read it and copy it on disk and load it.
I save any Index file update into the DB, so every time I upload an image (where the caption, url and the found indice from the Index file is saved to the DB). This done in the Liveview "handle_info({ref, result}" line 358.. Perhaps we should lock the table during transaction: my concern is multi-users access and modifying the Index file. It is not entire clear for me when an Ecto.Multi acquires a lock. I updated the code to use an Ecto.Multi transaction but I am not convinced. See this thread on elixirforum.
Let me know what you think

@ndrean
Copy link
Collaborator Author

ndrean commented Jan 24, 2024

At this point, I believe the code works, once again except the tests and Readme. I didn't write much since I don't know if you accept the code as such. Except for the models, it relies on Postgres for the images and Index so it should be deployable. I am very very curious to see this online!!

@LuchoTurtle
Copy link
Member

Hey @ndrean !

Before anything, thank you for the changes! Let me go over some of the points you've raised.


about unique uploads. Nothing prevents a user from uploading several times the same image.

The images are being uploaded to https://github.com/dwyl. The name of each file stems from the contents of the image, which is then hashed. So if we upload two images to https://github.com/dwyl that are the same, we're just replacing one with the other :)

the UX: I should remove the previous found picture & transcription once the semantic search runs. We should probably do the same for the caption.

The UX can be taken care of on another PR. It's easier to review smaller PRs then big ones :D Plus, I have some ideas on how to tackle this later.

second UX point: the caption is not editable, nor is the audio transcript. This could be a refinement that you may want so the user can modify them in case of an error.

This is a neat idea. But this is just a small demo/project to learn how to get some ML running on Elixir apps, it's not meant to be a production-ready app. Those details are cool but they're not the focus of this small project :D


Related to saving the index file in Postgres, I believe you made the right call (since we're deploying this to fly.io. I'm aware that there's an extension to Postgres called https://github.com/pgvector/pgvector, which is basically a vector database (which can inclusively perform knn search). But I haven't found any links of people able to run this extension in fly.io machines.

Your strategy to load the index file seems to make sense. I honestly have no idea how much time it takes to save the file to the DB but it seems that Ecto,Multi does not lock the table (from the link you sent me). It seems that, if we want to have a pessimistic row-level locking procedure going on, we have to use the lock function instead.

However, do we actually really want pessimistic locking? I think if there's a race condition, we can either return the info to the user with feedback or let it resolve itself (https://stackoverflow.com/questions/54896663/what-is-a-real-world-example-where-you-would-choose-to-use-pessimistic-locking). We can try both options and see how it fares though, for sure (at least Logger it, so we know in case it happens).

@ndrean
Copy link
Collaborator Author

ndrean commented Jan 24, 2024

Thanks Lucho!

About the SHA:

we're just replacing one with the other :)

ok, but:

  • you still can create 2 indices for the same picture, and this is not good since there is some kind of clustering, so I don't know the consequences
  • and is it worth going through all the process, making another call to S3 ect whilst you can simply add the SHA (its a one line computation in Elixir) as an indexed field into the Image table and look-up of it and make an early stop? This is what I implement, very little changes.

The usage is pretty reasonable and short and stops the process quite early. I just add an indexed "sha1" field to the Image table, and add the SHA1 string as a field to the %ImageInfo{} struct, hat's pretty much it.

#page_live.ex
def handle_progress(:image, ...) when entry.done? do
  with {:magic, {:ok, %{mime_type: mime}}} <-
                    {:magic, magic_check(path)},
                  file_binary <- File.read!(path),
                  sha1 <- App.Image.calc_sha1(file_binary),
                  {:sha_check, :ok} <- {:sha_check, App.Image.check_sha1(sha1)},
                  {:image_info, {mimetype, width, height, _variant}} <-..... do
         image_info = %ImageInfo{
                            ...,
                            sha1: sha1
                         }
         ...

  else
    {:sha_check, nil} -> push_event(socket, "toast", %{message: "Already present"})

and the functions are:

def calc_sha1(file_binary) do
    :crypto.hash(:sha, file_binary)
    |> Base.encode16()
  end

  def check_sha1(sha1) do
    App.Repo.get_by(App.Image, %{sha1: sha1})
    |> case do
      nil ->
        :ok

      _ ->
        nil
    end
  end

The Index file is not a vector, and seems small ? (10kB for 5 entries). I erased the "indexes.bin" file and takes 4 ms as per my logs to return it as a binary buffer.

[debug] QUERY OK source="hnswlib_index" db=3.6ms decode=1.5ms queue=53.1ms idle=0.0ms
SELECT h0."id", h0."file" FROM "hnswlib_index" AS h0 WHERE (h0."id" = $1) [1]
↳ App.HnswlibIndex.maybe_load_index_from_db/3, at: lib/app/hnswlib_index.ex:29
[info] Loading Index from DB

It seems quite reasonable.

pg_Vector: But I haven't found any links of people able to run this extension in fly.io machines.

I guess you can use extensions, such as PostGIS. See this thread. But it is clear as mud, sorry.

As for the locks, I really don't know. I was more thinking that the tests are more drastic on race conditions. But I would leave it like this, it's already quite a gymnastic.

@ndrean ndrean changed the title Semantic sezarch added Semantic search added Jan 25, 2024
@ndrean
Copy link
Collaborator Author

ndrean commented Jan 25, 2024

@LuchoTurtle

Sorry for pushing even further all this but I discovered a pb.

In fact, I did not take into account the fact that 2 different users may upload the same image at almost the same time. In this case, the SHA won't work because the image save to the DB occurs after the caption is computed, which means a few seconds, which gives space for another user to upload the same image (thus with the same SHA) but the DB is not yet uplated with the SHA. I hope this makes sense.

So I looked at optimistic locks. It seemed pretty easy to use, so I added it, just a few changes, exactly as per the doc.
The idea is that the Hnswlib Index is to avoid several saving the Index file at the same, to avoid different versions, in which case I ask for a retry as you suggested.
Furthermore, I had to modify the Ecto.Multi failure treatment: within a transaction, you firstly update the image, then save the Idnex into the DB and handle the errors.

My main fear is how to test all this, but let's firstly focus on understanding the code!

The optimistic lock changes (just following the mentionned documentation above):

  • Hnwslib_index migration
defmodule App.Repo.Migrations.CreateTableHnswlibIndex do
  use Ecto.Migration

  def change do
    create_if_not_exists table("hnswlib_index") do
      add :lock_version, :integer, default: 1
     ^^^
      add :file, :binary
    end
  end
end
  • Hnswlib_schema
schema "hnswlib_index" do
    field(:file, :binary)
    field(:lock_version, :integer, default: 1)
   ^^^
end

def changeset(struct \\ %__MODULE__{}, params \\ %{}) do
    struct
    |> Ecto.Changeset.cast(params, [:id, :file])
    |> Ecto.Changeset.optimistic_lock(:lock_version)
    ^^^
    |> Ecto.Changeset.validate_required([:id])
end

The Ecot.Multi where a potential race condition when saving the Index into the DB is captured by a toast "Please retry"

Ecto.Multi.new()
  # save updated Image to DB
  |> Ecto.Multi.run(:update_image, fn _, _ ->
    Map.put(image, :idx, idx)
    |> App.Image.insert()
  end)
  # save Index file to DB
  |> Ecto.Multi.run(:save_index, fn _, _ ->
    App.HnswlibIndex.save()
  end)
  |> App.Repo.transaction()
  |> case do
    {:error, :update_image, _changeset, _} ->
      {:noreply,
       socket
       |> push_event("toast", %{message: "Invalid entry"})
       |> assign(running?: false, index: index, task_ref: nil, label: nil)}

    {:error, :save_index, _, _} ->
      {:noreply,
       socket
       |> push_event("toast", %{message: "Please retry"})
       |> assign(running?: false, index: index, task_ref: nil, label: nil)}

    {:ok, _} ->
      {:noreply,
        socket
        |> assign(running?: false, index: index, task_ref: nil, label: label)}
    end
else....

@ndrean
Copy link
Collaborator Author

ndrean commented Feb 24, 2024

@LuchoTurtle
Regarding the formatting of the Readme. I find it difficult to read the text because of the numerous newlines.
is there a reason to produce newlines the way you do? yMaybe ou don't want to have the automatic slider?

@LuchoTurtle
Copy link
Member

LuchoTurtle commented Feb 24, 2024

There's a reason for this ☺️

Here's a link for a small explanation on this:
https://emmanuelbernard.com/blog/2013/08/08/one-line-per-idea/

Other than that, I think this PR is ready to be merged, don't you agree? Don't worry about the HTML, that can be worked on a different issue :)

@LuchoTurtle LuchoTurtle added in-review Issue or pull request that is currently being reviewed by the assigned person and removed in-progress An issue or pull request that is being worked on by the assigned person labels Feb 25, 2024
@ndrean
Copy link
Collaborator Author

ndrean commented Mar 3, 2024

Some remarks:

  1. Bandit. I saw that Phoenix uses it by default, so I used it to see if I could give any improvement. However, it maybe not be a good idea due to memory usage. We should revert back to Cowboy.
    Just comment the line below in "config.exs" and in "mixs.exs" and uncomment :plug_comwboy.
#config/congi.exs
config :app, AppWeb.Endpoint,
  # adapter: Bandit.PhoenixAdapter,

#mix.exs
 [...,
  # {:bandit, "~> 1.0"},
  {:plug_cowboy, "~> 2.7.0"},
 ...]
  1. ELXA:
    I updated EXLA and hnswlib and nx
{:bumblebee, "~> 0.5.0"},
{:exla, "~> 0.7.0"},
{:nx, "~> 0.7.0 "},
{:hnswlib, "~> 0.1.5"},

and I could make it work due to the error below:

[warning] The on_load function for module Elixir.EXLA.NIF returned:
{:error, {:load_failed,
  ~c"Failed to load NIF library: 'dlopen(/Users/nevendrean/code/elixir/image-classifier/_build/dev/lib/exla/priv/libexla.so

I erased "_build" and the local the cache, and tried to mix compile or mix deps.compile exla --force and it still did not work.
I see that the Dockerfile does not declare any special env var.
However, I could not get it working until I set the env var XLA_CACHE_DIR on my system as below (pointing to my cache).

 export XLA_CACHE_DIR=/Users/nevendrean/Library/Caches/xla/exla

I don't understand what happens and lost quite some time on this.

  1. Failling async model loading
    I transformed the models loading into a generic Genserver that takes the model as an argument. It was supposed to off-load the processes loading in a handle_continue. These Genservers are set as childs to the Supervisor. You need to use a map %{id: :audio, start: {....} and the :via, Registry trick because you are using several times the same GenServer but with different arguments. Note that you reach the module attribute @audio_prod_model with a function instead, App.Models.audio_prod_model() returning this struct.
#Applcition.ex

defp via_tuple(name) do
  {:via, Registry, {MyRegistry, name}}
end


children = [
  ...,
  {Registry, keys: :unique, name: MyRegistry},
  %{
    id: "audio",
    start:
       {App.ModelLoader, :start_link,
        [
           [
             name: via_tuple("audio"),
             space: :cosine,
             model: App.Models.get_audio_prod_model()
          ]
      ]}
  },
  %{
    id: "caption",
    start: 
      {App/ModLoader, :start_link,
        [....]
      }
  },
  .....

However, this fails as It still see a sequential loading so it may not be the way to go.

@LuchoTurtle
Copy link
Member

LuchoTurtle commented Mar 4, 2024

@ndrean Thank you again for the awesome insight (as always!).

1 - I've made this change, thank you for the warning. Reverted.
2 - It's weird that you got this happening to you 🤔 . Regardless, although upgrading the dependencies is always appreciated, it seems that with the new version of Bumblebee, some breaking changes have been introduced that may conflict with the work we've done here. They've been updated on #70.
3 - It's a nice idea to try and load the models async. But I feel like this PR is already massive and adds a ton of value. I don't think it's productive to delve into optimization in this PR, it's already bloated with many changes (great ones!), it's best to merge it and deal with each optimization separately. Again, we want to make this PR beginner-friendly, it's a tutorial meant to guide people :)

Thank you for the awesome work and the great value delivered on this PR!

We'll make some changes on the HTML/CSS on a different view to make it more pleasing to the eye :)

Thanks a lot!

@LuchoTurtle LuchoTurtle merged commit b8b6241 into dwyl:main Mar 4, 2024
1 check passed
@ndrean
Copy link
Collaborator Author

ndrean commented Mar 4, 2024

@LuchoTurtle
Note that I did test the async loading to try to (partly) improve the loading time of the models every time the Fly machine is pruned.

@ndrean
Copy link
Collaborator Author

ndrean commented Mar 5, 2024

@LuchoTurtle
If you want to use tabs, such as one with a file input, and one with audio input, I saved some notes on how one can do this by creating 2 components and some links to navigate between tabs. It is a simply JS.hide / JS.show.
If you think this could be interesting, I can propose something.

I am in the process of reading the Readme.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation enhancement New feature or enhancement of existing functionality in-review Issue or pull request that is currently being reviewed by the assigned person
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants