-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Semantic search added #47
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #47 +/- ##
==========================================
Coverage 100.00% 100.00%
==========================================
Files 3 5 +2
Lines 94 198 +104
==========================================
+ Hits 94 198 +104 ☔ View full report in Codecov by Sentry. |
I've merged #43 to |
More remarks:
{:hnswlib, git: "https://github.com/elixir-nx/hnswlib", override: true},
|
@LuchoTurtle To be doubled checked but the test is the following: I populate "normally" the "indexes.bin" Index file when I upload some images, then stop the app, erase the "indexes.bin" file and restart the app, so I expect to load the Index from the DB and be able to find a given image with an audio. |
At this point, I believe the code works, once again except the tests and Readme. I didn't write much since I don't know if you accept the code as such. Except for the models, it relies on Postgres for the images and Index so it should be deployable. I am very very curious to see this online!! |
Hey @ndrean ! Before anything, thank you for the changes! Let me go over some of the points you've raised.
The images are being uploaded to https://github.com/dwyl. The name of each file stems from the contents of the image, which is then hashed. So if we upload two images to https://github.com/dwyl that are the same, we're just replacing one with the other :)
The UX can be taken care of on another PR. It's easier to review smaller PRs then big ones :D Plus, I have some ideas on how to tackle this later.
This is a neat idea. But this is just a small demo/project to learn how to get some ML running on Elixir apps, it's not meant to be a production-ready app. Those details are cool but they're not the focus of this small project :D Related to saving the index file in Postgres, I believe you made the right call (since we're deploying this to Your strategy to load the index file seems to make sense. I honestly have no idea how much time it takes to save the file to the DB but it seems that However, do we actually really want pessimistic locking? I think if there's a race condition, we can either return the info to the user with feedback or let it resolve itself (https://stackoverflow.com/questions/54896663/what-is-a-real-world-example-where-you-would-choose-to-use-pessimistic-locking). We can try both options and see how it fares though, for sure (at least |
Thanks Lucho! About the SHA:
ok, but:
The usage is pretty reasonable and short and stops the process quite early. I just add an indexed "sha1" field to the Image table, and add the SHA1 string as a field to the #page_live.ex
def handle_progress(:image, ...) when entry.done? do
with {:magic, {:ok, %{mime_type: mime}}} <-
{:magic, magic_check(path)},
file_binary <- File.read!(path),
sha1 <- App.Image.calc_sha1(file_binary),
{:sha_check, :ok} <- {:sha_check, App.Image.check_sha1(sha1)},
{:image_info, {mimetype, width, height, _variant}} <-..... do
image_info = %ImageInfo{
...,
sha1: sha1
}
...
else
{:sha_check, nil} -> push_event(socket, "toast", %{message: "Already present"})
and the functions are: def calc_sha1(file_binary) do
:crypto.hash(:sha, file_binary)
|> Base.encode16()
end
def check_sha1(sha1) do
App.Repo.get_by(App.Image, %{sha1: sha1})
|> case do
nil ->
:ok
_ ->
nil
end
end The Index file is not a vector, and seems small ? (10kB for 5 entries). I erased the "indexes.bin" file and takes 4 ms as per my logs to return it as a binary buffer. [debug] QUERY OK source="hnswlib_index" db=3.6ms decode=1.5ms queue=53.1ms idle=0.0ms
SELECT h0."id", h0."file" FROM "hnswlib_index" AS h0 WHERE (h0."id" = $1) [1]
↳ App.HnswlibIndex.maybe_load_index_from_db/3, at: lib/app/hnswlib_index.ex:29
[info] Loading Index from DB It seems quite reasonable.
I guess you can use extensions, such as PostGIS. See this thread. But it is clear as mud, sorry. As for the locks, I really don't know. I was more thinking that the tests are more drastic on race conditions. But I would leave it like this, it's already quite a gymnastic. |
Sorry for pushing even further all this but I discovered a pb. In fact, I did not take into account the fact that 2 different users may upload the same image at almost the same time. In this case, the SHA won't work because the image save to the DB occurs after the caption is computed, which means a few seconds, which gives space for another user to upload the same image (thus with the same SHA) but the DB is not yet uplated with the SHA. I hope this makes sense. So I looked at optimistic locks. It seemed pretty easy to use, so I added it, just a few changes, exactly as per the doc. My main fear is how to test all this, but let's firstly focus on understanding the code! The optimistic lock changes (just following the mentionned documentation above):
defmodule App.Repo.Migrations.CreateTableHnswlibIndex do
use Ecto.Migration
def change do
create_if_not_exists table("hnswlib_index") do
add :lock_version, :integer, default: 1
^^^
add :file, :binary
end
end
end
schema "hnswlib_index" do
field(:file, :binary)
field(:lock_version, :integer, default: 1)
^^^
end
def changeset(struct \\ %__MODULE__{}, params \\ %{}) do
struct
|> Ecto.Changeset.cast(params, [:id, :file])
|> Ecto.Changeset.optimistic_lock(:lock_version)
^^^
|> Ecto.Changeset.validate_required([:id])
end
The Ecot.Multi where a potential race condition when saving the Index into the DB is captured by a toast "Please retry" Ecto.Multi.new()
# save updated Image to DB
|> Ecto.Multi.run(:update_image, fn _, _ ->
Map.put(image, :idx, idx)
|> App.Image.insert()
end)
# save Index file to DB
|> Ecto.Multi.run(:save_index, fn _, _ ->
App.HnswlibIndex.save()
end)
|> App.Repo.transaction()
|> case do
{:error, :update_image, _changeset, _} ->
{:noreply,
socket
|> push_event("toast", %{message: "Invalid entry"})
|> assign(running?: false, index: index, task_ref: nil, label: nil)}
{:error, :save_index, _, _} ->
{:noreply,
socket
|> push_event("toast", %{message: "Please retry"})
|> assign(running?: false, index: index, task_ref: nil, label: nil)}
{:ok, _} ->
{:noreply,
socket
|> assign(running?: false, index: index, task_ref: nil, label: label)}
end
else.... |
# Conflicts: # mix.exs # mix.lock
@LuchoTurtle |
There's a reason for this Here's a link for a small explanation on this: Other than that, I think this PR is ready to be merged, don't you agree? Don't worry about the HTML, that can be worked on a different issue :) |
# Conflicts: # mix.exs # mix.lock
Some remarks:
#config/congi.exs
config :app, AppWeb.Endpoint,
# adapter: Bandit.PhoenixAdapter,
#mix.exs
[...,
# {:bandit, "~> 1.0"},
{:plug_cowboy, "~> 2.7.0"},
...]
{:bumblebee, "~> 0.5.0"},
{:exla, "~> 0.7.0"},
{:nx, "~> 0.7.0 "},
{:hnswlib, "~> 0.1.5"}, and I could make it work due to the error below: [warning] The on_load function for module Elixir.EXLA.NIF returned:
{:error, {:load_failed,
~c"Failed to load NIF library: 'dlopen(/Users/nevendrean/code/elixir/image-classifier/_build/dev/lib/exla/priv/libexla.so I erased "_build" and the local the cache, and tried to export XLA_CACHE_DIR=/Users/nevendrean/Library/Caches/xla/exla I don't understand what happens and lost quite some time on this.
#Applcition.ex
defp via_tuple(name) do
{:via, Registry, {MyRegistry, name}}
end
children = [
...,
{Registry, keys: :unique, name: MyRegistry},
%{
id: "audio",
start:
{App.ModelLoader, :start_link,
[
[
name: via_tuple("audio"),
space: :cosine,
model: App.Models.get_audio_prod_model()
]
]}
},
%{
id: "caption",
start:
{App/ModLoader, :start_link,
[....]
}
},
.....
However, this fails as It still see a sequential loading so it may not be the way to go. |
@ndrean Thank you again for the awesome insight (as always!). 1 - I've made this change, thank you for the warning. Reverted. Thank you for the awesome work and the great value delivered on this PR! We'll make some changes on the HTML/CSS on a different view to make it more pleasing to the eye :) Thanks a lot! |
@LuchoTurtle |
@LuchoTurtle I am in the process of reading the Readme. |
@LuchoTurtle
First trial. Still amazed to see what you can do.
It works, probably needs a serious read. I close the other PR.
The embedding model is started via a GenServer. Maybe we can or should include this in your Models.ex.
Reset your the DB, and run the server. Logs should display "New Index".
Trials:
I experience that the semantic search is not perfect, especially if the audio transcription fails. It also depends a lot on the words you use. If you ask for a picture of a cat, you may have a dog or a house..... This is where a threshold is needed. TBC!