Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Semantic search added #47

Merged
merged 105 commits into from
Mar 4, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
105 commits
Select commit Hold shift + click to select a range
140c49c
first trial
ndrean Jan 22, 2024
c78f111
corrected dialyzer error case
ndrean Jan 23, 2024
4bf1276
corrected dialyzer error case
ndrean Jan 23, 2024
76ac055
corrected dialyzer error case in Readme
ndrean Jan 23, 2024
41ea744
align transcription output with caption
ndrean Jan 23, 2024
d61913d
align transcription output & image with caption
ndrean Jan 23, 2024
bfca6d2
add index in db
ndrean Jan 23, 2024
85ef304
loading from DB if index file erased
ndrean Jan 24, 2024
67476f2
loading from DB if index file erased
ndrean Jan 24, 2024
c88ef83
loading from DB if index file erased: cleaning
ndrean Jan 24, 2024
9c65846
with Ecto.Multi & transaction
ndrean Jan 24, 2024
ffc4739
added SHA to get unique files
ndrean Jan 24, 2024
b82df7d
created unique index on :sha1 for images & schema capture
ndrean Jan 24, 2024
db26ee7
changed SHA1 test to early failure
ndrean Jan 24, 2024
7993b5d
refactor SHA & doc with functions
ndrean Jan 25, 2024
6f99a58
added pg index on image_idx
ndrean Jan 25, 2024
aebaa10
added optimistic lock & refactored Multi
ndrean Jan 25, 2024
631c5d7
modified image saving sequence & keep lock on index file
ndrean Jan 27, 2024
1bf22cf
refactored to remove credo warnings for depth
ndrean Jan 27, 2024
a1897a1
refactored to remove credo warnings for depth
ndrean Jan 27, 2024
78c6597
recreate index, uuid on tmp.wav, try/rescue, Embedding in Models
ndrean Feb 3, 2024
5529a16
added check_load to Models
ndrean Feb 3, 2024
00a03e4
starting to modify Readme
ndrean Feb 3, 2024
860eae1
starting to modify Readme
ndrean Feb 3, 2024
b5dee2c
continue Readme
ndrean Feb 4, 2024
e0a60d3
added plug_cowboy ~>2.7 since bug and check_index on start-up
ndrean Feb 5, 2024
e6c6041
removed index from socket state and put into KnnIndex GenServer
ndrean Feb 5, 2024
4d414da
added on_mount for Index integrity => 404 warning
ndrean Feb 6, 2024
69dff02
refactored to remove nested credo warnings
ndrean Feb 6, 2024
d9ad772
changed order in transaction
ndrean Feb 7, 2024
336efdb
tested release
ndrean Feb 8, 2024
28ba1e2
added starte tests
ndrean Feb 8, 2024
c4684d3
remove duplicated
ndrean Feb 8, 2024
f5a5fd8
remove duplicated
ndrean Feb 8, 2024
ef4fc75
remove duplicated
ndrean Feb 8, 2024
8fc5f7f
chore: Trying to get `mix c` to work.
LuchoTurtle Feb 9, 2024
5bff3c9
chore: Mix test now runs. Though the tests fail.
LuchoTurtle Feb 10, 2024
d1182a9
rm on_mount and add integrity in GS init
ndrean Feb 10, 2024
5dc9c90
forgot hnwslib 0.14; credo warning...
ndrean Feb 10, 2024
49d4bd0
changed HnwslibIndex test file
ndrean Feb 10, 2024
6f2c3ea
chore: Changing html in hopes to fix test.
LuchoTurtle Feb 11, 2024
7950425
chore: Tests now can be executed properly and reset after every run.
LuchoTurtle Feb 12, 2024
7f46e6c
chore: Simplifying test runs.
LuchoTurtle Feb 12, 2024
de7f544
fix: Ignoring index files (shouldn't be on git).
LuchoTurtle Feb 12, 2024
c8e4c5d
chore: Forcing tests to run sync.
LuchoTurtle Feb 12, 2024
905b2db
chore: Adding tests for image.
LuchoTurtle Feb 13, 2024
74669db
refactored check_integrity for GenServer testing.
ndrean Feb 13, 2024
368cf3a
refactored check_integrity for GenServer testing.
ndrean Feb 13, 2024
23cb3e0
chore: Adding testing to hnswlib_index schema.
LuchoTurtle Feb 13, 2024
0f1f904
Merge branch 'semantic' of https://github.com/ndrean/image-classifier…
LuchoTurtle Feb 13, 2024
e564de5
removed edge case and added indexes1,2.bin"
ndrean Feb 13, 2024
0ebdb38
removed edge case and added indexes_gen_test_1,2.bin"
ndrean Feb 13, 2024
282498b
Merge branch 'semantic' of https://github.com/ndrean/image-classifier…
LuchoTurtle Feb 13, 2024
fb3d122
chore: Remove unused test.
LuchoTurtle Feb 14, 2024
5548ea1
comments on geneserver init tests corrected
ndrean Feb 14, 2024
9af8a07
comments on genserver init tests corrected
ndrean Feb 14, 2024
1a2952c
chore: Removing unnecessary code. It won't ever be used because it's …
LuchoTurtle Feb 15, 2024
e8a2d4b
Merge branch 'semantic' of https://github.com/ndrean/image-classifier…
LuchoTurtle Feb 15, 2024
ccc582a
adding test on early stop Index empty
ndrean Feb 15, 2024
80247d3
chore: Changing test timeout, since it takes more than a minute.
LuchoTurtle Feb 15, 2024
2a7d127
Merge branch 'semantic' of https://github.com/ndrean/image-classifier…
LuchoTurtle Feb 15, 2024
d7ffec5
fix: Fixing failing test of notification when audio is uploaded on em…
LuchoTurtle Feb 15, 2024
1550ff4
add GenServer knn_search nil test
ndrean Feb 15, 2024
874e901
continue GS tests
ndrean Feb 16, 2024
685395a
end GS tests
ndrean Feb 16, 2024
19d6b2e
update bump
ndrean Feb 16, 2024
38d8d3d
tests doc
ndrean Feb 16, 2024
daf4a84
improved tests doc
ndrean Feb 16, 2024
8da6f40
tests on image operations
ndrean Feb 16, 2024
dddaae0
tests on image operations
ndrean Feb 16, 2024
dd49fb9
moved from Cowboy to Bandit & test correction
ndrean Feb 16, 2024
b9a4132
moved from Cowboy to Bandit & test correction
ndrean Feb 16, 2024
b85dbc1
chore: Adding resetting with empty indexes helper to tests and coveri…
LuchoTurtle Feb 17, 2024
666660a
chore: Adding failed index "please retry" test.
LuchoTurtle Feb 17, 2024
a9dffbd
fix: Fixing bucket error and testing it.
LuchoTurtle Feb 18, 2024
0aa1a3e
fix: Fixing upload error handling and partial image edge case tested.
LuchoTurtle Feb 18, 2024
77b3395
chore: Formatting hnswlib_index.ex
LuchoTurtle Feb 18, 2024
b5154ea
chore: Commenting and formatting knn_index.ex.
LuchoTurtle Feb 18, 2024
a8839f6
chore: Commenting and formatting models.
LuchoTurtle Feb 18, 2024
5fe4f7f
chore: Removing unused code and commenting.
LuchoTurtle Feb 18, 2024
7a68517
chore: Page_live.ex general formatting.
LuchoTurtle Feb 18, 2024
57d895a
chore: Formatting README (before image captioning).
LuchoTurtle Feb 19, 2024
540b08e
chore: Fixing some Image Captioning section errors.
LuchoTurtle Feb 19, 2024
58e915b
chore: Fixing typos and numbering.
LuchoTurtle Feb 19, 2024
da01dcd
chore: Formatting and fixing typos on the Semantic Search part.
LuchoTurtle Feb 19, 2024
777b412
chore: Formatting the README and adding `hnswlib_index` schema code.
LuchoTurtle Feb 19, 2024
40070c3
readme: Add section of image schema changes.
LuchoTurtle Feb 21, 2024
6f1fe4f
chore: Renaming socket assigns.
LuchoTurtle Feb 22, 2024
9993b98
readme: Adding section for page_live
LuchoTurtle Feb 22, 2024
90f0368
readme: Adding view section.
LuchoTurtle Feb 23, 2024
6fe7202
Merge branch 'main' into semantic
LuchoTurtle Feb 23, 2024
9d42ca9
fix: Fixing mix.lock
LuchoTurtle Feb 23, 2024
579c5fb
chore: Normalizing all loggers.
LuchoTurtle Feb 23, 2024
e1675ad
fix: Fixing models loading while testing and on prod.
LuchoTurtle Feb 23, 2024
8b0dca5
readme: Updating README.
LuchoTurtle Feb 23, 2024
12204a5
minor changes on redundant & shorter code: :if and remove dir creationH
ndrean Feb 24, 2024
2528176
Update README.md
ndrean Feb 24, 2024
a34f132
Update README.md
ndrean Feb 24, 2024
b2c7882
Update README.md
ndrean Feb 24, 2024
96f6ccd
Update README.md
ndrean Feb 24, 2024
72693bf
padding to record button
ndrean Feb 24, 2024
9a11eab
readme: Adding example gif.
LuchoTurtle Feb 24, 2024
e679569
Merge branch 'main' into semantic
LuchoTurtle Mar 3, 2024
a9c2d83
merge: Fixing conflicts.
LuchoTurtle Mar 3, 2024
7b5287d
chore: Not using cowboy.
LuchoTurtle Mar 4, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 8 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,9 @@ erl_crash.dump
# Also ignore archive artifacts (built via "mix archive.build").
*.ez

# Ignore DB dumps
*.db

# Temporary files, for example, from tests.
/tmp/

Expand All @@ -37,4 +40,8 @@ npm-debug.log

# Bumblebee model directory
.bumblebee/*
.elixir_ls
.elixir_ls

# KNN index direcotry
priv/static/uploads/indexes.bin

2,745 changes: 2,459 additions & 286 deletions README.md

Large diffs are not rendered by default.

11 changes: 7 additions & 4 deletions _comparison/manage_models.exs
Original file line number Diff line number Diff line change
Expand Up @@ -12,13 +12,16 @@ defmodule Comparison.Models do
def verify_and_download_model(model, force_download? \\ false) do
case force_download? do
true ->
File.rm_rf!(model.cache_path) # Delete any cached pre-existing model
download_model(model) # Download model
# Delete any cached pre-existing model
File.rm_rf!(model.cache_path)
# Download model
download_model(model)

false ->
# Check if the model cache directory exists or if it's not empty.
# If so, we download the model.
model_location = Path.join(model.cache_path, "huggingface")

if not File.exists?(model_location) or File.ls!(model_location) == [] do
download_model(model)
end
Expand Down Expand Up @@ -50,7 +53,7 @@ defmodule Comparison.Models do
# It will load the model and the respective the featurizer, tokenizer and generation config if needed,
# and return a map with all of these at the end.
defp load_offline_model_params(model) do
Logger.info("Loading #{model.name}...")
Logger.info("ℹ️ Loading #{model.name}...")

# Loading model
loading_settings = {:hf, model.name, cache_dir: model.cache_path, offline: true}
Expand Down Expand Up @@ -92,7 +95,7 @@ defmodule Comparison.Models do
# Downloads the models according to a given %ModelInfo struct.
# It will load the model and the respective the featurizer, tokenizer and generation config if needed.
defp download_model(model) do
Logger.info("Downloading #{model.name}...")
Logger.info("ℹ️ Downloading #{model.name}...")

# Download model
downloading_settings = {:hf, model.name, cache_dir: model.cache_path}
Expand Down
4 changes: 2 additions & 2 deletions _comparison/run.exs
Original file line number Diff line number Diff line change
Expand Up @@ -80,7 +80,7 @@ defmodule Benchmark do
coco_dataset_images_path = File.cwd!() |> Path.join("coco_dataset") |> Path.join("*.jpg")
files = Path.wildcard(coco_dataset_images_path)

#coco_dataset_captions =
# coco_dataset_captions =
# File.stream!(File.cwd!() |> Path.join("coco_dataset") |> Path.join("captions.csv"))
# |> CSV.decode!()
# |> Enum.map(& &1)
Expand Down Expand Up @@ -120,7 +120,7 @@ defmodule Benchmark do

# Go over each image and make prediction
Enum.each(vips_images_with_captions, fn image ->
Logger.info("Benchmarking image #{image.id}...")
Logger.info("📊 Benchmarking image #{image.id}...")

# Run the prediction
{time_in_microseconds, prediction} =
Expand Down
10 changes: 4 additions & 6 deletions assets/js/micro.js
Original file line number Diff line number Diff line change
Expand Up @@ -11,23 +11,21 @@ export default {
blue = ["bg-blue-500", "hover:bg-blue-700"],
pulseGreen = ["bg-green-500", "hover:bg-green-700", "animate-pulse"];


_this = this;

// Adding event listener for "click" event
recordButton.addEventListener("click", () => {

// Check if it's recording.
// If it is, we stop the record and update the elements.
if (mediaRecorder && mediaRecorder.state === "recording") {
mediaRecorder.stop();
// audioChunks.getAudioTracks()[0].stop();
text.textContent = "Record";
}
}

// Otherwise, it means the user wants to start recording.
else {
navigator.mediaDevices.getUserMedia({ audio: true }).then((stream) => {

// Instantiate MediaRecorder
mediaRecorder = new MediaRecorder(stream);
mediaRecorder.start();
Expand All @@ -39,7 +37,7 @@ export default {

// Add "dataavailable" event handler
mediaRecorder.addEventListener("dataavailable", (event) => {
audioChunks.push(event.data);
event.data.size > 0 && audioChunks.push(event.data);
});

// Add "stop" event handler for when the recording stops.
Expand All @@ -57,4 +55,4 @@ export default {
}
});
},
};
};
2 changes: 1 addition & 1 deletion config/config.exs
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ config :app,
generators: [timestamp_type: :utc_datetime]

# Tells `NX` to use `EXLA` as backend
# config :nx, default_backend: EXLA.Backend
# config :nx, default_backend: EXLA.Backend
# needed to run on `Fly.io`
config :nx, :default_backend, {EXLA.Backend, client: :host}

Expand Down
1 change: 0 additions & 1 deletion config/dev.exs
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,6 @@ config :app, App.Repo,
show_sensitive_data_on_connection_error: true,
pool_size: 10


# For development, we disable any cache and enable
# debugging and code reloading.
#
Expand Down
3 changes: 2 additions & 1 deletion config/test.exs
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,8 @@ config :logger, level: :warning
# Initialize plugs at runtime for faster test compilation
config :phoenix, :plug_init_mode, :runtime


# App configuration
config :app,
start_genserver: false,
knnindex_indices_test: true,
use_test_models: true
4 changes: 2 additions & 2 deletions deployment.md
Original file line number Diff line number Diff line change
Expand Up @@ -926,7 +926,7 @@ defmodule App.Models do
# It will load the model and the respective the featurizer, tokenizer and generation config if needed,
# and return a map with all of these at the end.
defp load_offline_model(model) do
Logger.info("Loading #{model.name}...")
Logger.info("ℹ️ Loading #{model.name}...")

# Loading model
loading_settings = {:hf, model.name, cache_dir: model.cache_path, offline: true}
Expand Down Expand Up @@ -968,7 +968,7 @@ defmodule App.Models do
# Downloads the models according to a given %ModelInfo struct.
# It will load the model and the respective the featurizer, tokenizer and generation config if needed.
defp download_model(model) do
Logger.info("Downloading #{model.name}...")
Logger.info("ℹ️ Downloading #{model.name}...")

# Download model
downloading_settings = {:hf, model.name, cache_dir: model.cache_path}
Expand Down
127 changes: 127 additions & 0 deletions example.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,127 @@
Mix.install([
{:bumblebee, "~> 0.4.2"},
{:exla, "~> 0.6.4"},
{:nx, "~> 0.6.4 "},
{:hnswlib, "~> 0.1.4"}
])

Nx.global_default_backend(EXLA.Backend)

{:ok, index} = HNSWLib.Index.new(_space = :cosine, _dim = 384, _max_elements = 200)
transformer = "sentence-transformers/paraphrase-MiniLM-L6-v2"
{:ok, %{model: _model, params: _params} = model_info} =
Bumblebee.load_model({:hf, transformer})

{:ok, tokenizer} = Bumblebee.load_tokenizer({:hf, transformer})
serving = Bumblebee.Text.TextEmbedding.text_embedding(
model_info,
tokenizer,
defn_options: [compiler: EXLA, lazy_transfers: :never]
#output_pool: :mean_pooling,
#output_attribute: :hidden_state,
#embedding_processor: :l2_norm,
)

%{embedding: data} = Nx.Serving.run(serving, "small") |>dbg()
HNSWLib.Index.add_items(index, data)
HNSWLib.Index.get_count(index) |> dbg()

%{embedding: data} = Nx.Serving.run(serving, "tall") |> dbg()
HNSWLib.Index.add_items(index, data)
HNSWLIb.Index.get_count(index) |> dbg()

%{embedding: data} = Nx.Serving.run(serving, "high")
{:ok, labels, distances} = HNSWLib.Index.knn_query(index, data, k: 1) |> dbg()
idx = Nx.to_flat_list(labels[0])
{:ok, dt} = HNSWLib.Index.get_items(index, idx)
Nx.stack(Enum.map(dt, fn d -> Nx.from_binary(d, :f32) end))

defmodule Embedding do
use GenServer
@indexes "indexes.bin"

def start_link(norm) do
GenServer.start_link(__MODULE__, norm, name: __MODULE__)
end

# upload or create a new index file
def init(norm) do
space = norm

{:ok, index} =
case File.exists?(@indexes) do
false ->
HNSWLib.Index.new(_space = space, _dim = 384, _max_elements = 200)

true ->
HNSWLib.Index.load_index(space, 384, @indexes)
end

model_info = nil
tokenizer = nil
{:ok, {model_info, tokenizer, index}, {:continue, :load}}
end

def handle_continue(:load, {_, _, index}) do
transformer = "sentence-transformers/paraphrase-MiniLM-L6-v2"

{:ok, %{model: _model, params: _params} = model_info} =
Bumblebee.load_model({:hf, transformer})

{:ok, tokenizer} =
Bumblebee.load_tokenizer({:hf, transformer})

{:noreply, {model_info, tokenizer, index}}
end

def serve() do
GenServer.call(__MODULE__, :serve)
end

def get_count do
GenServer.call(__MODULE__, :get_count)
end

def get_index do
GenServer.call(__MODULE__, :get_index)
end

def handle_call(:serve, _from, {model_info, tokenizer, index} = state) do
serving = Bumblebee.Text.TextEmbedding.text_embedding(
model_info,
tokenizer,
output_pool: :mean_pooling,
output_attribute: :hidden_state,
embedding_processor: :l2_norm,
defn_options: [compiler: EXLA, lazy_transfers: :never]
)
{:reply, {serving, index}, state}
end

def handle_call(:get_count, _, {_, _, index} = state) do
{:ok, count} = HNSWLib.Index.get_current_count(index)
{:reply, count, state}
end

def handle_call(:get_index, _, {_, _, index} = state) do
{:reply, index, state}
end
end

{:ok, pid} = GenServer.start_link(Embedding, :l2)

{serving, index} = GenServer.call(pid, :serve)

%{embedding: data} = Nx.Serving.run(serving, "small") |>dbg()
HNSWLib.Index.add_items(index, data)
GenServer.call(pid, :get_count) |> dbg()

%{embedding: data} = Nx.Serving.run(serving, "tall") |> dbg()
HNSWLib.Index.add_items(index, data)
GenServer.call(pid, :get_count) |> dbg()

%{embedding: data3} = Nx.Serving.run(serving, "high")
{:ok, labels, distances} = HNSWLib.Index.knn_query(index, data, k: 1) |> dbg()
idx = Nx.to_flat_list(labels[0])
{:ok, dt} = HNSWLib.Index.get_items(index, idx)
Nx.stack(Enum.map(dt, fn d -> Nx.from_binary(d, :f32) end))
52 changes: 40 additions & 12 deletions lib/app/application.ex
Original file line number Diff line number Diff line change
Expand Up @@ -5,10 +5,28 @@ defmodule App.Application do
require Logger
use Application

@impl true
def start(_type, _args) do
@upload_dir Application.app_dir(:app, ["priv", "static", "uploads"])

@saved_index if Application.compile_env(:app, :knnindex_indices_test, false),
do: Path.join(@upload_dir, "indexes_test.bin"),
else: Path.join(@upload_dir, "indexes.bin")

def check_models_on_startup do
App.Models.verify_and_download_models()
|> case do
{:error, msg} ->
Logger.error("⚠️ #{msg}")
System.stop(0)

:ok ->
Logger.info("ℹ️ Models: ✅")
:ok
end
end

@impl true
def start(_type, _args) do
:ok = check_models_on_startup()

children = [
# Start the Telemetry supervisor
Expand All @@ -18,17 +36,16 @@ defmodule App.Application do
# Start the PubSub system
{Phoenix.PubSub, name: App.PubSub},
# Nx serving for the embedding
# App.TextEmbedding,

{Nx.Serving, serving: App.Models.embedding(), name: Embedding, batch_size: 1},
# Nx serving for Speech-to-Text
{Nx.Serving,
serving:
if Application.get_env(:app, :use_test_models) == true do
App.Models.audio_serving_test()
else
App.Models.audio_serving()
end,
name: Whisper},
serving:
if Application.get_env(:app, :use_test_models) == true do
App.Models.audio_serving_test()
else
App.Models.audio_serving()
end,
name: Whisper},
# Nx serving for image classifier
{Nx.Serving,
serving:
Expand All @@ -39,7 +56,7 @@ defmodule App.Application do
end,
name: ImageClassifier},
{GenMagic.Server, name: :gen_magic},

# Adding a supervisor
{Task.Supervisor, name: App.TaskSupervisor},
# Start the Endpoint (http/https)
Expand All @@ -48,6 +65,17 @@ defmodule App.Application do
# {App.Worker, arg}
]

# We are starting the HNSWLib Index GenServer only during testing.
# Because this GenServer needs the database to be seeded first,
# we only add it when we're not testing.
# When testing, you need to spawn this process manually (it is done in the test_helper.exs file).
children =
if Application.get_env(:app, :start_genserver, true) == true do
Enum.concat(children, [{App.KnnIndex, [space: :cosine, index: @saved_index]}])
else
children
end

# See https://hexdocs.pm/elixir/Supervisor.html
# for other strategies and supported options
opts = [strategy: :one_for_one, name: App.Supervisor]
Expand Down
Loading
Loading