Merge pull request #116 from smritae01/dev

Add Proof of Concept work
BU-Spark · Dec 7, 2023 · 0625948 · 0625948
2 parents 0e0758f + 49886d2
commit 0625948
Show file tree

Hide file tree

Showing 125 changed files with 3,308 additions and 0 deletions.
diff --git a/POC/02241014.jpg b/POC/02241014.jpg
diff --git a/POC/AzureVision-resized.ipynb b/POC/AzureVision-resized.ipynb
diff --git a/POC/AzureVision.ipynb b/POC/AzureVision.ipynb
@@ -0,0 +1,280 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "2f17ba8c",
+   "metadata": {},
+   "source": [
+    "# Azure Vision Implementaion - Dima "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "91f6a7e3",
+   "metadata": {},
+   "source": [
+    "This notebook utilizes Azure AI Document Intelligence Studio to extract text from a set of Herbarium specimens, obtained from: https://www.gbif.org/occurrence/gallery?basis_of_record=PRESERVED_SPECIMEN&media_ty%5B%E2%80%A6%5Daxon_key=6&year=1000,1941&advanced=1&occurrence_status=present\n",
+    "\n",
+    "A selection of 30 specimens was downloaded to the /projectnb/sparkgrp/ml-herbarium-grp/fall2023/LLM_Specimens folder. \n",
+    "\n",
+    "The folder is made up of:\n",
+    "1) 20 images that contain pure text, ranging from plain to hard-to-read-cursive and 1\n",
+    "2) 10 images that contain both the visual plant specimen and the attached textual labels\n",
+    "\n",
+    "Special care was taken to select a diverse collection of specimens, ranging in text quality and type\n",
+    "\n",
+    "In regards to the 10 images: there was a general trend in that the images with plant specimens and the actual text, the text was too small and or blurry to be deciphered by any LLM. Next steps would include improving the quality of the text for the LLM to analyze it. \n",
+    "\n",
+    "Currently: the notebook takes an input image from: /projectnb/sparkgrp/ml-herbarium-grp/fall2023/LLM_Specimens, runs it through Azure Vision, analyzes all text, creates a pdf with the original image, an annotated image that has boxes around identified words and predicted words written over the original text. Below the image the entire text identified is printed along with the confidence score for each identified term. All this is saved and stored in: /projectnb/sparkgrp/ml-herbarium-grp/fall2023/AzureVision-results\n",
+    "\n",
+    "Immediate next steps:\n",
+    "\n",
+    "1. Obtain a student Microsoft Azure account to finish the work (testing was done with a personal account, ran out of free credits)\n",
+    "2. Improve annotated images- currently the predicted text is hard to read, going to change it so that its above the orginal words. \n",
+    "3. Integrate GPT-4 to parse the written text into a format that clearly returns the species, date collected, geography. \n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 16,
+   "id": "bc1c7278",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#!pip install azure-ai-formrecognizer --pre\n",
+    "#!pip install opencv-python-headless matplotlib\n",
+    "#!pip install matplotlib pillow\n",
+    "#!pip install ipywidgets\n",
+    "#!pip install shapely\n",
+    "#!pip install openai\n",
+    "#!pip install reportlab"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 15,
+   "id": "c1566288",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Couldn't get a file descriptor referring to the console\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "An error occurred while processing /projectnb/sparkgrp/ml-herbarium-grp/fall2023/LLM_Specimens/Text_Sample_19.png: (403) Out of call volume quota for FormRecognizer F0 pricing tier. Please retry after 6 days. To increase your call volume switch to a paid tier.\n",
+      "Code: 403\n",
+      "Message: Out of call volume quota for FormRecognizer F0 pricing tier. Please retry after 6 days. To increase your call volume switch to a paid tier.\n",
+      "An error occurred while processing /projectnb/sparkgrp/ml-herbarium-grp/fall2023/LLM_Specimens/Text_Sample_10.png: (403) Out of call volume quota for FormRecognizer F0 pricing tier. Please retry after 6 days. To increase your call volume switch to a paid tier.\n",
+      "Code: 403\n",
+      "Message: Out of call volume quota for FormRecognizer F0 pricing tier. Please retry after 6 days. To increase your call volume switch to a paid tier.\n",
+      "An error occurred while processing /projectnb/sparkgrp/ml-herbarium-grp/fall2023/LLM_Specimens/Text_Sample_6.png: (403) Out of call volume quota for FormRecognizer F0 pricing tier. Please retry after 6 days. To increase your call volume switch to a paid tier.\n",
+      "Code: 403\n",
+      "Message: Out of call volume quota for FormRecognizer F0 pricing tier. Please retry after 6 days. To increase your call volume switch to a paid tier.\n",
+      "An error occurred while processing /projectnb/sparkgrp/ml-herbarium-grp/fall2023/LLM_Specimens/Mixed_Sample_7.png: (403) Out of call volume quota for FormRecognizer F0 pricing tier. Please retry after 6 days. To increase your call volume switch to a paid tier.\n",
+      "Code: 403\n",
+      "Message: Out of call volume quota for FormRecognizer F0 pricing tier. Please retry after 6 days. To increase your call volume switch to a paid tier.\n",
+      "An error occurred while processing /projectnb/sparkgrp/ml-herbarium-grp/fall2023/LLM_Specimens/Text_Sample_4.png: (403) Out of call volume quota for FormRecognizer F0 pricing tier. Please retry after 6 days. To increase your call volume switch to a paid tier.\n",
+      "Code: 403\n",
+      "Message: Out of call volume quota for FormRecognizer F0 pricing tier. Please retry after 6 days. To increase your call volume switch to a paid tier.\n"
+     ]
+    }
+   ],
+   "source": [
+    "from azure.core.credentials import AzureKeyCredential\n",
+    "from azure.ai.formrecognizer import DocumentAnalysisClient\n",
+    "import matplotlib.pyplot as plt\n",
+    "import matplotlib.image as mpimg\n",
+    "from PIL import Image, ImageDraw, ImageFont\n",
+    "import openai\n",
+    "import re\n",
+    "import os\n",
+    "from reportlab.lib.pagesizes import letter\n",
+    "from reportlab.pdfgen import canvas\n",
+    "\n",
+    "\n",
+    "# Azure Cognitive Services endpoint and key\n",
+    "endpoint = \"https://herbariumsamplerecognition.cognitiveservices.azure.com/\"\n",
+    "key = \"d341921d724e44bda113bc343e88d476\"\n",
+    "\n",
+    "def sanitize_filename(filename):\n",
+    "    # Remove characters that are not alphanumeric, spaces, dots, or underscores\n",
+    "    return re.sub(r'[^\\w\\s\\.-]', '', filename)\n",
+    "\n",
+    "def format_bounding_box(bounding_box):\n",
+    "    if not bounding_box:\n",
+    "        return \"N/A\"\n",
+    "    return \", \".join([\"[{}, {}]\".format(p.x, p.y) for p in bounding_box])\n",
+    "\n",
+    "def draw_boxes(image_path, words):\n",
+    "    original_image = Image.open(image_path)\n",
+    "    annotated_image = original_image.copy()\n",
+    "    draw = ImageDraw.Draw(annotated_image)\n",
+    "\n",
+    "    for word in words:\n",
+    "        polygon = word['polygon']\n",
+    "        if polygon:\n",
+    "            bbox = [(point.x, point.y) for point in polygon]\n",
+    "            try:\n",
+    "                # Replace special characters that cannot be encoded in 'latin-1'\n",
+    "                text_content = word['content'].encode('ascii', 'ignore').decode('ascii')\n",
+    "            except Exception as e:\n",
+    "                print(f\"Error processing text {word['content']}: {e}\")\n",
+    "                text_content = \"Error\"\n",
+    "            draw.polygon(bbox, outline=\"red\")\n",
+    "            draw.text((bbox[0][0], bbox[0][1]), text_content, fill=\"green\")\n",
+    "    \n",
+    "    return annotated_image\n",
+    "\n",
+    "\n",
+    "def parse_document_content(content):\n",
+    "    openai.api_key = 'your-api-key'\n",
+    "\n",
+    "    try:\n",
+    "        response = openai.Completion.create(\n",
+    "            model=\"gpt-4\",\n",
+    "            prompt=f\"Extract specific information from the following text: {content}\\n\\nSpecies Name: \",\n",
+    "            max_tokens=100\n",
+    "            # Add additional parameters as needed\n",
+    "        )\n",
+    "        parsed_data = response.choices[0].text.strip()\n",
+    "        return parsed_data\n",
+    "    except Exception as e:\n",
+    "        print(\"An error occurred:\", e)\n",
+    "        return None\n",
+    "\n",
+    "\n",
+    "def analyze_read(image_path, output_path, show_first_output=False):\n",
+    "    try:\n",
+    "        with open(image_path, \"rb\") as f:\n",
+    "            image_stream = f.read()\n",
+    "\n",
+    "        document_analysis_client = DocumentAnalysisClient(\n",
+    "            endpoint=endpoint, credential=AzureKeyCredential(key)\n",
+    "        )\n",
+    "\n",
+    "        poller = document_analysis_client.begin_analyze_document(\n",
+    "            \"prebuilt-read\", image_stream)\n",
+    "        result = poller.result()\n",
+    "\n",
+    "       # Collect words, their polygon data, and confidence\n",
+    "        words = []\n",
+    "        confidence_text = \"\"\n",
+    "        for page in result.pages:\n",
+    "            for word in page.words:\n",
+    "                words.append({\n",
+    "                    'content': word.content,\n",
+    "                    'polygon': word.polygon\n",
+    "                })\n",
+    "                confidence_text += \"'{}' confidence {}\\n\".format(word.content, word.confidence)\n",
+    "\n",
+    "        document_content = result.content + \"\\n\\nConfidence Metrics:\\n\" + confidence_text\n",
+    "        #parsed_info = parse_document_content(document_content)\n",
+    "\n",
+    "        original_image = Image.open(image_path)\n",
+    "        annotated_img = draw_boxes(image_path, words)\n",
+    "\n",
+    "        # Set up PDF\n",
+    "        output_filename = os.path.join(output_path, sanitize_filename(os.path.basename(image_path).replace('.png', '.pdf')))\n",
+    "        c = canvas.Canvas(output_filename, pagesize=letter)\n",
+    "        width, height = letter  # usually 612 x 792\n",
+    "\n",
+    "        # Draw original image\n",
+    "        if original_image.height <= height:\n",
+    "            c.drawImage(image_path, 0, height - original_image.height, width=original_image.width, height=original_image.height, mask='auto')\n",
+    "            y_position = height - original_image.height\n",
+    "        else:\n",
+    "            # Handle large images or add scaling logic here\n",
+    "            pass\n",
+    "\n",
+    "        # Draw annotated image\n",
+    "        annotated_image_path = '/tmp/annotated_image.png'  # Temporary path for the annotated image\n",
+    "        annotated_img.save(annotated_image_path)\n",
+    "        if y_position - annotated_img.height >= 0:\n",
+    "            c.drawImage(annotated_image_path, 0, y_position - annotated_img.height, width=annotated_img.width, height=annotated_img.height, mask='auto')\n",
+    "            y_position -= annotated_img.height\n",
+    "        else:\n",
+    "            c.showPage()  # Start a new page if not enough space\n",
+    "            c.drawImage(annotated_image_path, 0, height - annotated_img.height, width=annotated_img.width, height=annotated_img.height, mask='auto')\n",
+    "            y_position = height - annotated_img.height\n",
+    "\n",
+    "        # Add text\n",
+    "        textobject = c.beginText()\n",
+    "        textobject.setTextOrigin(10, y_position - 15)\n",
+    "        textobject.setFont(\"Times-Roman\", 12)\n",
+    "\n",
+    "        for line in document_content.split('\\n'):\n",
+    "            if textobject.getY() - 15 < 0:  # Check if new page is needed for more text\n",
+    "                c.drawText(textobject)\n",
+    "                c.showPage()\n",
+    "                textobject = c.beginText()\n",
+    "                textobject.setTextOrigin(10, height - 15)\n",
+    "                textobject.setFont(\"Times-Roman\", 12)\n",
+    "            textobject.textLine(line)\n",
+    "        \n",
+    "        c.drawText(textobject)\n",
+    "        c.save()\n",
+    "\n",
+    "        # Show the first output\n",
+    "        if show_first_output:\n",
+    "            os.system(f\"open {output_filename}\")\n",
+    "\n",
+    "    except Exception as e:\n",
+    "        print(f\"An error occurred while processing {image_path}: {e}\")\n",
+    "\n",
+    "\n",
+    "if __name__ == \"__main__\":\n",
+    "    input_folder = '/projectnb/sparkgrp/ml-herbarium-grp/fall2023/LLM_Specimens'\n",
+    "    output_folder = '/projectnb/sparkgrp/ml-herbarium-grp/fall2023/AzureVision-results'\n",
+    "    first_output_shown = False\n",
+    "\n",
+    "    # Create the output folder if it doesn't exist\n",
+    "    if not os.path.exists(output_folder):\n",
+    "        os.makedirs(output_folder)\n",
+    "\n",
+    "    # Iterate over each image in the input folder\n",
+    "    for image_file in os.listdir(input_folder):\n",
+    "        image_path = os.path.join(input_folder, image_file)\n",
+    "        \n",
+    "        # Check if the file is an image\n",
+    "        if image_path.lower().endswith(('.png', '.jpg', '.jpeg')):\n",
+    "            analyze_read(image_path, output_folder, show_first_output=not first_output_shown)\n",
+    "            first_output_shown = True  # Ensure that only the first output is shown\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "f27d0103",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.12"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/POC/ChineseGPT4Vision.ipynb b/POC/ChineseGPT4Vision.ipynb