created count_vectorizer.md (#319)

learning-software-engineering · Nov 30, 2023 · 41109f6 · 41109f6
2 parents 76ecab1 + b8695a9
commit 41109f6
Showing 1 changed file with 50 additions and 0 deletions.
diff --git a/Topics/Tech_Stacks/count_vectorizer.md b/Topics/Tech_Stacks/count_vectorizer.md
@@ -0,0 +1,50 @@
+***Table of contents*** 
+
+1.[Introduction](#introduction)
+
+2.[How it works](#how-it-works)
+
+3.[Usage](#usage)
+
+4.[Additional resources](#additional-resources)
+
+***Introduction:***
+
+
+The CountVectorizer class in Python's Scikit-learn library helps machines understand text by turning it into numbers. Imagine it as a tool that transforms a bunch of words into a table of counts that a computer can analyze.
+
+***How it works:***
+
+1. **Tokenization:**
+   - It first breaks the text into smaller pieces called "tokens," which are usually words. For example, the phrase "I love CSC301" becomes tokens like ["I", "love", "CSC301"].
+
+2. **Counting:**
+   - Then, it counts how many times each token appears in each document. This creates a table where each row is a document, and each column is a unique token. The numbers in the table show how many times each token appears in each document.
+
+For example, if you have three documents:
+
+- Document 1: "I love programming."
+- Document 2: "Programming is fun."
+- Document 3: "I love to program in Python."
+
+After using CountVectorizer, you might get a table like this:
+
+```
+|   | I | love | programming | is | fun | to | program | in | Python |
+|---|---|------|-------------|----|-----|----|---------|----|--------|
+| 1 | 1 | 1    | 1           | 0  | 0   | 0  | 0       | 0  | 0      |
+| 2 | 0 | 0    | 1           | 1  | 1   | 0  | 0       | 0  | 0      |
+| 3 | 1 | 1    | 1           | 0  | 0   | 1  | 1       | 1  | 1      |
+```
+
+Each row represents a document, and each column is a unique word. The numbers show how many times each word appears in the respective document.
+
+***Usage:***
+
+
+CountVectorizer is commonly used in tasks like text classification,clustering, sentiment analysis, and text mining in machine learning. It's a crucial step in getting text ready for algorithms that work with numbers.
+
+***Additional resources:***
+
+- [Scikit-learn Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)
+- [GeeksforGeeks Tutorial](https://www.geeksforgeeks.org/using-countvectorizer-to-extracting-features-from-text/)