Skip to content

Commit

Permalink
created count_vectorizer.md (#319)
Browse files Browse the repository at this point in the history
nomiko0626 authored Nov 30, 2023
2 parents 76ecab1 + b8695a9 commit 41109f6
Showing 1 changed file with 50 additions and 0 deletions.
50 changes: 50 additions & 0 deletions Topics/Tech_Stacks/count_vectorizer.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
***Table of contents***

1.[Introduction](#introduction)

2.[How it works](#how-it-works)

3.[Usage](#usage)

4.[Additional resources](#additional-resources)

***Introduction:***


The CountVectorizer class in Python's Scikit-learn library helps machines understand text by turning it into numbers. Imagine it as a tool that transforms a bunch of words into a table of counts that a computer can analyze.

***How it works:***

1. **Tokenization:**
- It first breaks the text into smaller pieces called "tokens," which are usually words. For example, the phrase "I love CSC301" becomes tokens like ["I", "love", "CSC301"].

2. **Counting:**
- Then, it counts how many times each token appears in each document. This creates a table where each row is a document, and each column is a unique token. The numbers in the table show how many times each token appears in each document.

For example, if you have three documents:

- Document 1: "I love programming."
- Document 2: "Programming is fun."
- Document 3: "I love to program in Python."

After using CountVectorizer, you might get a table like this:

```
| | I | love | programming | is | fun | to | program | in | Python |
|---|---|------|-------------|----|-----|----|---------|----|--------|
| 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 |
| 3 | 1 | 1 | 1 | 0 | 0 | 1 | 1 | 1 | 1 |
```

Each row represents a document, and each column is a unique word. The numbers show how many times each word appears in the respective document.

***Usage:***


CountVectorizer is commonly used in tasks like text classification,clustering, sentiment analysis, and text mining in machine learning. It's a crucial step in getting text ready for algorithms that work with numbers.

***Additional resources:***

- [Scikit-learn Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)
- [GeeksforGeeks Tutorial](https://www.geeksforgeeks.org/using-countvectorizer-to-extracting-features-from-text/)

0 comments on commit 41109f6

Please sign in to comment.