-
Notifications
You must be signed in to change notification settings - Fork 199
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Showing
1 changed file
with
50 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,50 @@ | ||
***Table of contents*** | ||
|
||
1.[Introduction](#introduction) | ||
|
||
2.[How it works](#how-it-works) | ||
|
||
3.[Usage](#usage) | ||
|
||
4.[Additional resources](#additional-resources) | ||
|
||
***Introduction:*** | ||
|
||
|
||
The CountVectorizer class in Python's Scikit-learn library helps machines understand text by turning it into numbers. Imagine it as a tool that transforms a bunch of words into a table of counts that a computer can analyze. | ||
|
||
***How it works:*** | ||
|
||
1. **Tokenization:** | ||
- It first breaks the text into smaller pieces called "tokens," which are usually words. For example, the phrase "I love CSC301" becomes tokens like ["I", "love", "CSC301"]. | ||
|
||
2. **Counting:** | ||
- Then, it counts how many times each token appears in each document. This creates a table where each row is a document, and each column is a unique token. The numbers in the table show how many times each token appears in each document. | ||
|
||
For example, if you have three documents: | ||
|
||
- Document 1: "I love programming." | ||
- Document 2: "Programming is fun." | ||
- Document 3: "I love to program in Python." | ||
|
||
After using CountVectorizer, you might get a table like this: | ||
|
||
``` | ||
| | I | love | programming | is | fun | to | program | in | Python | | ||
|---|---|------|-------------|----|-----|----|---------|----|--------| | ||
| 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | | ||
| 2 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | | ||
| 3 | 1 | 1 | 1 | 0 | 0 | 1 | 1 | 1 | 1 | | ||
``` | ||
|
||
Each row represents a document, and each column is a unique word. The numbers show how many times each word appears in the respective document. | ||
|
||
***Usage:*** | ||
|
||
|
||
CountVectorizer is commonly used in tasks like text classification,clustering, sentiment analysis, and text mining in machine learning. It's a crucial step in getting text ready for algorithms that work with numbers. | ||
|
||
***Additional resources:*** | ||
|
||
- [Scikit-learn Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) | ||
- [GeeksforGeeks Tutorial](https://www.geeksforgeeks.org/using-countvectorizer-to-extracting-features-from-text/) |