Skip to content

Commit

Permalink
feat: Add PageRank
Browse files Browse the repository at this point in the history
  • Loading branch information
wesleey committed Jun 1, 2024
1 parent 25c52b4 commit 98ae76f
Show file tree
Hide file tree
Showing 9 changed files with 286 additions and 0 deletions.
3 changes: 3 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,3 +13,6 @@
## Knowledge
- **Propositional Logic**
- [Inference](./knowledge/propositional-logic/inference/)
## Uncertainty
- **Markov Chain**
- [PageRank](./uncertainty/markov-chain/pagerank/)
82 changes: 82 additions & 0 deletions uncertainty/markov-chain/pagerank/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
# PageRank
[PageRank](https://en.wikipedia.org/wiki/PageRank) is an algorithm used by Google Search to rank web pages in their search engine results. It measures the importance of website pages by considering the number and quality of links to them, aiming to determine the likelihood of a user reaching a particular page through random web surfing.

## Usage
```bash
python pagerank.py corpus
```

## Random Surfer Model
### Introduction
- PageRank can be understood through the random surfer model.
- This model considers a hypothetical surfer on the internet who clicks on links randomly.
- A corpus of web pages is used to illustrate this model, where arrows between pages represent links.

![Corpus](./images/corpus.png)

### Behavior of the Random Surfer
- The surfer starts at a random page and randomly chooses links to follow.
- For example, if on page 2, they randomly choose between pages 1 and 3 to visit next.
- Duplicate links on the same page are treated as a single link, and links to the same page are ignored.

### PageRank and Probability
- PageRank of a page is the probability of a random surfer being on that page at any given time.
- Pages with more links have a higher chance of being visited by the surfer.
- Links from more important sites are more likely to be clicked than those from less important sites.

### Interpretation as a Markov Chain
- The model can be interpreted as a [Markov Chain](https://en.wikipedia.org/wiki/Markov_chain), where each page is a state.
- Transitions between states are made randomly through links.

### Disconnected Corpus
- When randomly sampling pages from a disconnected corpus, certain pages may end up with biased PageRank estimates.
- This bias occurs when the random surfer gets stuck in a loop due to lack of connections between pages.

![Network Disconnected](./images/network_disconnected.png)

- Suppose we start by sampling Page 5 randomly from the corpus.
- Since Page 5 only links to Page 6 and vice versa, the surfer alternates between these two pages indefinitely.
- This looping results in an estimate of 0.5 for the PageRank of Pages 5 and 6.
- All other pages, which were not visited, end up with an estimated PageRank of 0.
- To address this issue, a damping factor ($d$) is introduced to the model, typically set around $0.85$.
- With probability $d$, the random surfer chooses a link from the current page randomly.
- The random surfer starts by choosing a page randomly from the corpus.
- For each additional sample, they choose a link from the current page with probability $d$ and any page with probability $1 - d$.
- By tracking how many times each page appears as a sample, we can determine the proportion of states on each page.
- This proportion serves as an estimate of the PageRank for each page within the disconnected corpus.

## Iterative Algorithm
This formula is applied iteratively until the PageRank values converge to a stable set of values, indicating the relative importance of each page within the network.

$$PR(p) = \frac{1 - d}{N} + d \sum_{i} \frac{PR(i)}{NumLinks(i)}$$

### PageRank $PR(p)$
- $PR(p)$ represents the probability that a random surfer ends up on a given page $p$.

### Two Ways to Define $PR(p)$
1. With probability $1 - d$, the surfer chose a page at random and ended up on page $p$.
2. With probability $d$, the surfer followed a link from a page $i$ to page $p$.

### Mathematical Expression for $PR(p)$
1. For the first condition,

$$PR(p) = \frac{1 - d}{N}$$

where $N$ is the total number of pages in the corpus.

2. For the second condition,

$$PR(p) = \sum_{i} \frac{PR(i)}{NumLinks(i)}$$

where $i$ ranges over all pages that link to page $p$, and $NumLinks(i)$ is the number of links present on page $i$.

### Calculating PageRank Values
1. Start by assuming the PageRank of every page is $1 / N$ (equally likely to be on any page).
2. Use the PageRank formula to calculate new PageRank values for each page, based on the previous values.
3. Repeat this process iteratively until PageRank values converge (i.e., not change significantly with each iteration).

## Implementation in the Project
- This project will implement both approaches for calculating PageRank: sampling pages from a Markov Chain random surfer and iteratively applying the PageRank formula.

## References
- [CS50’s Introduction to Artificial Intelligence with Python](https://cs50.harvard.edu/ai/2024/)
14 changes: 14 additions & 0 deletions uncertainty/markov-chain/pagerank/corpus/1.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
<!DOCTYPE html>
<html lang="en">
<head>
<title>1</title>
</head>
<body>
<h1>1</h1>

<div>Links:</div>
<ul>
<li><a href="2.html">2</a></li>
</ul>
</body>
</html>
15 changes: 15 additions & 0 deletions uncertainty/markov-chain/pagerank/corpus/2.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
<!DOCTYPE html>
<html lang="en">
<head>
<title>2</title>
</head>
<body>
<h1>2</h1>

<div>Links:</div>
<ul>
<li><a href="1.html">1</a></li>
<li><a href="3.html">3</a></li>
</ul>
</body>
</html>
15 changes: 15 additions & 0 deletions uncertainty/markov-chain/pagerank/corpus/3.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
<!DOCTYPE html>
<html lang="en">
<head>
<title>3</title>
</head>
<body>
<h1>3</h1>

<div>Links:</div>
<ul>
<li><a href="2.html">2</a></li>
<li><a href="4.html">4</a></li>
</ul>
</body>
</html>
14 changes: 14 additions & 0 deletions uncertainty/markov-chain/pagerank/corpus/4.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
<!DOCTYPE html>
<html lang="en">
<head>
<title>4</title>
</head>
<body>
<h1>4</h1>

<div>Links:</div>
<ul>
<li><a href="2.html">2</a></li>
</ul>
</body>
</html>
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
143 changes: 143 additions & 0 deletions uncertainty/markov-chain/pagerank/pagerank.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,143 @@
import sys
import os
import re
import random


DAMPING = 0.85
SAMPLES = 10000
ACCURACY = 0.001


def main():
if len(sys.argv) != 2:
sys.exit("Usage: python pagerank.py corpus")

corpus = crawl(sys.argv[1])

print(f"Results from Sampling (n = {SAMPLES})")
ranks = sample_pagerank(corpus, DAMPING, SAMPLES)
for page in sorted(ranks, key=ranks.get, reverse=True):
print(f" {page}: {ranks[page]:.4f}")

print("Results from Iteration")
ranks = iterate_pagerank(corpus, DAMPING, ACCURACY)
for page in sorted(ranks, key=ranks.get, reverse=True):
print(f" {page}: {ranks[page]:.4f}")


def crawl(directory: str) -> dict:
"""
Parse a directory of HTML pages and check for links to other pages.
Return a dictionary where each key is a page, and values are
a list of all other pages in the corpus that are linked to by the page.
"""
pages = dict()

for filename in os.listdir(directory):
if filename.endswith(".html"):
with open(os.path.join(directory, filename)) as file:
contents = file.read()
links = re.findall(r"<a\s+(?:[^>]*?)href=\"([^\"]*)\"", contents)
pages[filename] = set(links) - {filename}

for filename in pages:
pages[filename] = set(link for link in pages[filename] if link in pages)

return pages


def transition_model(corpus: dict, page: str, damping_factor: float) -> dict:
"""
Return a probability distribution over which page to visit next,
given a current page.
With probability `damping_factor`, choose a link at random
linked to by `page`. With probability `1 - damping_factor`, choose
a link at random chosen from all pages in the corpus.
"""
distribution = dict()

links = corpus[page]
num_pages = len(corpus)

if not links:
for link in corpus:
distribution[link] = 1 / num_pages
else:
num_links = len(links)
for link in corpus:
distribution[link] = (1 - damping_factor) / num_pages
if link in links:
distribution[link] += damping_factor / num_links

return distribution


def sample_pagerank(corpus: dict, damping_factor: float, n: int) -> dict:
"""
Return PageRank values for each page by sampling `n` pages
according to transition model, starting with a page at random.
Return a dictionary where keys are page names, and values are
their estimated PageRank value (a value between 0 and 1). All
PageRank values should sum to 1.
"""
pagerank = {page: 0 for page in corpus}
pages = list(pagerank.keys())

sample = random.choice(pages)
pagerank[sample] += 1

for _ in range(n):
distribution = transition_model(corpus, sample, damping_factor)
samples = list(distribution.keys())
weights = list(distribution.values())
sample = random.choices(samples, weights, k=1)[0]
pagerank[sample] += 1

for page in pagerank:
pagerank[page] /= n

return pagerank


def iterate_pagerank(corpus: dict, damping_factor: float, accuracy: float) -> dict:
"""
Return PageRank values for each page by iteratively updating
PageRank values until convergence.
Return a dictionary where keys are page names, and values are
their estimated PageRank value (a value between 0 and 1). All
PageRank values should sum to 1.
"""
num_pages = len(corpus)
old_dict = {page: 1 / num_pages for page in corpus}
new_dict = dict()

while True:
for page in corpus:
result = 0
for i in corpus:
links = corpus[i]
if not links:
result += old_dict[i] / num_pages
elif page in links:
num_links = len(links)
result += old_dict[i] / num_links
result *= damping_factor
result += (1 - damping_factor) / num_pages
new_dict[page] = result

difference = max([abs(old_dict[i] - new_dict[i]) for i in old_dict])

if difference < accuracy:
break
else:
old_dict = new_dict.copy()

return old_dict


if __name__ == "__main__":
main()

0 comments on commit 98ae76f

Please sign in to comment.