Skip to content

Commit

Permalink
clean up crawl queue
Browse files Browse the repository at this point in the history
  • Loading branch information
freekmurze committed Dec 10, 2017
1 parent 94347da commit 4438cc5
Show file tree
Hide file tree
Showing 6 changed files with 21 additions and 9 deletions.
3 changes: 3 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,9 @@

All notable changes to `spatie/crawler` will be documented in this file.

## 2.7.0 - 2017-12-10
- added the ability to change the crawl queue

## 2.6.2 - 2017-12-10
- more performance improvements

Expand Down
9 changes: 6 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -125,12 +125,15 @@ Crawler::create()

## Setting the crawl queue

You can change the crawler queue handler. By default, the crawler uses a collection based crawl queue.
That function expects an objects that implements the `Spatie\Crawler\CrawlQueue`-interface:
When crawling a site the crawler will put urls to be crawled in a queue. By default this queue is stored in memory using the built in `CollectionCrawlQueue`.

When a site is very large you may want to store that queue elsewhere, maybe a database. In such cases you can write your own crawl queue.

A valid crawel queue is any class that implements the `Spatie\Crawler\CrawlQueue\CrawlQueue`-interface. You can pass your custom crawl queue via the `setCrawlQueue` method on the crawler.

```php
Crawler::create()
->setCrawlQueue(<implementation of \Spatie\Crawler\CrawlQueue>)
->setCrawlQueue(<implementation of \Spatie\Crawler\CrawlQueue\CrawlQueue>)
```

## Changelog
Expand Down
Original file line number Diff line number Diff line change
@@ -1,9 +1,11 @@
<?php

namespace Spatie\Crawler;
namespace Spatie\Crawler\CrawlQueue;

use Illuminate\Support\Collection;
use Spatie\Crawler\Exception\UrlNotFoundByIndex;
use Spatie\Crawler\CrawlUrl;
use Spatie\Crawler\Url;

class CollectionCrawlQueue implements CrawlQueue
{
Expand Down Expand Up @@ -53,7 +55,7 @@ public function getUrlById(int $id): CrawlUrl
return $this->urls->values()[$id];
}

public function hasAlreadyBeenProcessed(CrawlUrl $url)
public function hasAlreadyBeenProcessed(CrawlUrl $url): bool
{
return ! $this->contains($this->pendingUrls, $url) && $this->contains($this->urls, $url);
}
Expand Down
6 changes: 4 additions & 2 deletions src/CrawlQueue.php → src/CrawlQueue/CrawlQueue.php
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
<?php

namespace Spatie\Crawler;
namespace Spatie\Crawler\CrawlQueue;

use Spatie\Crawler\CrawlUrl;

interface CrawlQueue
{
Expand All @@ -15,7 +17,7 @@ public function getUrlById(int $id): CrawlUrl;
/** @return \Spatie\Crawler\CrawlUrl|null */
public function getFirstPendingUrl();

public function hasAlreadyBeenProcessed(CrawlUrl $url);
public function hasAlreadyBeenProcessed(CrawlUrl $url): bool;

public function markAsProcessed(CrawlUrl $crawlUrl);
}
4 changes: 3 additions & 1 deletion src/Crawler.php
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,8 @@
use Symfony\Component\DomCrawler\Link;
use Psr\Http\Message\ResponseInterface;
use GuzzleHttp\Exception\RequestException;
use Spatie\Crawler\CrawlQueue\CrawlQueue;
use Spatie\Crawler\CrawlQueue\CollectionCrawlQueue;
use Symfony\Component\DomCrawler\Crawler as DomCrawler;

class Crawler
Expand All @@ -32,7 +34,7 @@ class Crawler
/** @var int */
protected $concurrency;

/** @var \Spatie\Crawler\CrawlQueue */
/** @var \Spatie\Crawler\CrawlQueue\CrawlQueue */
protected $crawlQueue;

/** @var int */
Expand Down
2 changes: 1 addition & 1 deletion tests/CrawlQueueTest.php
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@

use Spatie\Crawler\Url;
use Spatie\Crawler\CrawlUrl;
use Spatie\Crawler\CollectionCrawlQueue;
use Spatie\Crawler\CrawlQueue\CollectionCrawlQueue;

class CrawlQueueTest extends TestCase
{
Expand Down

0 comments on commit 4438cc5

Please sign in to comment.