Skip to content

Latest commit

 

History

History
28 lines (19 loc) · 724 Bytes

README.md

File metadata and controls

28 lines (19 loc) · 724 Bytes

Web crawler and search engine

  1. Report (ru).
  2. Slides (ru).

Ranking functions:

  • BM25 (for whole content and headers only)
  • PageRank
  • Reference rating
  • Query position
  • Length of document

Architecture of the crawler:

architecture

Search page:

search page

Efficiency (1.6GHz i5 + SSD + 4Mbit/s):

  • Indexing: ~50'000 per hour pages.
  • Search: ~0.1s per query on database with 1'000'000 indexed pages.

License

The source code is licensed under MIT license.

The report and slides are not licensed (no rights are given to reproduce or modify this work).