Progress tracker for the project. Helps us understand where we are now and what we are still missing.
To do sometime in the future:
- Start implementing the actual PageRank algorithm (mathematically and programmatically)
- Start deploying this thing and exposing it to the outside world
- Think of security mechanisms that are going to have to be in place for us to deploy this safely
- Clean everything up, add some documentation
-
Research Elasticsearch (huh) and figure out the schema, the queries we will have to send both from the crawling and from the backend sides.
-
Look into whether we can transfer HTML parsing to Elasticsearch from crawlers.
-
Figure out a smart system of giving out the URLs to the crawling workers, custom round-robin algorithms and error re-attempts.
-
See what we can do right now to improve our later cloud deployment (devopsy things)
-
Implement proper pagination for the backend.
-
Look into better full text parsing for Rust crawlers.
-
Implement the main functionality of the website backend in Rust (the main page, error handling, user query handling)
-
Transfer the python crawler to a multiprocessor architecture.
-
Start laying out the final architecture of the project - the way all these systems are going to interact with each other.
-
Implement repeated attempts policy for sites that error out when we try to scrape them. Number of attempts, time between them should be configurable.
-
Figure out whether we should put our database queries in transactions to maintain data integrity
-
Think of database schemas and implementation details for a functional PageRank algorithm
-
Develop a protocol for communication between the backend and the database, which endpoints it should query, whether it's going to be a GET/POST request, etc.
-
Pick a library for language detection (both for a user's request and webpage's content). We've decided to make a separate module that is going to be used both for the backend and for the crawlers (since both work in Rust and we've found a Rust library for language detection it should be fine!)
-
Implement the database with the Python and Rust backends for it.
-
Update crawlers to connect them to the database.
-
Connect all of the containers into a single docker-compose project.
-
Figure out how to work with queries in different languages, + Ukrainian language tokenization.
-
Simple (but parallel) crawlers for structured data. Taking an input file with links and outputting collected structured data in another file. Taking a simple configuration with where to look for the files, how many threads to use, how many websites to crawl etc.
-
Docker images for crawlers, with minimal dependencies and sizes. Also a simple docker-compose file.
-
Figure out how to work with the database, does it support our languages (currently English and Ukrainian), which tables and columns we need etc.
-
A simple frontend prototype. Documented query format from the frontend to the database