-
Notifications
You must be signed in to change notification settings - Fork 0
Project description
The user interface is a key feature in modern applications. Recent developments (e.g. ChatGPT) show the usability and feasibility of chatbots as the user interface. Following this trend, I propose a chatbot application that helps users to find offers on popular web portals, such as OLX or Otodom. The chatbot is based on the HTTP responses from the said portals, it requires the user to specify the requests with a set of parameters (e.g. location, area, price range, etc..). It’s implemented in Go language with the use of a net/html
package’s parser and Telegram bot API. The user is required to create a set of search terms by using the chatbot’s interface and providing the aforementioned parameters. The parameters are then processed with a parser.
Parser operates on user-defined search terms, where the link to the page with offers can be easily built using parameters. The general structure of the link is:
olx.pl/nieruchomosci/mieszkania//
Example:
olx.pl/nieruchomosci/mieszkania/bydgoszcz/q-mieszkanie/?search%5Bfilter_float_price:from%5D=100&search%5Bfilter_float_price:to%5D=3000
By using the ?search[order]=created_at:desc
parameter the offers will be sorted in descending order.
Example:
olx.pl/nieruchomosci/mieszkania/bydgoszcz/q-mieszkanie/?search%5Border%5D=created_at:desc
Firstly, the parser builds url according to the user-defined parameters and always sets the order to be descending in order to ensure we parse the newest offers. The huge advantage of OLX is that it already includes offers both from Ottodom and OLX. Later, the parser uses the FetchHTMLPage
method to download the raw HTML code for later analysis of the page.
Offers on an OLX page are stored in divs with custom class names that are updated from time to time (e.g. class=”css-1sw7q4x”
), parser looks for such divs and treats them as an offer. Due to the nested character of the offers, the parser tracks whether the end div
tag token is a closing one for the offer and stores the value in the isOffer
boolean variable. All the tags of related to the offer are then passed to the extractOffer
method to extract specific features.
Here’s a general structure of an offer with a list of features parser heads to extract:
// Offer of an apartment for rent.
//
// Attributes:
//
// Title: The title of the offer.
// Price: The price of the offer.
// Location: The location of the offer.
// Time: The time offer was posted or updated.
// Url: The URL of the offer.
// AdditionalPayment: The additional payment for the offer.
// Description: The description of the offer.
// Rooms: The number of rooms on offer.
// Area: The area of the offer.
// Floor: The floor of the offer.
type Offer struct {
Title string
Price string
Location string
Time string
Url string
AdditionalPayment string
Description string
Rooms string
Area string
Floor string
Images []string
}
From the page with offers following attributes can be extracted:
- Title of the offer
- Location
- Area of the apartment
- Time the offer was posted (relative to the parsing time)
- Base price
- URL to the offer
That’s what the extractOffer method is responsible for, since almost all the data is stored in a span tag it’s a trivial task to solve.
Originally, the parser doesn’t follow the offer’s link, this part is performed inside a telegram bot logic in order to reduce the amount of requests to the OLX (there’s no rate limit logic implemented yet). However, the parser package offers the ParseOffer
method which accepts a partly filled offer as a parameter and finishes it with the data parsed from the Offer.URL page.
The ParseOffer
method distinguishes whether the provided URL is an OLX one or an Ottodom, and depending on that invokes the appropriate method. In the case of OLX offers, the task is similar to the offers page - all the data is stored in separate tags with custom classes indicating which of them is which. The only difference is that a lot of information is indicated in special OLX tags, which have the same structure in the code (e.g.: <p class="css-b5m1rv er34gjf0">Powierzchnia: 18 m²</p>
), but due to the fact that the data we need is separated by a colon - exporting them presents no difficulty.
On the other hand, the problem with Otodom offers is that data from the web page’s tags are separated from its label (e.g.: Powierzchnia
and 18
are in separate HTML divs). In order to properly track what data responds to what label I had to provide a set of boolean variables indicating what label data parsed in moment responds to, and operate on them as a single one-hot encoded vector. So the logic looks the next:
- Find
div
with tags - Find
div
with label - Set the appropriate label and toggle others (e.g.
IsArea = True
) - Find the next
div
with data - Extract the data and assign it to the label
- Toggle the boolean flag (e.g.
IsArea = False
)
Same as on OLX offer pages, the offer description is stored in a specific div, so there was no difficulty to extract it.
Last, but not least - images, URLs to images OLX stores in regular self-closing <img>
HTML tags, due to that fact there was no difficulty to extract them. However, Otodom has them in a one-line json at the end of a page. That’s why the parseOtodomImages
method was added just to extract images under the props->pageProps->ad->images[]
path of a json object.
The application provides a user-friendly interface for creating subscriptions to specific search queries for housing. Thanks to extracted data, the user doesn't have to spend time manually searching for offers on various resources and can have all the offers in one place with a convenient system for tracking viewed offers. This solution allows users to save time and approach the question of finding housing more efficiently.