Project description

Apartment parser

Description

The user interface is a key feature in modern applications. Recent developments (e.g. ChatGPT) show the usability and feasibility of chatbots as the user interface. Following this trend, I propose a chatbot application that helps users to find offers on popular web portals, such as OLX or Otodom. The chatbot is based on the HTTP responses from the said portals, it requires the user to specify the requests with a set of parameters (e.g. location, area, price range, etc..). It’s implemented in Go language with the use of a net/html package’s parser and Telegram bot API. The user is required to create a set of search terms by using the chatbot’s interface and providing the aforementioned parameters. The parameters are then processed with a parser.

Building the search query

Parser operates on user-defined search terms, where the link to the page with offers can be easily built using parameters. The general structure of the link is:

olx.pl/nieruchomosci/mieszkania//

Example:

olx.pl/nieruchomosci/mieszkania/bydgoszcz/q-mieszkanie/?search%5Bfilter_float_price:from%5D=100&search%5Bfilter_float_price:to%5D=3000

By using the ?search[order]=created_at:desc parameter the offers will be sorted in descending order.

Example:

olx.pl/nieruchomosci/mieszkania/bydgoszcz/q-mieszkanie/?search%5Border%5D=created_at:desc

Firstly, the parser builds url according to the user-defined parameters and always sets the order to be descending in order to ensure we parse the newest offers. The huge advantage of OLX is that it already includes offers both from Ottodom and OLX. Later, the parser uses the FetchHTMLPage method to download the raw HTML code for later analysis of the page.

Extracting the offers

Offers on an OLX page are stored in divs with custom class names that are updated from time to time (e.g. class=”css-1sw7q4x”), parser looks for such divs and treats them as an offer. Due to the nested character of the offers, the parser tracks whether the end div tag token is a closing one for the offer and stores the value in the isOffer boolean variable. All the tags of related to the offer are then passed to the extractOffer method to extract specific features.

Here’s a general structure of an offer with a list of features parser heads to extract:

// Offer of an apartment for rent.
//
// Attributes:
//
//  Title: The title of the offer.
//  Price: The price of the offer.
//  Location: The location of the offer.
//  Time: The time offer was posted or updated.
//  Url: The URL of the offer.
//  AdditionalPayment: The additional payment for the offer.
//  Description: The description of the offer.
//  Rooms: The number of rooms on offer.
//  Area: The area of the offer.
//  Floor: The floor of the offer.
type Offer struct {
   Title             string
   Price             string
   Location          string
   Time              string
   Url               string
   AdditionalPayment string
   Description       string
   Rooms             string
   Area              string
   Floor             string
   Images            []string
}

From the page with offers following attributes can be extracted:

Title of the offer
Location
Area of the apartment
Time the offer was posted (relative to the parsing time)
Base price
URL to the offer

That’s what the extractOffer method is responsible for, since almost all the data is stored in a span tag it’s a trivial task to solve.

Data extraction

Originally, the parser doesn’t follow the offer’s link, this part is performed inside a telegram bot logic in order to reduce the amount of requests to the OLX (there’s no rate limit logic implemented yet). However, the parser package offers the ParseOffer method which accepts a partly filled offer as a parameter and finishes it with the data parsed from the Offer.URL page.

The ParseOffer method distinguishes whether the provided URL is an OLX one or an Ottodom, and depending on that invokes the appropriate method. In the case of OLX offers, the task is similar to the offers page - all the data is stored in separate tags with custom classes indicating which of them is which. The only difference is that a lot of information is indicated in special OLX tags, which have the same structure in the code (e.g.: <p class="css-b5m1rv er34gjf0">Powierzchnia: 18 m²</p>), but due to the fact that the data we need is separated by a colon - exporting them presents no difficulty.

On the other hand, the problem with Otodom offers is that data from the web page’s tags are separated from its label (e.g.: Powierzchnia and 18 are in separate HTML divs). In order to properly track what data responds to what label I had to provide a set of boolean variables indicating what label data parsed in moment responds to, and operate on them as a single one-hot encoded vector. So the logic looks the next:

Find div with tags
Find div with label
Set the appropriate label and toggle others (e.g. IsArea = True)
Find the next div with data
Extract the data and assign it to the label
Toggle the boolean flag (e.g. IsArea = False)

Same as on OLX offer pages, the offer description is stored in a specific div, so there was no difficulty to extract it.

Last, but not least - images, URLs to images OLX stores in regular self-closing <img> HTML tags, due to that fact there was no difficulty to extract them. However, Otodom has them in a one-line json at the end of a page. That’s why the parseOtodomImages method was added just to extract images under the props->pageProps->ad->images[] path of a json object.

Summary

The application provides a user-friendly interface for creating subscriptions to specific search queries for housing. Thanks to extracted data, the user doesn't have to spend time manually searching for offers on various resources and can have all the offers in one place with a convenient system for tracking viewed offers. This solution allows users to save time and approach the question of finding housing more efficiently.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly