Image source

This page contains all the details on how this project was developed, from start to finish. The other parts of this project are:

Raw data extraction and datalake population

The extraction of the raw data is achieved by scraping immobiliare.it, and is performed by the webscraping script.

There are 107 provinces in Italy, and the website contains three different types of listings: sale, auction and rent. The scraping is orchestrated with Apache Airflow as follows: for each province, the script scrapes data for the rent listings first, then it scrapes auction listings and finally sale, processing each province in parallel. After fetching the raw data (i.e., the HTML source code of the listing’s webpage), the script loads it into a local instance of a MongoDB datalake.

In this data lake, each listing’s MongoDB document contains the following information:

the listing’s unique ID (which is simply the same ID of the listing used on the website)
the listing page’s HTML source code
the name and two-letter code of the province where the listing is from
the ID of the parent listing (if it exists)
a list of the ID of the child listings (if they exist)

“Parent” and “child” listings are used, for example, when there are multiple houses on sale in the same building (and this case, the parent listing is a listing that shows info about the building, while the child listings are the individual listings of each unit on sale).

Here is a schematic representation of how Airflow orchestrates this part:

The raw data for this project was scraped once in January and once in February 2025.

Relevant data extraction and MongoDB data warehouse population

After the extraction of the raw data, the MongoDB migration script uses Airflow to extract only the relevant information from each listing, perform some basic transformation and load it in a non-relational data warehouse, hosted again as a local MongoDB instance.

These ETL pipelines are organized by listing type, i.e. Airflow executes in parallel the ETL pipeline of all rent, auction and sale listings in parallel before logging the statistics of the newly populated data warehouse:

In this data warehouse, each listing’s document contains the following information:

_id: the unique MongoDB ID for each document, taken from the listing’s ID on the website
data: this field contains all the data relative to the listing. Since the same listing can be in principle scraped on different dates, this field is an array where each element is a dictionary containing the following information:
- scraping_date: date the listing was scraped
- listing_features: contains basic information on the listing:
  - price: asking price for sale listings, minimum offer for auction listings and monthly rent for rent listings
  - total_surface: property’s total surface, in squared meters
  - price_per_sq_mt: price/total_surface
  - listing_info: contains some of the listing’s metadata:
    - listing_age: number of days since the creation of the listing
    - listing_last_update: number of days since the listing was last modified
    - is_listing_new: True if the listing has not been modified since its creation, False otherwise
    - has_parent: True if the listing has a parent listing, False otherwise
    - has_child: True if the listing has at least one child listing, False otherwise
    - seller_type: what legal entity is selling/renting the property (e.g., private citizen, real estate agency, court etc.)
    - number_of_photos: number of pictures in the listing
- additional_costs: contains information about other potential additional costs:
  - other_costs:
    - condominium_expenses: if present, monthly condominium expenses (e.g., for elevator maintenance, gardening of common green areas etc.), in euros
    - heating_expenses: yearly expenses for heating the property, in euros
  - mortgage_rate: average mortgage rate available from the website at the time of scraping
- auction_info: for auction listings, this is a dictionary that contains information about the auction:
  - auction_end_date: date when the auction will end
  - deposit_modality: description of how the deposit for the property should be made
  - auction_type: type of auction (e.g., with/without competitive bidding etc.)
  - is_open: True if the auction is still open, False otherwise
  - minimum_offer: auction’s minimum offer
  - procedure_number: reference number for the auction’s legal procedure
  - auction_court: name of the court responsible for this auction
- cadastral_info: for auction listings, this is a dictionary containing cadastral information on the property:
  - cadastral: cadastral reference number for the property
  - cadastral_additional_info: if present, additional cadastral info
  - sub_cadastral_info: if present, more additional cadastral info
- surface_composition: if not empty, this field is an array containing dictionaries. Each dictionary contains information on the different parts of the property (i.e., the house, the backyard, the garage etc.):
  - element: name of the current element (e.g., “house” or “garage”)
  - floor: at what floor of the property is the element located
  - surface: its surface in squared meters
  - percentage: what percentage of the property the element makes up
  - commercial_surface: the surface of the ENTIRE element (e.g., including walls)
  - surface_type: this field can either be “principal” or “accessory”
- availability: contains information on either if the property is immediately available or not
- type_of_property: this is a dictionary containing the following information:
  - class: this contains information about the type of property (e.g., independent house or apartment)
  - category: this fields indicates if the property is (for example) residential or commercial real estate
  - is_luxury: True if the property is registered as luxury real estate (e.g., castles)
- condition: condition of the property (e.g., “good”, “needs renovation” etc.)
- rooms_info: this is a dictionary that contains information on the property’s rooms:
  - bathrooms_number: total number of bathrooms, ranging from 1 to 4 (where 4 means “more than 3”)
  - bedrooms_number: total number of bedrooms
  - total_room_number: total number of rooms in the property, ranging from 1 to 6 (where 5 means “more than 5”)
  - rooms_description: concise textual description of the property’s rooms
  - kitchen_status: type of kitchen (e.g., open kitchen, kitchenette or cooking corner)
  - garage: if present, textual description of how many garages/parking spots are associated with the property
  - property_floor: at what floor the property is located; 0 is ground floor
- additional_features: this is a list of strings, each indicating a different feature that the listing has (e.g., fireplace, private garden, furniture, balcony etc.)
- building_info: this is a dictionary containing information about the building where the property is:
  - has_elevator: True if the building has an elevator, False otherwise
  - building_usage: what the building is used for (e.g., apartments, offices, factory, hotel etc.)
  - building_year: year the building was built
  - total_building_floors: how many floors there are in total
  - total_number_of_residential_units: if the listing is of a residential building, this is the total number of units
  - work_start_date: for new constructions, day the works started (or will start)
  - work_end_date: for new constructions, day the works ended (or will end)
  - work_completion: this is a number that measures how far the construction works are far along. It’ 0 on work_start_date, 1 on work_end_date; negative numbers mean that the works haven’t started yet, and numbers larger than 1 mean that the works have finished
- energy_info: this is a dictionary containing information on the energy efficiency of the property:
  - is_zero_energy_building: True if the property’s building is certified as being “zero energy” (i.e., extremely energy efficient)
  - heating_type: type of heating (e.g., independent vs. centralized, radiators vs. fan coils, gas powered vs. solar powered etc.)
  - energy_class: EU energy label for the property. The classification system goes from G (the least energy efficient) to A (the most energy efficient). Class A is also divided into four sub-classes, A1 to A4
  - energy_consumption: an estimate of the property’s energy consumption, in kWh/squared meter/year
  - air_conditioning: if present, the type of air conditioning (e.g., hot/cold, only cold or only hot, autonomous vs. centralized etc.)
- location_info: this is a dictionary containing information about the listing’s location:
  - latitude: listing’s latitude
  - longitude: listing’s longitude
  - region: name of the Italian region where the listing is located
  - province: name of the Italian province where the listing is located
  - province_code: two-letter code of the listing’s province
  - city: name of the city (or town) where the list is located
  - macrozone: if present, name of the larger neighborhood where the listing is located
  - microzone: if present, name o the area within the neighborhood where the listing is located
  - locality: if present, name of the town or village where the property is located
  - address: name of the street where the property is located
  - street_number: name of the property’s street number
- text_info: this is a dictionary containing all the textual information on the listing:
  - title: listing’s title
  - caption: listing’s caption
  - description: full textual description of the listing
child_listings: in case of a parent listing, this field is an array with the IDs of the child listings, otherwise it’s null
parent_listing: in case of a child listing, this is the ID of the parent listing
province_name: name of the province from which the listing was scraped
province_code: two-letter code of the province name

PostgreSQL data warehouse population

At this point the data warehouse was migrated into a relational database, hosted as a local PostgreSQL instance, with the MongoDB to PostgreSQL migration script.

This warehouse was build with a snowflake schema to ensure data consistency and reduce data redundancy. Here is a graphical representation (ERD) of the PostgreSQL data warehouse’s structure:

ERD created with dbdiagram.io

Several of the fields in this warehouse contain text or words in Italian. Therefore, this script also takes care of translating into English the fields that will be used in the remainder of the project. Translations are done automatically with a local instance of LibreTranslate, with some manual adjustments for real estate-specific jargon.

Since there are a lot of records in the database (~1 million), the script divides the data migration into batches of 10,000 records and uses dynamic task generation to achieve this.

Synthetic data generation

Once the PostgreSQL data warehouse has been populated and the useful fields have been translated into English, the synthetic data generation script extracts relevant information from the PostgreSQL warehouse and uses it to create a pandas dataframe, with some data transformations in the process:

listing_id: same as in the PostgreSQL warehouse
listing_type: same as in the PostgreSQL warehouse
price: same as in the PostgreSQL warehouse
surface: same as in the PostgreSQL warehouse
seller_type: same as in the PostgreSQL warehouse
condominium_monthly_expenses: same as in the PostgreSQL warehouse
heating_yearly_expenses: same as in the PostgreSQL warehouse
type_of_property: same as in the PostgreSQL warehouse
condition: same as in the PostgreSQL warehouse
category_name: same as in the PostgreSQL warehouse
total_room_number: same as in the PostgreSQL warehouse
bathrooms_number: same as in the PostgreSQL warehouse
kitchen_status: same as in the PostgreSQL warehouse
garage: 1 if the property has at least one garage/parking spot, 0 otherwise
floor: same as in the PostgreSQL warehouse
has_elevator: same as in the PostgreSQL warehouse (with 1 for True and 0 for False)
building_usage: same as in the PostgreSQL warehouse
building_year: same as in the PostgreSQL warehouse
total_building_floors: same as in the PostgreSQL warehouse
is_zero_energy: same as in the PostgreSQL warehouse (with 1 for True and 0 for False)
heating_type: same as in the PostgreSQL warehouse
energy_class: same as in the PostgreSQL warehouse, but converted to numbers (from G→10 to B→5, A1→4, and so on until A4→1)
air_conditioning: same as in the PostgreSQL warehouse
latitude: same as in the PostgreSQL warehouse
longitude: same as in the PostgreSQL warehouse
region: same as in the PostgreSQL warehouse
province: same as in the PostgreSQL warehouse
features: the dataframe then contains a series of one-hot encoded columns, one for each of the additional features that can potentially be found in a listing; if the column contains 1 then that listing has that particular feature, otherwise it contains 0

This data is then used to generate synthetic data with a custom algorithm¹ (based on the K Nearest Neighbors algorithm) that I developed. Here is a schematic description of how the algorithm works:

For a given listing type (i.e., rent, auction or sale), the algorithm randomly draws one of that type
The algorithm looks for the 5 nearest neighbors, where the distance metric is a weighted average of the distances of price, surface, latitude and longitude. In other words, we look for 5 physically close listings that have similar price and surface as the one selected
A new synthetic listing is created by taking the weighted average of the numerical features (e.g., price, surface, latitude and longitude) of the neighbors, while for each categorical feature we choose the most frequent among the neighbors
The generation of synthetic listings is repeated until desired
A post-processing step makes sure that all the listing’s coordinates are within the borders of the Italian provinces they are supposed to belong

This algorithm, therefore, ensures that none of the synthetic listings contains exactly the same information as the original dataset, but the statistical properties of the data are left basically unchanged.

The synthetic data generation script also takes advantage of Tensorflow’s GPU acceleration

I used this algorithm to generate a number of synthetic listings of each type that resembles the number in the original dataset.

ML rent predictor model training

Once the synthetic data is ready, the rent predictor script takes care of the creation of a Machine Learning model to predict the cost of the rent for real estate listings. In other words, we extract the rent listings from the synthetic data to train a rent predictor model, with the idea to then apply it to auction and sale listings to determine the amount of monthly rent that can be earned after buying a particular property.

Since there are potentially many non-linear and non-trivial relationships between the listings’ features, I opted for a Random Forest Regressor model. All details can be found in the rent predictor script.

After training the model and making sure it’s doing an acceptable job, the value of the rent is predicted for all sale and auction listings, and the results are put in a dataset for visualization with Tableau.

Data visualization with dashboard

The final result of this project is a Tableau dashboard, which can be found here.

Footnotes

I have originally tried to train a generative model (CTGAN from the Synthetic Data Vault) to create the synthetic data. The task proved too difficult to tackle with the computational power I had available at the time I was working on this project (i.e., my own personal laptop): the CTGAN was only able to learn the distributions of all the listing’s features, but not their correlations. Therefore, critical information like the relationship between location and price was completely lost. For this reason I took a step back and developed my own algorithm as described here. ↩