Italian real estate
1. Intro and general overview

This page contains only an introduction to this project and its general overview. The other parts of this project are:
Why this project?
Introduction
When aspiring data scientists want to get their hands dirty analyzing data, most online resources list several “classic” problems that they encouraged to tackle. One of the most popular tasks is without doubt building a regression model to predict the price of houses. Like many other similar problems, however, it is extremely difficult to find truly original datasets, and people always end up using the same trite and/or old ones (like the Ames, the Boston and the California datasets). This makes really hard to showcase one’s ability to work with data, because one way or the other people end up following (consciously or not) any of the countless already available online tutorials for these well-known datasets. Therefore, with this project I set out to approach this challenge by creating my own dataset from scratch, which I later use for analysis. Since I am from Italy, this project deals with data on Italy’s real estate market.
In this project, I take care of everything: the collection of the raw data, its clean-up and insertion into a data warehouse and the building of machine learning models and dashboards.
Aims
The final deliverable of this project is a comprehensive analytics dashboard designed specifically for real estate investors seeking data-driven decision support. This dashboard uses the machine learning model trained in the project to predict potential rental income for properties listed for sale or auction. Investors can instantly visualize key investment metrics that are useful in making investment decisions.
The dashboard therefore allows the user to identify areas in Italy with promising investment opportunities.
This project was also designed to showcase that I am able to:
- Handle a project end-to-end, from its design to its complete deployment
- Extract data of interest from its source
- Handle the creation and management of a database (both non-relational and relational) from scratch for a specific purpose
- Build and train custom AI models (used in this project to generate synthetic data)
- Analyze data and generate valuable insights for specific audiences, and communicate it with dashboards
Skills that I’ve honed with this project:
- Web scraping
- ETL pipeline deployment
- Workflow management (Airflow)
- Creation and management of relational (PostgreSQL) and non-relational (MongoDB) databases
- Data cleaning and processing
- Data warehousing
- Data modeling
- Data analysis
- Development of custom algorithm for synthetic data generation
- Development of Machine Learning predictive models (scikit-learn)
- Dashboard creation for data visualization (Tableau)
General overview of the project
- The raw data is collected by scraping immobiliare.it, the largest online real estate marketplace in Italy. Generally speaking, the website contains three types of listings: sales, rents and auctions. Each category contains several types of real estate, from regular apartments/houses to whole buildings or commercial spaces. The raw data for every listing is loaded in a non-relational data lake, hosted on a local MongoDB instance
- An ETL pipeline then extracts all useful information on each listing from the data lake and loads it in a MongoDB data warehouse. The warehouse is again hosted on a local MongoDB instance
- Another ETL pipeline then extracts data from the MongoDB warehouse and loads it into a PostgreSQL data warehouse (with some data cleaning in the process)
- The data in the PostgreSQL warehouse is then fed to a custom algorithm for synthetic data generation that I developed
- This synthetic data is then analyzed and used to build both ML models and dashboards
- The results of the data analysis are then used to build the dashboards described above
Here is a schematic representation of the project’s high-level structure:

For more information on how each step of the project works, you can look at the detailed description.
Code availability and disclaimer
The code I wrote for the project can be found here. Please note that some of the scripts (especially the one used for the raw data extraction) have been redacted in some of their parts to avoid out-of-the-box reproducibility.
The code I wrote for this project is shared only as a demonstration of my programming skills, problem-solving approach, and technical implementation abilities. This code is not being shared for the purpose of enabling others to scrape websites, but rather to showcase my understanding of data collection and handling techniques as part of my professional portfolio.