This week I scraped and analyzed property listings in Mexico City to build a price prediction model.

Tools I used:
Scraping:
A property listing website
Scrapy - an open source web scraping Python package
Selenium - automates a browser as part of a web scraper
user-agents.net - supplies common user agents
Webshare - proxy servers for rent
capsolver + python3-capsolver - captcha solving as a service
Model building
A Jupyter notebook - to analyze the data
pandas, seaborn, xgboost, shap - Python packages
Context
For a long time I’ve wanted to try my hand at a prototypical data modeling problem - predicting house prices. I’ve tried a few times over the past few years, but I always got stuck on the scraping part, before I could even start modeling. I could have found an existing house dataset, but the full end to end process is more appealing to me. This time I was determined to see it to the end.
Process
There are three major components of this analysis:
Scraping - collecting the raw data from a property listing aggregator
Exploration and Modeling - cleaning, processing, feature engineering, model fitting
Insights - visualizing the data, interpreting it, drawing conclusions
Part 1 - Scraping
I committed to not giving up, as a result I did a much deeper dive into web scraping that I had anticipated. I decided to use scrapy - an open source Python library.
Web scraping is the process of programmatically browsing through a large number of URLs and saving data from it. In my case, I’m crawling through thousands of property listings and storing the details of each - price, area, number of bathrooms, bedrooms, location, etc.
A lot of websites don’t want you to do that, so they employ various measures to make your life difficult. The cat and mouse game of web scraping is figuring out how to sneak through these barriers at scale.
The first challenge is understanding the url tree structure of the website.
Problem: For the property listing website I was using, there were ~65,000 for sale listings in Mexico City. They displayed 30 results per page, so 65000/30 = 2167 results pages to crawl through. However, I quickly discovered they don’t let you scroll past page 166 (even manually), meaning the maximum number of listings I could get would be 30*166 = 4,980.
Solution: The breadcrumb structure of the website is organized like this:
City > Suburb > Neighborhood

Each neighborhood has <4980 listings, so by splitting it at that level and scraping each in turn, we avoid that first (easy) obstacle.
I used CrawlerRunner to spin up dozens of simultaneous spiders, one per neighborhood.
Problem: A python script without adjustments doesn’t look like a human browsing from a web browser, and it’s easy to detect and block. Sending headers that look like a common browser is a start, but a normal browser doesn’t send dozens of concurrent requests from a single IP.
Solution:
I found 10 common user agents from user-agents.net and used middleware to cycle through them for each request. Here’s two examples of a user agent:
- “Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36”,
“Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36”
- As well as rotating user agents, we can rotate through Proxy IPs. I found webshare.io which has a free tier that gives you 10 rotating proxy IPs and 1GB/month bandwidth. I used ~300mb through testing many different approaches
Problem: Even with the above methods, sometimes the site threw up a reCaptcha (specifically reCaptcha v2, shown below)

Solution: Capsolver - an API service that solves captchas for a small fee

Solved about 400 Captchas - cost $0.32
I think I could have avoided hitting captchas if I were smarter about user agents + request throttling
Problem: Slow crawling speed now that we’ve implemented the above. It was going to take 24 hours if I didn’t make some changes. It’s trivial to make many concurrent requests, but the more concurrent requests you make, the higher the risk of being blocked, which ultimately slows things down further.
Solution: Auto-throttling, a nice feature of Scrapy that calculates the optimal number of requests and adjusts throughout the process.

After many hours of tweaking, I managed to scrape 50k listings in a few minutes of scraping time. (If I need to scrape a website again, I’ll be much faster).
I missed 15k listings, I think there were some funky URLs that weren’t included in the map and only accessible from a different part of the website. I think 50k/65k is good enough to move on to the next part.
Part 2 - Exploration and Modeling
Firstly, I wanted to check if the listings I had represented a good geographical coverage. Typically when mapping (eg using GIS software) its common practice to base things on official administrative zones, eg postcodes, city boundaries, etc.
In Mexico City there are alcaldías/delegaciones ( boroughs ). On the left we have the 16 boroughs, and on the right 1102 post codes.
.
Then I plotted them on a map of the city. The listings didn’t come with a street address, so all listings in the same neighborhood have the same lat/long:

It does look like there are some gaps in the east part of the map. If I revisit this project, I’ll try to get a more complete dataset.
Data cleaning is so much faster with ChatGPT to co-write the laborious repeatable code. Love it. I spent a bunch of time working through the data in a Jupyter notebook, here’s the highlights of what I did:
Data Preprocessing:
used k-NN imputation to fill in missing values for features like bedrooms, bathrooms, and area + removed records that couldn’t be imputed
log transformed and scaled ‘price’ (the target variable)
identified and removed outliers
Feature Engineering:
extracted textual features from the ‘description’ field - eg ‘alberca’, ‘jardín’, ‘terraza’
created polynomial features and interaction terms
performed frequency encoding on ‘agency_name’ and ’neighborhood’, one-hot encoding for ‘property type’, and label encoding for ‘condition’
processed the ‘date’ field and generated a ‘days_since_earliest’ feature
Model Building and Evaluation:
trained an XGBoost regression model - initial R^2 = 0.58
identified the best parameters through hyperparameter tuning - R^2 = 0.83 (but it’s likely overfitting)
used SHAP values and correlation analyses to interpret feature importances
Without making further changes, I threw a bunch of other models at it, naively none did better than the XGBoost model.
I ran out of time to do much more this week, I’ll have to revisit this project!
Part 3 - Insights
Looking at the SHAP values of the XGBoost model:

The way I interpret this chart is as follows:
for any given variable, higher values are red, lower values in blue. For binary values like ‘has_jardín’, red means it has a garden, blue means it doesn’t
a positive impact (to the right) means this value for a variable increases the predicted price, and a negative value (to the left) means this value for a variable decreases the predicted price
For example:
has_bano - a feature indicating a mention of the word baño in the description, has a strong effect in both directions. This means that descriptions that don’t mention baño tend to have lower predicted price
agency_name_freq and neighbourhood_freq - this makes sense, listings in certain neighborhoods and listed by certain agents are more or less expensive
days_since_earliest - the older the listing, the lower the predicted price. Probably listings that don’t get sold and remain on the platform have to lower their price over time
I really wanted to be able to publish something like this, taken from this blog article:

But sadly I ran out of time and energy, so I didn’t quite get there. Still, I’m happy with my effort and I can pick up where I left off later.
Learnings
My 52weeks mindset of “I have a week to get something good enough” is a great motivator and remover of obstacles
Web scraping is fun
I found a great analysis mapping gentrification of Mexico City
This project had too many rabbit holes to finish in a week
I think I need a data source with more features to build a decent model. For comparison, the blog post mentioned used data from Kaggle, which came with 79 features, vs the ~10 or so I managed to scrape
I did this while doing a 5 day fast (technically a fasting mimicking diet). It really puts things on hard mode
Next steps
Start eating food again
Re-scrape to get the extra listings that I missed
Think more deeply about additional features to engineer
Enrich features - eg adding postcode level data from open source databases like
- Finish the insights part and generate some sort of conclusion